Volume 9, Issue 4

Articles
In and Out
Trott's Corner
New Products
New Publications
Calendar
News Bulletins
New Resources
Classifieds

Editorial Policy
Staff and Contributors
Submissions
Subscriptions
Back Issues
Contact Information

On the Numerical Accuracy of Mathematica 5.0 for Doing Linear and Nonlinear Regression

# 3. Methods for Linear and Nonlinear Regression

## (a) Linear Regression

The general statistical model assumed for the linear least-squares regression problems is

where denotes the response (dependent) variable, denotes the vector of unknown parameters to be estimated, is the number of observations, and denotes the by matrix of predictor (independent) variables. The specific functional form for each dataset is different from set to set and is specified in the certified results for that set, summarized in Appendix A. Unless the constant term in the regression is suppressed, is assumed to contain a column of ones, the coefficient of which is the nonzero intercept. If we want to make inferences about or how well the relationship fits the data, we have to make some assumptions about the joint distribution of and . The usual minimal assumptions are that has a distribution with variance-covariance matrix independently of . In this case, the least-squares estimates of ,

have certain optimal properties, and the estimated standard errors of are

where

is an estimate of and

are the calculated residuals from the regression. The measure of goodness-of-fit is defined as

when an intercept (frequently represented by including a column of ones in the matrix ) is included and defined as

when no intercept is included (e.g., NoInt1 and NoInt2 [8]). Here is what the NIST says about the method employed for linear regression.

"For all datasets, multiple precision calculations (accurate to 500 digits) were made using the preprocessor and FORTRAN subroutine package of Bailey (1995). Data were read in exactly as multiple precision numbers and all calculations were made with this very high precision. The results were output in multiple precision, and only then rounded to fifteen significant digits. These multiple precision results are an idealization. They represent what would be achieved if calculations were made without roundoff or other errors. Any typical numerical algorithm (i.e., not implemented in multiple precision) will introduce computational inaccuracies, and will produce results which differ slightly from these certified values." (See the certification methodology description for each dataset at www.itl.nist.gov/div898/strd/lls/lls.shtml, accessed 07/12/03.)

Although not generally advisable, it is perfectly possible to simply write out these equations in Mathematica and solve a particular problem, since standard methods for solving for inverses of well-conditioned matrices may not give accurate answers for ill-conditioned matrices. This was what, in fact, went wrong with most of the extant computer packages for doing ordinary least-squares regression when Longley [10] did his famous study. Mathematica, however, is very sophisticated about the way it inverts matrices and, indeed, gets this problem right, albeit with a warning message in the program exhibited for the computation of invX1.

Daniel Lichtblau comments on the difference between these two methods of doing OLS linear regression as follows: "What we [Mathematica] actually work with [in the add-on package, Statistics`LinearRegression`] are singular values. In effect, this is a method to compute a pseudo-inverse (which we do not actually do, unless necessitated by the various report option settings). In contrast, matrix inversion will be done using Gaussian elimination, with appropriate warnings if the matrix is found to be ill conditioned."

Here is my "direct" program.

The reader, of course, will have to fill in the exact location of the Longley.txt file, which is copied as a .txt file in Appendix C.

The Mathematica parameter estimates and standard errors compared with the benchmark results are:

And of and :

If I were to exhibit the results to 15 digits, the Mathematica and benchmark results would also coincide exactly. Section 3.1.4 of The Mathematica Book [31] gives a good discussion of the way in which Mathematica handles the problem of numerical precision in general and floating-point arithmetic in particular. So when doing linear least-squares regression it does not appear to matter whether I use my own code or Mathematica's Regress function.

I have not been able to find documentation as to exactly how Mathematica performs a linear least-squares regression with its Regress function, other than the full code for LinearRegression in AddOns/StandardPackages/Statistics`LinearRegression`. (See Daniel Lichtblau's earlier comments.) Needless to say, it is not so easy for a nonprogrammer to read this code. It appears that Mathematica uses the same computationally efficient and accurate methods to solve the usual least-squares normal equations as it does to solve any system of linear equations or to invert a numerically specified square matrix. See Section A.9.4 of The Mathematica Book [31].

Comparisons of the results from Regress with the NIST benchmarks for each of the 11 NIST SdRD linear regression examples are presented in summary form in Appendix A. The results presented for Mathematica are simply those Mathematica prints out. The reader, however, should be aware that these are not the actual results of Mathematica's computation, which are all carried out to 16 digits of precision, but rather a rounded form for ease of visual presentation. The full 16 digits can easily be obtained by using NumberForm[expr, {n, f}], which prints approximate real numbers having n digits, with f digits to the right of the decimal point.

## (b) Nonlinear Regression

Following the NIST description, the statistical model assumed for nonlinear regression in the benchmark calculations is the so-called generic univariate case:

where denotes the response (dependent) variable, denotes the predictor (independent) variables, and (unsubscripted) the unknown parameters to be estimated. The specific functional forms assumed differ from one benchmark case to the next. They are reported in Appendix B, in which comparisons between the results obtained by Mathematica and those obtained by the NIST are presented. The NIST methodology is described as follows.

"The certified values for the nonlinear least-squares regression problems were obtained using 128-bit precision with the reported results confirmed by at least two different algorithms and software packages using analytic derivatives.

"The certified values for the estimates, of the true model parameters are those produced by the smallest sum of squares, i.e.,

where denotes the number of observations. Under the assumption that

it follows that these are the maximum-likelihood estimators.

"The certified values for the standard deviations of the estimates of the model parameters are the square roots of the diagonal elements of the asymptotic covariance matrix,

where

denotes the Jacobian matrix with ijth element

evaluated at the current values of the parameters , and and denote the number of observations and the number of parameters, respectively.

"The certified value of the residual sum of squares is defined by

where denotes the number of observations."

The benchmark estimate of is thus (www.itl.nist.gov/div898/strd/nls/nls_info.shtml).

Under the assumption , it is thus possible to derive all of these estimates by maximum likelihood. In this case, the standard errors of the estimates are obtained asymptotically as the square roots of the diagonal elements of the inverse of the information matrix (the Hessian of the likelihood function evaluated at the maximizing point). (See the standard econometric presentations of nonlinear regression methods in Davidson and MacKinnon [32] and Greene [33].) Indeed, it is in some ways a more general method of analysis, since ML may be applied in the case of more general models.

Before proceeding, it is worth noting a few limitations of the standard univariate nonlinear regression model, . According to Davidson and MacKinnon [32, 42], "The feature that distinguishes regression models from all other statistical models is that the only way in which randomness affects the dependent variable is through an additive error term or disturbance." However, this may be less of a restriction than you might at first think. For example, suppose a multiplicative disturbance ; then a simple logarithmic transformation suffices to return things to additivity: . Of course any distributional assumptions we may have made concerning may have to be modified. Independence of the regressors, , and the disturbance, or at least lack of correlation, is an assumption common to both linear and nonlinear regression. A problem arises, however, if the appropriate transformation depends on parameters to be estimated. For example, a common generalization is a transformation of the dependent variable involving another parameter to be estimated:

In this case, the distribution of the s, given the s, will no longer be the same as the distribution of the s, and least squares may be problematic and generally will no longer yield the same estimates as maximum likelihood. Maximum likelihood is still an effective method, however, since the distribution of the s, given the s, may still be found from the distribution of the s by multiplying by the Jacobian of the transformation, . A more serious limitation is the necessity of specifying the functional form of . Whereas, in the case of linear regression we hardly ever think about alternative functional forms, except perhaps to worry about interactions, the very fact that we are considering nonlinear regression underscores a concern about the appropriate functional form. In recent years, there has been an efflorescence of work on nonparametric methods explicitly designed to cope with this problem. (See, especially Haerdle [34], and Pagan and Ullah [35]).

There is a extensive literature on nonlinear regression, including standard texts, such as Gallant [36], Bates and Watts [37], and Seber and Wild [38], each with somewhat different emphasis. Most emphasis falls on computational issues and problems of parametric inference. A very good brief expository introduction is Gallant [39].

Most algorithms for computing nonlinear regression parameters are based on Hartley's [40] modified Gauss-Newton method (Hartley and Booker, [41]) or as extended and modified by Marquardt [42] following a suggestion by Levenberg [43]. There is an important difference between maximizing/minimizing functions in general and the problem of finding the minimum of an analytically specified sum of squares (see the previous SS). In general, we can evaluate only the function itself analytically and perhaps its gradient. Using only these two pieces is the basis for the method of steepest descent, sometimes also known as the Gauss-Newton method. But, in the case of nonlinear least squares, we know a great deal more: we know the Hessian matrix. Hartley's insight was to exploit this information explicitly, rather than build it up piecemeal during the iterative process of descent. His modified Gauss-Newton method is called the inverse Hessian method. Levenberg's insight was to recognize that simple steepest descent is much faster and more accurate when you are close to the minimum and to suggest separating the problem of getting into the neighborhood of the minimum from "zipping in." Marquardt's contribution was an elegant algorithm designed to vary smoothly between the two methods as the minimum point was approached. Mathematica's NonlinearRegress uses the Levenberg-Marquardt algorithm by default, but also allows you to choose among several other algorithms for function minimization.

In [30, 44], Hiebert did a careful evaluation of existing programs for doing nonlinear least squares and for solving systems of nonlinear equations (a harder problem). One surprising conclusion he drew [30, 15] was that, "On the basis of this testing and these specific implementations of the Levenberg-Marquardt and the augmented Gauss-Newton methods, one method does not appear to be superior to the other." Perhaps, for this reason, the NIST uses a variety of methods, each considered appropriate for the problem at hand. However, the NIST is not more explicit about the methods used in specific cases than the statements quoted earlier.

Standard-error calculations and inference for nonlinear least-squares estimates are all based on asymptotic results (Hartley and Booker, [41]; Jennrich, [45]) and deftly summarized by Gallant [39]. These are the basis for the NIST calculations detailed earlier. Mathematica does not say how its calculations are carried out in the description of the NonlinearRegress function; however, the code is available in the Statistics`NonlinearRegress`package, and it appears that the same formulae are used.

Comparisons of the results from NonlinearRegress with the NIST benchmarks for each of the 27 NIST StRD nonlinear regression examples with their published starting values are presented in summary form in Appendix B. The NIST actually ran all computations using not only these two starting values but also the final benchmark answers as starting values because some nonlinear regression algorithms have a problem if you start from the right answer. However, in a few cases, I also tried starting at the benchmark answers: in the case of MGH10, as noted earlier, and for those few starting values for which Mathematica did not converge: Lanczos2; MGH17; MGH09; as well as MGH10. However, as noted by the NIST, MGH10 is a particularly difficult problem: "This problem was found to be difficult for some very good algorithms. There is a local minimum at (+inf, -14.07 ... , -inf, -inf) with final sum of squares 0.00102734." [46, 17] Lanczos3 is also a rather peculiar function, but Mathematica generally performs quite well for chosen starting values.

The crucial problem of choosing good starting values is discussed extensively by Bates and Watts [37, 72-76], and, more briefly, by Seber and Wild [38, 665-666]. Bates and Watts suggest using a linearized form of the expectation function, while Seber and Wild suggest grid search or randomized starting values. Ratkowsky [26] gives a much more extensive discussion. When only one or two parameters are involved, a graphical approach works wonders. However, when three or more parameters are involved, special techniques are required and the difficulties are formidable. I have been unable to ascertain how the NIST chose their starting values, which are always one quite close to and another far away from the final answer, but Mathematica does offer an alternative. Before invoking NonlinearRegress, you could use NMinimize with no starting values (or some known to cause problems) assumed to find the minimum of the appropriate sum of squares. This is a very elaborate and well-documented method in Mathematica for finding global minima. Such more sophisticated methods may not be very efficient compared to the algorithm Mathematica employs in NonlinearRegress, but may work in such difficult cases. In computing the results reported here, I avoid the problem of choosing starting values by simply using those reported by the NIST.