Negative binomial regression is implemented using maximum likelihood estimation. The traditional model and the rate model with offset are demonstrated, along with regression diagnostics.

### Traditional Model

Negative binomial regression is a type of generalized linear model in which the dependent variable is a count of the number of times an event occurs. A convenient parametrization of the negative binomial distribution is given by Hilbe [1]:

(1) |

where is the mean of and is the heterogeneity parameter. Hilbe [1] derives this parametrization as a Poisson-gamma mixture, or alternatively as the number of failures before the success, though we will not require to be an integer.

The traditional negative binomial regression model, designated the NB2 model in [1], is

(2) |

where the predictor variables are given, and the population regression coefficients are to be estimated.

Given a random sample of subjects, we observe for subject the dependent variable and the predictor variables . Utilizing vector and matrix notation, we let , and we gather the predictor data into the design matrix as follows:

Designating the row of to be , and exponentiating (2), we can then write the distribution (1) as

We estimate and using maximum likelihood estimation. The likelihood function is

and the log-likelihood function is

(3) |

The values of and that maximize will be the maximum likelihood estimates we seek, and the estimated variance-covariance matrix of the estimators is , where is the Hessian matrix of second derivatives of the log-likelihood function. Then the variance-covariance matrix can be used to find the usual Wald confidence intervals and -values of the coefficient estimates.

### Example 1: Traditional Model with Simulated Data

We will use *Mathematica* to replicate some examples given by Hilbe [1], who uses R and Stata. We start with simulated data generated with known regression coefficients, then recover the coefficients using maximum likelihood estimation. We will generate a sample of observations of a dependent random variable that has a negative binomial distribution with mean given by (2), using , , and . The design matrix will contain independent standard normal variates.

Now we define and maximize the log-likelihood function (3), obtaining the estimates of and . Some experimentation with starting values for the search may be required, and the accuracy goal may need to be lowered; we could obtain good starting values for using Poisson regression via `GeneralizedLinearModelFit`, while is usually between 0.0 and 4.0 [1].

But we arbitrarily set all starting values to 1.0 and successfully find the correct estimates.

Define two helper functions.

Next, we find the standard errors of the estimates. The standard errors are the square roots of the diagonal elements of the variance-covariance matrix , which as mentioned above is given by , where is the Hessian matrix of second derivatives of the log-likelihood function. First, define the Hessian for any function.

Then we find the Hessian and at the values of our parameter estimates.

Finally, these are our standard errors.

We can now print a table of the results: the estimates of the coefficients, their standard errors, and the Wald statistics, -values, and confidence intervals.

We see that in each case the confidence interval has captured the population parameter.

### Traditional Model for Rates, Using Offset

If the dependent variable counts the number of events during a specified time interval , then the observed rate can be modeled by using the traditional negative binomial model above, with a slight adjustment. We note that can also be thought of as area or subpopulation size, among other interpretations that lead to considering a rate.

Since , we make the following adjustment to model (2) above:

which can also be written as:

(4) |

This last term, , is called the *offset*. So in our log-likelihood function, instead of replacing with , we replace with , resulting in the following:

(5) |

Then we proceed as before, maximizing the new log-likelihood function in order to estimate the parameters.

### Example 2: Traditional Model with Offset for the *Titanic* Data

The *Titanic* survival data, available from [2] and analyzed in [1] using R and Stata, is summarized in Table 1, with crew members deleted.

Why did fewer first-class children survive than second class or third class? Was it because first-class children were at extra risk? No, it was because there were fewer first-class children on board the *Titanic* in the first place. So we do not want to model the raw number () of survivors; instead, we want to model the proportion () of survivors, which is the survival rate. So in (4) we need to be the number of cases.

We set up the design matrix, with indicators 1 for adults and males, and using indicator variables for second class and third class, which means first class will be a reference.

Then we set up the dependent variable and the offset.

We define the log-likelihood (5).

Now we maximize it to find the coefficients.

Then we find the standard errors of the coefficients.

And again we can print a table of the results.

But perhaps more useful for interpretation of the coefficients would be the Incidence Rate Ratio (IRR) for each variable, which is obtained by exponentiating each coefficient. For example, out of a sample of adults, we expect that the survival rate, from our model (4), will be while for an identical number of children we expect their survival rate to be . So by dividing the two rates, we obtain the ratio of rates (IRR) to be

which we estimate to be . Thus, our interpretation is that adults survived at roughly half the rate at which children survived, among those of the same sex and class. The standard error of IRR is found by multiplying the estimated IRR by the standard error of the coefficient (see [1]), while a confidence interval for IRR is found by exponentiating the confidence interval for the coefficient. Thus we obtain the following.

We do not need IRR for or , so we drop them and then print the resulting table.

The confidence interval for the variable `class2` contains 1.0, consistent with the lack of significance of its coefficient, and indicating that the survival rate of second-class passengers was not significantly different than that of first-class passengers. We will address this after computing some model assessment statistics and residuals.

### Model Assessment

Various types of model fit statistics and residuals are readily computed. We use definitions given in [1]; alternate definitions exist and would require only minor changes.

Commonly used model fit statistics include the log-likelihood, deviance, Pearson chi-square dispersion, Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC).

We already have the log-likelihood as a byproduct of the maximization process. The deviance is defined as

where is our log-likelihood function (5), and is the log-likelihood function with replacing . For our NB2 model, this simplifies to , where

(6) |

The Pearson chi-square dispersion statistic is given by , while AIC and BIC are defined as

and

We compute these for the *Titanic* data above and display them.

These model assessment statistics are most useful when compared to those of a competing model, which we pursue in the next section after computing residuals.

### Residuals

The raw residuals are of course , while the Pearson residuals are , and the deviance residuals are , as defined in (6).

These residuals can be standardized by dividing by , where the are the leverages obtained from the diagonal of the hat matrix , for equal to the diagonal matrix, with as the element of the diagonal.

Here are the unstandardized residuals for the *Titanic* data.

And here are the leverages and the standardized residuals.

Hilbe recommends plotting the Standardized Pearson residuals versus , with a poor model fit indicated by residuals that are outside the interval when the leverage is high.

We have two Standardized Pearson residuals that are not within the range , one of which has a high leverage. We also recall that the variable `class2` was not significant. Perhaps the model will be improved if we remove `class2`. All that is required is to remove `class2` from the design matrix , remove the corresponding starting value from the maximizing command, and run the model again. We obtain the following assessment statistics and standardized residuals for the revised model with `class2` removed.

We set up design matrix and find the coefficients.

Comparing to the full model, we see that the assessment statistics have improved (they are smaller, indicating a better fit), and the Standardized Pearson residuals with high leverages are within the recommended boundaries. It appears that the model has been improved by dropping `class2`.

### Conclusion

The traditional negative binomial regression model (NB2) was implemented by maximum likelihood estimation without much difficulty, thanks to the maximization command and especially to the automatic computation of the standard errors via the Hessian.

Other negative binomial models, such as the zero-truncated, zero-inflated, hurdle, and censored models, could likewise be implemented by merely changing the likelihood function.

### Acknowledgments

The author acknowledges suggestions and assistance by the editor and the referee that helped to improve this article.

### References

[1] | J. Hilbe, Negative Binomial Regression, 2nd ed., New York: Cambridge University Press, 2011. |

[2] | “JSE Data Archive.” Journal of Statistics Education. (Nov 19, 2012) www.amstat.org/publications/jse/jse_data_archive.htm. |

M. L. Zwilling, “Negative Binomial Regression,” The Mathematica Journal, 2013. dx.doi.org/10.3888/tmj.15-6. |

### About the Author

**Michael L. Zwilling**

*Department of Mathematics
University of Mount Union
1972 Clark Avenue
Alliance, OH 44601*

*zwilliml@mountunion.edu*