Logistic regression is a member of the generalized linear models (GLMs),$4$ a generalization of linear regression to case where residuals (error) is not normally distributed. Furthermore, the linear model is related to the actual response through a link function. In logistic regression we will study here, the link function is the logit function, and the residuals are distributed binomially.
We will use two data sets in this part. First, a larger version of a familiar data set, the seg data set from Section 6. This data can be found at http://coltekin.net/cagri/R/data/seg-_large.csv.
Second, we will use another data set related to language acquisition. The data contains information on ‘overregularized’ past tense verb forms in CHILDES recordings. The observations (lines in the data) correspond to recording sessions. For each session, the data contains the age of the child in months (age), the number of times the child utters irregular verbs in past form in (n.past) and number of times that form is overregularized (n.or). We are interested in predicting the probability (or rate) of overregularizations from a child’s age. You can get the full data set at http://coltekin.net/cagri/R/data/past-_tense-_or.csv.
Again, the problem is real, but the data is fake (The counts of past tense forms are from CHILDES, but error rate does not have any empirical basis.). You should not take the conclusions out of this analysis seriously.
Make sure that the variables utterance and phoneme are factor variables. $\u22b3$
Add a new variable correct to the or data frame that contains the rate of correct (not overregularized) production of the past tense forms. $\u22b3$
Check the model assumptions through diagnostic plots, and reﬁt a model excluding the most inﬂuential data point. Save the resulting model as or.ols.
Repeat the diagnostic plots. Do you see any other problems with the model diagnostics? $\u22b3$
If you were careful in Exercise 11.4, you should have realized that the model predicts a probability value greater than one. One of the ﬁrst problems with using ordinary regression to predict probability values is that the probability values are bounded between $0$ and $1$, while our linear model is happy to predict any value between $-\infty $ and $+\infty $. The ﬁrst step in logistic regression to relate the regression prediction to our response variable through a function. Instead of predicting the probability value $p$, we predict $logit\left(p\right)$ which is
Since probabilities are between $0$ and $1$, the quantity in the parentheses above, the odds, transform it between $0$ and $-\infty $, and taking logarithm of the value expands the range from $-\infty $ and $+\infty $.
Transforming the response variable with logit is just part of the solution, and we do not normally do the transformation manually. However, just for the demonstration, we will transform the data and ﬁt a ordinary regression model in the next exercise.
Fit a linear regression model predicting the transformed variable log.odds from age. Exclude the outlier we identiﬁed earlier. Save the model as or.logit.
Produce the diagnostic plots and inspect the results. How do you interpret the coeﬃcients? $\u22b3$
The exercise above aims to give you a sense of how logistic regression proceeds. Specifying a non-linear function as our response variable (logit transform) is only part of the solution. The model ﬁt in Exercise 11.6 still assumes that the residuals are normally distributed, which is incorrect for binary or binomial response data. We cannot ﬁx this with lm(). The lm() function in R ﬁts models with normal residuals. To ﬁt a proper logistic regression, as well as other GLMs, we use the glm() function.
The syntax of glm() is similar to the syntax of lm(). The diﬀerences are the speciﬁcation of the response variable, and the type of GLM (e.g., logistic or count regression) with the family parameter. The value of the famliy parameter we use with the logistic regression is binomial. If no family parameter is speciﬁed, the default gauissian is used, in which case the glm() is equivalent to lm(). In logistic regression, we specify our response variable either as binary $0$ and $1$ values, or as a two column matrix of ‘success’ and ‘failures’ for each observation. Since the or data ﬁts the second format, Listing 11 presents speciﬁcation and summary of a logistic regression model using glm() using the success/failure notation.
The output should mostly be familiar. The ﬁrst important diﬀerence to keep in mind is that the coeﬃcient estimates are not the estimates of the probabilities, but the logit of the probability values. The hypothesis test performed is a z-test (the z-value here is also called Wald’s statistic), but the interpretation is the same. The other diﬀerences you should have noticed include the frequent use of the term deviance, and no sign of ${r}^{2}$ in the output. These are because of the fact that the logistic regression model is estimated using maximum likelihood estimation (MLE) rather than least squares regression used in ordinary linear regression. The deviance measures the diﬀerence between the model of interest, as measured by log likelihood, and a model that ﬁts the data perfectly.$5$ The deviance in GLMs is similar to the residual sums of squares in ordinary regression. Smaller deviance indicating better model ﬁt. The AIC we introduced in Section 8 is also reported here. Remember that this is a combination of the model ﬁt (likelihood) and number of parameters in the model, and lower values of AIC indicate better models.
A problem with logistic regression (as well as other GLMs) is overdispersion. Overdispersion occurs when error (residuals) are more variable than expected from the theorized distribution. In case of logistic regression, the theorized error distribution is the binomial distribution. The variance of binomial distribution is a function of its mean (or the parameter $p$). If there is overdispersion, the coeﬃcient estimates will be more conﬁdent (smaller standard error values) than they should be. One can detect overdispersion by comparing the residual deviance with the degrees of freedom. If these two numbers are close, there is no overdispersion. Residual variation much larger than degree of freedom indicates overdispersion. Underdispersion, detected by lower residual deviance than the degrees of freedom, is also possible but more rare. In our example in Listing 11, there is no indication of overdispersion.
The last line in Listing 11 indicates the number of iterations used in the MLE estimation. The MLE, unlike least squares estimation, is not an analytic procedure but an iterative optimization procedure. All else being equal, large number of iterations means that the model was diﬃcult to ﬁt, and in some cases glm() will give up and announce that the model could not be ﬁt within the maximum number of iterations deﬁned. Sometimes, you will get a result (probably accompanied by a warning) but the estimates be will far from optimum. This is generally evidenced by very large standard errors in comparison to the coeﬃcient estimates.
We will ﬁrst deal with the overdispersion problem. A simple solution for overdispersion is to estimate an additional parameter indicating the amount of the oversidpersion. With glm(), this is done so-called ‘quasi’ families, i.e., in logistic regression we specify family=quasibinomial instead of binomial.
The predict() function for GLMs can present you either the value of the linear outcome (logit value in this case), or the actual response value (the probabilities). The ﬁst one is the default, if you want to get the response values, you should specify type=’response’ option of the predict() function.
Another case of binary or binomial response is when we have binary ‘success/failure’ information for each observation. In the or data set we worked with above, the unit of observation has been a recoding session. Alternatively, we could consider every single past tense verb uttered by any child as the unit of observation, and indicate whether it was correct or overregularized (why is this a bad idea?). As a tricky exercise, you can try to expand the or data set as binary responses, and ﬁt a GLM and observe the diﬀerences. However, we will switch to the seg data set for experimenting with binary-response data.
To ﬁt a GLM with a binary response, we simply use our binary response variable (the response variable it can be in diﬀerent data types, but it should have only two values), and use the binomial family with glm.
Do you observe overdispersion?
Plot the observed data, and the prediction curve on the same graph. $\u22b3$
A measure of ﬁt for a logistic regression model is to check how accurately the model predicts the data. Note that since the predictions of the model are probabilities rather than direct success/failure indications, we need to set a threshold value. Using a threshold of $0.5$, i.e., assuming probabilities over $0.5$ indicate success, is common in practice. However, there may be cases where a diﬀerent threshold may be applicable, or in some cases we are not interested in converting probabilities to decisions at all, as in the or data set above,
TIP: You can compare seg$boundary with predict(seg.pmi, type=’response’)> 0.5. Note that type=’response’ is important here. $\u22b3$
What is the majority-class baseline accuracy for the seg data set? $\u22b3$
In the exercises above, we have worked with only one predictor for simplicity. The logistic regression, and the GLMs in general, is an extension of the general linear models we studied earlier. As a result, we can use multiple numeric or categorical predictors with the logistic regression as well.
The AIC values (in the model summaries) also indicate that the big model is better. However, some of the variables in the big model may not be necessary.
Perform a stepwise selection starting from the full model model.full. Save the model identiﬁed by step() as seg.model. $\u22b3$
Is the diﬀerence statistically signiﬁcant? TIP: You could either use Fisher’s exact test, or Chi-Square Test for Independence for this purpose. The names of the functions are fisher.test() and chisq.test() respectively. $\u22b3$
In this tutorial we only discuss the binary (or binomial) case. Logistic regression can also be used with a multinomial response, where you have more than two categories, e.g., noun, adjective or adverb. If the response is ordered, multinomial logistic regression can be ﬁt using polr() from the MASS package. Unordered multinomial logistic regression can be ﬁt using the mnp package.
The lrm() function from the rms package also provides functions for ﬁtting logistic regression models, with some additional options and output with additional statistics (e.g., pseudo ${r}^{2}$ values). The package also contains other useful utilities for ﬁtting and diagnosing GLMs.