### 11 Logistic Regression

Logistic regression is a member of the generalized linear models (GLMs),$4$ a generalization of linear regression to case where residuals (error) is not normally distributed. Furthermore, the linear model is related to the actual response through a link function. In logistic regression we will study here, the link function is the logit function, and the residuals are distributed binomially.

We will use two data sets in this part. First, a larger version of a familiar data set, the seg data set from Section 6. This data can be found at http://coltekin.net/cagri/R/data/seg-_large.csv.

Second, we will use another data set related to language acquisition. The data contains information on ‘overregularized’ past tense verb forms in CHILDES recordings. The observations (lines in the data) correspond to recording sessions. For each session, the data contains the age of the child in months (age), the number of times the child utters irregular verbs in past form in (n.past) and number of times that form is overregularized (n.or). We are interested in predicting the probability (or rate) of overregularizations from a child’s age. You can get the full data set at http://coltekin.net/cagri/R/data/past-_tense-_or.csv.

Again, the problem is real, but the data is fake (The counts of past tense forms are from CHILDES, but error rate does not have any empirical basis.). You should not take the conclusions out of this analysis seriously.

Exercise 11.1. Read the CSV ﬁle http://coltekin.net/cagri/R/data/seg-_large.csv into your R environment. Name the resulting data set seg (overwriting the earlier one if exists).

Make sure that the variables utterance and phoneme are factor variables. $⊳$

Exercise 11.2. Read the CSV ﬁle http://coltekin.net/cagri/R/data/past-_tense-_or.csv into your R environment. Name the resulting data set or.

Add a new variable correct to the or data frame that contains the rate of correct (not overregularized) production of the past tense forms. $⊳$

#### 11.1 Regression and binomial response variables

Exercise 11.3. Fit an ordinary least squares regression model predicting the rate of correct past tense production from the age of the child using the data set or.

Check the model assumptions through diagnostic plots, and reﬁt a model excluding the most inﬂuential data point. Save the resulting model as or.ols.

Repeat the diagnostic plots. Do you see any other problems with the model diagnostics? $⊳$

Exercise 11.4.  What is the predicted correct past-tense rate for children at ages $1.5$ and $8$? Note that age in our data is in months. $⊳$

If you were careful in Exercise 11.4, you should have realized that the model predicts a probability value greater than one. One of the ﬁrst problems with using ordinary regression to predict probability values is that the probability values are bounded between $0$ and $1$, while our linear model is happy to predict any value between $-\infty$ and $+\infty$. The ﬁrst step in logistic regression to relate the regression prediction to our response variable through a function. Instead of predicting the probability value $p$, we predict $logit\left(p\right)$ which is

$log\left(\frac{p}{1-p}\right)$

Since probabilities are between $0$ and $1$, the quantity in the parentheses above, the odds, transform it between $0$ and $-\infty$, and taking logarithm of the value expands the range from $-\infty$ and $+\infty$.

Exercise 11.5. Plot the curve of the logit function deﬁned above. $⊳$

Transforming the response variable with logit is just part of the solution, and we do not normally do the transformation manually. However, just for the demonstration, we will transform the data and ﬁt a ordinary regression model in the next exercise.

Exercise 11.6.  Add a new variable, log.odds, to the or data frame which contains the log odds for the probability of correct responses.

Fit a linear regression model predicting the transformed variable log.odds from age. Exclude the outlier we identiﬁed earlier. Save the model as or.logit.

Produce the diagnostic plots and inspect the results. How do you interpret the coeﬃcients? $⊳$

Exercise 11.7. What are the expected rate of overregularization for children at ages $1$ to $10$ (years) according to the model ﬁt in Exercise 11.6? $⊳$

#### 11.2 Binomial data and generalized linear models

The exercise above aims to give you a sense of how logistic regression proceeds. Specifying a non-linear function as our response variable (logit transform) is only part of the solution. The model ﬁt in Exercise 11.6 still assumes that the residuals are normally distributed, which is incorrect for binary or binomial response data. We cannot ﬁx this with lm(). The lm() function in R ﬁts models with normal residuals. To ﬁt a proper logistic regression, as well as other GLMs, we use the glm() function.

The syntax of glm() is similar to the syntax of lm(). The diﬀerences are the speciﬁcation of the response variable, and the type of GLM (e.g., logistic or count regression) with the family parameter. The value of the famliy parameter we use with the logistic regression is binomial. If no family parameter is speciﬁed, the default gauissian is used, in which case the glm() is equivalent to lm(). In logistic regression, we specify our response variable either as binary $0$ and $1$ values, or as a two column matrix of ‘success’ and ‘failures’ for each observation. Since the or data ﬁts the second format, Listing 11 presents speciﬁcation and summary of a logistic regression model using glm() using the success/failure notation.

Listing 11: Logistic regression example. The output is edited for clarity.
1> or.glm <- glm(cbind(n.past-n.or, n.or) ~ age,
2            family=binomial, data=or, subset-68)
3Deviance Residuals:
4     Min        1Q    Median        3Q       Max
5-1.93872  -0.54742   0.01848   0.75018   2.31268
6Coefficients:
7            Estimate Std. Error z value Pr(>|z|)
8(Intercept) 0.558637   0.153005   3.651 0.000261
9age         0.037836   0.003878   9.756  < 2e-16
10---
11(Dispersion parameter for binomial family taken to be 1)
12    Null deviance: 190.765  on 90  degrees of freedom
13Residual deviance:  93.036  on 89  degrees of freedom
14AIC: 385.24
15Number of Fisher Scoring iterations: 4

The output should mostly be familiar. The ﬁrst important diﬀerence to keep in mind is that the coeﬃcient estimates are not the estimates of the probabilities, but the logit of the probability values. The hypothesis test performed is a z-test (the z-value here is also called Wald’s statistic), but the interpretation is the same. The other diﬀerences you should have noticed include the frequent use of the term deviance, and no sign of ${r}^{2}$ in the output. These are because of the fact that the logistic regression model is estimated using maximum likelihood estimation (MLE) rather than least squares regression used in ordinary linear regression. The deviance measures the diﬀerence between the model of interest, as measured by log likelihood, and a model that ﬁts the data perfectly.$5$ The deviance in GLMs is similar to the residual sums of squares in ordinary regression. Smaller deviance indicating better model ﬁt. The AIC we introduced in Section 8 is also reported here. Remember that this is a combination of the model ﬁt (likelihood) and number of parameters in the model, and lower values of AIC indicate better models.

A problem with logistic regression (as well as other GLMs) is overdispersion. Overdispersion occurs when error (residuals) are more variable than expected from the theorized distribution. In case of logistic regression, the theorized error distribution is the binomial distribution. The variance of binomial distribution is a function of its mean (or the parameter $p$). If there is overdispersion, the coeﬃcient estimates will be more conﬁdent (smaller standard error values) than they should be. One can detect overdispersion by comparing the residual deviance with the degrees of freedom. If these two numbers are close, there is no overdispersion. Residual variation much larger than degree of freedom indicates overdispersion. Underdispersion, detected by lower residual deviance than the degrees of freedom, is also possible but more rare. In our example in Listing 11, there is no indication of overdispersion.

The last line in Listing 11 indicates the number of iterations used in the MLE estimation. The MLE, unlike least squares estimation, is not an analytic procedure but an iterative optimization procedure. All else being equal, large number of iterations means that the model was diﬃcult to ﬁt, and in some cases glm() will give up and announce that the model could not be ﬁt within the maximum number of iterations deﬁned. Sometimes, you will get a result (probably accompanied by a warning) but the estimates be will far from optimum. This is generally evidenced by very large standard errors in comparison to the coeﬃcient estimates.

We will ﬁrst deal with the overdispersion problem. A simple solution for overdispersion is to estimate an additional parameter indicating the amount of the oversidpersion. With glm(), this is done so-called ‘quasi’ families, i.e., in logistic regression we specify family=quasibinomial instead of binomial.

Exercise 11.8. Although we concluded that there is no overdispersion with the model presented in Listing 11, ﬁt the same model with quasibinomial family, and compare the summary with the model or.glm above. $⊳$

The predict() function for GLMs can present you either the value of the linear outcome (logit value in this case), or the actual response value (the probabilities). The ﬁst one is the default, if you want to get the response values, you should specify type=response option of the predict() function.

Exercise 11.9. What are the expected rate of overregularization for children at ages $1$ to $10$ (years) according to the model or.glm from Listing 11? $⊳$
Exercise 11.10.  Plot the scatter plot of observed correct past tense rates against age in the or data. Plot the curve representing the expected correct past tense rate against the age range in the data using the model or.glm. $⊳$

#### 11.3 Binary data

Another case of binary or binomial response is when we have binary ‘success/failure’ information for each observation. In the or data set we worked with above, the unit of observation has been a recoding session. Alternatively, we could consider every single past tense verb uttered by any child as the unit of observation, and indicate whether it was correct or overregularized (why is this a bad idea?). As a tricky exercise, you can try to expand the or data set as binary responses, and ﬁt a GLM and observe the diﬀerences. However, we will switch to the seg data set for experimenting with binary-response data.

To ﬁt a GLM with a binary response, we simply use our binary response variable (the response variable it can be in diﬀerent data types, but it should have only two values), and use the binomial family with glm.

Exercise 11.11.  Using the seg data set, ﬁt a logistic regression model predicting whether there is a boundary or not for a given pmi value. Save the model as seg.pmi.

Do you observe overdispersion?

Plot the observed data, and the prediction curve on the same graph. $⊳$

A measure of ﬁt for a logistic regression model is to check how accurately the model predicts the data. Note that since the predictions of the model are probabilities rather than direct success/failure indications, we need to set a threshold value. Using a threshold of $0.5$, i.e., assuming probabilities over $0.5$ indicate success, is common in practice. However, there may be cases where a diﬀerent threshold may be applicable, or in some cases we are not interested in converting probabilities to decisions at all, as in the or data set above,

Exercise 11.12.  Calculate the accuracy (ratio of successful predictions divided by total number of predictions) of boundary predictions by the model ﬁtted in Exercise 11.11.

TIP: You can compare seg\$boundary with predict(seg.pmi, type=response)> 0.5. Note that type=response is important here. $⊳$

Exercise 11.13. We can view logistic regression as a classiﬁcation method. For a classiﬁer, we often want to know whether it achieves better than a trivial baseline. In most cases the trivial baseline is not just choosing the response randomly, but guessing the majority class. In our segmentation problem, majority class is non-boundary, that is there are more word-internal phoneme pairs than word boundaries.

What is the majority-class baseline accuracy for the seg data set? $⊳$

#### 11.4 Further exercises in model selection

In the exercises above, we have worked with only one predictor for simplicity. The logistic regression, and the GLMs in general, is an extension of the general linear models we studied earlier. As a result, we can use multiple numeric or categorical predictors with the logistic regression as well.

Exercise 11.14. Fit a logistic regression model predicting boundaries from all variables in the seg data frame. Make sure that the variables utterance and phoneme are processed as categorical variables. Name the resulting model seg.full. Summarize the results. $⊳$

Exercise 11.15.  Compare the models glm.pmi and glm.full using anova(). Note that if you like to perform hypothesis testing, you should specify test="Chisq", since the diﬀerence between the deviances for the two models follows an approximate ${\chi }^{2}$ distribution (as opposed to F-distribution we use with lm() residuals). $⊳$

Exercise 11.16.  You should see in Exercise 11.15 that the diﬀerence between the models seg.pmi and seg.full is statistically signiﬁcant, but you should have also realized that the smaller model has much fewer degrees of freedom.

The AIC values (in the model summaries) also indicate that the big model is better. However, some of the variables in the big model may not be necessary.

Perform a stepwise selection starting from the full model model.full. Save the model identiﬁed by step() as seg.model. $⊳$

Exercise 11.17. Calculate the accuracy of the seg.model from Exercise 11.15. Compare it with the accuracy of the model seg.pmi you calculated in Exercise 11.12.

Is the diﬀerence statistically signiﬁcant? TIP: You could either use Fisher’s exact test, or Chi-Square Test for Independence for this purpose. The names of the functions are fisher.test() and chisq.test() respectively. $⊳$

Exercise 11.18.  A model’s predictions are most useful outside the data used to ﬁt the model. Fit another logistic regression model that predicts the boundaries from pmi, h and rh using the ﬁrst half of the data (utterance numbers $0$ through $49$). Calculate the accuracy of the model on the ﬁrst and second half of the data, and compare the results. TIP: converting utterance back to numeric may be useful here. $⊳$

Exercise 11.19. Repeat Exercise 11.18 with only a single predictor, pmi. $⊳$

#### 11.5 More on logistic regression and GLMs

In this tutorial we only discuss the binary (or binomial) case. Logistic regression can also be used with a multinomial response, where you have more than two categories, e.g., noun, adjective or adverb. If the response is ordered, multinomial logistic regression can be ﬁt using polr() from the MASS package. Unordered multinomial logistic regression can be ﬁt using the mnp package.

The lrm() function from the rms package also provides functions for ﬁtting logistic regression models, with some additional options and output with additional statistics (e.g., pseudo ${r}^{2}$ values). The package also contains other useful utilities for ﬁtting and diagnosing GLMs.