Multiple regression

8 Multiple regression

In our many passes over the regression so far, we have only worked with a single explanatory variable. In this part extend it to multiple predictors. Although it does not change the model ﬁtting drastically, there are a number of issues that come with multiple predictors.

The data we will use in this part is another hypothetical study on child language acquisition. This time we want to investigate the eﬀects of amount of time spend in front of TV to two-year-old children’s language development. The data can be found as an R data ﬁle at http://coltekin.net/cagri/R/data/tv.rda. The response variable in this data set, cdi, is a standard measure of children’s language abilities based on parental reports. The predictor we are mainly interested in is tv.hours, which is the weekly hours of TV time for each child. We have some other predictors that we will explain in the relevant exercises below. As in most cases in this tutorial, the problem/design is simpliﬁed and the data is randomly generated.

: Exercise 8.1. Read the data from http://coltekin.net/cagri/R/data/tv.rda. If all goes ﬁne, you should have a new data frame called tv. $⊳$

8.1 Revisiting single regression (for the last time)

Multiple regression is not very diﬀerent than simple regression we studied piece-by-piece in multiple steps so far. In this subsection, we will ﬁt two simple regression models for later comparison with the multiple regression.

Exercise 8.2. Fit a simple regression model where tv.hours is the only predictor of the cdi score. Store the model object in variable m1.

Summarize the results, what does the model predict about the eﬀect of watching TV?
Produce the diagnostic plots. Make sure that all four diagnostic plots produced are shown on the same graph.
Plot a scatter plot of cdi against tv.hours, and the regression line over it.
Plot 95% ‘conﬁdence’ and ‘prediction’ bands around the regression line. Use distinct colors and line types.

⊳

: Exercise 8.3. Fit a simple regression model where mot.education is the only predictor of the cdi score. Store the model object in variable m2. Summarize the results, and compare it with m1.
You are encouraged to repeat Exercise 8.2 fully for m2. $⊳$

8.2 Multiple regression

To ﬁt a regression model with multiple predictors, we add each predictor to the right side of the tilde ‘~’ in our model speciﬁcation (formula). Note that for numeric predictors, the interaction terms are diﬃcult to interpret. As a result we almost exclusively use + between the predictors. $1$

Exercise 8.4. Fit a multiple regression model predicting cdi from tv.hours and mot.education using the tv data set.

Save the new model object as m3.

Summarize the new model, and compare it with the earlier models m1 (from Exercise 8.2) and m2 (from Exercise 8.2).

In particular, you should be checking the changes in the p-values, and the $r^{2}$ . Can you explain the diﬀerences that you observe? $⊳$

The scatter plots that were so useful in single regression becomes rather useless in the context of multiple regression. For two predictors, you can plot so-called 3D scatter plots. A 3D scatter plot is hardly digestible except for simple/demonstrative data sets. We will not experiment with them here, but note that a number of R packages, including scatterplot3d and rgl and Rcmdr have functions to plot 3D graphs.

Exercise 8.5. Plots two side-by-side scatter plots using the tv data set. One cdi against tv.hours and another cdi against mot.education.

On both plots, draw the single regression line, and the resulting line from the multiple regression m3 (the intersection of the regression plane and the plane deﬁned by the relevant axes).

TIP: You can use coef() to extract coeﬃcients from m3. For example, coef(m3)[1] will give you the estimated intercept. $⊳$

Exercise 8.6. A typical visualization trick for representing high dimensional data is to use colors (or shades) to represent one of the dimensions. In this (somewhat tricky) exercise, we want to see for which values of the predictors tv.hours and mot.education we have high cdi values.

Plot squares ﬁlled with shades of gray that represent the predicted cdi scores for each pair of tv.hours and mot.education values within the range of data.

The maximum CDI value should be plotted in black, and the minimum should be plotted in white. Put tv.hours to x axis, and mot.education to the y axis. For both, use the integer values within the range in the tv data set.

TIP1: you can initialize a new graph, but do not plot anything on it (except axes, labels etc.) using the option type=’n’.

TIP2: the function gray() returns color values that can be used with plotting functions. Alternatively, you can use rgb() produce a colorful graph instead. $⊳$

8.3 Multicollinearity

One of the problems that aﬀect the parameter estimates and interpretation of a multiple regression analysis is multicollinearity. If two predictors are collinear, part of the variance in the response variable is explained by both. Multicollinearity does not aﬀect the predictive power of the model. However, the coeﬃcient estimates will become less certain, and diﬃcult to understand. If a predictor which is collinear with the existing predictors is added to a regression model, the earlier estimates may change. A sign of multicollinearity is the high linear correlation between the predictors.

: Exercise 8.7. Check the correlation between the predictors of m3 from Exercise 8.4, using cor(). $⊳$

An extreme case of multicollinearity is when one of the predictors is a linear combination of one or more other predictors. If this is the case, the least-squares estimation will not be possible.

: Exercise 8.8. Fit another model, where as well as tv.hours and mot.education, their sum is also a predictor. Note that to make sure that + works as an arithmetic operator in a model formula, you need the function I(). In this case the new predictor should be speciﬁed as I(tv.hours + mot.education).
Summarize the result. $⊳$

A well known measure of collinearity is variance inﬂation factor (VIF). The VIF represents the increased uncertainty of a coeﬃcient (slope) estimate of a predictor due to the predictors in the model. The details of the calculation is, as usual, not we are interested here, but the following formula may give you some insights. Variance inﬂation factor for the $j^{th}$ predictor, $x_{j}$ , is

V I F_{j} = \frac{1}{1 - r_{j}^{2}}

where $r_{j}^{2}$ is the $r^{2}$ value obtained by ﬁtting a model predicting $x_{j}$ from the rest of the predictors. For example, if our aim is to check VIF of $x 1$ in the model speciﬁed by y ~ x1 + x2 + x3, we use the $r^{2}$ from the regression x1 ~ x2 + x3. Clearly, the higher the VIF, the higher the variability on estimates of coeﬃcients. Although there is no clear-cut rule, a common suggestion to consider values above $5$ to high VIF. You should have already realized that each predictor has an associated VIF.

You already know how to calculate VIF without any additional help. However, you can also use the function vif() from car package [Fox and Weisberg, 2011]. The [Fox and Weisberg, 2011] package contains quite a few useful/convenient functions for regression and ANOVA.

Exercise 8.9. Calculate VIF values for the predictors in the largest model we studied so far, m3.

Reminder: to be able use the functions from the library car, you need to ﬁrst run

 
> library(’car’)

If your installation does not have the ’car’ library, you can install it (with proper rights on a computer connected to Internet) using the command

 
> install.packages(’car’)

⊳

8.4 $r^{2}$ and adjusted $r^{2}$

We already know that $r^{2}$ is the square of the Pearson’s correlation coeﬃcient, and interpreted as the ‘the ratio or percentage of the variance in the outcome variable explained by the predictor’. With multiple regression, there is no straightforward correlation coeﬃcient anymore. However, the ‘Multiple R-squared’ reported in the regression summary still has the interpretation of ‘the ratio of the variance in the outcome variable explained by the predictors’.

Despite this nice interpretation, $r^{2}$ is inﬂated by adding more predictors, even if the new predictors have no explanatory power. The adjusted version, noted as ${\bar{r}}^{2}$ , corrects for this behavior. The following formulation of the ${\bar{r}}^{2}$ helps relating it to $r^{2}$ (it looks crowded but is is quite easy to digest if you make the attempt).

{\bar{r}}^{2} = 1 - [\frac{n - 1}{n - k - 1} \times (1 - r^{2})]

where $n$ is the number of observations, and $k$ is the number of predictors (not including the intercept term). You should realize that ${\bar{r}}^{2}$ is always smaller than $r^{2}$ , and as $k$ gets larger, the diﬀerence will be bigger. For large data sets, on the other hand, the diﬀerence will be smaller ( $k$ will have little eﬀect on the adjustment). One can also view ${\bar{r}}^{2}$ as the estimate of $r^{2}$ for the population (this will be more apparent if you rearrange the formula in terms of the ‘model sums of squares’ and the ‘total sums of squares’).

Exercise 8.10. The function runif() returns a random sample from a uniform distribution. For example, runif(80, 0, 1) returns a vector of

80

random numbers between

0

and

1

, which are distributed uniformly.

Create two additional columns in tv data set, and ﬁll it with random numbers from a uniform distribution (the range of the numbers does not matter).

Fit a new model where these two ‘random’ columns are added as predictors besides tv.hours and mot.education.

Compare the ‘R-squared’ and ‘Adjusted R-squared’ values with and without the new irrelevant (random) predictors. $⊳$

8.5 Model selection

In ﬁtting models with multiple predictors, we typically experiment with including or excluding predictors. Deciding for an adequate model depends on many factors, such as the predictors that are measured/collected and aim of the model. For example the choice of a model over another can be very diﬀerent if the aim of the modeling is to make predictions as opposed to understand the coeﬃcients (eﬀects). The ground rule is, all else being equal, we prefer simpler models. As well as the Occam’s razor, for the purpose of understandable coeﬃcient estimates, simplicity is also motivated by the fact that with many predictors, the parameter estimates become too variable to be interpretable. For the purposes of prediction, large number of predictors will typically be costly to be measured, and they will cause overﬁtting: the model will perform well for the know data, but will perform poorly on unseen data. Here, we will not make any strong suggestions about model selection, but try to introduce some common tools that help comparing models.

The function update() in R allows you to modify an existing model gradually. Once we have a base model, instead of specifying a new model from scratch, we can deﬁne a model that adds or removes variables to or from the base model. For example,

 
> update(m, . ~ . + x1)

adds a new predictor x1 to a model m that is ﬁtted earlier, for example, with lm(). Similarly,

 
> update(m, . ~ . - x2)

returns a new model where the predictor x2 is removed from the original model m. The dot ‘.’ in these formulas represents the set of values from the original model (m). In both cases, the result is a new model object.

Exercise 8.11. The tv data set includes another predictor daycare.hours which speciﬁes the number of hours per week the child spends in a daycare.

Using update() create a new model that adds this predictor to the model m3 from Exercise 8.4. Name the new variable m4.

Does the new predictor aﬀect the model ﬁt? Is the eﬀect statistically signiﬁcant? $⊳$

Any additional predictor, even a completely random one, will result in a better ﬁt: the residual variance will be lower (and $r^{2}$ will be higher). The question we often ask is whether the reduced residual variance can be a chance eﬀect or not. This leads us to compare residual variances of two models. You should already be familiar with the concept of comparing two variances: this is what we do in ANOVA. If you run anova() on two model objects, it will perform an F-test, and tell you whether the residual diﬀerences are statistically signiﬁcant.

If a model (with all the predictors) is doing something useful, it should be doing better than predicting the mean of the response variable for any predictor value. Such a test is automatically done when we summarize linear regression results. The F-test result on the last line of a linear regression summary is from such a test.

Exercise 8.12. The formula ‘response ~ 1’ speciﬁes a model with no predictor, or more correctly a model with a constant prediction.

Using the tv data set, ﬁt such a model predicting cdi. Assign the resulting model to variable m0.

Check whether the estimated intercept is the same as mean of the response variable.

What does the standard error of the estimated intercept correspond to? $⊳$

Exercise 8.13. Perform an ANOVA comparing m0 (Exercise 8.12) and m1 (Exercise 8.2).

Do the improvement (reduction) in residual variation in m1 statistically signiﬁcant (at level $0.05$ )?

Compare your result with the F-test reported in the summary of model m1. $⊳$

Exercise 8.14. The ANOVA we have performed in Exercise 8.13 serves only as a demonstration. The same test is already included in the model summary. It is more interesting to compare two diﬀerent models with predictors.

Compare m1 vs. m3 and m3 vs. m4 using anova().

Note: conventionally, the smaller model is listed ﬁrst.

What conclusions do you get out of these comparisons? $⊳$

If you run anova() on a single model object, it will perform F-tests incrementally for each predictor. The order of tests performed depend on the order the predictors in the regression formula. anova() ﬁrst tests the model with the ﬁrst predictor against the model of the mean, then the model with ﬁrst two predictors against the model only with the ﬁrst predictor, and so on. As a result, changing the order of predictors will result in diﬀerent results.

: Exercise 8.15. Repeat the tests performed in Exercise 8.14 by a single call to anova(). $⊳$

Akaike information criterion (AIC) is a measure used for model selection. The AIC is

AIC = 2 k - 2 log (L)

Where $k$ is the number of predictors, and the $L$ is the likelihood (of the data given the estimated model), which is another measure of model’s ﬁt to the data (we will return to discussion of likelihood).

Given two models, the model with lower AIC value is preferred. Notice that the AIC rewards good ﬁt (higher $L$ ), $2$ and penalizes the modes with large number of predictors (lower $k$ ).

The function AIC() calculates the AIC value for a given model.

: Exercise 8.16. Compare the models you compared with anova() in Exercise 8.14 with AIC(). Which model is the best according to AIC? $⊳$

When there are many predictors, one may want to employ an automatic method for model selection. Note that the number of models to compare when you have $k$ predictors is $2^{k}$ . For example, for $k = 10$ this amounts to $1024$ comparisons, and for $k = 20$ the number of comparisons required is over a million. With increasing $k$ , comparing all possible models becomes intractable.

Even if comparing all possible combinations of coeﬃcients is possible, you should not expect any mechanical procedure to ﬁnd you the best model. Normally your model selection should be guided by the knowledge relevant to the problem at hand. However, a few search procedures that look for an optimal model are commonly used in the literature. The function step() provides the well-known backward (starting with the full model, and eliminating variables that are deemed not useful), forward (starting with no or a few predictors, and adding ones that improve the model most) procedures, or a combination of both. If you want larger models to be explored, you will need to use the scope option to specify which predictors step should consider adding.

: Exercise 8.17. Apply step() to the full-model (m4 from Exercise 8.11) on tv data set. $⊳$

[next] [prev] [prev-tail] [front] [up]