8 Multiple regression

In our many passes over the regression so far, we have only worked with a single explanatory variable. In this part extend it to multiple predictors. Although it does not change the model fitting drastically, there are a number of issues that come with multiple predictors.

The data we will use in this part is another hypothetical study on child language acquisition. This time we want to investigate the effects of amount of time spend in front of TV to two-year-old children’s language development. The data can be found as an R data file at http://coltekin.net/cagri/R/data/tv.rda. The response variable in this data set, cdi, is a standard measure of children’s language abilities based on parental reports. The predictor we are mainly interested in is tv.hours, which is the weekly hours of TV time for each child. We have some other predictors that we will explain in the relevant exercises below. As in most cases in this tutorial, the problem/design is simplified and the data is randomly generated.

Exercise 8.1. Read the data from http://coltekin.net/cagri/R/data/tv.rda. If all goes fine, you should have a new data frame called tv.

8.1 Revisiting single regression (for the last time)

Multiple regression is not very different than simple regression we studied piece-by-piece in multiple steps so far. In this subsection, we will fit two simple regression models for later comparison with the multiple regression.

Exercise 8.2.  Fit a simple regression model where tv.hours is the only predictor of the cdi score. Store the model object in variable m1.
  1. Summarize the results, what does the model predict about the effect of watching TV?
  2. Produce the diagnostic plots. Make sure that all four diagnostic plots produced are shown on the same graph.
  3. Plot a scatter plot of cdi against tv.hours, and the regression line over it.
  4. Plot 95% ‘confidence’ and ‘prediction’ bands around the regression line. Use distinct colors and line types.

Exercise 8.3.  Fit a simple regression model where mot.education is the only predictor of the cdi score. Store the model object in variable m2. Summarize the results, and compare it with m1.

You are encouraged to repeat Exercise 8.2 fully for m2.

8.2 Multiple regression

To fit a regression model with multiple predictors, we add each predictor to the right side of the tilde ‘~’ in our model specification (formula). Note that for numeric predictors, the interaction terms are difficult to interpret. As a result we almost exclusively use + between the predictors.1

Exercise 8.4.  Fit a multiple regression model predicting cdi from tv.hours and mot.education using the tv data set.

Save the new model object as m3.

Summarize the new model, and compare it with the earlier models m1 (from Exercise 8.2) and m2 (from Exercise 8.2).

In particular, you should be checking the changes in the p-values, and the r2. Can you explain the differences that you observe?

The scatter plots that were so useful in single regression becomes rather useless in the context of multiple regression. For two predictors, you can plot so-called 3D scatter plots. A 3D scatter plot is hardly digestible except for simple/demonstrative data sets. We will not experiment with them here, but note that a number of R packages, including scatterplot3d and rgl and Rcmdr have functions to plot 3D graphs.

Exercise 8.5. Plots two side-by-side scatter plots using the tv data set. One cdi against tv.hours and another cdi against mot.education.

On both plots, draw the single regression line, and the resulting line from the multiple regression m3 (the intersection of the regression plane and the plane defined by the relevant axes).

TIP: You can use coef() to extract coefficients from m3. For example, coef(m3)[1] will give you the estimated intercept.

Exercise 8.6. A typical visualization trick for representing high dimensional data is to use colors (or shades) to represent one of the dimensions. In this (somewhat tricky) exercise, we want to see for which values of the predictors tv.hours and mot.education we have high cdi values.

Plot squares filled with shades of gray that represent the predicted cdi scores for each pair of tv.hours and mot.education values within the range of data.

The maximum CDI value should be plotted in black, and the minimum should be plotted in white. Put tv.hours to x axis, and mot.education to the y axis. For both, use the integer values within the range in the tv data set.

TIP1: you can initialize a new graph, but do not plot anything on it (except axes, labels etc.) using the option type=n.

TIP2: the function gray() returns color values that can be used with plotting functions. Alternatively, you can use rgb() produce a colorful graph instead.

8.3 Multicollinearity

One of the problems that affect the parameter estimates and interpretation of a multiple regression analysis is multicollinearity. If two predictors are collinear, part of the variance in the response variable is explained by both. Multicollinearity does not affect the predictive power of the model. However, the coefficient estimates will become less certain, and difficult to understand. If a predictor which is collinear with the existing predictors is added to a regression model, the earlier estimates may change. A sign of multicollinearity is the high linear correlation between the predictors.

Exercise 8.7.  Check the correlation between the predictors of m3 from Exercise 8.4, using cor().

An extreme case of multicollinearity is when one of the predictors is a linear combination of one or more other predictors. If this is the case, the least-squares estimation will not be possible.

Exercise 8.8. Fit another model, where as well as tv.hours and mot.education, their sum is also a predictor. Note that to make sure that + works as an arithmetic operator in a model formula, you need the function I(). In this case the new predictor should be specified as I(tv.hours + mot.education).

Summarize the result.

A well known measure of collinearity is variance inflation factor (VIF). The VIF represents the increased uncertainty of a coefficient (slope) estimate of a predictor due to the predictors in the model. The details of the calculation is, as usual, not we are interested here, but the following formula may give you some insights. Variance inflation factor for the jth predictor, xj, is

VIFj = 1 1 rj2

where rj2 is the r2 value obtained by fitting a model predicting xj from the rest of the predictors. For example, if our aim is to check VIF of x1 in the model specified by y ~ x1 + x2 + x3, we use the r2 from the regression x1 ~ x2 + x3. Clearly, the higher the VIF, the higher the variability on estimates of coefficients. Although there is no clear-cut rule, a common suggestion to consider values above 5 to high VIF. You should have already realized that each predictor has an associated VIF.

You already know how to calculate VIF without any additional help. However, you can also use the function vif() from car package [Fox and Weisberg2011]. The [Fox and Weisberg2011] package contains quite a few useful/convenient functions for regression and ANOVA.

Exercise 8.9. Calculate VIF values for the predictors in the largest model we studied so far, m3.

Reminder: to be able use the functions from the library car, you need to first run

 
> library(car)  

If your installation does not have the car library, you can install it (with proper rights on a computer connected to Internet) using the command

 
> install.packages(car)  

8.4 r2 and adjusted r2

We already know that r2 is the square of the Pearson’s correlation coefficient, and interpreted as the ‘the ratio or percentage of the variance in the outcome variable explained by the predictor’. With multiple regression, there is no straightforward correlation coefficient anymore. However, the ‘Multiple R-squared’ reported in the regression summary still has the interpretation of ‘the ratio of the variance in the outcome variable explained by the predictors’.

Despite this nice interpretation, r2 is inflated by adding more predictors, even if the new predictors have no explanatory power. The adjusted version, noted as r̄2, corrects for this behavior. The following formulation of the r̄2 helps relating it to r2 (it looks crowded but is is quite easy to digest if you make the attempt).

r̄2 = 1 n 1 n k 1 × (1 r2)

where n is the number of observations, and k is the number of predictors (not including the intercept term). You should realize that r̄2 is always smaller than r2, and as k gets larger, the difference will be bigger. For large data sets, on the other hand, the difference will be smaller (k will have little effect on the adjustment). One can also view r̄2 as the estimate of r2 for the population (this will be more apparent if you rearrange the formula in terms of the ‘model sums of squares’ and the ‘total sums of squares’).

Exercise 8.10. The function runif() returns a random sample from a uniform distribution. For example, runif(80, 0, 1) returns a vector of 80 random numbers between 0 and 1, which are distributed uniformly.

Create two additional columns in tv data set, and fill it with random numbers from a uniform distribution (the range of the numbers does not matter).

Fit a new model where these two ‘random’ columns are added as predictors besides tv.hours and mot.education.

Compare the ‘R-squared’ and ‘Adjusted R-squared’ values with and without the new irrelevant (random) predictors.

8.5 Model selection

In fitting models with multiple predictors, we typically experiment with including or excluding predictors. Deciding for an adequate model depends on many factors, such as the predictors that are measured/collected and aim of the model. For example the choice of a model over another can be very different if the aim of the modeling is to make predictions as opposed to understand the coefficients (effects). The ground rule is, all else being equal, we prefer simpler models. As well as the Occam’s razor, for the purpose of understandable coefficient estimates, simplicity is also motivated by the fact that with many predictors, the parameter estimates become too variable to be interpretable. For the purposes of prediction, large number of predictors will typically be costly to be measured, and they will cause overfitting: the model will perform well for the know data, but will perform poorly on unseen data. Here, we will not make any strong suggestions about model selection, but try to introduce some common tools that help comparing models.

The function update() in R allows you to modify an existing model gradually. Once we have a base model, instead of specifying a new model from scratch, we can define a model that adds or removes variables to or from the base model. For example,

 
> update(m, . ~ . + x1)  

adds a new predictor x1 to a model m that is fitted earlier, for example, with lm(). Similarly,

 
> update(m, . ~ . - x2)  

returns a new model where the predictor x2 is removed from the original model m. The dot ‘.’ in these formulas represents the set of values from the original model (m). In both cases, the result is a new model object.

Exercise 8.11.  The tv data set includes another predictor daycare.hours which specifies the number of hours per week the child spends in a daycare.

Using update() create a new model that adds this predictor to the model m3 from Exercise 8.4. Name the new variable m4.

Does the new predictor affect the model fit? Is the effect statistically significant?

Any additional predictor, even a completely random one, will result in a better fit: the residual variance will be lower (and r2 will be higher). The question we often ask is whether the reduced residual variance can be a chance effect or not. This leads us to compare residual variances of two models. You should already be familiar with the concept of comparing two variances: this is what we do in ANOVA. If you run anova() on two model objects, it will perform an F-test, and tell you whether the residual differences are statistically significant.

If a model (with all the predictors) is doing something useful, it should be doing better than predicting the mean of the response variable for any predictor value. Such a test is automatically done when we summarize linear regression results. The F-test result on the last line of a linear regression summary is from such a test.

Exercise 8.12.  The formula ‘response ~ 1’ specifies a model with no predictor, or more correctly a model with a constant prediction.

Using the tv data set, fit such a model predicting cdi. Assign the resulting model to variable m0.

Check whether the estimated intercept is the same as mean of the response variable.

What does the standard error of the estimated intercept correspond to?

Exercise 8.13.  Perform an ANOVA comparing m0 (Exercise 8.12) and m1 (Exercise 8.2).

Do the improvement (reduction) in residual variation in m1 statistically significant (at level 0.05)?

Compare your result with the F-test reported in the summary of model m1.

Exercise 8.14.  The ANOVA we have performed in Exercise 8.13 serves only as a demonstration. The same test is already included in the model summary. It is more interesting to compare two different models with predictors.

Compare m1 vs. m3 and m3 vs. m4 using anova().

Note: conventionally, the smaller model is listed first.

What conclusions do you get out of these comparisons?

If you run anova() on a single model object, it will perform F-tests incrementally for each predictor. The order of tests performed depend on the order the predictors in the regression formula. anova() first tests the model with the first predictor against the model of the mean, then the model with first two predictors against the model only with the first predictor, and so on. As a result, changing the order of predictors will result in different results.

Exercise 8.15. Repeat the tests performed in Exercise 8.14 by a single call to anova().

Akaike information criterion (AIC) is a measure used for model selection. The AIC is

AIC = 2k 2 log(L)

Where k is the number of predictors, and the L is the likelihood (of the data given the estimated model), which is another measure of model’s fit to the data (we will return to discussion of likelihood).

Given two models, the model with lower AIC value is preferred. Notice that the AIC rewards good fit (higher L), 2 and penalizes the modes with large number of predictors (lower k).

The function AIC() calculates the AIC value for a given model.

Exercise 8.16. Compare the models you compared with anova() in Exercise 8.14 with AIC(). Which model is the best according to AIC?

When there are many predictors, one may want to employ an automatic method for model selection. Note that the number of models to compare when you have k predictors is 2k. For example, for k = 10 this amounts to 1024 comparisons, and for k = 20 the number of comparisons required is over a million. With increasing k, comparing all possible models becomes intractable.

Even if comparing all possible combinations of coefficients is possible, you should not expect any mechanical procedure to find you the best model. Normally your model selection should be guided by the knowledge relevant to the problem at hand. However, a few search procedures that look for an optimal model are commonly used in the literature. The function step() provides the well-known backward (starting with the full model, and eliminating variables that are deemed not useful), forward (starting with no or a few predictors, and adding ones that improve the model most) procedures, or a combination of both. If you want larger models to be explored, you will need to use the scope option to specify which predictors step should consider adding.

Exercise 8.17. Apply step() to the full-model (m4 from Exercise 8.11) on tv data set.