In our many passes over the regression so far, we have only worked with a single explanatory variable. In this part extend it to multiple predictors. Although it does not change the model fitting drastically, there are a number of issues that come with multiple predictors.
The data we will use in this part is another hypothetical study on child language acquisition. This time we want to investigate the effects of amount of time spend in front of TV to two-year-old children’s language development. The data can be found as an R data file at http://coltekin.net/cagri/R/data/tv.rda. The response variable in this data set, cdi, is a standard measure of children’s language abilities based on parental reports. The predictor we are mainly interested in is tv.hours, which is the weekly hours of TV time for each child. We have some other predictors that we will explain in the relevant exercises below. As in most cases in this tutorial, the problem/design is simplified and the data is randomly generated.
Multiple regression is not very different than simple regression we studied piece-by-piece in multiple steps so far. In this subsection, we will fit two simple regression models for later comparison with the multiple regression.
You are encouraged to repeat Exercise 8.2 fully for m2.
To fit a regression model with multiple predictors, we add each predictor to the right side of the tilde ‘~’ in our model specification (formula). Note that for numeric predictors, the interaction terms are difficult to interpret. As a result we almost exclusively use + between the predictors.
Save the new model object as m3.
Summarize the new model, and compare it with the earlier models m1 (from Exercise 8.2) and m2 (from Exercise 8.2).
In particular, you should be checking the changes in the p-values, and the . Can you explain the differences that you observe?
The scatter plots that were so useful in single regression becomes rather useless in the context of multiple regression. For two predictors, you can plot so-called 3D scatter plots. A 3D scatter plot is hardly digestible except for simple/demonstrative data sets. We will not experiment with them here, but note that a number of R packages, including scatterplot3d and rgl and Rcmdr have functions to plot 3D graphs.
On both plots, draw the single regression line, and the resulting line from the multiple regression m3 (the intersection of the regression plane and the plane defined by the relevant axes).
TIP: You can use coef() to extract coefficients from m3. For example, coef(m3)[1] will give you the estimated intercept.
Plot squares filled with shades of gray that represent the predicted cdi scores for each pair of tv.hours and mot.education values within the range of data.
The maximum CDI value should be plotted in black, and the minimum should be plotted in white. Put tv.hours to x axis, and mot.education to the y axis. For both, use the integer values within the range in the tv data set.
TIP1: you can initialize a new graph, but do not plot anything on it (except axes, labels etc.) using the option type=’n’.
TIP2: the function gray() returns color values that can be used with plotting functions. Alternatively, you can use rgb() produce a colorful graph instead.
One of the problems that affect the parameter estimates and interpretation of a multiple regression analysis is multicollinearity. If two predictors are collinear, part of the variance in the response variable is explained by both. Multicollinearity does not affect the predictive power of the model. However, the coefficient estimates will become less certain, and difficult to understand. If a predictor which is collinear with the existing predictors is added to a regression model, the earlier estimates may change. A sign of multicollinearity is the high linear correlation between the predictors.
An extreme case of multicollinearity is when one of the predictors is a linear combination of one or more other predictors. If this is the case, the least-squares estimation will not be possible.
A well known measure of collinearity is variance inflation factor (VIF). The VIF represents the increased uncertainty of a coefficient (slope) estimate of a predictor due to the predictors in the model. The details of the calculation is, as usual, not we are interested here, but the following formula may give you some insights. Variance inflation factor for the predictor, , is
where is the value obtained by fitting a model predicting from the rest of the predictors. For example, if our aim is to check VIF of in the model specified by y ~ x1 + x2 + x3, we use the from the regression x1 ~ x2 + x3. Clearly, the higher the VIF, the higher the variability on estimates of coefficients. Although there is no clear-cut rule, a common suggestion to consider values above to high VIF. You should have already realized that each predictor has an associated VIF.
You already know how to calculate VIF without any additional help. However, you can also use the function vif() from car package [Fox and Weisberg, 2011]. The [Fox and Weisberg, 2011] package contains quite a few useful/convenient functions for regression and ANOVA.
Reminder: to be able use the functions from the library car, you need to first run
> library(’car’)
If your installation does not have the ’car’ library, you can install it (with proper rights on a computer connected to Internet) using the command
> install.packages(’car’)
We already know that is the square of the Pearson’s correlation coefficient, and interpreted as the ‘the ratio or percentage of the variance in the outcome variable explained by the predictor’. With multiple regression, there is no straightforward correlation coefficient anymore. However, the ‘Multiple R-squared’ reported in the regression summary still has the interpretation of ‘the ratio of the variance in the outcome variable explained by the predictors’.
Despite this nice interpretation, is inflated by adding more predictors, even if the new predictors have no explanatory power. The adjusted version, noted as , corrects for this behavior. The following formulation of the helps relating it to (it looks crowded but is is quite easy to digest if you make the attempt).
where is the number of observations, and is the number of predictors (not including the intercept term). You should realize that is always smaller than , and as gets larger, the difference will be bigger. For large data sets, on the other hand, the difference will be smaller ( will have little effect on the adjustment). One can also view as the estimate of for the population (this will be more apparent if you rearrange the formula in terms of the ‘model sums of squares’ and the ‘total sums of squares’).
Create two additional columns in tv data set, and fill it with random numbers from a uniform distribution (the range of the numbers does not matter).
Fit a new model where these two ‘random’ columns are added as predictors besides tv.hours and mot.education.
Compare the ‘R-squared’ and ‘Adjusted R-squared’ values with and without the new irrelevant (random) predictors.
In fitting models with multiple predictors, we typically experiment with including or excluding predictors. Deciding for an adequate model depends on many factors, such as the predictors that are measured/collected and aim of the model. For example the choice of a model over another can be very different if the aim of the modeling is to make predictions as opposed to understand the coefficients (effects). The ground rule is, all else being equal, we prefer simpler models. As well as the Occam’s razor, for the purpose of understandable coefficient estimates, simplicity is also motivated by the fact that with many predictors, the parameter estimates become too variable to be interpretable. For the purposes of prediction, large number of predictors will typically be costly to be measured, and they will cause overfitting: the model will perform well for the know data, but will perform poorly on unseen data. Here, we will not make any strong suggestions about model selection, but try to introduce some common tools that help comparing models.
The function update() in R allows you to modify an existing model gradually. Once we have a base model, instead of specifying a new model from scratch, we can define a model that adds or removes variables to or from the base model. For example,
> update(m, . ~ . + x1)
adds a new predictor x1 to a model m that is fitted earlier, for example, with lm(). Similarly,
> update(m, . ~ . - x2)
returns a new model where the predictor x2 is removed from the original model m. The dot ‘.’ in these formulas represents the set of values from the original model (m). In both cases, the result is a new model object.
Using update() create a new model that adds this predictor to the model m3 from Exercise 8.4. Name the new variable m4.
Does the new predictor affect the model fit? Is the effect statistically significant?
Any additional predictor, even a completely random one, will result in a better fit: the residual variance will be lower (and will be higher). The question we often ask is whether the reduced residual variance can be a chance effect or not. This leads us to compare residual variances of two models. You should already be familiar with the concept of comparing two variances: this is what we do in ANOVA. If you run anova() on two model objects, it will perform an F-test, and tell you whether the residual differences are statistically significant.
If a model (with all the predictors) is doing something useful, it should be doing better than predicting the mean of the response variable for any predictor value. Such a test is automatically done when we summarize linear regression results. The F-test result on the last line of a linear regression summary is from such a test.
Using the tv data set, fit such a model predicting cdi. Assign the resulting model to variable m0.
Check whether the estimated intercept is the same as mean of the response variable.
What does the standard error of the estimated intercept correspond to?
Do the improvement (reduction) in residual variation in m1 statistically significant (at level )?
Compare your result with the F-test reported in the summary of model m1.
Compare m1 vs. m3 and m3 vs. m4 using anova().
Note: conventionally, the smaller model is listed first.
If you run anova() on a single model object, it will perform F-tests incrementally for each predictor. The order of tests performed depend on the order the predictors in the regression formula. anova() first tests the model with the first predictor against the model of the mean, then the model with first two predictors against the model only with the first predictor, and so on. As a result, changing the order of predictors will result in different results.
Akaike information criterion (AIC) is a measure used for model selection. The AIC is
Where is the number of predictors, and the is the likelihood (of the data given the estimated model), which is another measure of model’s fit to the data (we will return to discussion of likelihood).
Given two models, the model with lower AIC value is preferred. Notice that the AIC rewards good fit (higher ), and penalizes the modes with large number of predictors (lower ).
The function AIC() calculates the AIC value for a given model.
When there are many predictors, one may want to employ an automatic method for model selection. Note that the number of models to compare when you have predictors is . For example, for this amounts to comparisons, and for the number of comparisons required is over a million. With increasing , comparing all possible models becomes intractable.
Even if comparing all possible combinations of coefficients is possible, you should not expect any mechanical procedure to find you the best model. Normally your model selection should be guided by the knowledge relevant to the problem at hand. However, a few search procedures that look for an optimal model are commonly used in the literature. The function step() provides the well-known backward (starting with the full model, and eliminating variables that are deemed not useful), forward (starting with no or a few predictors, and adding ones that improve the model most) procedures, or a combination of both. If you want larger models to be explored, you will need to use the scope option to specify which predictors step should consider adding.