9 General Linear Models

The main distinction we made between the models (or analysis methods) we discussed so far has been based on the type of the predictors. This follows the historical distinction made in the literature. If the predictors are categorical, we use ANOVA. If they are numeric, we use regression. These distinction turns out to be superficial. Both type of analyses can be carried out a generalization of the linear regression, general linear models. We have already made an excursion into general linear models in Section 4.6, by analyzing a categorical predictor with lm(). In this part, we will built on this, working with categorical variables with more than two levels, and also mixing the categorical and numeric predictors.

The new data set we use comes from another hypothetical study where we try to assess the factors that affect school children’s success in learning a second language (L2). We assume that all our participants are kids at the same age and schooling but from different backgrounds. The response variable is a test score for second language learners (l2score), and we will use a variety of predictors including the amount of L2 exposure outside the class (l2exposure), a standardized score of overall achievement in school (gpa), proficiency in their first language (l1score), socio-economic status of the parents (ses) and the gender (gender). The data set is available as a CSV file at http://coltekin.net/cagri/R/data/l2.csv.

Exercise 9.1. Load the CSV file from http://coltekin.net/cagri/R/data/l2.csv. Name the resulting data frame l2. Make sure that the variables gender, ses and subject are factor variables.

Exercise 9.2.  Using the l2 data set, test whether gender affects the school success (gpa) using a linear model fit by lm().

What does the intercept and slope represent in this model?

Verify your conclusions with an equivalent t-test.

9.1 Categorical variables in regression

A categorical variable with k levels (groups) is represented with k 1 separate variables in linear regression. The most common way of representation is called indicator, treatment or dummy coding. In this type of coding, one of the levels are taken to be the base or the reference level (e.g., control group in a experimental study). For the reference level, all k 1 variables are set to 0. For the other levels, one of the k 1 variables are set to 1, the rest are set to 0.

In our gender example above, since we have only two levels, we have only one indicator variable (should be named genderM in Exercise 9.2). The indicator variable is set to 0 for female subject, and it is set to 1 for male subjects. If we write the linear predictor as

gpa = a + b × genderM

it should be clear that for girls, we prediction of the GPA to be a + b × 0 = a, the intercept. For male subjects, the prediction will be a + b × 1. As a result the slope, b, will represent the difference between the means of male and female participants. Then, the t-test performed for the slope indicates whether there is a difference between the means of male and female participants.

Extending this idea to a variable with more than two levels is easy. For example, if we perform a similar analysis predicting gpa from ses, the estimated linear predictor is (assuming low is the base level)

gpa = a + b1 × ses.mid + b2 × ses.high

where ses.mid and ses.high are two indicator variables, that are set to 1 only for the corresponding level of the ses. Table 1 shows the relationship between the levels of the ses and the indicator variables.


Table 1: Indicator variables for three-level factor ses.

indicator variable
ses.midses.high
low 0 0
mid 1 0
high 0 1

Exercise 9.3.  Fit a linear regression model predicting l2score from ses using the l2 data set.

How do you interpret each coefficient and the corresponding significance value?

Exercise 9.4. We already know that we can get a classical ANOVA summary using summary.aov(). Summarize the model in Exercise 9.3 using summary.aov(). Which quantities are the same on both summaries?

The interactions of categorical variables are represented in a straightforward way in indicator coding. What we do is simply include the product of each of the indicator variables. For example, for the gender and ses factors we were working with, the resulting predictor would be like

In this case, the intercept represents the case where both variables are in their reference level (gender = F and ses = low in our example). The slope values should now be interpreted with care. For example, if gender = M and ses = mid, the expected response is a + b1 + b2 + b4.

Exercise 9.5. Fit a linear model predicting l2score from ses, gender and their interactions.

Try to interpret each coefficient.

Specification of intercept is implicit in R formulas. We can force intercept to be 0 by adding -1 to the right side of a formula. For linear models with continuous predictors this does not make much sense. For linear models with categorical predictors, on the other hand, forcing intercept to be 0 makes the coefficient estimates the estimates of means for each level. In combination with the standard errors of each estimate, this parametrization of a linear model may be more insightful in some cases.

Exercise 9.6. Fit a linear model predicting l2score from ses with no intercept.

9.2 Other ways of coding categorical variables: contrasts

The indicator coding of categorical variables we worked with so far is easy to interpret. However, there is nothing special about this 0-1 coding scheme. A categorical variable with k levels canbe coded as k 1 variables many different and useful ways. We will not go into details of contrast coding in this tutorial, but just mention a few ways to do it.

The default contrasts for a factor variable, as we saw above, is so-called treatment contrasts. However, if you create a factor variable as an ordered factor (either using the ordered() function or ordered=T option of factor()),3 the default contrasts become so-called polynomial contrasts. Instead of differences between means, the polynomial contrasts test whether there is a polynomial trend of some sort along the ordered levels.

Exercise 9.7.  Add a new variable named ses.ordered to the l2, which is an ordered copy of the ses variable.

First plot side-by-side box plots of l2score for each ses.ordered group. Do you observe of trend based on SES?

Now, fit a linear regression model predicting l2score from ses.ordered.

Compare the summary of the model with the one from Exercise 9.3.

Exercise 9.8.  Change the order of levels of the ses.ordered created in Exercise 9.7 as "high"<"low"<"mid", repeat the model fit, and compare your results with Exercise 9.7.

So, far we worked with the default contrasts for different types of factor variables (ordered and unordered). You can display and set contrasts for a factor variables using the function contrasts().

Exercise 9.9. Inspect the contrast matrices of variables gender, ses and ses.ordered in the l2 data set.

You can create well-known contrasts using a set of functions that start with contr.. For example, contr.poly(5) will create the contrast matrix (similar to the one presented in Table 1) for a factor with five levels. These matrices then can be assigned to a factor variable through the contrast() function. For example,

 
> contrast(ses) <- contr.poly(3)  

sets the contrast for the factor ses to polynomial contrasts.

In fact, you can use any k by k 1 matrix as contrasts. Of course, it will be useful only if the resulting comparisons makes sense. As said at the beginning, we will not cover contrasts in much detail here. However, before concluding, we will experiment with another well-known contrasts, Helmert contrasts, which are used to compare each level of a factor to the previous one. Helmert contrasts in R can be obtained with the function contr.helmert().

Exercise 9.10. Re-order the ses.ordered factor from low to high.

Set the contrasts for the ses.ordered to Helmert contrasts. Fit a linear regression model predicting l2score from ses.ordered with the new contrasts.

Compare your results with the model with the treatment contrasts.

In closing, we note that in all the contrast experiments we did, the main (ANOVA) results are the same. The reason for setting contrasts is to facilitate the hypothesis tests performed with a linear model for different purposes. Contrasts that obey certain criteria are called orthogonal contrasts. Orthogonal contrasts are safe against the multiple comparisons problem we discussed in Section 4.

For the rest of this tutorial we stick to the default (treatment) contrasts.

9.3 Mixing categorical and numeric predictors

As we just saw, we can convert a categorical variable with k levels to k 1 indicator variables, and use multiple linear regression to estimate a linear predictor. This also opens up the possibility of using both categorical and numeric predictors in the same model. Traditionally this sort of analyses would be called analysis of covariance (ANCOVA).

Given our formula notation, there is nothing interesting for fitting mixed-predictor models. We just specify all predictors, categorical or numeric, on the right hand side of the formula.

Exercise 9.11.  Fit a linear model predicting l2score from gender and gpa (without interaction). Summarize and interpret your results.

Compare the results with a model where gender is the only predictor. How do you explain the change in the coefficient estimates of gender between these two models?

We can conceptualize the model we fit in Exercise 9.11 above as having two intercepts, one for girls and one for boys, resulting in two regression lines with the same slope.

Exercise 9.12.  Plot l2score against gpa, using the gender symbols ~ and ~, and colors red and blue for points corresponding to girls and boys respectively. (use other symbols if your R environment does not support Unicode characters).

Plot two lines over the scatter plot with the same slope and appropriate intercepts for girls and boys based on the estimations from Exercise 9.11.

If we specify interaction between a categorical and a numeric variable, we can conceptualize it as fitting separate intercepts and slopes for all levels of the factor variable.

Exercise 9.13.  Fit a linear model predicting l2score from gender and gpa without interaction. Summarize and interpret your results.

Exercise 9.14.  Repeat the Exercise 9.12 with varying intercepts and slopes estimated in Exercise 9.13.

Exercise 9.15.  Fit two separate linear models predicting l2score from gender, one only using the data corresponding to gender=F, and the other one for gender=M.

Compare the coefficient estimates with the ones from Exercise 9.13.

Exercise 9.16.  Fit a linear model predicting l2score from all other variables in the l2 data set, except subject. Include all interaction terms (also interactions between the numeric predictors). We will refer to this model for the rest of this section as m.full.

  1. Does the model fit to the data well?
  2. Which coefficient estimates are statistically significant?
  3. Is the overall model fit statistically significant? How do you interpret that the inference for the overall model, considering the inference for the individual coefficients?

Exercise 9.17.  Starting from the m.full from Exercise 9.16, do a stepwise elimination of the variables.

What does step() return as the best model?

How many variables are attempted for removal at the first step? (and why?)

Exercise 9.18.  Do another stepwise analysis, but this time start from a model with no predictors (the model of the mean), and allow addition and removal of the variables at each step The scope should be equal to the predictors of the m.full defined in Exercise 9.16.

Compare the AIC value with the one suggested by the step() in Exercise 9.17. Which procedure returns you the best model? Is this the best possible model with these predictors?

Exercise 9.19. Produce the standard diagnostic plots for the best model identified by step() in Exercise 9.18.

  1. Do you see any sign of non-linearity?
  2. Is there any evidence for non-constant variance?
  3. Are the residuals distributed approximately normally?
  4. Are there any influential observations?

The following exercise is useless in practice. However, understanding the model fit, and the consequences will help you understand the models we will discuss in later sections.

Exercise 9.20. Fit a model predicting l2score using subject as the only predictor. What does the coefficient estimates mean in this model? Can you explain the r2 and the inference about the coefficient estimates and the overall model?