The main distinction we made between the models (or analysis methods) we discussed so far has been based on the type of the predictors. This follows the historical distinction made in the literature. If the predictors are categorical, we use ANOVA. If they are numeric, we use regression. These distinction turns out to be superﬁcial. Both type of analyses can be carried out a generalization of the linear regression, general linear models. We have already made an excursion into general linear models in Section 4.6, by analyzing a categorical predictor with lm(). In this part, we will built on this, working with categorical variables with more than two levels, and also mixing the categorical and numeric predictors.
The new data set we use comes from another hypothetical study where we try to assess the factors that aﬀect school children’s success in learning a second language (L2). We assume that all our participants are kids at the same age and schooling but from diﬀerent backgrounds. The response variable is a test score for second language learners (l2score), and we will use a variety of predictors including the amount of L2 exposure outside the class (l2exposure), a standardized score of overall achievement in school (gpa), proﬁciency in their ﬁrst language (l1score), socio-economic status of the parents (ses) and the gender (gender). The data set is available as a CSV ﬁle at http://coltekin.net/cagri/R/data/l2.csv.
What does the intercept and slope represent in this model?
Verify your conclusions with an equivalent t-test. $\u22b3$
A categorical variable with $k$ levels (groups) is represented with $k-1$ separate variables in linear regression. The most common way of representation is called indicator, treatment or dummy coding. In this type of coding, one of the levels are taken to be the base or the reference level (e.g., control group in a experimental study). For the reference level, all $k-1$ variables are set to $0$. For the other levels, one of the $k-1$ variables are set to $1$, the rest are set to $0$.
In our gender example above, since we have only two levels, we have only one indicator variable (should be named genderM in Exercise 9.2). The indicator variable is set to $0$ for female subject, and it is set to $1$ for male subjects. If we write the linear predictor as
it should be clear that for girls, we prediction of the GPA to be $a+b\times 0=a$, the intercept. For male subjects, the prediction will be $a+b\times 1$. As a result the slope, $b$, will represent the diﬀerence between the means of male and female participants. Then, the t-test performed for the slope indicates whether there is a diﬀerence between the means of male and female participants.
Extending this idea to a variable with more than two levels is easy. For example, if we perform a similar analysis predicting gpa from ses, the estimated linear predictor is (assuming low is the base level)
where $ses.mid$ and $ses.high$ are two indicator variables, that are set to 1 only for the corresponding level of the ses. Table 1 shows the relationship between the levels of the ses and the indicator variables.
How do you interpret each coeﬃcient and the corresponding signiﬁcance value? $\u22b3$
The interactions of categorical variables are represented in a straightforward way in indicator coding. What we do is simply include the product of each of the indicator variables. For example, for the gender and ses factors we were working with, the resulting predictor would be like
In this case, the intercept represents the case where both variables are in their reference level (gender = F and ses = low in our example). The slope values should now be interpreted with care. For example, if gender = M and ses = mid, the expected response is $a+{b}_{1}+{b}_{2}+{b}_{4}$.
Try to interpret each coeﬃcient. $\u22b3$
Speciﬁcation of intercept is implicit in R formulas. We can force intercept to be $0$ by adding -1 to the right side of a formula. For linear models with continuous predictors this does not make much sense. For linear models with categorical predictors, on the other hand, forcing intercept to be $0$ makes the coeﬃcient estimates the estimates of means for each level. In combination with the standard errors of each estimate, this parametrization of a linear model may be more insightful in some cases.
The indicator coding of categorical variables we worked with so far is easy to interpret. However, there is nothing special about this $0$-$1$ coding scheme. A categorical variable with $k$ levels canbe coded as $k-1$ variables many diﬀerent and useful ways. We will not go into details of contrast coding in this tutorial, but just mention a few ways to do it.
The default contrasts for a factor variable, as we saw above, is so-called treatment contrasts. However, if you create a factor variable as an ordered factor (either using the ordered() function or ordered=T option of factor()),$3$ the default contrasts become so-called polynomial contrasts. Instead of diﬀerences between means, the polynomial contrasts test whether there is a polynomial trend of some sort along the ordered levels.
First plot side-by-side box plots of l2score for each ses.ordered group. Do you observe of trend based on SES?
Now, ﬁt a linear regression model predicting l2score from ses.ordered.
Compare the summary of the model with the one from Exercise 9.3. $\u22b3$
So, far we worked with the default contrasts for diﬀerent types of factor variables (ordered and unordered). You can display and set contrasts for a factor variables using the function contrasts().
You can create well-known contrasts using a set of functions that start with contr.. For example, contr.poly(5) will create the contrast matrix (similar to the one presented in Table 1) for a factor with ﬁve levels. These matrices then can be assigned to a factor variable through the contrast() function. For example,
> contrast(ses) <- contr.poly(3)
sets the contrast for the factor ses to polynomial contrasts.
In fact, you can use any $k$ by $k-1$ matrix as contrasts. Of course, it will be useful only if the resulting comparisons makes sense. As said at the beginning, we will not cover contrasts in much detail here. However, before concluding, we will experiment with another well-known contrasts, Helmert contrasts, which are used to compare each level of a factor to the previous one. Helmert contrasts in R can be obtained with the function contr.helmert().
Set the contrasts for the ses.ordered to Helmert contrasts. Fit a linear regression model predicting l2score from ses.ordered with the new contrasts.
Compare your results with the model with the treatment contrasts. $\u22b3$
In closing, we note that in all the contrast experiments we did, the main (ANOVA) results are the same. The reason for setting contrasts is to facilitate the hypothesis tests performed with a linear model for diﬀerent purposes. Contrasts that obey certain criteria are called orthogonal contrasts. Orthogonal contrasts are safe against the multiple comparisons problem we discussed in Section 4.
For the rest of this tutorial we stick to the default (treatment) contrasts.
As we just saw, we can convert a categorical variable with $k$ levels to $k-1$ indicator variables, and use multiple linear regression to estimate a linear predictor. This also opens up the possibility of using both categorical and numeric predictors in the same model. Traditionally this sort of analyses would be called analysis of covariance (ANCOVA).
Given our formula notation, there is nothing interesting for ﬁtting mixed-predictor models. We just specify all predictors, categorical or numeric, on the right hand side of the formula.
Compare the results with a model where gender is the only predictor. How do you explain the change in the coeﬃcient estimates of gender between these two models? $\u22b3$
We can conceptualize the model we ﬁt in Exercise 9.11 above as having two intercepts, one for girls and one for boys, resulting in two regression lines with the same slope.
Plot two lines over the scatter plot with the same slope and appropriate intercepts for girls and boys based on the estimations from Exercise 9.11. $\u22b3$
If we specify interaction between a categorical and a numeric variable, we can conceptualize it as ﬁtting separate intercepts and slopes for all levels of the factor variable.
Compare the coeﬃcient estimates with the ones from Exercise 9.13. $\u22b3$
What does step() return as the best model?
How many variables are attempted for removal at the ﬁrst step? (and why?) $\u22b3$
Compare the AIC value with the one suggested by the step() in Exercise 9.17. Which procedure returns you the best model? Is this the best possible model with these predictors? $\u22b3$
The following exercise is useless in practice. However, understanding the model ﬁt, and the consequences will help you understand the models we will discuss in later sections.