4 Linear models with categorical predictors

In this set of exercises we will study methods to analyze data where the response variable is continuous (e.g., pitch, duration, reaction time), the predictor is categorical (e.g., gender, language, part-of-speech tag). Crucially, all methods we discuss here works with independent samples (e.g., no multiple measurements from the same individual). Analyzing ‘repeated measures’ or non-independent data will be addressed in the next section.

In Section 2, we have already seen an example where we tried to see whether ‘talkativeness’ can be predicted by gender, and analyzed the data using the t test.

For the first part of this section, we revisit the t-test exercises from Section 2, so you will need the words data set used there. If you do not have the variables from Section 2 in your R environment (remember, you can check this using ls()), use the commands given at the beginning of the Section 2 to create it.

Now that we know that the related sets of variables should be stored in data frames instead of individual vector variables, we would like to put these variables together in a data frame.

Exercise 4.1.  Create a data frame containing only the variables words and age. Name your data frame ‘talk’.

The talk data frame does not yet contain the information on the genders of the participants. From the way we constructed the words vector earlier, we know that the first 10 values correspond to the female participants, and the last 10 values correspond to the male participants. In a properly constructed data frame, we typically do not want unrelated values on a row. So, a data frame that contains word counts for male and female users in different columns is not a good option (although this data format, called ‘wide format’, is used by some other statistical software by default). Instead, we will keep the data frame as created in Exercise 4.1, and add a new column with the categorical variable gender.

Like numeric data types, R supports a ‘factor’ data type that is suitable for categorical or nominal data (which can optionally be ordinal, but this is not a concern for us in this section). Here is how to add the gender column to the talk data frame.

 
> talk$gender <- factor(c(rep(F, 10), rep(M, 10))) 
> talk$gender 
 [1] F F F F F F F F F F M M M M M M M M M M 
Levels: F M  

The first line, has two new R functions that we haven’t seen yet. rep() returns a vector where its first argument repeated as many times as its second argument. As we have seen before, c() concatenates the results. And, factor() makes sure that the resulting vector contains factors. If we display it as in the second line of the listing above, R will also give you the possible values ‘levels’ the factor variable can take.

We will return to some of the details regarding the factor variables. For now, it suffices to understand that we have a factor variable gender with values F for the values in words that were collected from the female participants, and M for male participants.

Besides the talk data we reorganized, we will use another hypothetical set of data where we investigate early language development of children, given their parents’ socio economic status (SES). Our fake data set include mean length utterance (MLU) values from 10 girls and 10 boys for each three levels of SES we consider (low, mid, high). You can load the data with the following command.

 
> mlu.ses <- read.csv( 
   http://coltekin.net/cagri/R/data/mlu-ses.csv) 
> mlu.ses$ses <- factor( 
    mlu.ses$ses, levels=c(low, medium, high))  

If you are working offline, you should use the local file name instead of the URL in the first command. Note that in R strings backslash ‘\’ has a special function. When using local path names on Windows, you should either use double backslash (like ‘C:\\My files\\R\\mlu-ses.csv’) in path names, or use forward slash ‘/’ instead. The second command in the above listing corrects the ordering of the SES levels (and yes, we will return to this one too).

If all goes fine, you will have a data frame with four variables (subject, ses, gender, mlu) and 60 rows (observations, cases). Note that the talk data frame we have created earlier also have variables with the same name. For each exercises that follows, we will tell explicitly which data to use.

Exercise 4.2.  The function str() prints a readable summary of a variable (an R object). Inspect the data frame using str(). Does the summary match what you expect from the description above?

Exercise 4.3. You should have seen in Exercise 4.2 that R took the variable subject (which corresponds to the arbitrary ID of participant in the study) as a numeric value. Convert subject variable in mlu.ses to a factor variable.

Earlier, we saw how to extract the value of a particular cell, column or row from a data frame. Often, we want to extract values that meet a certain condition. The row and column and indices that we used earlier can contain arbitrary conditional expressions. For example, the following listing demonstrates how to extract all rows where mlu is greater than four and ses is high.

 
> mlu.ses[mlu.ses$mlu > 4 & mlu.ses$ses == high,] 
   subject  ses gender  mlu 
54      54 high female 4.20 
59      59 high female 4.62  

Note that equality in a comparison is expressed with double equal signs ‘==’ and the sign ‘&’ means boolean ‘and’ operation. Similarly some common operators you can use are,

Exercise 4.4. Choose only the mlu values from mlu.ses where the gender is male.

4.1 Comparing two means

The classical method for comparing two means is the Student’s t test that we have experimented with in the previous section. Here we will repeat some of the earlier t-test exercises in a slightly differently way. An alternative notation for running the t test in Exercise 2.16 is


Listing 6: T test with formula notation.
> t.test(talk$words ~ talk$gender, alternative=greater)

The output will be the same. The reason for this repetition is to familiarize you with the categorical variables and the formula notation. Both of these concepts will be used throughout this section, and for the rest of the tutorial. Note that you need to prefix the variable names with ‘talk$’ to specify which data you are referring to. If you do not what to type the data frame name you can use the command

 
> attach(talk)  

after which, the columns in talk data frame will be accessible from your default R environment without referring to the data frame name. You can detach() if you want to go back to the state before the attach() command.

The notation words ~ gender can be translated into plain words as ‘words is explained by gender’, or more technically ‘words is the response variable, and gender is the predictor’. We will use this notation frequently, and Appendix B explains the formula notation in some detail.

4.2 Checking assumptions of t test and ANOVA

All statistical methods (or models) come with a set of assumptions. For independent-samples t test and the ANOVA that we will work on shortly, there are three important assumptions:

The first assumption above is related to how you collect your data (or conduct your experiments). We will discuss methods for analyzing non-independent data later. For now we will work with a few graphical and analytic tools for investigating the last two assumptions.

Exercise 4.5. One of the ways of visualizing the distribution of a set of data points is to plot a histogram. Plot necessary histograms to inspect whether the talk data meets the normality assumption required by the t test in Listing 6. Note that you will need to check both subsets.

Exercise 4.6. Use normal Q-Q plots to inspect whether the data meets the normality requirement.

A non-graphical method of testing for normality is ‘Shapiro-Wilk test of normality’. The R function shapiro.test() performs this test.

Exercise 4.7. Verify whether Shapiro-Wilk test of normality agrees with your earlier judgments from the histograms and Q-Q plots.

By default, t.test() in R will apply a correction for the unequal variances. This is the reason why earlier t tests we run were reported as ‘Welch Two Sample t-test’. If you want to override this behavior you can pass the option var.equal=TRUE to t.test().

Exercise 4.8. Run an independent samples t test to test the alternative hypothesis that women talk more than man, using the words data set as before, but instruct t.test() to assume equal variances, and use the formula notation.

Do the t-test results with or without corrections differ substantially?

Can you reason about the change in the ‘degrees of freedom’ reported by the t tests with and without the Welch correction?

Exercise 4.9. A good way of visualizing whether variances of two (or more) samples are equal is to use box plots. We have already produced the plot necessary in Exercise 2.4. Produce the same box plot, but use the formula notation this time. Do the variances of male and female subsets look equal?

Exercise 4.10. Another (non-visual) way of checking equality of variances is to use var.test() (which basically performs an F-test). Do the variances of male and female subsets differ according to var.test()?

We have already seen that t.test() includes a correction for the non-homogeneity of variances by default. On the other hand, if divergence from normality is a concern, then we prefer a non-parametric test. The non-parametric alternative of independent two-samples t test is called Mann-Whitney test (equivalently, Wilcoxon rank sum test).

Exercise 4.11. In R, the non-parametric alternatives of all flavors of the t test are performed using the function wilcox.test(). Perform a Mann-Whitney test, and compare the results with the results from the t tests performed earlier.

4.3 Single ANOVA

If we have more than two levels or ‘groups’ in our predictor variable, the classical test to perform is the single, or one-way, ANOVA (remember that in this section we only work with independent samples). Traditionally ANOVA and linear regression would be considered very different analyses. In this subsection and the next, we will follow this approach and study these methods as ‘hypothesis testing’ methods. However, single and multiple ANOVA we discuss in this section, and the linear regression are, in fact, part of a general framework called general linear models. We will make a first attempt to demonstrate this connection in Section 4.6.

To demonstrate ANOVA in R, we try to investigate the effect of socio-economic status (SES) on children’s mean length utterance (MLU) using the hypothetical data set introduced at the beginning of this section. Before doing the actual analysis we will inspect our data and check whether the data meets the requirements of ANOVA or not.

Exercise 4.12. Generate side-by-side box plots of MLU for each level of SES.

Remember that you can access the individual columns (variables) in a data frame using the dollar ‘$’ operator. For example mlu.ses$mlu will give you the mlu column of the data frame. Alternatively, you can give the option data=mlu.ses to boxplot() to and use the column names directly.

Based on the box plots,

  1. Do the variances look similar?
  2. Can you see any clear divergences from normality?
  3. Do you expect any significant differences in MLU due to SES?

Exercise 4.13. Check normality of each group using graphical methods, and shapiro.test().

All (independent samples) linear models can be fit using the function lm() that was introduced briefly in Section 3. Our use here will differ in two ways. First, our predictor is a categorical variable, and second, we summarize the model fit by lm() in a different way. Listing 7 shows how to do a single ANOVA investigating the effect of ses on mlu. We use the lm() function to fit a general linear model, and summarize it using the function summary.aov() which prints out the relevant information for a classical ANOVA.


Listing 7: Single ANOVA.
1> summary.aov(lm(mlu~ses, data=mlu.ses)) 
2            Df Sum Sq Mean Sq F value   Pr(>F) 
3ses          2  20.71  10.357   21.07 1.41e-07 ⋆⋆⋆ 
4Residuals   57  28.02   0.492

If you remember ANOVA from your basic statistics course, the output requires little explanation. The sums of squared differences, and ‘mean squared difference’ you see on the third line represent the variation between ses groups. Degrees of freedom are 2 since we have three groups. The quantities in the fourth line are due to ‘within group’ or ‘error’ variation. Degrees of freedom value is 57 since we have 60 data points and three groups. The F-value is the result of dividing ‘Mean Sq’ ses to ‘Mean Sq’ Residuals, which results in a very small p value, more precisely 1.41 × 107 = 0.000000141. Note that R typically uses the scientific notation when presenting very small numbers. If you are convinced that ANOVA is the correct analysis to run with this data, the result tells us that SES plays a role on MLU of a three-year-old child, and this role is (very) unlikely to be due to chance. Note, however, whether this effect has a practical importance is another question that needs some further information than the p-value, and also dependent on the particular problem studied.

Exercise 4.14. When you have only two groups, ANOVA is equivalent to the t test. Using the data frame talk, analyze the effect of gender on words using ANOVA. Compare your ANOVA results with the equivalent t test results from Section 4.1,.

Can you test ‘directional’ hypotheses with ANOVA as we did earlier with the t-test?

At this point, you may wonder why do ANOVA at all, or why not perform multiple t-tests instead? One of the answers to this question is that you may not have a specific hypothesis about differences in individual groups (e.g., ‘low’ SES and ‘medium’ SES differ), but you may expect that the categorical predictor has an overall effect. This is probably reasonable for our example analysis of the effect of SES on MLU.

The second reason have to do with the logic of hypothesis testing in general. In an exploratory study looking for differences, every pairwise difference you test will increase the probability of finding a difference by chance. As the number of groups increases, the number of possible (pairwise) comparisons increase. So, your chances of getting a significant difference where there isn’t one increases. That said, if you have a set of specific hypotheses, you should go ahead and test them. The concern is valid when you would take ‘any’ pairwise difference interesting.

For an amusing story of multiple comparisons causing false positives, see http://www.wired.com/wiredscience/2009/09/fmrisalmon/, which reports on an fMRI study that finds “significant evidence” that a dead salmon shows emotional responses. For another fun but less ‘scientific’ demonstration, see http://xkcd.com/882/.

If you want to explore the data for potential pairwise differences and also want to perform valid hypothesis tests, then you should apply appropriate corrections (typically by lowering your significance level).

There are a number of different corrections that differ in their assumptions. Some of these corrections are implemented by the R function pairwise.t.test().

Exercise 4.15. Performs all pairwise SES comparisons of MLU values in the data set mlu.ses using pairwise.t.test(). Note that pairwise.t.test() uses a different correction method than well-known Bonferroni correction. However, it can do Bonferroni correction if the appropriate options are given. Also perform pairwise comparisons with Bonferroni correction, and compare the results with the R default.

Which correction is more conservative?

4.4 Factorial ANOVA

Often there are multiple categorical predictors that may affect the response variable of interest. In such cases, it is more economical (in terms of effort spent on experimentation or data collection) to perform a combined analysis. At least as importantly, analyzing multiple effects together also allows investigating their interactions. The extension of ANOVA with multiple groupings of the response variable is called factorial, or n-way, ANOVA.

The way to run factorial ANOVA is not much different than the single ANOVA example in Listing 7. Only difference is in the model specification, or the way we write the model formula. We will first start with an example where we ignore possible interactions. To specify a factorial ANOVA to investigate the independent effects of ses and gender on mlu, the formula to be used is ‘mlu ~ ses + gender’. Note that you should not read the sign ‘+’ here as an arithmetic operation.

Exercise 4.16. Perform a factorial ANOVA that investigates the effects of ses and gender on mlu. Compare your results with the single ANOVA results in Listing 7.

If we also want to include the interactions our formula becomes ‘mlu ~ ses  gender’ (with a instead of +. Again, you should refrain yourself from reading this as arithmetic multiplication).

Exercise 4.17.  Perform a factorial ANOVA to investigate the effects of ses and gender and their interactions on mlu.

The conventional visualization of interactions between two categorical variables can be displayed using the function interaction.plot(). Interaction plot puts one of the factors (categorical variables) on x-axis and the response (numeric) variable on the y-axis. It plots a dot for the mean of the response variable for all combinations of both factor variables, and draws a line between all dots that have the same level of the second factor. As a result, you need to give interaction.plot() two factor variables, and the response variable as arguments. It will calculate means of the each ‘cell’ from the data plot according tot the description above.

Exercise 4.18. Plot the interactions of ses and gender, and interpret the resulting graph with reference to the findings in Exercise 4.17.

4.5 If ANOVA assumptions are not met

If normality and homogeneity of variances assumptions of ANOVA is violated with the data at hand, you typically have two sensible ways to go

Finding correct transformation is not always easy. We will work with some sensible transformations later. For complicated methods, including most configurations of factorial ANOVA, there aren’t straightforward non-parametric alternatives. As a result, trying to tame the data (e.g., by transformations) might be the only option at hand.

The function to perform Kruskal-Wallis rank sum test in R is called kruskal.test(). Kruskal-Wallis test does not assume normality, however, you should be aware that it requires the samples to come from identically-shaped distributions, which is not true if the variances are not similar.

Exercise 4.19. Perform the alternative non-parametric test for the ANOVA presented in Listing 7. Is the test more conservative or less sensitive than ANOVA?

4.6 T-test as a linear model

You have been hinted along the way that there is a strong connection between the t-test (and also ANOVA) and the linear regression. In this subsection, we will try to demonstrate this connection with the t test, and extend it to ANOVA(s) later.

Exercise 4.20.  Perform (yet another) the t-test testing the differences in number of words spoken per day between male and female speakers. Force t.test() to assume equal variances, do not use a directional alternative hypothesis.

Exercise 4.21. Fit a least squares regression model using lm(), where the response is words and the predictor is gender. Produce a summary of the model with the function summary() (not with summary.aov()).

Which numbers match with the t-test results form Exercise 4.20. Can you explain what is happening?

The trick is in the way the categorical variables are used in regression. Remember the usual equation for a linear line:

y = a + bx

Now, consider x is a categorical variable, e.g., gender, and we set it to 0 for women, and 1 for men. This means that expected value of y is equal to a for women (put 0 instead of x above formula and calculate), and expected value of y is a + b for men. In other words, the difference between expected values of y for men and women is b. Hence, the t test performed for the slope means testing whether the mean difference between women and men is 0.

The above description is only a special case of what is called indicator, or dummy, coding. The coding turns analyses like t-test or ANOVA into linear regression. The concept is also directly related to the so-called ‘contrasts’ used in ANOVA. The particular way the variable x above is coded is called treatment contrast, which is the default in R.

We will return to this topic after we work on multiple regression.