Linear models with categorical predictors

4 Linear models with categorical predictors

In this set of exercises we will study methods to analyze data where the response variable is continuous (e.g., pitch, duration, reaction time), the predictor is categorical (e.g., gender, language, part-of-speech tag). Crucially, all methods we discuss here works with independent samples (e.g., no multiple measurements from the same individual). Analyzing ‘repeated measures’ or non-independent data will be addressed in the next section.

In Section 2, we have already seen an example where we tried to see whether ‘talkativeness’ can be predicted by gender, and analyzed the data using the t test.

For the ﬁrst part of this section, we revisit the t-test exercises from Section 2, so you will need the words data set used there. If you do not have the variables from Section 2 in your R environment (remember, you can check this using ls()), use the commands given at the beginning of the Section 2 to create it.

Now that we know that the related sets of variables should be stored in data frames instead of individual vector variables, we would like to put these variables together in a data frame.

: Exercise 4.1. Create a data frame containing only the variables words and age. Name your data frame ‘talk’. $⊳$

The talk data frame does not yet contain the information on the genders of the participants. From the way we constructed the words vector earlier, we know that the ﬁrst 10 values correspond to the female participants, and the last 10 values correspond to the male participants. In a properly constructed data frame, we typically do not want unrelated values on a row. So, a data frame that contains word counts for male and female users in diﬀerent columns is not a good option (although this data format, called ‘wide format’, is used by some other statistical software by default). Instead, we will keep the data frame as created in Exercise 4.1, and add a new column with the categorical variable gender.

Like numeric data types, R supports a ‘factor’ data type that is suitable for categorical or nominal data (which can optionally be ordinal, but this is not a concern for us in this section). Here is how to add the gender column to the talk data frame.

 
> talk$gender <- factor(c(rep(’F’, 10), rep(’M’, 10))) 
> talk$gender 
 [1] F F F F F F F F F F M M M M M M M M M M 
Levels: F M

The ﬁrst line, has two new R functions that we haven’t seen yet. rep() returns a vector where its ﬁrst argument repeated as many times as its second argument. As we have seen before, c() concatenates the results. And, factor() makes sure that the resulting vector contains factors. If we display it as in the second line of the listing above, R will also give you the possible values ‘levels’ the factor variable can take.

We will return to some of the details regarding the factor variables. For now, it suﬃces to understand that we have a factor variable gender with values F for the values in words that were collected from the female participants, and M for male participants.

Besides the talk data we reorganized, we will use another hypothetical set of data where we investigate early language development of children, given their parents’ socio economic status (SES). Our fake data set include mean length utterance (MLU) values from 10 girls and 10 boys for each three levels of SES we consider (low, mid, high). You can load the data with the following command.

 
> mlu.ses <- read.csv( 
   ’http://coltekin.net/cagri/R/data/mlu-ses.csv’) 
> mlu.ses$ses <- factor( 
    mlu.ses$ses, levels=c(’low’, ’medium’, ’high’))

If you are working oﬄine, you should use the local ﬁle name instead of the URL in the ﬁrst command. Note that in R strings backslash ‘\’ has a special function. When using local path names on Windows, you should either use double backslash (like ‘C:\\My files\\R\\mlu-ses.csv’) in path names, or use forward slash ‘/’ instead. The second command in the above listing corrects the ordering of the SES levels (and yes, we will return to this one too).

If all goes ﬁne, you will have a data frame with four variables (subject, ses, gender, mlu) and 60 rows (observations, cases). Note that the talk data frame we have created earlier also have variables with the same name. For each exercises that follows, we will tell explicitly which data to use.

: Exercise 4.2. The function str() prints a readable summary of a variable (an R object). Inspect the data frame using str(). Does the summary match what you expect from the description above? $⊳$

: Exercise 4.3. You should have seen in Exercise 4.2 that R took the variable subject (which corresponds to the arbitrary ID of participant in the study) as a numeric value. Convert subject variable in mlu.ses to a factor variable. $⊳$

Earlier, we saw how to extract the value of a particular cell, column or row from a data frame. Often, we want to extract values that meet a certain condition. The row and column and indices that we used earlier can contain arbitrary conditional expressions. For example, the following listing demonstrates how to extract all rows where mlu is greater than four and ses is high.

 
> mlu.ses[mlu.ses$mlu > 4 & mlu.ses$ses == ’high’,] 
   subject  ses gender  mlu 
54      54 high female 4.20 
59      59 high female 4.62

Note that equality in a comparison is expressed with double equal signs ‘==’ and the sign ‘&’ means boolean ‘and’ operation. Similarly some common operators you can use are,

!= not equal to
>= greater than or equal to
<= less than or equal to
| boolean ‘or’

: Exercise 4.4. Choose only the mlu values from mlu.ses where the gender is male. $⊳$

4.1 Comparing two means

The classical method for comparing two means is the Student’s t test that we have experimented with in the previous section. Here we will repeat some of the earlier t-test exercises in a slightly diﬀerently way. An alternative notation for running the t test in Exercise 2.16 is

Listing 6: T test with formula notation.

> t.test(talk$words ~ talk$gender, alternative=’greater’)

The output will be the same. The reason for this repetition is to familiarize you with the categorical variables and the formula notation. Both of these concepts will be used throughout this section, and for the rest of the tutorial. Note that you need to preﬁx the variable names with ‘talk$’ to specify which data you are referring to. If you do not what to type the data frame name you can use the command

 
> attach(talk)

after which, the columns in talk data frame will be accessible from your default R environment without referring to the data frame name. You can detach() if you want to go back to the state before the attach() command.

The notation words ~ gender can be translated into plain words as ‘words is explained by gender’, or more technically ‘words is the response variable, and gender is the predictor’. We will use this notation frequently, and Appendix B explains the formula notation in some detail.

4.2 Checking assumptions of t test and ANOVA

All statistical methods (or models) come with a set of assumptions. For independent-samples t test and the ANOVA that we will work on shortly, there are three important assumptions:

Observations are independent.
The data in each group follow an (approximately) normal distribution.
The variances for each group is (approximately) the same.

The ﬁrst assumption above is related to how you collect your data (or conduct your experiments). We will discuss methods for analyzing non-independent data later. For now we will work with a few graphical and analytic tools for investigating the last two assumptions.

: Exercise 4.5. One of the ways of visualizing the distribution of a set of data points is to plot a histogram. Plot necessary histograms to inspect whether the talk data meets the normality assumption required by the t test in Listing 6. Note that you will need to check both subsets. $⊳$

: Exercise 4.6. Use normal Q-Q plots to inspect whether the data meets the normality requirement. $⊳$

A non-graphical method of testing for normality is ‘Shapiro-Wilk test of normality’. The R function shapiro.test() performs this test.

: Exercise 4.7. Verify whether Shapiro-Wilk test of normality agrees with your earlier judgments from the histograms and Q-Q plots. $⊳$

By default, t.test() in R will apply a correction for the unequal variances. This is the reason why earlier t tests we run were reported as ‘Welch Two Sample t-test’. If you want to override this behavior you can pass the option var.equal=TRUE to t.test().

Exercise 4.8. Run an independent samples t test to test the alternative hypothesis that women talk more than man, using the words data set as before, but instruct t.test() to assume equal variances, and use the formula notation.

Do the t-test results with or without corrections diﬀer substantially?

Can you reason about the change in the ‘degrees of freedom’ reported by the t tests with and without the Welch correction? $⊳$

: Exercise 4.9. A good way of visualizing whether variances of two (or more) samples are equal is to use box plots. We have already produced the plot necessary in Exercise 2.4. Produce the same box plot, but use the formula notation this time. Do the variances of male and female subsets look equal? $⊳$

: Exercise 4.10. Another (non-visual) way of checking equality of variances is to use var.test() (which basically performs an F-test). Do the variances of male and female subsets diﬀer according to var.test()? $⊳$

We have already seen that t.test() includes a correction for the non-homogeneity of variances by default. On the other hand, if divergence from normality is a concern, then we prefer a non-parametric test. The non-parametric alternative of independent two-samples t test is called Mann-Whitney test (equivalently, Wilcoxon rank sum test).

: Exercise 4.11. In R, the non-parametric alternatives of all ﬂavors of the t test are performed using the function wilcox.test(). Perform a Mann-Whitney test, and compare the results with the results from the t tests performed earlier. $⊳$

4.3 Single ANOVA

If we have more than two levels or ‘groups’ in our predictor variable, the classical test to perform is the single, or one-way, ANOVA (remember that in this section we only work with independent samples). Traditionally ANOVA and linear regression would be considered very diﬀerent analyses. In this subsection and the next, we will follow this approach and study these methods as ‘hypothesis testing’ methods. However, single and multiple ANOVA we discuss in this section, and the linear regression are, in fact, part of a general framework called general linear models. We will make a ﬁrst attempt to demonstrate this connection in Section 4.6.

To demonstrate ANOVA in R, we try to investigate the eﬀect of socio-economic status (SES) on children’s mean length utterance (MLU) using the hypothetical data set introduced at the beginning of this section. Before doing the actual analysis we will inspect our data and check whether the data meets the requirements of ANOVA or not.

Exercise 4.12. Generate side-by-side box plots of MLU for each level of SES.

Remember that you can access the individual columns (variables) in a data frame using the dollar ‘$’ operator. For example mlu.ses$mlu will give you the mlu column of the data frame. Alternatively, you can give the option data=mlu.ses to boxplot() to and use the column names directly.

Based on the box plots,

Do the variances look similar?
Can you see any clear divergences from normality?
Do you expect any signiﬁcant diﬀerences in MLU due to SES?

⊳

: Exercise 4.13. Check normality of each group using graphical methods, and shapiro.test(). $⊳$

All (independent samples) linear models can be ﬁt using the function lm() that was introduced brieﬂy in Section 3. Our use here will diﬀer in two ways. First, our predictor is a categorical variable, and second, we summarize the model ﬁt by lm() in a diﬀerent way. Listing 7 shows how to do a single ANOVA investigating the eﬀect of ses on mlu. We use the lm() function to ﬁt a general linear model, and summarize it using the function summary.aov() which prints out the relevant information for a classical ANOVA.

Listing 7: Single ANOVA.

1> summary.aov(lm(mlu~ses, data=mlu.ses))
2            Df Sum Sq Mean Sq F value   Pr(>F)
3ses          2  20.71  10.357   21.07 1.41e-07 ⋆⋆⋆
4Residuals   57  28.02   0.492

If you remember ANOVA from your basic statistics course, the output requires little explanation. The sums of squared diﬀerences, and ‘mean squared diﬀerence’ you see on the third line represent the variation between ses groups. Degrees of freedom are $2$ since we have three groups. The quantities in the fourth line are due to ‘within group’ or ‘error’ variation. Degrees of freedom value is $57$ since we have $60$ data points and three groups. The F-value is the result of dividing ‘Mean Sq’ ses to ‘Mean Sq’ Residuals, which results in a very small p value, more precisely $1.41 \times 1 0^{- 7} = 0.000000141$ . Note that R typically uses the scientiﬁc notation when presenting very small numbers. If you are convinced that ANOVA is the correct analysis to run with this data, the result tells us that SES plays a role on MLU of a three-year-old child, and this role is (very) unlikely to be due to chance. Note, however, whether this eﬀect has a practical importance is another question that needs some further information than the p-value, and also dependent on the particular problem studied.

: Exercise 4.14. When you have only two groups, ANOVA is equivalent to the t test. Using the data frame talk, analyze the eﬀect of gender on words using ANOVA. Compare your ANOVA results with the equivalent t test results from Section 4.1,.
Can you test ‘directional’ hypotheses with ANOVA as we did earlier with the t-test? $⊳$

At this point, you may wonder why do ANOVA at all, or why not perform multiple t-tests instead? One of the answers to this question is that you may not have a speciﬁc hypothesis about diﬀerences in individual groups (e.g., ‘low’ SES and ‘medium’ SES diﬀer), but you may expect that the categorical predictor has an overall eﬀect. This is probably reasonable for our example analysis of the eﬀect of SES on MLU.

The second reason have to do with the logic of hypothesis testing in general. In an exploratory study looking for diﬀerences, every pairwise diﬀerence you test will increase the probability of ﬁnding a diﬀerence by chance. As the number of groups increases, the number of possible (pairwise) comparisons increase. So, your chances of getting a signiﬁcant diﬀerence where there isn’t one increases. That said, if you have a set of speciﬁc hypotheses, you should go ahead and test them. The concern is valid when you would take ‘any’ pairwise diﬀerence interesting.

For an amusing story of multiple comparisons causing false positives, see http://www.wired.com/wiredscience/2009/09/fmrisalmon/, which reports on an fMRI study that ﬁnds “signiﬁcant evidence” that a dead salmon shows emotional responses. For another fun but less ‘scientiﬁc’ demonstration, see http://xkcd.com/882/.

If you want to explore the data for potential pairwise diﬀerences and also want to perform valid hypothesis tests, then you should apply appropriate corrections (typically by lowering your signiﬁcance level).

There are a number of diﬀerent corrections that diﬀer in their assumptions. Some of these corrections are implemented by the R function pairwise.t.test().

: Exercise 4.15. Performs all pairwise SES comparisons of MLU values in the data set mlu.ses using pairwise.t.test(). Note that pairwise.t.test() uses a diﬀerent correction method than well-known Bonferroni correction. However, it can do Bonferroni correction if the appropriate options are given. Also perform pairwise comparisons with Bonferroni correction, and compare the results with the R default.
Which correction is more conservative? $⊳$

4.4 Factorial ANOVA

Often there are multiple categorical predictors that may aﬀect the response variable of interest. In such cases, it is more economical (in terms of eﬀort spent on experimentation or data collection) to perform a combined analysis. At least as importantly, analyzing multiple eﬀects together also allows investigating their interactions. The extension of ANOVA with multiple groupings of the response variable is called factorial, or n-way, ANOVA.

The way to run factorial ANOVA is not much diﬀerent than the single ANOVA example in Listing 7. Only diﬀerence is in the model speciﬁcation, or the way we write the model formula. We will ﬁrst start with an example where we ignore possible interactions. To specify a factorial ANOVA to investigate the independent eﬀects of ses and gender on mlu, the formula to be used is ‘mlu ~ ses + gender’. Note that you should not read the sign ‘+’ here as an arithmetic operation.

: Exercise 4.16. Perform a factorial ANOVA that investigates the eﬀects of ses and gender on mlu. Compare your results with the single ANOVA results in Listing 7. $⊳$

If we also want to include the interactions our formula becomes ‘mlu ~ ses ⋆ gender’ (with a ⋆ instead of +. Again, you should refrain yourself from reading this as arithmetic multiplication).

: Exercise 4.17. Perform a factorial ANOVA to investigate the eﬀects of ses and gender and their interactions on mlu. $⊳$

The conventional visualization of interactions between two categorical variables can be displayed using the function interaction.plot(). Interaction plot puts one of the factors (categorical variables) on x-axis and the response (numeric) variable on the y-axis. It plots a dot for the mean of the response variable for all combinations of both factor variables, and draws a line between all dots that have the same level of the second factor. As a result, you need to give interaction.plot() two factor variables, and the response variable as arguments. It will calculate means of the each ‘cell’ from the data plot according tot the description above.

: Exercise 4.18. Plot the interactions of ses and gender, and interpret the resulting graph with reference to the ﬁndings in Exercise 4.17. $⊳$

4.5 If ANOVA assumptions are not met

If normality and homogeneity of variances assumptions of ANOVA is violated with the data at hand, you typically have two sensible ways to go

transform your data.
use the non-parametric alternative, the Kruskal-Wallis rank sum test (for single ANOVA).

Finding correct transformation is not always easy. We will work with some sensible transformations later. For complicated methods, including most conﬁgurations of factorial ANOVA, there aren’t straightforward non-parametric alternatives. As a result, trying to tame the data (e.g., by transformations) might be the only option at hand.

The function to perform Kruskal-Wallis rank sum test in R is called kruskal.test(). Kruskal-Wallis test does not assume normality, however, you should be aware that it requires the samples to come from identically-shaped distributions, which is not true if the variances are not similar.

: Exercise 4.19. Perform the alternative non-parametric test for the ANOVA presented in Listing 7. Is the test more conservative or less sensitive than ANOVA? $⊳$

4.6 T-test as a linear model

You have been hinted along the way that there is a strong connection between the t-test (and also ANOVA) and the linear regression. In this subsection, we will try to demonstrate this connection with the t test, and extend it to ANOVA(s) later.

: Exercise 4.20. Perform (yet another) the t-test testing the diﬀerences in number of words spoken per day between male and female speakers. Force t.test() to assume equal variances, do not use a directional alternative hypothesis. $⊳$

: Exercise 4.21. Fit a least squares regression model using lm(), where the response is words and the predictor is gender. Produce a summary of the model with the function summary() (not with summary.aov()).
Which numbers match with the t-test results form Exercise 4.20. Can you explain what is happening? $⊳$

The trick is in the way the categorical variables are used in regression. Remember the usual equation for a linear line:

y = a + b x

Now, consider $x$ is a categorical variable, e.g., gender, and we set it to $0$ for women, and $1$ for men. This means that expected value of $y$ is equal to $a$ for women (put $0$ instead of $x$ above formula and calculate), and expected value of $y$ is $a + b$ for men. In other words, the diﬀerence between expected values of $y$ for men and women is $b$ . Hence, the t test performed for the slope means testing whether the mean diﬀerence between women and men is $0$ .

The above description is only a special case of what is called indicator, or dummy, coding. The coding turns analyses like t-test or ANOVA into linear regression. The concept is also directly related to the so-called ‘contrasts’ used in ANOVA. The particular way the variable $x$ above is coded is called treatment contrast, which is the default in R.

We will return to this topic after we work on multiple regression.

[next] [prev] [prev-tail] [front] [up]