Except for repeated-measures ANOVA, all methods we have studied so far assume independent observations. Often, this is not correct for the data at hand. And sometimes, we do break the independence assumption in data collection, because some questions are answered more directly or naturally by data with some systematic dependence between the observations. In this section, we will study so-called mixed-effect or multilevel models. The first name implies that some of the effects, the coefficients, which we always assumed to be fixed quantities so far, can be random, distributed according to a probability distribution. The second name emphasizes that the (some of the) coefficients are also ‘modeled’. That is, coefficients vary according to a model at a different ‘level’. The term ‘level’ here does not necessarily include a strict hierarchy. The use of the first name mixed-effect models is more common in (psycho)linguistics literature. However, there is also a large number of valuable resources that refer to this type of models as multilevel models. Both, at the end, refer to the same sort of modeling practice we will work on in this section.
We will use three different data sets in this section. Besides two familiar data sets, bilingual and newborn from Section 5, we will also introduce a new data set.
In our somewhat hypothetical investigation here, would like to know whether parenthetical expressions, expressions like this one, behave similar to intonational phrases (IP), or phonological phrases (PP). The proper definitions of these phrase types and the theoretical motivation behind this investigation are not important for our purposes. We will use the rate at which these different classes of phrases uttered for the comparison here. The data is available as an R data file containing multiple data frames at http://coltekin.net/cagri/R/data/par.rda. The problem is inspired by Güneş [2014], but heavily simplified, and the data is modified for these exercises.
You probably realized that the data set loaded in Exercise 12.1 consists of three data frames. The data frame par.data contains the individual observations. par.item contains information regarding ‘items’, the different phrases used in the study, and par.subj contains information about the participants. This sort of organization is more suitable for storing and organizing your data. It avoids replication of the same information. For example, if we have stored everything in a large data frame (or spreadsheet), we would need to replicate age of a participant for each observation from the same participant. The problem is not the storage (it is plenty and cheap nowadays). However, a data organization with replicated values calls for inconsistency, and it should be avoided while storing and updating the data.
Merge all three data frames into a single data frame with name par. Remember from Section 6 that par is also a built in function name. This is fine. Different types of objects (a function and a data frame in our case) can have the same name in R. We do this here for demonstration, but you should avoid using the names of built-in R objects to prevent potential confusion.
We start with a(nother) reminder of the simple regression expressed by the following formula:
where is the response variable, is the predictors, indexes each observation or data point, and are the coefficients of the model, and is the variation that cannot be explained by the model. In general linear models, we assume this error to be normally distributed with zero mean. Generalized linear models (GLMs) allow us to work with different error distributions. For example, in logistic regression, is distributed binomially. However, a crucial assumption of all these models is that is the only source of random variation. The rest of the model is deterministic. Hence the coefficients and are unknown but fixed quantities.
The mixed-effect models we will study here allow other (more systematic) random sources of variation. This additional source of variation typically occurs due to multiple correlated measurements. For example, in the newborn data set we worked with in Section 5, we had multiple measurements of sucking rates from the same baby. Besides the random variation that we cannot account for, this includes a systematic variation due to individuals: some babies will be faster, and some will be slower. In repeated-measures designs, we make use of this variation to estimate the effects of interest more precisely. The mixed-effect models we will study here allow more flexible modeling of additional sources of variation. For example, extending the linear equation above to include variation due to each subject would result in the following set of equations:
Although this can be written in other forms, the notation necessarily gets complicated with mixed-effect models. Here we introduce a random variation due to each subject that only affects the intercept. In other words, now we have as many intercept terms as the subjects. Intercept vary between the subjects. The term means the intercept that corresponding to the subject that the observation came from. Crucially, the are estimated from all the data, not only from the data that belongs to subject . As a result, the estimation of will be pulled towards the . This means for subjects whose intercept is not estimated precisely from their own data (because of small number of observations or high variation), the population estimates will play a role.
The equation above defines a rather simple form of mixed-effect model. In this model we only vary the intercept for each subject. One can also define varying slope(s), in which case our assumption would be that the subjects not only have a different base rate (intercept) but they also have different slopes. In this case, we assume, for example, that the subjects react to experimental manipulation differently. Furthermore, we are not limited to a single source of variation. The mixed-effect models can incorporate multiple sources of variation. For example, subjects and the items (the experimental material, e.g., words, sentences, stories) can be defined as two different sources of variation each having systematic effect on the intercept, slope(s) or both. The mixed-effect models are a generalization of generalized linear models, which means you can also fit mixed-effect logistic regression, or any other GLM.
In this section, we will use the lmer() function from the lme4 package for fitting mixed-effect models. To match the lmer() output better, we rewrite the equation for the mixed-effect model above as
The model is the same, but lmer() syntax follows a more flat representation as in here. Writing it this way allows us to understand lmer() output better. Now we have a common intercept for all subjects, but two error terms: one is due to subjects () and the other () is the unexplained ‘residual’ variation. Both errors are normally distributed with zero mean for the models we will discuss here, but as noted earlier for mixed-effect GLMs that can have non-normal error terms.
This much of explanation is already too much for a ‘hands-on’ tutorial. Hopefully, the above gave you a general sense of the theory behind the exercises we will do in this section. You will learn the details while working on the exercises below. At the end of the section we list a few books that cover multilevel or mixed-effect models in a comprehensive but also accessible way.
As usual, we will start with a simple case. In Section 5, we worked with a hypothetical experiment where we wanted to see if newborn babies react to their mother’s native language differently than a foreign language. The response variable was the sucking rate, and the predictor was the input language. Earlier, we used paired t-test and repeated measures ANOVA for analyzing the same data set. It will also be our first example in mixed-effect models.
Note that the default plot method for numeric vs. factor is a box plot. For a scatter plot, you need to convert the factor language to integer values between and (to match the indicator coding). TIP: adding small amount of noise to the data points in the x-axis will make the individual data points more visible. For this, you can use the function jitter(), or alternatively, you can add your own noise by random numbers sampled from a uniform distribution.
We know that this analysis is not correct (the observations are not independent). This would only be correct if there would be no differences between the subjects. In other words, our analysis would be correct if all the variation we observe was due to random (or randomized) factors outside the experimental design.
We already hinted that there are multiple mixed-effect models that can be specified for the same the problem. The simplest form of the mixed-effects specification includes a single, ‘intercept-only’ random effect. In this type of model, we assume that the individuals measured have different ‘base rates’, but all react to the different conditions (e.g., experimental manipulation) the same way, except for the completely random variation. Hence, the intercept varies per subject, but slope is the same.
To specify a random intercept term in lmer() we use a notation like (1|subject). Here we assume that the source of additional variation (the random effect) is the variable subject. This term needs to be added to the right side of our model formula. Listing 12 shows the way to fit a random-intercept-only model to the newborn data.
The first line in Listing 12 loads the relevant functions from the lme4 library, and it is needed only once in an R session. The lmer() call is similar to lm(). Only big difference is the specification of the random intercept term.
Summary of the model fit first includes information about the ‘random effects’. Here, the line labeled participant refers to the variation due to participants. It corresponds to the error term in the formula above. The column Name indicates that this random effect only affects the intercept. The line labeled residuals is the general variation that cannot be accounted for by the model. The lmer() summary presents both the estimated variance and the standard deviation (square root of the variance), since standard deviation is easier to interpret. In essence the summary tels us that the estimated intercept varies according to a normal distribution with zero mean and a standard deviation of . The residual variation is a lot smaller. This is a good case for using mixed-effect modeling (or repeated measures). One last thing note here is that in comparison to the ordinary regression fit from Exercise 12.3, residual variation is reduced a lot.
Returning to the fixed effects report in Listing 12, we see that the intercept is estimated as and the slope of the language effect is estimated as . These correspond to mean sucking rate while listening to stories in the foreign language, and the rate difference between foreign and native languages, respectively. These estimates are almost the same as the estimates from the ordinary regression fit from Exercise 12.3. This is a coincidence, due to simple (and fabricated) the data set. The important difference lies in the standard error of the slope estimate. Compared to the ordinary regression estimate, we now have a much smaller standard error. Using the rough approximation SE, the % confidence interval for the slope estimate is . The difference from zero will certainly be statistically significant. However, lmer() summary does not include a p-value. We also do not see the or an F-test for the overall model fit. As in generalized linear models, due to the estimation method(s) used to fit mixed-effect models, we do not have straightforward ways to calculate these. For now, we will use approximate SE confidence intervals to asses the reliability of the coefficient estimates, and we will return to these problems later.
We said earlier that for the random-intercept model in Listing 12, each subject is assigned to a different intercept. However, the summary of the lmer() fit gives us only one intercept estimate in the fixed effects, which corresponds to the mean of the estimated intercepts. We only get the standard deviation of the estimated subject intercepts in the summary() output. In most cases, this is what we are interested in, we are interested in overall trends in the data, not the individual behavior. In some cases, on the other hand, investigating random effect estimates may be crucial, and without any doubt, it is instrumental for learning what the mixed-effect models do.
To obtain the estimated random coefficients we use the function ranef(). The intercept value corresponding to a particular participant is the sum of the estimated random intercept and the mean intercept value reported in the fixed-effects section of the summary. To obtain only the fixed effects, you can use the function fixef(). The coef() function we used earlier with lm() returns both fixed and random effects.
TIP: if you specify condVar=T to ranef(), it will return (co)variance information together with the random effects which will be picked up by dotplot() and it will plot 95% confidence intervals along with the estimated random effects.
And a note: the lattice library includes many useful, easy-to-use plotting commands multivariate data that you may want to explore further [see Sarkar, 2008, for more information].
Use different colors, and re-plot (without jitter) the data points belonging to these two participants as filled circles with the color of the corresponding regression line.
Make sure that the x-axis does not contain numeric labels, but two labels foreign and native at the corresponding values. TIP: for this you can first prevent drawing the x-axis with xaxt=’n’ during plot(), and then, you can use the function axis() to create a custom one.
The least squares regression we fitted in Exercise 12.3 ignores the subject variation completely. We will now study another extreme example.
Compare your results to earlier models nb.lmer1 and nb.lm.
Although it does not make much sense in our example, the model of the sorts we fit in Exercise 12.9 is not uncommon in practice. The variables that are entered into analyses to reduce the unknown variation are sometimes called blocking variables.
The point of the exercises 12.8 and 12.9 for us is to demonstrate that the estimates of a mixed-effect model will fall in between a model that ignores the subject variation completely, and the one that includes individual subjects as predictors. Including subject and language interaction in the model would be another instructive exercise. However, since our newborn data does not contain replication, such a model will be a saturated model, and it will not be useful in practice (you are encouraged to try it, though).
So far, we worked only with random intercepts. If there is interaction between the random and fixed effects, e.g., if we expect the subjects to react to the experimental manipulation differently, the way to express this is to allow slopes to vary. If we model random slopes, we will almost certainly model random intercepts as well. Varying-slopes-only models are conceivable but uncommon. When we talk about a model with ‘random slopes’ in this tutorial, it means we also include random intercepts.
Specification of random slopes for lmer() is easy. We extend the random effect expression like (1 + language | participant). Note that as in other parts of the formula notation, the intercept is implicit in random effects term as well. The expression (language | participant) have the same meaning. If you happen to fit a model that only includes random slopes, you need to specify it as either (language - 1 | participant), or (language + 0| participant).
We will not be able to fit a random-slopes model with our newborn data. If we include both intercepts and slopes, the number of random coefficients will be equal to number of observations. Hence, the model cannot be fitted. We fit our first random-slopes model using the bilingual data. Remember that in this data set we are interested in differences in linguistic skills of bilingual children between the language spoken only at home, and the language which is also spoken in the language community, particularly at school. Listing 13 presents lmer() summary for predicting the language skills only from the type of language (home.only or school), using random slopes and intercepts.
The model specification does not need much explanation: we want to predict mlu based on language, but we also specify subj as a source of variation that affects both the slope and the intercept. The resulting model can be described mathematically as follows:
Before going into the interesting part, we quickly observe that the fixed effects indicate an MLU of for the home.only language for an average kid, and on average they show an MLU of for the school language. These correspond to the and in our formula above. Our rule of thumb of SE confidence interval indicates that the MLU difference between languages are statistically significant.
More interesting differences are in the random effects, since we now have both random intercepts and slopes. The individual values of and are not visible in the summary, but the random effects part indicate that the standard deviations of estimated random effects are and respectively. The residual standard deviation is in this model fit. lmer() also reports the correlation between the two random effects. The correlation between the random intercepts and slopes are reported to be . In general, there is nothing wrong with high correlation between random intercepts or slopes. In our case, that would mean that the kids who are more proficient in their home language also show the higher difference between home and school languages (it may be difficult to reason about, but it can happen). However, a perfect or near perfect correlation (close to or ) indicates problems with fitting the model. One option is to specify a model without random effect correlations, which we will revisit later. Another option, especially after observing that the variance of random slope is close to zero is to check whether it is worth opting for this more complex model instead of random-slopes-only model.
So far, we have been fitting mixed-effect models with so-called residual or restricted maximum likelihood (REML) which is lmer() default. This method has some advantages, particularly for estimating variance components, over the maximum likelihood estimation (MLE) while fitting mixed-effect models. However, MLE fit provides some additional information such as AIC that we have used for model selection earlier. To fit a mixed-effect model with MLE, you need to specify the option REML=FALSE. In most cases, the differences between MLE and REML estimates should be small.
Another way to compare two models is to test them using anova(). In case of models fit by MLE, anova() will do a test for the differences between log likelihoods of two models. The test result will be statistically significant if the reduction in the log like likelihood is supported by the data.
With mixed-effect models, it is no more straightforward to calculate the p-values for the model coefficients. The difficulty has to do with calculating degrees of freedom for the t-values calculated from the coefficient standard errors. In most cases, especially if you have enough data, SE will give a crude but useful approximation to 95% confidence intervals for a coefficient estimate. Other more precise alternatives include,
We will work only with the first two.
Reminder: To fit a model without any predictors (i.e., only with an intercept), we use the syntax response ~ 1. Of course, you will need to specify the random-effects term for your new model to be comparable to the model nb.lmer1 fitted earlier.
TIP: to make sure the likelihood-ratio test performed makes sense, you should normally (re)fit the models with the MLE. However, anova() will do it for you if your models were fit with REML.
You can obtain ‘profile’ of a model with profile() command. The model profile can be inspected for inspecting the coefficient estimates. Here, we will only use the profiles to get the confidence intervals. To get the confidence intervals from a profile, all you need to do is to run the function confint() on the profile object. Although you can perform hypothesis tests at any significance level, it makes more sense to report the confidence intervals directly. This is particularly true for random effects whose confidence intervals tends to be asymmetric.
Does the random effect due to the participants statistically significant at the same levels?
TIP: 95% confidence intervals are the default for confint(). To obtain 99% confidence intervals, you need to specify level=0.99.
As in factorial repeated measures ANOVA, we can have multiple fixed effects in an mixed-effect model. The specification of the fixed effects are straightforward in the model formula.
Fit a mixed effect model with random intercepts and slopes. (The repeated measure ANOVA corresponds to a mixed-effect model with both random intercepts and slopes.)
In our bilingual data, the bilingual kids were measured on both languages, and different ages. In repeated measures ANOVA terms, both language and age are within-subjects variables. As in ANOVA, we can also analyze ‘mixed’ data between- and within- subjects predictors with mixed-effect models. In multilevel modeling literature, a between-subjects predictor, such as gender in our bilingual data set, is associated with a different level in the hierarchy. In the formula notation used with lmer() you do not need to distinguish the between- or within-subject variables. This will be apparent in the data organization. Of course, you do not specify random slopes for such predictor if you are fitting a model with random slopes.
In the data sets we have been working with in this section, we have only one sensible random effect (due to the participants). Mixed-effect models has been advocated in (psycho)linguistic research, particularly for their capability of handling crossed, non-hierarchical random effects. The problem, named ‘language-as-a-fixed-effect fallacy’ after Clark [1973], arises because in a typical ‘repeated measures’ psycholinguistics study one does not only repeat the measurements over the participants, but also over the items, e.g., phrases, words, sentences and maybe pictures, that are used as experimental material. A typical repeated measures ANOVA generalizes to the population the participants are sampled from, but not to the language where the items are sampled from. In such an experiment, the subjects and the items are crossed. That is, all subject react to all items. Although there are some approximate solutions, including one suggested by Clark [1973], classical repeated measures methods cannot deal with this sort of crossed random effects in a straightforward way [see Raaijmakers et al., 1999, Baayen, 2008, for further discussion].
There is nothing special in specification of multiple random effects for lmer(). We simply add the additional random effects as before to the formula notation. For example specification of random intercepts both due to subject and item would be specified like (1|subject)+(1|item).
Compare the coefficient estimates with par.m4. Which fixed effect estimates (and their inferences) change? Why?
Exercise 12.20 points to a common problem with the estimation of mixed-effect models with numeric predictors. In Exercise 12.20, besides the fact that we have an intercept term with no straightforward interpretation, you should have noticed that the now the estimate is less certain, and there is a large correlation between the intercept and the slope of the lenght. These problems are due to the fact that the intercept is now far away from the observations, and can be remedied by centering or scaling the numeric predictor.
How do you interpret the intercept term now? Do you observe improvements in the estimation of the intercept term?
In the models we fitted in the last two exercises, we observe another interesting aspect of the mixed-effect models. In a normal linear model, the phrase id item and the phrase length would be collinear. As a result, the coefficient of one of these predictors would not be estimable. In mixed-effect models this is fine, since the predictor length is only used for estimating the intercept due to items. In other words, the length is a predictor in a different ‘level’. You should also observe this in the reduction of the estimated variance of the random intercept due to item (but almost no other changes) with the addition of length as a predictor.
What is the new interpretation of the intercept?
Do you observe the reduction in the subject variation in the graphs?
TIP: The argument which="subject" to ranef() causes to extract only the random effects due to subject from the model.
TIP2: dotplot() resets the values set by par(). So, plotting these side-by-side is a bit tricky (you will not be tortured with this).
Although we try to include some introduction to mixed-effect models in this section, our aim in this tutorial, as usual, is not to be comprehensive. Here we mention a few books that you may want to consult for learning more about multilevel models, or possibly, study along with the exercises here, especially if you are using this tutorial for self-study. A rather accessible and comprehensive textbook on mixed-effect or multilevel modeling is Gelman and Hill [2007]. Another ‘must-read’ book, if you will be using mixed-effect models in practice is Bates [2010]. This book is written by the author of the lme4 package, includes many practical examples with clear explanations, and also available online (see the references at the end for the URL). The focus of Baayen [2008, ch. 7] is also mixed-effect models, discussed through examples from linguistic data.