Except for repeated-measures ANOVA, all methods we have studied so far assume independent observations. Often, this is not correct for the data at hand. And sometimes, we do break the independence assumption in data collection, because some questions are answered more directly or naturally by data with some systematic dependence between the observations. In this section, we will study so-called mixed-eﬀect or multilevel models. The ﬁrst name implies that some of the eﬀects, the coeﬃcients, which we always assumed to be ﬁxed quantities so far, can be random, distributed according to a probability distribution. The second name emphasizes that the (some of the) coeﬃcients are also ‘modeled’. That is, coeﬃcients vary according to a model at a diﬀerent ‘level’. The term ‘level’ here does not necessarily include a strict hierarchy. The use of the ﬁrst name mixed-eﬀect models is more common in (psycho)linguistics literature. However, there is also a large number of valuable resources that refer to this type of models as multilevel models. Both, at the end, refer to the same sort of modeling practice we will work on in this section.
We will use three diﬀerent data sets in this section. Besides two familiar data sets, bilingual and newborn from Section 5, we will also introduce a new data set.
In our somewhat hypothetical investigation here, would like to know whether parenthetical expressions, expressions like this one, behave similar to intonational phrases (IP), or phonological phrases (PP). The proper deﬁnitions of these phrase types and the theoretical motivation behind this investigation are not important for our purposes. We will use the rate at which these diﬀerent classes of phrases uttered for the comparison here. The data is available as an R data ﬁle containing multiple data frames at http://coltekin.net/cagri/R/data/par.rda. The problem is inspired by Güneş [2014], but heavily simpliﬁed, and the data is modiﬁed for these exercises.
You probably realized that the data set loaded in Exercise 12.1 consists of three data frames. The data frame par.data contains the individual observations. par.item contains information regarding ‘items’, the diﬀerent phrases used in the study, and par.subj contains information about the participants. This sort of organization is more suitable for storing and organizing your data. It avoids replication of the same information. For example, if we have stored everything in a large data frame (or spreadsheet), we would need to replicate age of a participant for each observation from the same participant. The problem is not the storage (it is plenty and cheap nowadays). However, a data organization with replicated values calls for inconsistency, and it should be avoided while storing and updating the data.
Merge all three data frames into a single data frame with name par. Remember from Section 6 that par is also a built in function name. This is ﬁne. Diﬀerent types of objects (a function and a data frame in our case) can have the same name in R. We do this here for demonstration, but you should avoid using the names of built-in R objects to prevent potential confusion. $\u22b3$
We start with a(nother) reminder of the simple regression expressed by the following formula:
where $y$ is the response variable, $x$ is the predictors, $i$ indexes each observation or data point, $a$ and $b$ are the coeﬃcients of the model, and $e$ is the variation that cannot be explained by the model. In general linear models, we assume this error to be normally distributed with zero mean. Generalized linear models (GLMs) allow us to work with diﬀerent error distributions. For example, in logistic regression, $e$ is distributed binomially. However, a crucial assumption of all these models is that $e$ is the only source of random variation. The rest of the model $a+bx$ is deterministic. Hence the coeﬃcients $a$ and $b$ are unknown but ﬁxed quantities.
The mixed-eﬀect models we will study here allow other (more systematic) random sources of variation. This additional source of variation typically occurs due to multiple correlated measurements. For example, in the newborn data set we worked with in Section 5, we had multiple measurements of sucking rates from the same baby. Besides the random variation that we cannot account for, this includes a systematic variation due to individuals: some babies will be faster, and some will be slower. In repeated-measures designs, we make use of this variation to estimate the eﬀects of interest more precisely. The mixed-eﬀect models we will study here allow more ﬂexible modeling of additional sources of variation. For example, extending the linear equation above to include variation due to each subject $j$ would result in the following set of equations:
Although this can be written in other forms, the notation necessarily gets complicated with mixed-eﬀect models. Here we introduce a random variation due to each subject $j$ that only aﬀects the intercept. In other words, now we have as many intercept terms as the subjects. Intercept vary between the subjects. The term ${a}_{j}\left[i\right]$ means the intercept that corresponding to the subject $j$ that the observation $i$ came from. Crucially, the ${a}_{j}$ are estimated from all the data, not only from the data that belongs to subject $j$. As a result, the estimation of ${a}_{j}$ will be pulled towards the ${\mu}_{a}$. This means for subjects whose intercept is not estimated precisely from their own data (because of small number of observations or high variation), the population estimates will play a role.
The equation above deﬁnes a rather simple form of mixed-eﬀect model. In this model we only vary the intercept for each subject. One can also deﬁne varying slope(s), in which case our assumption would be that the subjects not only have a diﬀerent base rate (intercept) but they also have diﬀerent slopes. In this case, we assume, for example, that the subjects react to experimental manipulation diﬀerently. Furthermore, we are not limited to a single source of variation. The mixed-eﬀect models can incorporate multiple sources of variation. For example, subjects and the items (the experimental material, e.g., words, sentences, stories) can be deﬁned as two diﬀerent sources of variation each having systematic eﬀect on the intercept, slope(s) or both. The mixed-eﬀect models are a generalization of generalized linear models, which means you can also ﬁt mixed-eﬀect logistic regression, or any other GLM.
In this section, we will use the lmer() function from the lme4 package for ﬁtting mixed-eﬀect models. To match the lmer() output better, we rewrite the equation for the mixed-eﬀect model above as
The model is the same, but lmer() syntax follows a more ﬂat representation as in here. Writing it this way allows us to understand lmer() output better. Now we have a common intercept for all subjects, but two error terms: one is due to subjects ($\mathit{\epsilon}$) and the other ($e$) is the unexplained ‘residual’ variation. Both errors are normally distributed with zero mean for the models we will discuss here, but as noted earlier for mixed-eﬀect GLMs that can have non-normal error terms.
This much of explanation is already too much for a ‘hands-on’ tutorial. Hopefully, the above gave you a general sense of the theory behind the exercises we will do in this section. You will learn the details while working on the exercises below. At the end of the section we list a few books that cover multilevel or mixed-eﬀect models in a comprehensive but also accessible way.
As usual, we will start with a simple case. In Section 5, we worked with a hypothetical experiment where we wanted to see if newborn babies react to their mother’s native language diﬀerently than a foreign language. The response variable was the sucking rate, and the predictor was the input language. Earlier, we used paired t-test and repeated measures ANOVA for analyzing the same data set. It will also be our ﬁrst example in mixed-eﬀect models.
Note that the default plot method for numeric vs. factor is a box plot. For a scatter plot, you need to convert the factor language to integer values between $0$ and $1$ (to match the indicator coding). TIP: adding small amount of noise to the data points in the x-axis will make the individual data points more visible. For this, you can use the function jitter(), or alternatively, you can add your own noise by random numbers sampled from a uniform distribution. $\u22b3$
We know that this analysis is not correct (the observations are not independent). This would only be correct if there would be no diﬀerences between the subjects. In other words, our analysis would be correct if all the variation we observe was due to random (or randomized) factors outside the experimental design.
We already hinted that there are multiple mixed-eﬀect models that can be speciﬁed for the same the problem. The simplest form of the mixed-eﬀects speciﬁcation includes a single, ‘intercept-only’ random eﬀect. In this type of model, we assume that the individuals measured have diﬀerent ‘base rates’, but all react to the diﬀerent conditions (e.g., experimental manipulation) the same way, except for the completely random variation. Hence, the intercept varies per subject, but slope is the same.
To specify a random intercept term in lmer() we use a notation like (1|subject). Here we assume that the source of additional variation (the random eﬀect) is the variable subject. This term needs to be added to the right side of our model formula. Listing 12 shows the way to ﬁt a random-intercept-only model to the newborn data.
The ﬁrst line in Listing 12 loads the relevant functions from the lme4 library, and it is needed only once in an R session. The lmer() call is similar to lm(). Only big diﬀerence is the speciﬁcation of the random intercept term.
Summary of the model ﬁt ﬁrst includes information about the ‘random eﬀects’. Here, the line labeled participant refers to the variation due to participants. It corresponds to the error term $\mathit{\epsilon}$ in the formula above. The column Name indicates that this random eﬀect only aﬀects the intercept. The line labeled residuals is the general variation that cannot be accounted for by the model. The lmer() summary presents both the estimated variance and the standard deviation (square root of the variance), since standard deviation is easier to interpret. In essence the summary tels us that the estimated intercept varies according to a normal distribution with zero mean and a standard deviation of $9.72$. The residual variation is a lot smaller. This is a good case for using mixed-eﬀect modeling (or repeated measures). One last thing note here is that in comparison to the ordinary regression ﬁt from Exercise 12.3, residual variation is reduced a lot.
Returning to the ﬁxed eﬀects report in Listing 12, we see that the intercept is estimated as $31.84$ and the slope of the language eﬀect is estimated as $4.52$. These correspond to mean sucking rate while listening to stories in the foreign language, and the rate diﬀerence between foreign and native languages, respectively. These estimates are almost the same as the estimates from the ordinary regression ﬁt from Exercise 12.3. This is a coincidence, due to simple (and fabricated) the data set. The important diﬀerence lies in the standard error of the slope estimate. Compared to the ordinary regression estimate, we now have a much smaller standard error. Using the rough approximation $\pm 2\times $SE, the $95$% conﬁdence interval for the slope estimate is $\left[2.8211,6.2263\right]$. The diﬀerence from zero will certainly be statistically signiﬁcant. However, lmer() summary does not include a p-value. We also do not see the ${r}^{2}$ or an F-test for the overall model ﬁt. As in generalized linear models, due to the estimation method(s) used to ﬁt mixed-eﬀect models, we do not have straightforward ways to calculate these. For now, we will use approximate $\pm 2\times $SE conﬁdence intervals to asses the reliability of the coeﬃcient estimates, and we will return to these problems later.
We said earlier that for the random-intercept model in Listing 12, each subject is assigned to a diﬀerent intercept. However, the summary of the lmer() ﬁt gives us only one intercept estimate in the ﬁxed eﬀects, which corresponds to the mean of the estimated intercepts. We only get the standard deviation of the estimated subject intercepts in the summary() output. In most cases, this is what we are interested in, we are interested in overall trends in the data, not the individual behavior. In some cases, on the other hand, investigating random eﬀect estimates may be crucial, and without any doubt, it is instrumental for learning what the mixed-eﬀect models do.
To obtain the estimated random coeﬃcients we use the function ranef(). The intercept value corresponding to a particular participant is the sum of the estimated random intercept and the mean intercept value reported in the ﬁxed-eﬀects section of the summary. To obtain only the ﬁxed eﬀects, you can use the function fixef(). The coef() function we used earlier with lm() returns both ﬁxed and random eﬀects.
TIP: if you specify condVar=T to ranef(), it will return (co)variance information together with the random eﬀects which will be picked up by dotplot() and it will plot 95% conﬁdence intervals along with the estimated random eﬀects.
And a note: the lattice library includes many useful, easy-to-use plotting commands multivariate data that you may want to explore further [see Sarkar, 2008, for more information]. $\u22b3$
Are these estimates diﬀerent from the observed values? $\u22b3$
Use diﬀerent colors, and re-plot (without jitter) the data points belonging to these two participants as ﬁlled circles with the color of the corresponding regression line.
Make sure that the x-axis does not contain numeric labels, but two labels foreign and native at the corresponding values. TIP: for this you can ﬁrst prevent drawing the x-axis with xaxt=’n’ during plot(), and then, you can use the function axis() to create a custom one. $\u22b3$
The least squares regression we ﬁtted in Exercise 12.3 ignores the subject variation completely. We will now study another extreme example.
Compare your results to earlier models nb.lmer1 and nb.lm.
Although it does not make much sense in our example, the model of the sorts we ﬁt in Exercise 12.9 is not uncommon in practice. The variables that are entered into analyses to reduce the unknown variation are sometimes called blocking variables.
The point of the exercises 12.8 and 12.9 for us is to demonstrate that the estimates of a mixed-eﬀect model will fall in between a model that ignores the subject variation completely, and the one that includes individual subjects as predictors. Including subject and language interaction in the model would be another instructive exercise. However, since our newborn data does not contain replication, such a model will be a saturated model, and it will not be useful in practice (you are encouraged to try it, though).
So far, we worked only with random intercepts. If there is interaction between the random and ﬁxed eﬀects, e.g., if we expect the subjects to react to the experimental manipulation diﬀerently, the way to express this is to allow slopes to vary. If we model random slopes, we will almost certainly model random intercepts as well. Varying-slopes-only models are conceivable but uncommon. When we talk about a model with ‘random slopes’ in this tutorial, it means we also include random intercepts.
Speciﬁcation of random slopes for lmer() is easy. We extend the random eﬀect expression like (1 + language | participant). Note that as in other parts of the formula notation, the intercept is implicit in random eﬀects term as well. The expression (language | participant) have the same meaning. If you happen to ﬁt a model that only includes random slopes, you need to specify it as either (language - 1 | participant), or (language + 0| participant).
We will not be able to ﬁt a random-slopes model with our newborn data. If we include both intercepts and slopes, the number of random coeﬃcients will be equal to number of observations. Hence, the model cannot be ﬁtted. We ﬁt our ﬁrst random-slopes model using the bilingual data. Remember that in this data set we are interested in diﬀerences in linguistic skills of bilingual children between the language spoken only at home, and the language which is also spoken in the language community, particularly at school. Listing 13 presents lmer() summary for predicting the language skills only from the type of language (home.only or school), using random slopes and intercepts.
The model speciﬁcation does not need much explanation: we want to predict mlu based on language, but we also specify subj as a source of variation that aﬀects both the slope and the intercept. The resulting model can be described mathematically as follows:
Before going into the interesting part, we quickly observe that the ﬁxed eﬀects indicate an MLU of $4.3123$ for the home.only language for an average kid, and on average they show an MLU of $4.3123+0.4898=4.8021$ for the school language. These correspond to the ${\mu}_{a}$ and ${\mu}_{b}$ in our formula above. Our rule of thumb of $\pm 2\times $SE conﬁdence interval indicates that the MLU diﬀerence between languages are statistically signiﬁcant.
More interesting diﬀerences are in the random eﬀects, since we now have both random intercepts and slopes. The individual values of ${\mathit{\epsilon}}_{j\left[i\right]}$ and ${\delta}_{j\left[i\right]}$ are not visible in the summary, but the random eﬀects part indicate that the standard deviations of estimated random eﬀects are $0.8704$ and $0.2005$ respectively. The residual standard deviation $e$ is $1.0873$ in this model ﬁt. lmer() also reports the correlation between the two random eﬀects. The correlation between the random intercepts and slopes are reported to be $1$. In general, there is nothing wrong with high correlation between random intercepts or slopes. In our case, that would mean that the kids who are more proﬁcient in their home language also show the higher diﬀerence between home and school languages (it may be diﬃcult to reason about, but it can happen). However, a perfect or near perfect correlation (close to $1$ or $-1$) indicates problems with ﬁtting the model. One option is to specify a model without random eﬀect correlations, which we will revisit later. Another option, especially after observing that the variance of random slope is close to zero is to check whether it is worth opting for this more complex model instead of random-slopes-only model.
So far, we have been ﬁtting mixed-eﬀect models with so-called residual or restricted maximum likelihood (REML) which is lmer() default. This method has some advantages, particularly for estimating variance components, over the maximum likelihood estimation (MLE) while ﬁtting mixed-eﬀect models. However, MLE ﬁt provides some additional information such as AIC that we have used for model selection earlier. To ﬁt a mixed-eﬀect model with MLE, you need to specify the option REML=FALSE. In most cases, the diﬀerences between MLE and REML estimates should be small.
Which model does AIC prefer? $\u22b3$
Another way to compare two models is to test them using anova(). In case of models ﬁt by MLE, anova() will do a ${\chi}^{2}$ test for the diﬀerences between log likelihoods of two models. The test result will be statistically signiﬁcant if the reduction in the log like likelihood is supported by the data.
What is your conclusion? $\u22b3$
With mixed-eﬀect models, it is no more straightforward to calculate the p-values for the model coeﬃcients. The diﬃculty has to do with calculating degrees of freedom for the t-values calculated from the coeﬃcient standard errors. In most cases, especially if you have enough data, $\pm 2\times $SE will give a crude but useful approximation to 95% conﬁdence intervals for a coeﬃcient estimate. Other more precise alternatives include,
We will work only with the ﬁrst two.
Reminder: To ﬁt a model without any predictors (i.e., only with an intercept), we use the syntax response ~ 1. Of course, you will need to specify the random-eﬀects term for your new model to be comparable to the model nb.lmer1 ﬁtted earlier.
TIP: to make sure the likelihood-ratio test performed makes sense, you should normally (re)ﬁt the models with the MLE. However, anova() will do it for you if your models were ﬁt with REML. $\u22b3$
You can obtain ‘proﬁle’ of a model with profile() command. The model proﬁle can be inspected for inspecting the coeﬃcient estimates. Here, we will only use the proﬁles to get the conﬁdence intervals. To get the conﬁdence intervals from a proﬁle, all you need to do is to run the function confint() on the proﬁle object. Although you can perform hypothesis tests at any signiﬁcance level, it makes more sense to report the conﬁdence intervals directly. This is particularly true for random eﬀects whose conﬁdence intervals tends to be asymmetric.
Does the random eﬀect due to the participants statistically signiﬁcant at the same levels?
TIP: 95% conﬁdence intervals are the default for confint(). To obtain 99% conﬁdence intervals, you need to specify level=0.99. $\u22b3$
As in factorial repeated measures ANOVA, we can have multiple ﬁxed eﬀects in an mixed-eﬀect model. The speciﬁcation of the ﬁxed eﬀects are straightforward in the model formula.
Fit a mixed eﬀect model with random intercepts and slopes. (The repeated measure ANOVA corresponds to a mixed-eﬀect model with both random intercepts and slopes.)
In our bilingual data, the bilingual kids were measured on both languages, and diﬀerent ages. In repeated measures ANOVA terms, both language and age are within-subjects variables. As in ANOVA, we can also analyze ‘mixed’ data between- and within- subjects predictors with mixed-eﬀect models. In multilevel modeling literature, a between-subjects predictor, such as gender in our bilingual data set, is associated with a diﬀerent level in the hierarchy. In the formula notation used with lmer() you do not need to distinguish the between- or within-subject variables. This will be apparent in the data organization. Of course, you do not specify random slopes for such predictor if you are ﬁtting a model with random slopes.
Does gender have a signiﬁcant eﬀect (at $\alpha =0.05$)? $\u22b3$
In the data sets we have been working with in this section, we have only one sensible random eﬀect (due to the participants). Mixed-eﬀect models has been advocated in (psycho)linguistic research, particularly for their capability of handling crossed, non-hierarchical random eﬀects. The problem, named ‘language-as-a-ﬁxed-eﬀect fallacy’ after Clark [1973], arises because in a typical ‘repeated measures’ psycholinguistics study one does not only repeat the measurements over the participants, but also over the items, e.g., phrases, words, sentences and maybe pictures, that are used as experimental material. A typical repeated measures ANOVA generalizes to the population the participants are sampled from, but not to the language where the items are sampled from. In such an experiment, the subjects and the items are crossed. That is, all subject react to all items. Although there are some approximate solutions, including one suggested by Clark [1973], classical repeated measures methods cannot deal with this sort of crossed random eﬀects in a straightforward way [see Raaijmakers et al., 1999, Baayen, 2008, for further discussion].
There is nothing special in speciﬁcation of multiple random eﬀects for lmer(). We simply add the additional random eﬀects as before to the formula notation. For example speciﬁcation of random intercepts both due to subject and item would be speciﬁed like (1|subject)+(1|item).
How do you interpret the ﬁxed and random eﬀects? $\u22b3$
Pick the best model suggested by AIC(). $\u22b3$
Compare the coeﬃcient estimates with par.m4. Which ﬁxed eﬀect estimates (and their inferences) change? Why? $\u22b3$
Exercise 12.20 points to a common problem with the estimation of mixed-eﬀect models with numeric predictors. In Exercise 12.20, besides the fact that we have an intercept term with no straightforward interpretation, you should have noticed that the now the estimate is less certain, and there is a large correlation between the intercept and the slope of the lenght. These problems are due to the fact that the intercept is now far away from the observations, and can be remedied by centering or scaling the numeric predictor.
How do you interpret the intercept term now? Do you observe improvements in the estimation of the intercept term? $\u22b3$
In the models we ﬁtted in the last two exercises, we observe another interesting aspect of the mixed-eﬀect models. In a normal linear model, the phrase id item and the phrase length would be collinear. As a result, the coeﬃcient of one of these predictors would not be estimable. In mixed-eﬀect models this is ﬁne, since the predictor length is only used for estimating the intercept due to items. In other words, the length is a predictor in a diﬀerent ‘level’. You should also observe this in the reduction of the estimated variance of the random intercept due to item (but almost no other changes) with the addition of length as a predictor.
What is the new interpretation of the intercept?
Which random eﬀects has changed? Why? $\u22b3$
Do you observe the reduction in the subject variation in the graphs?
TIP: The argument which="subject" to ranef() causes to extract only the random eﬀects due to subject from the model.
TIP2: dotplot() resets the values set by par(). So, plotting these side-by-side is a bit tricky (you will not be tortured with this). $\u22b3$
Although we try to include some introduction to mixed-eﬀect models in this section, our aim in this tutorial, as usual, is not to be comprehensive. Here we mention a few books that you may want to consult for learning more about multilevel models, or possibly, study along with the exercises here, especially if you are using this tutorial for self-study. A rather accessible and comprehensive textbook on mixed-eﬀect or multilevel modeling is Gelman and Hill [2007]. Another ‘must-read’ book, if you will be using mixed-eﬀect models in practice is Bates [2010]. This book is written by the author of the lme4 package, includes many practical examples with clear explanations, and also available online (see the references at the end for the URL). The focus of Baayen [2008, ch. 7] is also mixed-eﬀect models, discussed through examples from linguistic data.