Answers

A Answers

A.1 Starting R and ﬁnding your way around

Answer of 1.1.

 
> ??’multivariate anova’ 
stats::manova           Multivariate Analysis of Variance 
stats::summary.manova   Summary Method for Multivariate 
                        Analysis of Variance

So, manova() should be the function we are looking for. Now it’s time to type ?manova and learn how to use it. Note that part of the output is trimmed for readability.

Answer of 1.2.

The answer may change depending on the version of R and the packages loaded in the R environment. On my system if I press tab after the third letter (the initial segment sha) the full function name is displayed. And, after sh pressing tab twice gives me a list of 9 commands.

Answer of 1.3.

Here is a way to do it:

 
1> x = 20 
2> m = 10 
3> 5 = s 
4> t <- x - m 
5> t / s -> z 
6> z 
7[1] 2
  
8

Note the diﬀerent assignment operators, <- and ->, in lines 4 and 5. We have already mentioned <-. -> is similar, but you place the target variable to the right of ->, which is sometime handy when you start writing a complex expression and what to store the result in a variable.

Answer of 1.4.

 
z = (x - m) / s

Thing to note is that you need parentheses around the subtraction operation for correct interpretation. What is the result without parentheses?

Answer of 1.5.

First we search using the keyword ‘logarithm’

 
> ??logarithm 
base::log      Logarithms and Exponentials 
nlme::logDet   Extract the Logarithm of the Determinant

Some additional information is excluded from this listing. In this case R ﬁnds two matches. We are interested in the ﬁrst one. The notation base::log tels that there is a log function in package base. We will study use of packages later. It is reassuring that it is in base package. Which means it is directly accessible.

Now we know the function name, we calculate the logarithm of 2.7:

 
> log(2.7) 
[1] 0.9932518

Answer of 1.6.

If you have tried help(log) or ?log (or ?base::log), you see that you can give a second parameter as base of the logarithm. Furthermore, R gives you ‘shortcut’ functions for common bases like, log2() and log10(). We will use the ﬁrst method:

 
> log(2.7, base=2) 
[1] 1.432959

We could just type log(2.7, 2), since the value is the ﬁrst and base is the second argument. The notation we use speciﬁes the base parameter explicitly, so that you do not need to remember the order of the parameters. This notation is particularly useful for well known parameters (like mean or sd) of other commonly used functions as well.

Answer of 1.7.

 
> ls() 
[1] "m"  "s" "t"  "x"  "z" 
> rm(t,x) 
> ls() 
[1] "m"  "s" "z"

Here is a tip for deleting all variables without needing to list all ﬁles (use with caution!):

 
> rm(list=ls()) 
> ls() 
character(0)

Answer of 1.8.

 
> sd(nwords)/sqrt(length(nwords)) 
[1] 7.485096

The sqrt(x) function we used here is an alternative to x̂0.5.

Answer of 1.9.

1> znwords <- (nwords - mean(nwords)) /sd(nwords)
2> znwords; mean(znwords); sd(znwords)
3[1] -0.06759625 -0.15209156 -1.84199776 0.35488030
4[5] 0.18588968 0.56611858 -0.27883452 0.31263265
5[9] 1.96029119 -1.03929231
6[1] 3.852463e-15
7[1] 1

On line 2, we speciﬁed multiple commands on a single line, separating them with semicolons ‘;’. You could, of course, run each command on a separate line.

Note that R displays the mean on line 6 using the scientiﬁc notation. This means $3.86 \times 1 0^{- 15}$ , a very small number (but should it not be $0$ ?).

Answer of 1.10.

 
> sort(c(36,35,8,71), decreasing=T) 
[1] 71 36 35  8

Another side note here: T is a shorthand for TRUE. It is unlikely to make a diﬀerence for most purposes, but TRUE is a language reserved keyword, and T is just a convenience variable. If you are interested, the following proves the point:

 
> T 
[1] TRUE 
> T = FALSE 
> T 
[1] FALSE 
> TRUE = FALSE 
Error in TRUE = FALSE : 
        invalid (do_set) left-hand side to assignment

If you try these, do not forget to rm(T) at the end of your trial. Otherwise, at some point later, you may have a hard time understanding ‘why things stopped working as they used to be’.

Answer of 1.11.

 
> seats = c(36,35,8,71) 
> (seats / sum(seats)) ⋆ 100 
[1] 24.000000 23.333333  5.333333 47.333333 
> round((sort(seats, dec=T) / sum(seats)) ⋆ 100, 2) 
[1] 47.33 24.00 23.33  5.33

Line 3 gives the answer, but line 4 uses the sorted list and rounds the result to two signiﬁcant digits after dot, which makes it easier to read.

Answer of 1.12.

 
> wdiff <- c(6,8,6,5,7,5,9,7,10,9)

Answer of 1.13.

 
> nwords2 = nwords 
> nwords = nwords - wdiff 
> nwords; nwords2 
 [1] 3497 3495 3456 3506 3503 3515 3486 3504 3541 3469 
 [1] 3510 3508 3468 3520 3516 3525 3505 3519 3558 3487

Note, again, the use of multiple commands on the same line. This time it is particularly useful, since it allows easier comparisons of the (short) vectors.

Answer of 1.14.

 
> mean(wdiff) 
[1] 7.2 
> mean(nwords2) - mean(nwords) 
[1] 7.2

A.2 Basic data exploration and inference

Answer of 2.1.

 
> stem(words) 
  The decimal point is 4 digit(s) to the right of the | 
 
  0 | 255666 
  1 | 0113445778 
  2 | 011 
  3 | 2

The value displayed on the last line (corresponding to the person who spoke 31955 words) seem to be quite far away from the others.

Answer of 2.2.

 
> hist(words)

you should see the histogram on a separate window. For now, we will only display our graphs on screen without much makeup. R allows you to control every aspect of your graphs, and you can save graphs from R in many diﬀerent formats. We will frequently revisit graphs later in this tutorial.

Answer of 2.3.

 
> boxplot(words)

Answer of 2.4.

 
> boxplot(words.f, words.m)

Answer of 2.5.

 
> cor(words, age) 
[1] -0.2604937

We have a negative correlation, indicating people speak less with age. The correlation is not strong, you should interpret the strength of the correlation with respect to the problem at hand, but typically absolute values less than 0.5 would not be regarded as a strong correlation.

Answer of 2.6.

 
> cor(words.m, words.f) 
[1] -0.3212523

There is no reason for these two values to be correlated (unless sampling is done horribly wrong, and order of the numbers reﬂect it somehow). Best explanation for the above correlation coeﬃcient is ‘chance’. And we will be discussing inference for correlation later.

Answer of 2.7.

 
> plot(age, words)

The convention is to put the predictor on the x-axis and the response variable on the y-axis. Presumably, we are interested in eﬀect of age on talkativeness (unless you hypothesize that talking keeps you young). We should put age on the x-axis and words on the y-axis.

Answer of 2.8. Here is the result, including the scatter plot:

The regression line indeed agrees, it indicates that as age increases, people become less talkative. In fact, the correlation coeﬃcient, Pearson’s r, is simply normalized simple regression slope. However, as we will stress later, least squares regression (and also correlation) is sensitive to extreme values. In our example, it seems the regression line is aﬀected (possibly substantially) by the youngest and most talkative participant.

You may want to remove this data point (or replace with a more moderate value) and try last three exercises again. You are encoraged to exercise with removing or replacing data points as you like here. However, when you remove data points in ‘real’ research, you should be convinced (and be ready to convince others) that it is a sensible thing to do.

Answer of 2.9.

 
> plot(words.m, words.f) 
abline(lm(words.f ~ words.m))

Answer of 2.10.

 
> se.words <- sd(words)/sqrt(length(words)) 
> se.words 
[1] 1620.478

Answer of 2.11.

 
> mean(words) - 2 ⋆ se.words 
[1] 10007.14 
> mean(words) + 2 ⋆ se.words 
[1] 16489.06

or a more practical notation:

 
> mean(words) - c(-2, 2) ⋆ se.words 
[1] 16489.06 10007.14

The second calculation may be diﬃcult to grasp at this point. However, it shows clearly how R treats the vectors.

Answer of 2.12.

 
> mean(words) + qnorm(0.025) ⋆ se.words 
[1] 10072.02 
> mean(words) + qnorm(0.975) ⋆ se.words 
[1] 16424.18

Or simplifying it similar to the above, both for normal and t distributions:

 
> mean(words) + qnorm(c(0.025,0.975)) ⋆ se.words 
[1] 10072.02 16424.18 
> mean(words) + qt(c(0.025,0.975), df=19) ⋆ se.words 
[1]  9856.401 16639.799

The conﬁdence interval calculated using the t distribution is larger (less certain, or more conservative if testing for hypotheses). As expected, ‘multiply by $\pm 2$ ’ approximation yields a slightly larger interval than the intervals calculated using the normal distribution.

: Answer of 2.13. The conﬁdence intervals calculated in Exercise 2.12 includes 16000. As a result, we cannot reject the null hypothesis (that there is no statistically signiﬁcant diﬀerence).

Answer of 2.14.

 
> t.test(words, mu=16000) 
        One Sample t-test 
data:  words 
t = -1.6982, df = 19, p-value = 0.1058 
alternative hypothesis: true mean is not equal to 16000 
95 percent confidence interval: 
  9856.401 16639.799 
sample estimates: 
mean of x 
  13248.1

Not surprisingly, the diﬀerence is not signiﬁcant at $0.05$ signiﬁcance level (p-value = $0.1058$ ). Also note that the conﬁdence interval reported by t.test() is the same as the conﬁdence interval you calculated using the t distribution above.

Answer of 2.15. Again, we simply do a single-sample t test.

 
> t.test(words.f, mu=20000) 
        One Sample t-test 
data:  words.f 
t = -4.2857, df = 9, p-value = 0.002033 
alternative hypothesis: true mean is not equal to 20000 
95 percent confidence interval: 
  9590.798 16783.202 
sample estimates: 
mean of x 
    13187

This time we have a small p-value. So we can reject the hypothesis that population mean is $20000$ .

Answer of 2.16. We can already tell by looking at the mean values (in our sample, mean value for men is larger than that of women). However, we do the test nevertheless:

 
> t.test(words.f, words.m, alternative=’greater’) 
        Welch Two Sample t-test 
data:  words.f and words.m 
t = -0.0367, df = 13.889, p-value = 0.5144 
alternative hypothesis: true difference in means 
            is greater than 0 
95 percent confidence interval: 
 -5990.058       Inf 
sample estimates: 
mean of x mean of y 
  13187.0   13309.2

The above tests whether we can reject the null hypothesis, $μ_{f} - μ_{m} \leq 0$ in the population. Since p-value is (much) greater than 0.05, we can not reject the null hypothesis and can not conclude that ‘women talk more’.

Two notes on the above output: First, the conﬁdence interval reported is for the diﬀerence of means (as expected it includes $0$ ). Second, R reports that this is a ‘Welch Two Sample t-test’ instead of calling the test simply ‘independent samples t-test’. This is due to the fact that t.test() function by default applies a correction for unequal variances. You can force t.test() to assume equal variances (see help(t.test)).

A.3 Linear regression: a ﬁrst introduction

Answer of 3.1.

 
> plot(mot~chi, data=mlu)

or, alternatively

 
> plot(mlu$chi, mlu$mot)

: Answer of 3.2. No answer yet.

Answer of 3.3. Assuming you have decided the best line is with intercept $5.5$ and slope $0.25$ ,

 
> abline(5.5, 0.25, col="red", lwd=2)

Answer of 3.4. You should get an intercept of $5.7133$ and a slope of $0.1503$ from the lm() output. You can either draw it using these numbers,

 
> abline(5.7133, 0.1503, col="blue")

or let abline() extract $a$ and $b$ from the lm() output:

 
> abline(lm(mot ~ chi, data = mlu), col="blue")

Answer of 3.5.

 
> cor(mlu$chi, mlu$mot)̂2 
[1] 0.1272399

Not surprisingly this is the R $^{2}$ value reported by lm().

Answer of 3.6. Histogram can be produced by:

 
> hist(resid(lm(mot~chi, data=mlu)))

and Q-Q plot, including the theoretical line:

 
 > qqnorm(resid(lm(mot~chi, data=mlu))) 
 > qqline(resid(lm(mot~chi, data=mlu)))

Interpreting these diagnostic plots require some experience. In summary: we see some deviances from the normality, but in general it is normal to see this in small data sets.

Answer of 3.7. First the summary of the linear regression analysis:

 
1> summary(lm(age~chi, data=mlu)) 
2Call: lm(formula = age ~ chi, data = mlu) 
3Residuals: 
4    Min      1Q  Median      3Q     Max 
5-3.2881 -0.8938 -0.0091  0.8140  2.9785 
6Coefficients: 
7            Estimate Std. Error t value Pr(>|t|) 
8(Intercept)  13.2892     1.1795   11.27 1.32e-10 
9chi           8.7177     0.4254   20.49 8.00e-16 
10--- 
11Residual standard error: 1.613 on 22 degrees of freedom 
12Multiple R-squared:  0.9502, Adjusted R-squared:  0.948 
13F-statistic:   420 on 1 and 22 DF,  p-value: 7.999e-16
  
14

Since we take age as unquestionable measure of the linguistic competence, and want to predict it using on mlu. Our response variable is age and the predictor is chi.
Intercept is $13.2892$ , and the slope is $8.7177$ . The intercept corresponds to the age (remember, in months) of a child who has $0$ MLU. Although you should approach cautiously, it does not seem to be a very bad estimate. The slope means that each unit increase in MLU is associated with 8.7 months.
The commands plot(age~chi, data=mlu) and abline(lm(age~chi, data=mlu)) should do the trick.
The p-value presented on line 9 in the listing above is very small, and it is deﬁnitely statistically signiﬁcant.
The plot (not given here, you should really produce it) does not indicate any outliers.
As before you should use resid() to extract the outliers, and check using qqnorm(), and possibly with shappiro.test(). If we are looking at the same pictures, the graph and the test should not indicate non-normality.

A.4 Linear models with categorical predictors

Answer of 4.1.

 
> talk = data.frame(age=age, words=words)

Answer of 4.2.

 
> str(mlu.ses) 
’data.frame’:   60 obs. of  4 variables: 
 $ subject: int  1 2 3 4 5 6 7 8 9 10 ... 
 $ ses    : Factor w/ 3 levels "low","medium",..: 1 1 1 ... 
 $ gender : Factor w/ 2 levels "female","male": 2 2 2 ... 
 $ mlu    : num  1.81 1.91 1.35 2.84 2.54 2.92 1.2 ...

The output looks ﬁne, except, as we will see in the next Exercise, the subject variable is identiﬁed as an int (integer), although it is a categorical variable and should have been a factor variable.

Answer of 4.3.

 
> mlu.ses$subject = factor(mlu.ses$subject)

We convert the integer values in mlu$subject to factor, and store the result again in mlu$subject.

Answer of 4.4. Here are a few ways to do it:

 
> mlu.ses[mlu.ses$gender == ’male’, 4] 
> mlu.ses[mlu.ses$gender == ’male’, ’mlu’] 
> mlu.ses$mlu[mlu.ses$gender == ’male’]

Answer of 4.5.

 
> hist(talk$words[talk$gender == ’F’]) 
> hist(talk$words[talk$gender == ’M’])

Answer of 4.6.

 
> qqnorm(talk$words[talk$gender == ’F’]) 
> qqline(talk$words[talk$gender == ’F’]) 
> qqnorm(talk$words[talk$gender == ’M’]) 
> qqline(talk$words[talk$gender == ’M’])

Answer of 4.7. Here is the result only for men:

 
> shapiro.test(talk$words[talk$gender == ’M’]) 
        Shapiro-Wilk normality test 
data:  talk$words[talk$gender == "M"] 
W = 0.929, p-value = 0.438

which indicates that we have no evidence for non-normality (remember that the null hypothesis is ‘the distribution is normal’). The same should hold for women.

Answer of 4.8.

 
> t.test(words~gender, data=talk, var.equal=T, 
+        alternative=’greater’) 
        Two Sample t-test 
data:  words by gender 
t = -0.0367, df = 18, p-value = 0.5144 
alternative hypothesis: true difference in means is 
                        greater than 0 
95 percent confidence interval: 
 -5896.008       Inf 
sample estimates: 
mean in group F mean in group M 
        13187.0         13309.2

The p-value is the same up to four decimal points. So, the correction for unequal variances does not seem to aﬀect the test results.

The degrees of freedom above is as expected, we have 20 subjects, we calculate two mean values from it, leaving us with 18 degrees of freedom. The interesting bit is the DF with the Welch correction, which was $13.889$ . Knowing that a t distribution with smaller degrees of freedom will have heavier tails, the correction seems to reduce the DF to force a more conservative test (although it does not make much diﬀerence in our case).

Answer of 4.9.

 
> boxplot(words ~ gender, data=talk)

You should see rather a large diﬀerence (variation for men is larger).

Answer of 4.10. We could use a command like
var.test(talk$words[talk$gender == ’F’], talk$words[talk$gender == ’M’]),
but var.test() accepts the formula notation which is neater:

 
> var.test(words ~ gender, data=talk) 
        F test to compare two variances 
data:  words by gender 
F = 0.2953, num df = 9, denom df = 9, 
p-value = 0.08358 
alternative hypothesis: true ratio of variances is not equal to 1 
95 percent confidence interval: 
 0.07333846 1.18871596 
sample estimates: 
ratio of variances 
         0.2952602

Despite the fact that variance form men is about three times larger than the variance for women, we get a p value of $0.08358$ , we cannot rule out the possibility that the variances diﬀer by chance at conventional signiﬁcance ( $α$ ) levels.

Answer of 4.11.

 
> wilcox.test(words ~ gender, data=talk, alternative=’greater’) 
        Wilcoxon rank sum test 
data:  words by gender 
W = 53, p-value = 0.4267 
alternative hypothesis: true location shift is greater than 0

Nothing surprising here. Similar to the t tests, we cannot reject the null hypothesis.

Answer of 4.12. Here is how to generate the box plot:

 
> boxplot(mlu ~ ses, data=mlu.ses)

Compared to the other two between high SES seems to have a high variation, but it does not look extremely diﬀerent. (You may want to verify this with a var.test())
All box plots show some skew, but again, this may not necessarily indicate non-normality.
From the box plots the high and medium groups seems similar, but diﬀerent from low. Box plots cannot answer signiﬁcance questions reliably, but it seems there is a large diﬀerence between low and the others, and depending on sample size, it is likely to be statistically signiﬁcant.

Answer of 4.13. You can run three separate commands, namely

 
> shapiro.test(mlu.ses$mlu[mlu.ses$ses == ’low’]) 
> shapiro.test(mlu.ses$mlu[mlu.ses$ses == ’medium’]) 
> shapiro.test(mlu.ses$mlu[mlu.ses$ses == ’high’])

However, R provides mechanisms to ease the repetitions like this one. Here is an example:

 
> by(mlu.ses$mlu, mlu.ses$ses, shapiro.test) 
mlu.ses$ses: low 
        Shapiro-Wilk normality test 
data:  dd[x, ] 
W = 0.9242, p-value = 0.1192 
------------------------------------- 
mlu.ses$ses: medium 
        Shapiro-Wilk normality test 
data:  dd[x, ] 
W = 0.942, p-value = 0.2614 
------------------------------------- 
mlu.ses$ses: high 
        Shapiro-Wilk normality test 
data:  dd[x, ] 
W = 0.9721, p-value = 0.798

by() partitions its ﬁrst argument (mlu here) set based on the levels of its second argument (ses here), and applies the function given in the last argument to each group. None of the above tests suggests a signiﬁcant divergence from normality.

Answer of 4.14. Here is ANOVA results on this data

 
> summary.aov(lm(words ~ gender, data=talk)) 
            Df    Sum Sq  Mean Sq F value Pr(>F) 
gender       1     74664    74664   0.001  0.971 
Residuals   18 997785410 55432523

And t tests repeated here for ease of comparison

 
> t.test(words ~ gender, data=talk, var.equal=T) 
        Two Sample t-test 
data:  words by gender 
t = -0.0367, df = 18, p-value = 0.9711 
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval: 
 -7117.515  6873.115 
sample estimates: 
mean in group F mean in group M 
        13187.0         13309.2

No surprises, the p-value found by both tests are the same.

And, No, directional tests is not possible in the main ANOVA (one can do some directional post-hoc comparison).

Answer of 4.15. Here are multiple comparisons using both methods:

 
> pairwise.t.test(mlu.ses$mlu, mlu.ses$ses) 
        Pairwise comparisons using t tests with pooled SD 
data:  mlu.ses$mlu and mlu.ses$ses 
       low     medium 
medium 1.2e-06 - 
high   1.8e-06 0.83 
P value adjustment method: holm 
> pairwise.t.test(mlu.ses$mlu, mlu.ses$ses, p.adjust.method=’bonf’) 
        Pairwise comparisons using t tests with pooled SD 
data:  mlu.ses$mlu and mlu.ses$ses 
       low     medium 
medium 1.2e-06 - 
high   2.7e-06 1 
P value adjustment method: bonferroni

Notice that all p-values except the largest diﬀerence are larger with the Bonferroni correction (which indicates that it is more conservative).

Answer of 4.16.

 
> summary.aov(lm(mlu ~ ses + gender, data=mlu.ses)) 
            Df Sum Sq Mean Sq F value   Pr(>F) 
ses          2 20.714  10.357  22.159 8.14e-08 
gender       1  1.852   1.852   3.961   0.0514 
Residuals   56 26.173   0.467

Answer of 4.17.

 
> summary.aov(lm(mlu ~ ses ⋆ gender, data=mlu.ses)) 
            Df Sum Sq Mean Sq F value  Pr(>F) 
ses          2 20.714  10.357  23.047 5.8e-08 ⋆⋆⋆ 
gender       1  1.852   1.852   4.120  0.0473 ⋆ 
ses:gender   2  1.907   0.954   2.122  0.1297 
Residuals   54 24.266   0.449

Answer of 4.18.

 
> interaction.plot(mlu.ses$gender, mlu.ses$ses, mlu.ses$mlu)

We see some interaction patterns. The lines for high and medium cross, and girls perform better within these two groups, boys perform better in the low-SES group. However, the interaction term ses:gender above is not statistically signiﬁcant. We do not have enough evidence in support of any of the interaction patterns we observe.

Answer of 4.19.

 
> kruskal.test(mlu~ses, data=mlu.ses) 
        Kruskal-Wallis rank sum test 
data:  mlu by ses 
Kruskal-Wallis chi-squared = 24.3514, 
               df = 2, p-value = 5.154e-06

We again ﬁnd a statistically signiﬁcant eﬀect of ses on mlu. As expected, the p-value found here is larger than the one found using ANOVA. Which indicates that the non-parametric test is likely to be more conservative.

Answer of 4.20.

 
> t.test(words ~ gender, data=talk, var.equal=T) 
        Two Sample t-test 
data:  words by gender 
t = -0.0367, df = 18, p-value = 0.9711 
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval: 
 -7117.515  6873.115 
sample estimates: 
mean in group F mean in group M 
        13187.0         13309.2

Answer of 4.21.

 
> summary(lm(words ~ gender, data=talk)) 
Call: lm(formula = words ~ gender, data = talk) 
Residuals: 
     Min       1Q   Median       3Q      Max 
-11157.2  -7184.8    374.4   4084.7  18645.8 
Coefficients: 
            Estimate Std. Error t value Pr(>|t|) 
(Intercept)  13187.0     2354.4   5.601 2.58e-05 
genderM        122.2     3329.6   0.037    0.971 
Residual standard error: 7445 on 18 degrees of freedom 
Multiple R-squared:  7.482e-05, Adjusted R-squared:  -0.05548 
F-statistic: 0.001347 on 1 and 18 DF,  p-value: 0.9711

Note that the p-value for the slope is the p-value we found in the t test.

A.5 Repeated measures

: Answer of 5.1.
> bilingual <- read.delim(’http://coltekin.net/cagri/R/data/bilingual.txt’)

Answer of 5.2. You should realize that the variable subj is of type integer rather than a factor. To covert it:

 
> bilingual$subj = factor(bilingual$subj)

Another useful but not strictly necessary conversion is from factor to factor, such that the levels are ordered in a meaningful way:

 
> bilingual$age <- factor(bilingual$age, 
    levels=c(’preschool’,’firstgrade’,’secondgrade’))

R orders the levels alphabetically by default. The ordering here makes sure that the data levels show up in a reasonable while displaying in graphics or as text.

Answer of 5.3. Here is the full listing of the test.

 
> t.test(rate ~ language, data=newborn) 
        Welch Two Sample t-test 
data:  rate by language 
t = -1.7074, df = 57.994, p-value = 0.0931 
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval: 
 -9.8271059  0.7797726 
sample estimates: 
mean in group foreign  mean in group native 
             31.84367              36.36733

Two normal Q-Q plots, one for each language, and var.test() testing for equivalence of variances would be ways to test for normality and homogeneity of variance assumptions. A boxplot() is also useful for visualizing the distributions of both groups. However, as in this case, it may be very diﬃcult or impossible to see the eﬀects of assumption of independence.

Answer of 5.4.

 
> t.test(rate ~ language, data=newborn, paired=T) 
        Paired t-test 
data:  rate by language 
t = -5.3138, df = 29, p-value = 1.06e-05 
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval: 
 -6.264775 -2.782559 
sample estimates: 
mean of the differences 
              -4.523667

Nothing surprising, the paired test is a lot more sensitive, resulting in a smaller p-value.

Answer of 5.5.

 
> qqnorm(newborn$rate[newborn$language == ’native’] - 
          newborn$rate[newborn$language == ’foreign’])

The point to remember is that the paired test compares the ‘mean of the diﬀerences’ to $0$ , not diﬀerences of the means as in the independent samples test.

Answer of 5.6.

 
> boxplot(rate ~ language, data=newborn)

You should see some diﬀerence, it is rather small, and the distributions largely overlap. What we see includes the individual variation which clouds the real paired diﬀerences. To demonstrate we can plot,

 
> boxplot(newborn$rate[newborn$language == ’native’] 
        - newborn$rate[newborn$language == ’foreign’])

Note that here, we are interested whether the mean of the distribution visualized by the single box plot is $0$ or not.

Answer of 5.7.

 
> summary(aov(rate ~ language, data=newborn))

Answer of 5.8.

 
> summary(aov(rate ~ language + 
        Error(participant/language), data=newborn))

Answer of 5.9.

 
> summary(aov(mlu ~ language⋆age + 
        Error(subj/(language⋆age)), data=bilingual))

Answer of 5.10.

 
> interaction.plot(bilingual$age, 
    bilingual$language, 
    bilingual$mlu)

Answer of 5.11.

 
> summary(aov(mlu ~ language⋆age⋆gender + 
    Error(subj/(language⋆age)), data=bilingual))

Note that the only diﬀerence from the purely within-subjects syntax is that the between-subjects variable does not go into the Error() term.

A.6 Graphics

Answer of 6.1. There is nothing special with loading this CSV ﬁle:

> read.csv(’http://coltekin.net/cagri/R/data/seg.csv’)

The data types you see (e.g., when you check with str()) should be in general sensible, but two of them are questionable. Depending on your analysis utterance may be integer (numeric), or categorical (factor). Similarly, having phoneme as a factor variable is ﬁne, in some cases you may want to convert it to to a character string. For the rest of the exercises here, there is no need to convert these data types, but for the sake of exercise here is how we convert them.

 
> seg$utterance <- factor(seg$utterance) 
> seg$phoneme <- as.character(seg$phoneme)

Answer of 6.2.

 
> plot(seg$h[seg$utterance == ’1’])

Answer of 6.3.

 
> plot(h ~ pmi, data=seg, col=’red’) 
> abline(lm(h ~ pmi, data=seg), col=’blue’)

Answer of 6.4.

 
> plot(seg[,4:7])

or more clear notation with the expense of some typing:

 
> plot(seg[,c(’pmi’, ’h’, ’rh’, ’sv’))])

Answer of 6.5.

 
> hist(seg$pmi) 
> hist(seg$h)

Answer of 6.6.

 
> qqnorm(seg$pmi);qqline(seg$pmi) 
> qqnorm(seg$h);qqline(seg$h)

The PMI values should look approximately normal, while the H values should show a serious divergence from normality.

Note that you can run two or more commands at once by separating each command by a semicolon ‘;’ on the same line.

Answer of 6.7.

 
> boxplot(pmi ~ boundary, data=seg)

Answer of 6.8. Assuming we the variable x from the above listing is in your R environment,

 
> plot(x, dt(x, df=5), type=’l’, col=’green’, lwd=3)

Answer of 6.9.

 
> x <- seq(-4,4,by=0.1) 
> plot(x, dnorm(x), type=’l’, col=’blue’) 
> lines(x, dt(x, df=5), col=’green’)

Answer of 6.10.

 
> plot(x, dnorm(x), type=’l’) 
> lines(x, dt(x, df=1), col=2, lty=2) 
> lines(x, dt(x, df=5), col=3, lty=3) 
> lines(x, dt(x, df=20), col=4, lty=4)

Answer of 6.11.

 
> plot(seg$h[seg$utterance == ’1’], pch=19)

Answer of 6.12. It may take a bit of experimenting, but here is an example:

 
> plot(x, dnorm(x), type=’l’, col=’blue’) 
> lines(x, dt(x, df=5), col=’green’) 
> grid() 
> text(0, 0.4, ’standard normal’, pos=3, col=’blue’, offset=-0.05) 
> text(0, 0.35, ’t(5)’, col=’green’)

Note that the grid() function plots a grid, which is helpful for deciding the coordinates of the points on the graph.

Answer of 6.13.

 
> y <- seg$h[seg$utterance == ’1’] 
> x <- 1:length(y) 
> plot(x, y, type=’l’, lty=’dotted’) 
> text(x, y , 
       labels=seg$phoneme[seg$utterance == ’1’], 
       pos=3, 
       col=c(’blue’, ’red’)[1 + seg$boundary[seg$utterance == ’1’]])

Answer of 6.14.

 
> legend(’topright’, 
         c(’normal’, ’t(1)’, ’t(5)’, ’t(20)’), 
         lty=1:4, 
         col=1:4)

Answer of 6.15. Only listing the ﬁrst plot() command, the other are not aﬀected:

 
> plot(x, dnorm(x), type=’l’, 
       main=’normal and t  distributions’, 
       ylab=’density’, 
       xlab=’’)

Answer of 6.16.

 
> plot(h ~ pmi, data=seg, xlim=c(0,max(pmi)))

should do. However, if you want to be certain without peeking into the data:

 
> plot(h ~ pmi, data=seg, 
    xlim=c(min(c(0, pmi)), max(c(0, pmi))), 
    ylim=c(min(c(0, h)), max(c(0, h))))

Answer of 6.17. Again, only listing the ﬁrst plot() command, the other are not aﬀected:

 
> plot(pmi ~ h, data=seg, subset =(boundary == T), 
       pch=’+’, col=’red’, 
       main=’PMI vs. H’, 
       xlab=’PMI’, 
       ylab=’H’) 
> points(pmi ~ h, data=seg, subset =(boundary == F), 
       pch=’-’, col=’blue’) 
> legend(’bottomleft’, c(’boundary’, ’word internal’), 
        pch=c(’+’, ’-’), 
        col=c(’red’, ’blue’))

Answer of 6.18.

 
> par(mfrow=c(2,2)) 
> boxplot(pmi ~ boundary, main=’PMI’, data=seg) 
> boxplot(h ~ boundary, main=’H’, data=seg) 
> boxplot(rh ~ boundary, main=’RH’, data=seg) 
> boxplot(sv ~ boundary, main=’SV’, data=seg)

Answer of 6.19.

 
> pdf(file=’noramlity-check.pdf’, width=6.27, height=3) 
> par(mfrow=c(1,2)) 
> hist(seg$pmi, main=’Histogram of PMI’, xlab=’PMI’) 
> qqnorm(seg$pmi, pch=24, col=’blue’, bg=’blue’) 
> qqline(seg$pmi) 
> dev.off()

Answer of 6.20.

Here is only the sequences of commands to produce the PNG with resolution 1024x512.

> par(mfrow=c(1,2))
> png(’mlu-ttr-1024x512.png’, width=1024, height=512)
> boxplot(chi.mlu, mot.mlu, names=c(’child’, ’mother’), main=’Mean length of utterance’)
> boxplot(chi.ttr, mot.ttr, names=c(’child’, ’mother’), main=’Type/token ratio’)
> dev.off()

If you use bitmap graphics a lot, you’ll realize that they scale poorly on higher resolution medium, particularly on paper. If you create bigger images, and scale them when necessary, the text on the graphs will typically become poorly readable, or even unreadable. To optimize for the intended medium, you also need to specify the resolution (res). By default bitmap graphics are screen optimized (not for new fancy Retina Displays, for good old 72dpi screens). You should change this to match the printer when you need to use them on paper (for printing, you should really use vector graphics, unless there is a good reason for using bitmap).

Answer of 6.21.

 
> x <- c(0,1,2,4) 
> y <- c(0,1,3,4) 
> plot(x, y, type=’l’)

Answer of 6.22.

 
> pie(c(36,35,8,71), 
    labels=c(’A’, ’B’, ’C’, ’D’))

Answer of 6.23.

 
> barplot(c(36,35,8,71), 
          names.arg=c(’A’, ’B’, ’C’, ’D’))

Answer of 6.24.

 
> x <- seq(-pi, pi, by=0.1) 
> plot(x, sin(x), type=’l’, col=’blue’) 
> lines(x, cos(x), col=’red’)

Answer of 6.25. First, a not-so-correct attempt:

 
> abline(a=0,b=0) 
> abline(a=0,b=10̂50)

The horizontal line drawn ﬁrst is straightforward, intercept and slope are both 0. However, the way we drew the vertical line is a bit of cheating. The slope of a vertical line is undeﬁned (well, if inﬁnity is easier to deal with for you, we can also agree on positive inﬁnity). As a result, we just put a very big number with the hope that approximation is ﬁne enough. However, for horizontal and vertical lines, abline() provides a simpler and neater alternative,

 
> abline(h=0) 
> abline(v=0)

Answer of 6.26. Although the actual code that generates Figure 1 is slightly diﬀerent, the following produces a very similar ﬁgure.

 
> par(mfrow=c(1,2), mar=c(1,1,3,1)) 
> plot(0, 
      type=’n’, 
      xlim=c(0,2), ylim=c(0,2), 
      xaxt=’n’, yaxt=’n’, 
      xlab=’’, ylab=’’, 
      main=’(a) par(mfrow=c(2,2))’, 
      col.main=’blue’) 
> abline(h=1) 
> abline(v=1) 
> text(c(0.5,1.5,0.5,1.5),c(1.5,1.5,0.5,0.5), c(’1’, ’2’,’3’, ’4’),cex=9) 
> plot(0, 
      type=’n’, 
      xlim=c(0,2), ylim=c(0,2), 
      xaxt=’n’, yaxt=’n’, 
      xlab=’’, ylab=’’, 
      main=’(b) par(mfcol=c(2,2))’, 
      col.main=’blue’) 
> abline(h=1) 
> abline(v=1) 
> text(c(0.5,0.5,1.5,1.5),c(1.5,0.5,1.5,0.5), c(’1’, ’2’,’3’, ’4’),cex=9)

Two things that is worth noticing above are: (1) the mar graphical option that adjusts (shrinks compared to the defaults in this case) the graphics margins, and (2) the cex options that adjusts the font size.

A.7 Regression again

Answer of 7.1.

 
> print(load( 
  url(’http://coltekin.net/cagri/R/data/ling-geo.rda’) 
  ))

Answer of 7.2.

 
> cor(seg$pmi, seg$h) 
> cor(seg$pmi, seg$h, method=’kendall’) 
> cor(seg$pmi, seg$h, method=’spearman’)

Nothing surprising here: all indicate that the measures are negatively correlated (pmi is a measure of cooccurrence, while h is a measure of surprise), and non-parametric tests indicate lower correlation. To check we can reliable use Pearson’s $r$ , we need to check its assumptions which we will leave to the regression analysis.

Answer of 7.3. Again, nothing special.

 
> cor.test(seg$pmi, seg$h) 
> cor.test(seg$pmi, seg$h, method=’kendall’) 
> cor.test(seg$pmi, seg$h, method=’spearman’)

Note, however, the non-parametric tests gives you approximate p-values because of ties (the data points with the pmi or h values resulting in the same rank).

Answer of 7.4.

 
> cor(seg[,-c(1,2)])

Answer of 7.5. It is easy to obtain the summary, although it may take a while since our data set is rather large:

 
> summary(lm(ling ~ geo, data=lg))

The slope is highly signiﬁcant indicating, although it is very small in the original scale, the eﬀect very unlikely to be due to chance eﬀects.
The intercept is the expected linguistic diﬀerence in the same location (when geographic distance is 0km).
The $r^{2}$ indicates about $37$ % of the variance in the linguistic diﬀerences to be due to geographic distance.

Note, however, these conclusions are not viable unless we check the assumptions of the model.

Answer of 7.6. The ﬁrst one is easy:

 
> m <- lm(mot ~ chi, data=mlu, subset=-20)

For the second one, you can calculate mean() or median() of mlu$chi, and ﬁnd ﬁnd the row index of the closest value in the data frame. The following solution does is in a slightly subtle way, that you should try to understand:

 
> m <- lm(mot ~ chi, data=mlu, 
     subset=(chi != sort(chi)[sort(chi) > median(chi)][1]))

When the outlier is removed, the model coeﬃcients show a visible change. In our example you should see an intercept estimate of $5.54$ and a slope estimate of $0.23$ , as opposed to $5.71$ and $0.15$ respectively in the full model. When we remove the non-inﬂuential (middle) data point, we get almost the same results as the complete model. Furthermore, the model ﬁt ( $r^{2}$ ) becomes much better when you remove an inﬂuential data point.

Answer of 7.7.

 
> par(mfrow=c(2,2)) 
> plot(lm(ling ~ geo, data=lg), pch=’.’, col=’gray’)

The ‘residuals vs. ﬁtted’ graph should indicate a clear sign of non-linearity.
You should also be observing in ‘residuals vs. ﬁtted’ graph that variance is reduced as ﬁtted values increase.
The normal Q-Q plot of the residuals also show a clear pattern of divergence on tails. For a smaller data set this might not be a clear indication, but for large data sets, like the one we use here, this is clearly a concern.
Not really, the ‘residuals vs. leverage’ plot does not even include the 0.5 Cook’s distance contour line we observed earlier. Although there are points with higher leverage and large residuals, for large data sets, an individual observation can seldom be very inﬂuential.

Answer of 7.8.

 
> png(file=’diagnostics.png’, width=1024, height=768) 
> par(mfrow=c(2,2)) 
> plot(lm(ling ~ geo, data=lg), pch=’.’, col=’gray’) 
> dev.off()

Answer of 7.9.

 
> lg$log.geo <- log(lg$geo)

Answer of 7.10. The commands to use are the same as before. You should ﬁnd an intercept of about $- 0.03$ , and a slope of $0.06$ , again with highly signiﬁcant estimates.

The conclusions do not change drastically. However, you should observe an increase in the $r^{2}$ which is an indication of a better model ﬁt.
In all graphs you should see some improvements. Particularly, no clear sign non-linearity, and variance should look more constant across the range of the predicted values.

Answer of 7.11.

 
> plot(mot ~ chi, data=mlu) 
> m <- lm(mot ~ chi, data=mlu) 
> abline(m, col=’red’) 
> seq(min(mlu$chi), max(mlu$chi), length.out=20) -> x 
> y <- predict(m, newdata=data.frame(chi=x), interval=’conf’) 
> lines(x, y[,’lwr’], col=’red’, lty=’dashed’) 
> lines(x, y[,’upr’], col=’red’, lty=’dashed’)

Answer of 7.12.

 
> plot(ling ~ geo, data=lg, col=’gray’, pch=’.’, 
    xlab=’Geographic distance (km)’, 
    ylab=’Linguistic difference’) 
> m <- lm(ling ~ geo, data=lg) 
> mlog <- lm(ling ~ log.geo, data=lg) 
> lines(lg$geo, predict(m), col=’red’) 
> lines(lg$geo, predict(mlog), col=’blue’, lty=’dashed’) 
> legend(’bottomright’, 
    c(’no transformation’, ’log transformed’), 
    col=c(’red’, ’blue’), 
    lty=c(’solid’, ’dashed’))

Note that the line and the curve plotted by lines() commands above works ﬁne since the lg$geo values in this data set is sorted. Otherwise, we would either need to order $x$ and $y$ values, or alternatively, we could get predictions for new data points along the x axis.

Answer of 7.13. The answer is left as an exercise, noting that this exercises is almost the same as Exercise 7.11, except you need to be careful not to use the same x values for prediction and plotting.

What you should observe is a very tight conﬁdence interval all around the prediction curve that it is not possible to see three separate lines.

A.8 Multiple Regression

Answer of 8.1.

 
> load( 
  url(’http://coltekin.net/cagri/R/data/tv.rda’), 
  verbose=T 
  )

Answer of 8.2. Only the commands (without the output) is given below.

 
> lm(cdi ~ tv.hours, data=tv) -> m1 
> summary(m1) 
> par(mfrow=c(2,2)) 
> plot(m1) 
> par(mfrow=c(1,1)) 
> plot(cdi ~ tv.hours, data=tv) 
> abline(m1) 
> x <- min(tv$tv.hours):max(tv$tv.hours) 
> y <- predict(m1, newdata=data.frame(tv.hours=x), interval=’conf’) 
> lines(x, y[,’lwr’], lty=’dashed’, col=’blue’) 
> lines(x, y[,’upr’], lty=’dashed’, col=’blue’) 
> y <- predict(m1, newdata=data.frame(tv.hours=x), interval=’pred’) 
> lines(x, y[,’upr’], lty=’dotted’, col=’green’) 
> lines(x, y[,’lwr’], lty=’dotted’, col=’green’)

Answer of 8.4.

 
> m3 <- lm(cdi ~ tv.hours + mot.education, data = tv)

The slope estimates of the multiple regression model is diﬀerent than the corresponding single predictor models. we also see that the estimates are slightly less certain in the multiple regression model (smaller t-, larger p-values). Furthermore, $r^{2}$ , the variation explained by the multiple regression model is less than (about 10%) the sum of the variations explained by the individual predictors.

The explanation for all comes from collinearity. The predictors are correlated, hence, they share the part of the variation explained by each other (so, $r^{2}$ ’s do not sum up). This also means that the regression estimation cannot assign the credit (or blame) for some the explained variation (so, variability and lower conﬁdence in the parameter estimates).

Answer of 8.5.

 
> par(mfrow=c(1,2)) 
> plot(cdi ~ tv.hours, data=tv) 
> abline(m1) 
> abline(a=coef(m3)[1], b=coef(m3)[2], col=’red’) 
> plot(cdi ~ mot.education, data=tv) 
> abline(m2) 
> abline(a=coef(m3)[1], b=coef(m3)[3], col=’red’)

Answer of 8.6.

 
> x <- seq(min(tv$tv.hours),max(tv$tv.hours), 
           length.out=20) 
> y <- seq(min(tv$mot.education),max(tv$mot.education), 
           length.out=20) 
> plot(0, type=’n’, 
       xlim=range(x), xlab="TV time", 
       ylim=range(y), ylab="Mother’s education") 
> new.data=data.frame(tv.hours=rep(x,20), 
                      mot.education=rep(y, each=20)) 
> z <- predict(m3, newdata=new.data) 
> gray.levels <- 1 - ((z-min(z)) / (max(z)-min(z))) 
> points(new.data$tv.hours, 
         new.data$mot.education, 
         pch=22, bg=gray(gray.levels), cex=3)

Answer of 8.7.

 
> cor(tv$tv.hours, tv$mot.education) 
[1] -0.2920446

This is not an alarmingly large correlation. In fact, there is no clear threshold after which you should get alerted. In general, the larger the correlation, the higher the expected eﬀect of the collinearity. Furthermore, with more predictors, correlation is not an adequate tool for detecting multicollinearity, since the predictor of interest may correlate with a linear combination of more than one of the other predictors. We will later discuss other ways of detecting multicollinearity.

It is not a complete coincidence that if you square the above correlation coeﬃcient will be closer to the diﬀerence between the sum of the $r^{2}$ from m1 and m2 and the multiple regression model m3.

Answer of 8.8.

 
> summary(lm(cdi ~  tv.hours + mot.education + I(tv.hours + mot.education), data=tv))

In the output, you should see that the coeﬃcient for the predictor I(tv.hours + mot.education) cannot be estimated, and indicated by NA, which is a special value used for missing data.

Answer of 8.9.

 
> vif(m3) 
     tv.hours mot.education 
     1.093243      1.093243

Note that since we have only two predictors, both have the same VIF value. This is simply because $r^{2}$ is the square of the correlation coeﬃcient, and correlation is symmetric.

For two-predictor case, we could alternatively calculate it with

 
> 1/(1-cor(tv$tv.hours, tv$mot.education)̂2) 
[1] 1.093243

Answer of 8.10.

 
> tv$rnd.1 <- runif(80, 0, 1) 
> tv$rnd.2 <- runif(80, 0, 1) 
> summary(lm(cdi ~ tv.hours + mot.education 
                 + rnd.1 + rnd.2, data=tv)) 
[...] 
Coefficients: 
              Estimate Std. Error t value Pr(>|t|) 
(Intercept)   99.29090    0.82075 120.976  < 2e-16 
tv.hours      -0.11671    0.03047  -3.831 0.000264 
mot.education  0.13000    0.03475   3.741 0.000357 
rnd.1         -0.32504    0.39126  -0.831 0.408758 
rnd.2          0.37200    0.44865   0.829 0.409645 
 
Residual standard error: 1.057 on 75 degrees of freedom 
Multiple R-squared: 0.3557, Adjusted R-squared:  0.3213 
F-statistic: 10.35 on 4 and 75 DF,  p-value: 9.951e-07

As expected, the eﬀects of the new variables are non-signiﬁcant. The values you get will be diﬀerent since our predictors are randomly generated.

Remember that the model without rnd.1 and rnd.2 had $r^{2} = 0.3444$ and ${\bar{r}}^{2} = 0.3274$ .

Answer of 8.11. The update command is:

 
> update(m3, . ~ . + daycare.hours) -> m4

The model ﬁt is slightly better. The increase in $r^{2}$ is expected since it will increase with any additional predictor. However since ${\bar{r}}^{2}$ is also (slightly) higher, the increase in $r^{2}$ is probably not just a chance eﬀect.

The summary (not presented above) should tell you that the eﬀect of the new predictor is not statistically signiﬁcant in this model (given other predictors).

Answer of 8.12.

 
> m0 <- lm(cdi ~ 1, data=tv)

The single coeﬃcient estimated (intercept) should be the same as (mean(tv$cdi)).

The standard error of the intercept that you can see from the summary of the model is the standard error of the mean. (You are encouraged to calculate this manually and compare).

Answer of 8.13.

 
> anova(m0, m1) 
Analysis of Variance Table 
 
Model 1: cdi ~ 1 
Model 2: cdi ~ tv.hours 
  Res.Df    RSS Df Sum of Sq      F    Pr(>F) 
1     79 129.99 
2     78 100.33  1    29.662 23.061 7.438e-06 ⋆⋆⋆

The addition of the predictor causes a statistically signiﬁcant reduction in the mean residual sums of squares. The F-test reported in summary of a linear regression model is eﬀectively the same test.

Answer of 8.14.

 
> anova(m1, m3) 
Analysis of Variance Table 
Model 1: cdi ~ tv.hours 
Model 2: cdi ~ tv.hours + mot.education 
  Res.Df     RSS Df Sum of Sq      F    Pr(>F) 
1     78 100.325 
2     77  85.222  1    15.104 13.647 0.0004106 ⋆⋆⋆ 
> anova(m3, m4) 
Analysis of Variance Table 
Model 1: cdi ~ tv.hours + mot.education 
Model 2: cdi ~ tv.hours + mot.education + daycare.hours 
  Res.Df    RSS Df Sum of Sq      F Pr(>F) 
1     77 85.222 
2     76 83.971  1    1.2503 1.1316 0.2908

The ﬁrst ANOVA indicates that the amount of reduction in residual sum of squares ( $15.104$ ) is unlikely to be by chance. F-test yields a very low p-value. The second one, on the other hand, indicates that the small reduction in the residual sums of squares due to the new variable daycare.hours is not statistically signiﬁcant.

Answer of 8.15.

 
> anova(m4)

You should check which F-values match with the F-values from the earlier exercises.

Answer of 8.16.

 
> AIC(m1,m3,m4) 
   df      AIC 
m1  3 251.1416 
m3  4 240.0884 
m4  5 240.9060

We observe a decreased in AIC by adding mot.education to m1 (m3 is better). However, adding daycare.hours increases the AIC value. The best model among the ones we compared according to AIC is m3.

: Answer of 8.17. step(m4) will present you the steps taken, and the choice made at the end. Note that the direction of serach (backward, forward or both) may change the result.

A.9 Probability distributions

Answer of 10.1.

 
> pnorm(3000, mean=3500, sd=200) 
[1] 0.006209665 
> 1 - pnorm(4000, mean=3500, sd=200) 
[1] 0.006209665 
> pnorm(4000, mean=3500, sd=200) - pnorm(3000, mean=3500, sd=200) 
[1] 0.9875807 
> dnorm(3400, mean=3500, sd=200) 
[1] 0.001760327 
> qnorm(0.01, mean=3500, sd=200) 
[1] 3034.73 
> qnorm(0.005, mean=3500, sd=200); qnorm(1-0.005, mean=3500, sd=200) 
[1] 2984.834 
[1] 4015.166

Answer of 10.2.

 
> x <- seq(-4, 4, by=0.1) 
> plot(x, pnorm(x), type=’l’, ylab=’CDF(x)’) 
> lines(x, pt(x, df=3), col=’red’) 
> lines(x, pt(x, df=10), col=’blue’) 
> lines(x, pt(x, df=30), col=’orange’) 
> legend(’bottomright’, 
    c(’normal’, ’t(2)’, ’t(10)’, ’t(30)’), 
    col=c(’black’, ’red’, ’blue’, ’orange’), 
    lty=’solid’)

Answer of 10.3.

 
> sample <- rbinom(200, size=100, p=0.55) 
> hist(sample, probability=T, 
+      xlim=c(0,100), ylim=c(0,0.1)) 
> curve(dbinom(x, size=100, p=0.55), 
+       add=T, xlim=c(0,100))

A few points to note here:

If you repeat the exercise, you will get a (slightly) diﬀerent histogram. Like other ‘r’ functions, rbinom() produces random numbers (although they are distributed according to binomial distribution).
The parameter probability=T tells hist() to plot a ‘density’ histogram instead of counts (or frequencies).
xlim=c(0,100) tells that the x-axis should be in range 0 to 100 (all possible values for a binomial distribution with n=100). Likewise, ylim=c(0,0.1) sets the range of the y-axis to make sure that the distribution function we plot next ﬁts into the graph regardless oﬀ the diﬀerences in the sampling (although, you should work a bit harder if you want to make sure that everything ﬁts into the graph for all possible samples).
Last, the curve() function is an alternative way of plotting smooth curves. Previously we generated x-range manually and use plot() or lines() to plot corresponding values of a function. For example:
```
 
> lines(0:100, dbinom(0:100, size=100, p=0.55))
      
```

Answer of 10.4. You should be drawing histograms with

 
> hist(rbinom(N,size=20, p=0.5))

for increasing N. Your results will diﬀer even for the same value of N, since rbinom() will produce random samples. But you should clearly observe that as you increase N, the histogram resembles more and more like the bell curve.

Answer of 10.5. Compared to Exercise 10.4, you should ﬁnd it more diﬃcult to convince yourself that the distribution looks normal. Binomial distributions with extreme $p$ parameters will be more skewed, and more diﬃcult to be approximated by the normal distribution.

Nevertheless, here is an example with sample size $100000$ :

 
> x <- rbinom(100000,size=20, p=0.9) 
> hist(x, probability=T) 
> lines(min(x):max(x), dnorm(min(x):max(x), mean=mean(x), sd=sd(x)))

Answer of 10.6.

 
> plot(1:50, dpois(1:50, lambda=3), type=’l’) 
> lines(1:50, dpois(1:50, lambda=10), type=’l’) 
> lines(1:50, dpois(1:50, lambda=30), type=’l’)

Answer of 10.7.

 
> N=10000 
> sample.n <- rnorm(N) 
> sample.t <- rt(N, df=3) 
> sample.f <- rf(N, df1=3, df2=10) 
> sample.l <- rlnorm(N) 
> sample.p <- rpois(N, lambda=3) 
> sample.b <- rbinom(N, size=10, p=0.1) 
> qqnorm(sample.n, main=’normal’);qqline(sample.n) 
> qqnorm(sample.t, main=’t’);qqline(sample.t) 
> qqnorm(sample.f, main=’F’);qqline(sample.f) 
> qqnorm(sample.l, main=’log normal’);qqline(sample.l) 
> qqnorm(sample.p, main=’Poisson’);qqline(sample.p) 
> qqnorm(sample.b, main=’binomial’);qqline(sample.b)

A.10 Logistic Regression

Answer of 11.1.

 
> seg <- read.csv(’http://coltekin.net/cagri/R/data/seg-large.csv’) 
> seg$utterance <- factor(seg$utterance) 
> seg$phoneme <- factor(seg$phoneme)

The last line is most probably not necessary.

Answer of 11.2.

 
> or <- read.csv(’http://coltekin.net/cagri/R/data/past-tense-or.csv’) 
> or$correct <- 1 - (or$n.or / or$n.past)

Answer of 11.3. The diagnostic plots you get with

 
> plot(lm(correct ~ age, data=or))

should indicate that the data points $68$ is the most inﬂuential observation.

The command

 
> or.ols <- lm(correct ~ age, data=or, subset=-68)

ﬁts the model without the outlier.

If you plot the diagnostics again, you should still observe that the variance is not constant. The Q-Q plot indicates some non-normality, and we can also observe some slight non-linearity in the ‘Scale-location’ graph.

Answer of 11.4.

 
> predict(or.ols, 
          newdata=data.frame(age=c(1.5⋆12, 8⋆12))) 
        1         2 
0.8072069 1.0024616

Answer of 11.5.

 
> p <- seq(0,1, 0.01) 
> plot(p, log(p/(1-p)), type=’l’)

Answer of 11.6.

 
> or$log.odds <- log(or$correct/(1 - or$correct)) 
> or.logit <- lm(log.odds ~ age, data=or, subset=-68) 
> summary(or.logit) 
> plot(or.logit)

The coeﬃcients now reﬂect the linear equation predicting log odds. If we want to predict the probabilities, we need to calculate the log odds for the given age, take the exponent of it (to cancel the log), the expected probability or correct responses for the given age can be calculated with $p = \frac{o d d s}{1 + o d d s}$ (it is a simple arithmetic exercise to get this equation from the deﬁnition of the odds) .

In this model the relation between the probability and the age is non-linear.

The transformation seem to improve the model diagnostics. In particular, the variance seems more constant now (although you should see clear diﬀerences between the residuals below $0$ and above $0$ ). We still see the eﬀects of non-linearity and non-normality in the graphs.

Answer of 11.7. We do it piece-by-piece for demonstration:

 
> logit <- predict(or.logit, 
                  newdata=data.frame(age=12⋆(1:10))) 
> odds <- exp(logit) 
> odds / (1 + odds)

The ﬁrst line gets the predictions for each age, second line takes the exponent (undoes the log), and the last line converts the odds to probabilities.

Answer of 11.8. Command is almost the same.

 
> summary(glm(cbind(n.past-n.or, n.or) ~ age, 
            family=’binomial’, data=or, subset-68))

In the summary, you should realize that an estimated ‘dispersion’ parameter is used instead of 1. This aﬀects our model marginally since we do not have real overdispersion here. Since the estimated dispersion parameter is less than one, the standard error for the coeﬃcients are slightly tighter. In most cases (when there is overdispersion), quasibinomial will result in less certain coeﬃcient estimates.

Answer of 11.9.

 
The command 
> 1 - predict(or.glm, type=’response’, 
        newdata=data.frame(age=12⋆(1:10))) 
should give you the result as a vector.

Answer of 11.10.

 
> plot(correct ~ age, data = or) 
> x <- min(or$age):max(or$age) 
> y <- predict(or.glm, type=’response’, 
        newdata=data.frame(age=x)) 
> lines(x, y)

Note that the curve you see is only the part of the logistic curve. Although it is not useful in this case, you are encouraged to try the plot using a larger age range, e.g., -200:200, to see the complete curve.

Answer of 11.11.

 
> seg.pmi <- glm(boundary~pmi, data=seg, family=binomial) 
> plot(boundary ~ pmi, data = seg) 
> x <- seq(min(seg$pmi), max(seg$pmi), by=0.2) 
> y <- predict(seg.pmi, type=’response’, newdata=data.frame(pmi=x)) 
> lines(x, y)

Again, the diﬀerence between the residual deviance and the degrees of freedom is not large (we even have a bit of underdispersion). However, in general, overdispersion does not occur with binary response variables.

Answer of 11.12. Here are three ways to get the accuracy (all results the same value):

 
> success <- seg$boundary == (predict(seg.pmi, type=’response’) > 0.5) 
> length(success[success == TRUE])/length(success) 
[1] 0.772229 
> length(success[success])/length(success) 
> sum(success)/length(success)

The ﬁrst one (line 2) is more descriptive, but understanding the tricks used in more compact notations are worth the eﬀort.

Answer of 11.13.

 
> length(seg$boundary[seg$boundary == F]) / 
  length(seg$boundary) 
[1] 0.6394641

and the trick to compact it:

 
> sum(!seg$boundary)/length(seg$boundary)

result is again the same. Indeed, our model performs better than the baseline (we will soon test our conﬁdence in this).

Answer of 11.14.

 
> seg.full <- glm(boundary~ ., data=seg, family=binomial) 
> summary(seg.full)

First thing to note here is that we get a warning about estimated probability values exceeding $0$ or $1$ . By itself, this is not necessarily alarming. In this case, it only means that the some of the estimated probabilities are equal to $0$ or $1$ with the available numeric precision. Here we will not worry about this,but if you get such warnings from R, you should understand what is going on. We will shortly see an example where the warning is important.

The GLM summary, otherwise, is too crowded to try to interpret. We will ﬁrst simplify the model.

Answer of 11.15.

 
> anova(seg.pmi, seg.full, test=’Chisq’)

Answer of 11.16.

 
> step(model.full)

should present you the stepwise elimination beginning from the full model. The result indicates that utterance and h were dropped. The ﬁrst one is a clear choice, as the utterance number has nothing to do with the word boundaries. The second one is due to high multicollinearity (mainly high correlation between h and sv, you are encouraged to check this with cor() and/or vif() from the car package).

Answer of 11.17.

 
> success.model <- seg$boundary == 
                (predict(seg.model, type=’response’) > 0.5) 
> sum(success.model)/length(success.model) 
[1] 0.9123021 
> success.pmi <- seg$boundary == 
                (predict(seg.pmi, type=’response’) > 0.5) 
> fisher.test(success.model, success.pmi, 
              alternative=’greater’)

Fisher’s test indicates that the diﬀerence between these two models are very unlikely to be due to chance.

Note that you should not interpret this as an indication that the larger model is better outside this data set. It just means that correct/incorrect decisions made by the two models on this data set are unlikely to be the same.

Answer of 11.18.

 
> seg.half <- glm(boundary ~ pmi + h + rh, 
                 family = binomial, 
                 data=seg, subset=(utterance < 50)) 
> success.training <- seg$boundary[seg$utterance < 50] == 
               (predict(seg.half, type=’response’) > 0.5) 
> sum(success.training)/length(success.training) 
[1] 0.8495146 
> success.test <- seg$boundary[seg$utterance >= 50] == 
    (predict(seg.half, 
             newdata=seg[seg$utterance >= 50,], 
            type=’response’) > 0.5) 
> sum(success.test)/length(success.test) 
[1] 0.799511

As expected, the accuracy is better on training data (the data we used for ﬁtting the model) compared to the novel data. This is a sign of overﬁtting. The model do not only model the systematic relationship, but also some of the noise in the data.

Variations of this procedure is common in statistical machine learning literature to protect against overﬁtting.

Answer of 11.19. The R commands do not change much, so they are not included here.

The accuracy values you ﬁnd should be surprisingly close (the accuracy on the test data is even higher). This means that the model did not overﬁt at all (thanks to small number of predictors).

A.11 Multilevel / mixed-eﬀect models

: Answer of 12.1.
> load(url(’http://coltekin.net/cagri/R/data/par.rda’), verbose=T)

Answer of 12.2.

 
> par <- merge(merge(par.subj, par.item), par.data)

Answer of 12.3. There is nothing special about the syntax, we have done this many times:

 
> summary(nb.lm <- lm(rate ~ language, data=newborn))

We also know the relation between the t-test and the ordinary linear regression, e.g., from Exercise 9.2. The intercept in linear regression will be the mean of the one of the groups, the slope will indicate the diﬀerence between them, and the p-values will be the same (to make sure that the analyses are equivalent, you should disable the ‘Welch correction’ in t.test() by var.equal=T).

Noting the residual variance (or standard deviation) reported in the lm() output is important for comparing this exercise to the ones that follow.

Answer of 12.4.

 
> plot(rate ~ jitter(c(0,1)[language], amount=0.05), 
       data=newborn) 
> abline(nb.lm)

: Answer of 12.5. You will ﬁnd that the sum of the participant and residual variances in nb.lmer1 is exactly the same as the residual variance of the model nb.lm (remember variance is the squared standard deviation).

Answer of 12.6.

 
> library(lattice) 
> dotplot(ranef(nb.lmer1, condVar=T))

Answer of 12.7.

 
> fixef(nb.lmer1)[1] + ranef(nb.lmer1)$participant[1,] 
    22.7941

Returns the estimated intercept for the participant number $1$ , which should correspond to the estimated sucking rate of this baby while listening to the foreign language input. We add ﬁxed slope to this value to obtain the rate for the native language input.

 
> fixef(nb.lmer1)[1] + fixef(nb.lmer1)[2] + ranef(nb.lmer1)$participant[1,] 
   27.31777

Repeating it for participant 3 in a somewhat compact notation, we get:

 
> c(fixef(nb.lmer1)[1], sum(fixef(nb.lmer1))) + ranef(nb.lmer1)$participant[3,] 
   49.07146    53.59512

To compare it with the observed values, one way to get the information we need is:

 
> newborn[newborn$participant %in% c(1,3),] 
  participant language  rate 
1           1   native 29.01 
2           1  foreign 20.06 
5           3   native 50.92 
6           3  foreign 53.73

The values are deﬁnitely not the same. The diﬀerences in native (the slope) are expected since we estimate a single slope for all babies. The diﬀerence between the intercept estimates and the observed values are more interesting here. The important thing to notice here is that participant 1 has a rather small observed intercept ( $20.06$ ), while participant 3 has a rather large one ( $53.73$ ). The estimated values, $22.79$ and $49.07$ , are not too far from the observed values. However, they are pulled towards the common mean.

Answer of 12.8.

 
> plot(rate ~ jitter(c(0,1)[language], amount=0.05), 
        xaxt="n", xlab=’language’, data=newborn) 
> axis(1, at=c(0,1), labels=c(’foreign’, ’native’)) 
> abline(nb.lm) 
> abline(a=fixef(nb.lmer1)[1] + 
           ranef(nb.lmer1)$participant[1,], 
         b=fixef(nb.lmer1)[2], col=’red’) 
> abline(a=fixef(nb.lmer1)[1] + 
           ranef(nb.lmer1)$participant[3,], 
         b=fixef(nb.lmer1)[2], col=’blue’) 
> x <- c(0,1,0,1) 
> y <- with(newborn, 
     c(rate[participant == ’1’ & language == ’foreign’], 
       rate[participant == ’1’ & language == ’native’], 
       rate[participant == ’3’ & language == ’foreign’], 
       rate[participant == ’3’ & language == ’native’])) 
> points(x, y, pch=21, 
         bg=c(’red’,’red’, ’blue’,’blue’))

(the last part, plotting the colored points, can be done with a more compact syntax)

Answer of 12.9.

 
> nb.lm2 <- lm(rate ~ language + participant, data=newborn) 
> summary(nb.lm2)

The slope should be the same as nb.lm, and almost the same as nb.lmer1. The intercept now is the ‘foreign’ rate for the ﬁrst participant. The new (slope) coeﬃcients for participant are the diﬀerences of the respective participant and the intercept term.

Adding the subject variable reduces the variation. Actually, it is now the same as the residual variance in nb.lmer1.

The slope estimate of language is not diﬀernt for diﬀerent subjects as in other models. The intercept is now associated with the ﬁrst subject, and to ﬁnd the ‘intercept’ of the third subject we need to add these coeﬃcients.

 
> coef(nb.lm2)[1];coef(nb.lm2)[1] + coef(nb.lm2)[’participant3’]

Answer of 12.10.

 
> abline(a=coef(nb.lm2), 
         b=coef(nb.lm2)[’languagenative’], 
         col=’red’, lty=’dotted’) 
> abline(a=coef(nb.lm2)[1]+coef(nb.lm2)[3], 
         b=coef(nb.lm2)[’languagenative’], 
         col=’blue’, lty=’dotted’)

You should notice that the lines for mixed-eﬀect model is closer to the lm() estimate ignoring the subject variation compared to the model that includes participants as ﬁxed eﬀects.

Answer of 12.11.

 
> bl.lmer2 <- lmer(mlu ~ language + (language|subj), data=bilingual, REML=F) 
> bl.lmer3 <- lmer(mlu ~ language + (1|subj), data=bilingual, REML=F)

Summaries of the models ﬁts, or if you run $A I C (b l . l m e r 3, b l . l m e r 2)$ , should indicate that bl.lmer3 has a smaller AIC value (remember that we prefer models with smaller AIC values).

Answer of 12.12.

 
> anova(bl.lmer3, bl.lmer2)

Remember that convention is to specify the smaller model ﬁrst, but it does not change the results. The test above returns a p-value of $0.6604$ , indicating that there isn’t enough evidence in the data in support of more complex model.

: Answer of 12.13. ¿ dotplot(ranef(bl.lmer2, condVar=T))

Answer of 12.14.

 
> nb.lmer2 <- lmer(rate ~ 1 + (1|participant), data=newborn) 
> anova(nb.lmer2, nb.lmer1)

You should see that the p-value reported is really small ( $6.293 \times 1 0^{6}$ ), indicating that the predictor has a statistically signiﬁcant eﬀect.

Answer of 12.15.

 
> confint(profile(nb.lmer1)) 
> confint(profile(nb.lmer1), level=0.99)

In both cases you should observe that the conﬁdence interval for languagenative does not include $0$ . Hence, you can reject the null hypothesis (that the babies react to both stimuli the same way) at level $p < . 01$ (and of course at $p < . 05$ ). It is not surprising that you get this value given the p-value found in Exercise 12.14.

The coeﬃcient of the random eﬀect, participant, is reported as .sig01. You should observe that the intervals calculated does not include $0$ for the random eﬀect either.

Answer of 12.16. First we ﬁt a series of models:

 
> bl.la <- lmer(mlu ~ language⋆age + (language+age|subj), data=bilingual, REML=F) 
> bl.l <- lmer(mlu ~ language⋆age + (language|subj), data=bilingual, REML=F) 
> bl.a <- lmer(mlu ~ language⋆age + (age|subj), data=bilingual, REML=F) 
> bl.i <- lmer(mlu ~ language⋆age + (1|subj), data=bilingual, REML=F)

First line ﬁts random intercepts and slopes for both language and age, the second line ﬁts only random slopes for language, the third line ﬁts only random slopes for age, and the last line ﬁts a random-intercepts-only model. Random intercepts are implicitly speciﬁed in ﬁrst three models.

Checking AIC values

 
> AIC(bl.i, bl.a, bl.l, bl.la) 
      df      AIC 
bl.i   8 370.7229 
bl.a  13 375.8502 
bl.l  10 373.0892 
bl.la 17 375.9423

indicate that there is no point in using the larger models. The simplest random-intercepts-only model is the best.

For the hypothesis testing part,

 
> confint(profile(bl.i))

Indicate that conﬁdence intervals of intercept, agesecondgrade and interaction term languageschool:agesecondgrade do not include $0$ , hence statistically signiﬁcant at the given level.

We will skip the full interpretation here, but you should try to interpret the result.

Answer of 12.17.

 
> bl.i2 <- update(bl.i, . ~ . + gender) 
> confint(profile(bl.i2))

The conﬁdence interval for genderM contains $0$ . Hence we decide that the gender is not signiﬁcant.

Answer of 12.18.

 
> par.m1 <- lmer(rate ~ context + (context|subject) + (context|item), data=par) 
> summary(par.m1) 
... 
Random effects: 
 Groups   Name        Variance Std.Dev. Corr 
 subject  (Intercept) 1.34353  1.1591 
          contextpp   0.01033  0.1017   0.76 
          contextip   0.01895  0.1377   0.63  0.98 
 item     (Intercept) 6.91181  2.6290 
          contextpp   0.01416  0.1190   -0.41 
          contextip   0.05385  0.2321   -0.19  0.97 
 Residual             0.97588  0.9879 
Number of obs: 900, groups: subject, 30; item, 10 
Fixed effects: 
            Estimate Std. Error t value 
(Intercept)  4.26505    0.85978   4.961 
contextpp    0.79375    0.09092   8.730 
contextip   -1.00695    0.11190  -8.998 
...

In the above listing only part of the output is included. Note that we have random intercepts and two random slopes because of both random eﬀects. The correlation between the random eﬀects are also modeled.

In the ﬁxed eﬀects listing, (Intercept) indicates the average speech rate for an average speaker (subject) and average phrase (item) in our base level context, in this case a parenthetical (par). The slopes of pp and ip indicate the diﬀerences from the base level. Again, the slope estimates here are the average estimates for speakers and phrases.

Answer of 12.19.

 
par.m1 <- lmer(rate ~ context + (context|subject) + (context|item), data=par, REML=F) 
par.m2 <- lmer(rate ~ context + (context|subject) + (1|item), data=par, REML=F) 
par.m3 <- lmer(rate ~ context + (1|subject) + (context|item), data=par, REML=F) 
par.m4 <- lmer(rate ~ context + (1|subject) + (1|item), data=par, REML=F) 
par.m5 <- lmer(rate ~ context + (1|subject) , data=par, REML=F) 
par.m6 <- lmer(rate ~ context + (1|item), data=par, REML=F) 
AIC(par.m1, par.m2, par.m3, par.m4, par.m5, par.m6)

The best model suggested by the lowest AIC value is par.m4.

Answer of 12.20. Fitting the model is simple

 
> par.m4a <- lmer(rate ~ context + length + (1|subject) + (1|item), data=par, REML=F)

 
> par.m4a <- update(par.m4, . ~ . + length)

Now we have new ﬁxed eﬀect for length which indicates a higher speech rate for longer phrases. However, the intercept estimate is also changed drastically. Since the intercept now correspond to a parenthetical with $0$ length. Furthermore, intercept estimate becomes less certain.

Answer of 12.21.

 
> par.m4a <- update(par.m4, . ~ . + scale(length))

Intercept now corresponds to a parenthetical phrase with average length. You should observe that the standard error for the intercept (in comparison to the model par.m4) is smaller now, indicating that our estimate of intercept has improved.

Answer of 12.22.

 
> par.m4b <- lmer(rate ~ context + scale(age) + sex + scale(length) + (1|subject) + (1|item), data=par,   REML=F)

The intercept now corresponds to the speech rate of a average-length parenthetical phrase for an average-aged female subject. The intercept estimate should be more certain than the earlier models. And now the subject variation (the standard deviation of random intercept due to subject) should also be smaller than before.

Answer of 12.23.

 
> dotplot(ranef(par.m4, which="subject", condVar=T)) 
> dotplot(ranef(par.m4b, which="subject", condVar=T))

If you pay attention to the x-axis, you will see that the overall variation is reduced considerably by the predictors age and gender.

Answer of 12.24.

 
> profile.m4b <- profile(par.m4b) 
> confint(profile.m4b, level=0.99)

You should observe that none of the conﬁdence intervals include $0$ . Hence, all eﬀects are signiﬁcant at $α$ -level $0.01$ .

[next] [prev] [prev-tail] [front] [up]