3 Linear regression: a first introduction

This section will introduce the simple linear regression briefly. We will return to the topic in almost all of the sections that follow.

The typical application of linear regression is when you have a continuous outcome with continuous predictor(s). In this section, we will consider the case with only one predictor. First, we describe the data that we will exercise with.

During language acquisition, it is claimed that caregivers adapt their language to the language abilities of children. We follow a particular child and record an hour conversation between the child and the mother once every month between the child’s second and fourth birth dates. We calculate a well-known measure of language complexity/competency for children Mean Length of Utterance (MLU), for the child and her mother for each recording session. The data is real (from the CHILDES database). However, it is still a ‘toy’ data set, and you should be careful not to make generalizations based on this data.

Here is how to create our data set (you can copy & paste):

 
mlu <- data.frame( 
    age=seq(25,48), 
    chi=c(1.46, 1.41, 1.66, 1.74, 1.90, 1.91, 1.85, 2.06, 
          2.27, 2.43, 2.70, 2.81, 2.69, 2.72, 2.64, 3.05, 
          3.22, 3.42, 3.70, 3.90, 3.57, 3.49, 3.66, 3.64), 
    mot=c(5.42, 5.69, 6.27, 6.10, 6.06, 5.98, 6.10, 6.09, 
          6.10, 6.14, 6.42, 6.35, 6.21, 6.07, 5.84, 6.17, 
          5.74, 6.11, 6.41, 5.50, 6.00, 6.90, 6.65, 6.40) 
)  

This time we have created three vectors, and wrapped it into a data frame. Data frames are the data structure we will use most of the time, although, we will rarely create them by hand as we did above. The way we specified the chi and mot variables (vectors) should be familiar. We have created age using seq() which returns a vector of integer values between 25 and 48 in this example (the age of the child in months). The resulting data frame will have corresponding values of each variable (or vector) on the same row.

The following listing gives some examples of how to access individual columns, rows and items in a data frame.

 
> mlu$age 
  [1] 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 
 [16] 40 41 42 43 44 45 46 47 48 
> mlu[2,2] 
  [1] 1.41 
> mlu[,1] 
  [1] 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 
 [16] 40 41 42 43 44 45 46 47 48 
> mlu[2,] 
  age  chi  mot 
2  26 1.41 5.69 
> mlu$chi[2] 
[1] 1.41  

In a nutshell, you can extract certain variables (or columns) from a data frame using the $ notation.In essence you can select any individual cell with the notation mlu[r,c], where r refers to the row number and c refers to the column number. If you leave either column or the row number unspecified, you get the complete column, or the row respectively (note, you keep the comma ‘,’ in both cases). You can also mix and match the dollar-notation and the vector indexing.

You should study each line, and make sure you understand what it means. We will exercise with more complex ways to access and manipulate data in data frames. However, the basics above will be used rather frequently.

3.1 Some preliminaries

The basic way of visualizing two related sets of numerical data is to plot them using a scatter plot.

Exercise 3.1.  Plot a scatter plot for visualizing the effect of the child’s MLU on the mother’s MLU. Remember that we put the predictor (or independent variable) on the x-axis.

In simple linear regression, our aim is to find the best linear equation that fits the data points. The general form of the equation is

yi = a + bxi + 𝜖i

where y and x in this equation are the variables of interest, the index i ranges over all observations, and a and b are called coefficients, or individually they are called intercept and slope respectively. The term 𝜖 reflects the fact that our best estimate of a and b will not result in perfect prediction of yi from xi for every observation. In other words, 𝜖i is the error made by the model for observation i. Our best estimate is the one that makes the least error (for some definition of ‘least error’). We will come back to the proper estimation of the regression equation, shortly after some exercises with drawing lines in R.

We already used the function abline() to draw the line estimated by lm(). If you give two numeric arguments to the abline() function (instead of the lm() result), it takes the first argument as the intercept (a), and the second one as the slope (b) and draws the corresponding line (hence, the name ‘abline’). For example abline(0,1) will draw a line with intercept 0 and slope 1 (a line that passes from the origin with a 45 degrees of slope).

Exercise 3.2.  Using the scatter plot you have produced in Exercise 3.1, try to draw the line that you think fits the data the best. Do not use lm() to estimate the line yet, draw multiple straight lines using abline() until you are convinced that you have the best line.

Exercise 3.3.  Many plotting commands in R accept a ‘col’ argument for specifying the color of the objects drawn. Similarly, you can specify the width of a line using the argument ‘lwd’. Redraw the best-fitting line from Exercise 3.2. Make sure the line is ‘red’ and it is twice as thick as the standard lines.

We already know that the function lm() in R finds the best line using the least squares regression. In fact, this function can estimate any general linear model, and we will use it in most of the sections that follow for performing different analyses. First here is a simple call to lm() (the output is slightly edited to save space):

 
> lm(mlu$mot ~ mlu$chi) 
Call: lm(formula = mlu$mot ~ mlu$chi) 
Coefficients: 
(Intercept)      mlu$chi 
     5.7133       0.1503  

The estimated intercept is 5.7133 and the slope is 0.1503. The intercept represents expected value of the mother’s MLU when the child’s MLU is 0. Although this is a reasonable quantity to predict, our sample would not allow us to predict it reliably (why?). On the other hand the slope tells us that for every unit increase in child’s MLU, we expect mother’s MLU to increase by 0.15.

Exercise 3.4. Draw the estimated regression line in color blue over the ones you have drawn in Exercise 3.2 and Exercise 3.3. Compare it with the line you estimated.

The estimated regression line tells what we found in our data. However, it does not tell anything about generalizability of these results outside our sample. The lm() function does more than what we see when we just run it. It returns an R object that we can investigate further, and inferential question we have just raised can be answered by asking for a summary(), as shown in Listing 5.


Listing 5: Summary of a linear regression fit.
1> m <- lm(mlu$mot ~ mlu$chi) 
2> summary(m) 
3Call: lm(formula = mlu$mot ~ mlu$chi) 
4Residuals: 
5     Min       1Q   Median       3Q      Max 
6-0.79928 -0.14665  0.06142  0.14003  0.66232 
7Coefficients: 
8            Estimate Std. Error t value Pr(>|t|) 
9(Intercept)   5.7133     0.2326  24.559   <2e-16 
10mlu$chi       0.1503     0.0839   1.791   0.0871 
11 
12Residual standard error: 0.3182 on 22 degrees of freedom 
13Multiple R-squared:  0.1272,    Adjusted R-squared:  0.08757 
14F-statistic: 3.207 on 1 and 22 DF,  p-value: 0.08708

First, instead of using the lm() output directly, we save the ‘model object’ returned in a variable. In line 2, we get the summary of the model using the variable ‘m’. The same could be achieved with the command ‘summary(lm(mlu$mot ~ mlu$chi))’ without storing the intermediate result. The first line in the output above reminds us the way we run lm(). The lines 4–6 present the 5-point-summary for the residuals, the 𝜖 in our formula above. For now, we skip these, but it will soon be clear why this is important for interpreting linear regression results. Next, lines 8–10 present the estimated coefficients (intercept and the slope) along with some inferential statistics about them. The standard error presented is similar to the standard error of the mean we have discussed earlier. It represents the standard deviation of the coefficient estimates that would be obtained from similar samples. The t-tests presented have the null hypothesis that the coefficient tested is 0. For the intercept, this test is not quite useful. The other problems regarding estimation and interpretation of the intercept in this problem aside, we do not really think that mothers start speaking to their children only when their children start talking. The inference for the slope is generally something we are interested. Remember that the slope represents the unit increase in the mother’s MLU, given the child’s MLU. In other words, this is the effect of the child’s MLU on the mother’s MLU. If we cannot reject the null hypothesis that the slope is 0 (=no effect), we cannot be certain that the effect we estimated is not a chance effect.

The lines 12–14 present further statistics that are useful in our interpretation of the linear regression results. We will return to these later. For now, we note that the reported ‘R-squared’ value is the standardized effect size for linear regression, and interpreted as the ‘amount of variance in the response variable that is explained by the predictor(s)’.

Exercise 3.5. Find the Pearson correlation coefficient between the mother’s and the child’s MLU. Compare square of the correlation coefficient with the R-squared reported in Listing 5.

3.2 Some model diagnostics

The lm() function in R will find the regression equation with the minimum sum of squared errors ( i𝜖i2). However, the results may be irrelevant if the following modeling assumptions are not checked.

Besides these, the least squares regression estimation is sensitive to extreme values or outliers.

We will return to all these assumptions, and how to check them later. For now, we will only check the residuals for normality, as we already know how to do it. Note that almost all assumptions of the linear regression is about residuals. Now you know why the summary presented in Listing 5 includes a five-point-summary of residuals. To extract residuals from a model you can use the resid() function.

Exercise 3.6. Extract residuals from the model fitted in Listing 5,

Exercise 3.7. We wonder whether MLU is a good measure of a child’s language ability. For a normally developing child, we expect the age of the child to be a good predictor of his/her language ability. Using age as a proxy to child’s language ability, investigate the relation between the MLU and the language ability.
  1. What is the best choice of predictor and response variables for this problem?
  2. What is the estimates of intercept and the slope, and how do you interpret them?
  3. Produce a scatter plot of the data, and draw the regression line.
  4. Is the slope estimated statistically significant at level 0.05?
  5. Do you observe any clear outliers in the scatter plot?
  6. Extract residuals form the model, and check whether they are normally distributed or not.