7 Regression again

Linear regression is one of the basic tools in data analysis. In its classical application, linear regression is used when the response variable and the predictor(s) are both numeric (interval or ratio scale) variables. However, as exemplified in Section 4.6, and as we will demonstrate further, many of the statistical analysis techniques can be cast as variations (or extensions) of linear regression. In this section we will revisit the correlation and regression with single predictor variable, and continue with the multiple regression and model selection. First, we will again prepare the data we will use.

The new data set we will use in this part contains geographical distance and aggregate pronunciation difference of 613 locations in the Netherlands. It has two variables, linguistic distance measured by average string edit distance of a set of words, and geographic distance in kilometers. The data is a subset of the Dutch date set analyzed by Nerbonne [2010]. Note that (to save myself from some difficult questions) the distances corresponding to the sites very close to each other are not included in the data set we will use here. The data can be found at http://coltekin.net/cagri/R/data/ling-_geo.rda.

Exercise 7.1.  Load the R data file from http://coltekin.net/cagri/R/data/ling-_geo.rda into your R environment. If all goes right, you should have a new data frame named lg.

We will also make use of the data set seg from Section 6, and the mlu data set from Section 3. If you do not have the data in your R environment, you should return to Exercise 6.1 and to the beginning of Section 3 to load or create these data sets.

7.1 Correlation

We have introduced how to calculate some of the correlation coefficients earlier in Section 2.2. Here, we will revisit this, and introduce the way to make inferences regarding correlation. We will not dwell on the assumptions of correlation. The Pearson’s correlation coefficient shares all its assumption with least-squares regression with a single predictor, and we will check these assumptions in the next subsection.

Exercise 7.2.  Using the seg data set, find the following correlation coefficients between pmi and h.

Remember that all can be calculated using the same function cor().

The correlation coefficient calculated by cor() gives you the expected correlation in the population. However, it does not include any inferential information. The function cor.test() gives you the confidence interval and a hypothesis against with the null hypothesis ‘there is no correlation (the coefficient is equal to 0)’.

Exercise 7.3. Perform cor.test() for the values you have calculated in Exercise 7.2.

If you pass a data frame to cor(), it will return a matrix of all pairwise correlations between the variables in the data frame. This is handy if you have a large number of variables, and you’d like to inspect correlations between them.

Exercise 7.4.  Calculate pairwise r values between pmi, h, rh, sv and boundary columns of the seg data frame.

TIP: In this exercise, we want to take a subset of 5 columns from a 7-column data frame. It is easier to exclude the remaining two columns instead of listing all 5 we need. In R a negative index value results in exclusion of the data point. For example, a index value of 1 will exclude the first item, and likewise, a vector index of the form c(1,2) will exclude the first two items.

One thing to note in Exercise 7.4 is that one of the variables we have included in the correlation analysis (boundary) is not a numeric variable. Correlation coefficient between the numeric and categorical variables can be calculated using so-called Point-biserial correlation coefficient, which gives yet another hint that linear models can handle both numeric an categorical data types.

7.2 Least-squares linear regression

In Section 3, we already did some exercises with linear regression. In summary, we fit a linear model using lm(), and extract the information we need in most cases using summary(). We worked on these steps earlier in Listing 5 for predicting mother’s MLU from the child’s MLU (repeated here as Listing 9 for convenience).


Listing 9: Summary of a linear regression analsysis (repeated from Listing 5).
1> m <- lm(mlu$mot ~ mlu$chi) 
2> summary(m) 
3Call: lm(formula = mlu$mot ~ mlu$chi) 
4Residuals: 
5     Min       1Q   Median       3Q      Max 
6-0.79928 -0.14665  0.06142  0.14003  0.66232 
7Coefficients: 
8            Estimate Std. Error t value Pr(>|t|) 
9(Intercept)   5.7133     0.2326  24.559   <2e-16 
10mlu$chi       0.1503     0.0839   1.791   0.0871 
11 
12Residual standard error: 0.3182 on 22 degrees of freedom 
13Multiple R-squared:  0.1272,    Adjusted R-squared:  0.08757 
14F-statistic: 3.207 on 1 and 22 DF,  p-value: 0.08708

You should already be able to interpret most of the information presented in Listing 9. If not, you are encouraged to revise Section 3.

Exercise 7.5.  Using the lg data set, fit a linear regression model predicting the linguistic difference from geographic distance. Produce the summary of linear regression fit.

Assuming the modeling assumptions are sound (we’ll check these shortly),

  1. Is geographic distance reliably reflected in linguistic difference, e.g., is the slope of the model statistically significant?
  2. What does the intercept represent in this model?
  3. How much of the linguistic difference is explained by the geographic distance?

7.3 Model diagnostics

Here, we will first focus on model diagnostics, checking whether the least-regression assumptions hold or not. Previously we extract residuals and tried to check whether they are normally distributed or not. R provides an easier way of checking various assumptions of linear regression. All you need to do is to type plot(m) (assuming you saved your model as m). This produces four graphs presented in Figure 2.


pict

Figure 2: Diagnostic plots for linear regression.


All graphs in Figure 2 expose possible violations of modeling assumptions. The top-left graph, residuals vs. fitted gives an indication of whether your residuals are correlated or not (independent). What you do not want to see here is to have any sort of pattern. The graph at the bottom left, fitted vs. |standardized residuals | is basically the same graph with a different scale. It helps seeing some patterns that are difficult to see in the first graph. In both graphs, you could detect nonlinearities and if variance is not constant across the predicted values. For a regression analysis with single predictor, you can see these values from a simple scatter plot (you may need to tilt your head a bit to see the resemblance). However, as the number of predictors increase, scatter plots will not be as useful, but fitted vs. residuals graphs will still be a useful diagnostic.

In both graphs, although it is difficult to decide due to small number of data points, it seems variance decreases around the region that correspond to fitted values of 6.0 to 6.1 of the mother’s MLU. The fitted vs. |standardized residuals | graph also indicates some mild non-linearity. The cases that may be causing the problems are labeled (by their row numbers in the data frame). These graphs identify data points 1, 20 and 22 as potential causes of poor model fit.

The top-right graph is the familiar Q-Q plot. For least-squares regression, we want our residuals to be normally distributed. Again, it is difficult to judge due to small number of cases, but our example seems to diverge slightly from the ideal line in both tails.

The last, bottom-right graph gives an indication of influential data points. Leverage of a data point is the distance of a data point from an average data point on the x-axis (or on the multidimensional space defined by all predictors). The points that have high leverage and/or large residuals are likely to be very influential. Cook’s distance is a measure of influence of a particular data point. You can think of Cook’s distance as measuring the effect of removing a data point from the regression analysis. If resulting parameter estimates change drastically, the point will have larger Cook’s distance. The contour lines drawn (at Cook’s distance 0.5, and 1.0 if you have such points) indicate influential data points. The points that appear outside the contour lines need particular attention. The data point 20 is obviously the most influential case, and 1 and 22 may also needs some attention. Although opinions differ, a rough guideline is that a Cook’s distance above 1 should be a cause of concern.

Once you are convinced that a particular observation is an outlier, you may want to remove it from the model fit. lm() and similar functions provide a handy notation to select or exclude certain observations. The argument subset may be used to specify the row index for data in use. Alternatively, you can specify the data option with already selected rows, e.g., data=mlu[c(1:19,21:24),], or simpler data=mlu[-20,].

Exercise 7.6. Repeat the analysis in Listing 9 twice. First, without the most influential observation according to Figure 2. And second, without an observation (approximately) at the center with respect to the child’s MLU values (such an observation is unlikely to be influential, why?).

Compare the change in coefficient estimates, and r2 value with respect to the summary displayed in Listing 9.

Exercise 7.7.  Produce the diagnostic plots for the model you fit in Exercise 7.5. Make sure all four diagnostic plots are plotted on the same graph, and use gray ‘dots’ for the points plotted to reduce the clutter you get due to large number of points (this will also make the lines without a specified color gray, but that’s fine).
  1. Do you see any non-linearity?
  2. Do the residual look normally distributed?
  3. Is the variance of the residuals constant?
  4. Are there very influential data points?
Exercise 7.8.  Repeat the plot you have produced in Exercise 7.7, this time save the result into a PNG file.

7.4 An example transformation

If you were careful enough in Exercise 7.7, you should not be content with the assumptions of the model we fit in Exercise 7.5. When regression assumptions especially the linearity assumption, are violated, one of the ways to fix it is to transform the predictor or the response variable. Normally, inspecting your data is helpful in detecting non-linearities, and deciding for possibly useful transformations. Here, we will first do the transformation first, and then justify it.

Exercise 7.9. Add a new variable log.geo to lg data frame which contains the log-transformed geographic distance.

Exercise 7.10.  Fit a linear regression model using log.geo as the predictor of the linguistic differences.
  1. Compare the output with the earlier output without log transformation.
  2. Plot the diagnostic graphs, and compare them with the one you saved in Exercise 7.8.

7.5 Predictions of a linear model

The inference for the slope indicates a reliable relationship between the outcome variable and the predictor. In other words, if modeling assumptions are correct, and the t-test performed for the slope is statistically significant, we confirm that predictor has an affect on the response variable within the chosen level of significance. Besides this ‘hypothesis testing’ interpretation, a linear regression fit can also be used for predicting the outcome variable for new values of the predictor. Given a predictor value x, and the coefficients a and b, the value of the outcome variable, ŷ, can simply be calculated using the formula

ŷ = a + bx

Note that this prediction does not include the uncertainty associated with the estimate.

In R, you can use predict() to calculate model predictions (and confidence) about unseen data points. For example, assuming the variable mlog contains the model that predict linguistic difference from the logarithm of the geographic distance (from Exercise 7.10), Listing 10 demonstrates the use of predict() function to obtain the predicted linguistic differences between sites whose distances are 10, 20, 50 and 100km, along with the lower and upper values of the 95% confidence interval for the estimated regression line.


Listing 10: Example use of predict().
1> predict(mlog, 
2          newdata=data.frame( 
3                log.geo=log(c(10, 20, 50, 100))), 
4          interval=confidence) 
5        fit       lwr       upr 
61 0.1053994 0.1045929 0.1062059 
72 0.1462061 0.1456081 0.1468042 
83 0.2001497 0.1998106 0.2004888 
94 0.2409565 0.2407551 0.2411578

Note that the newdata option of predict() requires a data frame where the predictor variable(s) are named the same as the original data frame used for fitting the model. This may seem unnecessary for simple regression, but it will be clear why a data frame is required when we work with multiple regression. If you do not provide the newdata option, predict will use the data used for fitting the model.

The option interval=confidence in Listing 10 results in the confidence interval for the estimated regression line: the expected value of the response variable at a given predictor value. If you specify interval=prediction, instead, you will get an interval where 95% of the future observations are expected to fall. You should think about the former as your estimate of the mean in the population, and the latter as the expected range of individual observations. If the number of observations used for fitting the model is large, the former (the confidence interval of the estimated mean) will be tight. However, the latter (the individual values sampled from the population) is unlikely to be affected by the increased number of observations during the model fit. If you skip the interval option, predict() returns only the predicted values without any interval.

Exercise 7.11.  Using the mlu data set (see the beginning of Section 3) perform the following:
  1. Plot a scatter plot of the mother’s MLU vs. the child’s MLU.
  2. Fit a linear regression model predicting the mother’s MLU from the child’s MLU, assign the result to a variable.
  3. Draw the linear regression line over the scatter plot using a red solid line.
  4. Draw lines for the lower and upper bounds of the 95% confidence interval for the linear regression estimate over the scatter plot and the regression line. Use a different color and line type than the regression line.

    TIP: You can create a number of, e.g., 20, equally spaced values that span the x axis by
    seq(min(mlu$chi), max(mlu$chi), length.out=20),
    and use it with predict to obtain the lower and upper bounds of the confidence intervals.

Exercise 7.12.  Plot the scatter plot of geographic distance vs. linguistic difference. Use gray dots for the points plotted.

Plot geo vs. predicted values from the linear model without log transformation (the model in Exercise 7.5) using a red solid line.

Plot geo vs. predicted values from the linear model with log transformation (the model in Exercise 7.10) using a blue dashed line.

Use sensible axis titles, and include an appropriate legend in the resulting figure.

Exercise 7.13. We did not plot our confidence bands using the lg data set since due to large number of data points the confidence intervals is very small and not visible on the graph.

For the sake of demonstration, draw the confidence bands as in Exercise 7.11 for the model predicting linguistic differences from the logarithm of geographic distances. While plotting the confidence bands, use newdata option of predict() to obtain 20 equally spaced points over the x axis.