Regression again

7 Regression again

Linear regression is one of the basic tools in data analysis. In its classical application, linear regression is used when the response variable and the predictor(s) are both numeric (interval or ratio scale) variables. However, as exempliﬁed in Section 4.6, and as we will demonstrate further, many of the statistical analysis techniques can be cast as variations (or extensions) of linear regression. In this section we will revisit the correlation and regression with single predictor variable, and continue with the multiple regression and model selection. First, we will again prepare the data we will use.

The new data set we will use in this part contains geographical distance and aggregate pronunciation diﬀerence of 613 locations in the Netherlands. It has two variables, linguistic distance measured by average string edit distance of a set of words, and geographic distance in kilometers. The data is a subset of the Dutch date set analyzed by Nerbonne [2010]. Note that (to save myself from some diﬃcult questions) the distances corresponding to the sites very close to each other are not included in the data set we will use here. The data can be found at http://coltekin.net/cagri/R/data/ling-_geo.rda.

: Exercise 7.1. Load the R data ﬁle from http://coltekin.net/cagri/R/data/ling-_geo.rda into your R environment. If all goes right, you should have a new data frame named lg. $⊳$

We will also make use of the data set seg from Section 6, and the mlu data set from Section 3. If you do not have the data in your R environment, you should return to Exercise 6.1 and to the beginning of Section 3 to load or create these data sets.

7.1 Correlation

We have introduced how to calculate some of the correlation coeﬃcients earlier in Section 2.2. Here, we will revisit this, and introduce the way to make inferences regarding correlation. We will not dwell on the assumptions of correlation. The Pearson’s correlation coeﬃcient shares all its assumption with least-squares regression with a single predictor, and we will check these assumptions in the next subsection.

Exercise 7.2. Using the seg data set, ﬁnd the following correlation coeﬃcients between pmi and h.

Pearson’s $r$
Kendall’s $τ$
Spearman’s $ρ$

Remember that all can be calculated using the same function cor(). $⊳$

The correlation coeﬃcient calculated by cor() gives you the expected correlation in the population. However, it does not include any inferential information. The function cor.test() gives you the conﬁdence interval and a hypothesis against with the null hypothesis ‘there is no correlation (the coeﬃcient is equal to $0$ )’.

: Exercise 7.3. Perform cor.test() for the values you have calculated in Exercise 7.2. $⊳$

If you pass a data frame to cor(), it will return a matrix of all pairwise correlations between the variables in the data frame. This is handy if you have a large number of variables, and you’d like to inspect correlations between them.

: Exercise 7.4. Calculate pairwise $r$ values between pmi, h, rh, sv and boundary columns of the seg data frame.
TIP: In this exercise, we want to take a subset of 5 columns from a 7-column data frame. It is easier to exclude the remaining two columns instead of listing all 5 we need. In R a negative index value results in exclusion of the data point. For example, a index value of $- 1$ will exclude the ﬁrst item, and likewise, a vector index of the form $- c (1, 2)$ will exclude the ﬁrst two items. $⊳$

One thing to note in Exercise 7.4 is that one of the variables we have included in the correlation analysis (boundary) is not a numeric variable. Correlation coeﬃcient between the numeric and categorical variables can be calculated using so-called Point-biserial correlation coeﬃcient, which gives yet another hint that linear models can handle both numeric an categorical data types.

7.2 Least-squares linear regression

In Section 3, we already did some exercises with linear regression. In summary, we ﬁt a linear model using lm(), and extract the information we need in most cases using summary(). We worked on these steps earlier in Listing 5 for predicting mother’s MLU from the child’s MLU (repeated here as Listing 9 for convenience).

Listing 9: Summary of a linear regression analsysis (repeated from Listing 5).

1> m <- lm(mlu$mot ~ mlu$chi)
2> summary(m)
3Call: lm(formula = mlu$mot ~ mlu$chi)
4Residuals:
5     Min       1Q   Median       3Q      Max
6-0.79928 -0.14665  0.06142  0.14003  0.66232
7Coefficients:
8            Estimate Std. Error t value Pr(>|t|)
9(Intercept)   5.7133     0.2326  24.559   <2e-16
10mlu$chi       0.1503     0.0839   1.791   0.0871
11
12Residual standard error: 0.3182 on 22 degrees of freedom
13Multiple R-squared:  0.1272,    Adjusted R-squared:  0.08757
14F-statistic: 3.207 on 1 and 22 DF,  p-value: 0.08708

You should already be able to interpret most of the information presented in Listing 9. If not, you are encouraged to revise Section 3.

Exercise 7.5. Using the lg data set, ﬁt a linear regression model predicting the linguistic diﬀerence from geographic distance. Produce the summary of linear regression ﬁt.

Assuming the modeling assumptions are sound (we’ll check these shortly),

Is geographic distance reliably reﬂected in linguistic diﬀerence, e.g., is the slope of the model statistically signiﬁcant?
What does the intercept represent in this model?
How much of the linguistic diﬀerence is explained by the geographic distance?

⊳

7.3 Model diagnostics

Here, we will ﬁrst focus on model diagnostics, checking whether the least-regression assumptions hold or not. Previously we extract residuals and tried to check whether they are normally distributed or not. R provides an easier way of checking various assumptions of linear regression. All you need to do is to type plot(m) (assuming you saved your model as m). This produces four graphs presented in Figure 2.

Figure 2: Diagnostic plots for linear regression.

All graphs in Figure 2 expose possible violations of modeling assumptions. The top-left graph, residuals vs. ﬁtted gives an indication of whether your residuals are correlated or not (independent). What you do not want to see here is to have any sort of pattern. The graph at the bottom left, ﬁtted vs. $\sqrt{| standardized residuals |}$ is basically the same graph with a diﬀerent scale. It helps seeing some patterns that are diﬃcult to see in the ﬁrst graph. In both graphs, you could detect nonlinearities and if variance is not constant across the predicted values. For a regression analysis with single predictor, you can see these values from a simple scatter plot (you may need to tilt your head a bit to see the resemblance). However, as the number of predictors increase, scatter plots will not be as useful, but ﬁtted vs. residuals graphs will still be a useful diagnostic.

In both graphs, although it is diﬃcult to decide due to small number of data points, it seems variance decreases around the region that correspond to ﬁtted values of $6.0$ to $6.1$ of the mother’s MLU. The ﬁtted vs. $\sqrt{| standardized residuals |}$ graph also indicates some mild non-linearity. The cases that may be causing the problems are labeled (by their row numbers in the data frame). These graphs identify data points $1$ , $20$ and $22$ as potential causes of poor model ﬁt.

The top-right graph is the familiar Q-Q plot. For least-squares regression, we want our residuals to be normally distributed. Again, it is diﬃcult to judge due to small number of cases, but our example seems to diverge slightly from the ideal line in both tails.

The last, bottom-right graph gives an indication of inﬂuential data points. Leverage of a data point is the distance of a data point from an average data point on the x-axis (or on the multidimensional space deﬁned by all predictors). The points that have high leverage and/or large residuals are likely to be very inﬂuential. Cook’s distance is a measure of inﬂuence of a particular data point. You can think of Cook’s distance as measuring the eﬀect of removing a data point from the regression analysis. If resulting parameter estimates change drastically, the point will have larger Cook’s distance. The contour lines drawn (at Cook’s distance $0.5$ , and $1.0$ if you have such points) indicate inﬂuential data points. The points that appear outside the contour lines need particular attention. The data point $20$ is obviously the most inﬂuential case, and $1$ and $22$ may also needs some attention. Although opinions diﬀer, a rough guideline is that a Cook’s distance above $1$ should be a cause of concern.

Once you are convinced that a particular observation is an outlier, you may want to remove it from the model ﬁt. lm() and similar functions provide a handy notation to select or exclude certain observations. The argument subset may be used to specify the row index for data in use. Alternatively, you can specify the data option with already selected rows, e.g., data=mlu[c(1:19,21:24),], or simpler data=mlu[-20,].

: Exercise 7.6. Repeat the analysis in Listing 9 twice. First, without the most inﬂuential observation according to Figure 2. And second, without an observation (approximately) at the center with respect to the child’s MLU values (such an observation is unlikely to be inﬂuential, why?).
Compare the change in coeﬃcient estimates, and $r^{2}$ value with respect to the summary displayed in Listing 9. $⊳$

Exercise 7.7. Produce the diagnostic plots for the model you ﬁt in Exercise 7.5. Make sure all four diagnostic plots are plotted on the same graph, and use gray ‘dots’ for the points plotted to reduce the clutter you get due to large number of points (this will also make the lines without a speciﬁed color gray, but that’s ﬁne).

Do you see any non-linearity?
Do the residual look normally distributed?
Is the variance of the residuals constant?
Are there very inﬂuential data points?

⊳

: Exercise 7.8. Repeat the plot you have produced in Exercise 7.7, this time save the result into a PNG ﬁle. $⊳$

7.4 An example transformation

If you were careful enough in Exercise 7.7, you should not be content with the assumptions of the model we ﬁt in Exercise 7.5. When regression assumptions especially the linearity assumption, are violated, one of the ways to ﬁx it is to transform the predictor or the response variable. Normally, inspecting your data is helpful in detecting non-linearities, and deciding for possibly useful transformations. Here, we will ﬁrst do the transformation ﬁrst, and then justify it.

: Exercise 7.9. Add a new variable log.geo to lg data frame which contains the log-transformed geographic distance. $⊳$

Exercise 7.10. Fit a linear regression model using log.geo as the predictor of the linguistic diﬀerences.

Compare the output with the earlier output without log transformation.
Plot the diagnostic graphs, and compare them with the one you saved in Exercise 7.8.

⊳

7.5 Predictions of a linear model

The inference for the slope indicates a reliable relationship between the outcome variable and the predictor. In other words, if modeling assumptions are correct, and the t-test performed for the slope is statistically signiﬁcant, we conﬁrm that predictor has an aﬀect on the response variable within the chosen level of signiﬁcance. Besides this ‘hypothesis testing’ interpretation, a linear regression ﬁt can also be used for predicting the outcome variable for new values of the predictor. Given a predictor value $x$ , and the coeﬃcients $a$ and $b$ , the value of the outcome variable, $ŷ$ , can simply be calculated using the formula

ŷ = a + b x

Note that this prediction does not include the uncertainty associated with the estimate.

In R, you can use predict() to calculate model predictions (and conﬁdence) about unseen data points. For example, assuming the variable mlog contains the model that predict linguistic diﬀerence from the logarithm of the geographic distance (from Exercise 7.10), Listing 10 demonstrates the use of predict() function to obtain the predicted linguistic diﬀerences between sites whose distances are 10, 20, 50 and 100km, along with the lower and upper values of the 95% conﬁdence interval for the estimated regression line.

Listing 10: Example use of predict().

1> predict(mlog,
2          newdata=data.frame(
3                log.geo=log(c(10, 20, 50, 100))),
4          interval=’confidence’)
5        fit       lwr       upr
61 0.1053994 0.1045929 0.1062059
72 0.1462061 0.1456081 0.1468042
83 0.2001497 0.1998106 0.2004888
94 0.2409565 0.2407551 0.2411578

Note that the newdata option of predict() requires a data frame where the predictor variable(s) are named the same as the original data frame used for ﬁtting the model. This may seem unnecessary for simple regression, but it will be clear why a data frame is required when we work with multiple regression. If you do not provide the newdata option, predict will use the data used for ﬁtting the model.

The option interval=’confidence’ in Listing 10 results in the conﬁdence interval for the estimated regression line: the expected value of the response variable at a given predictor value. If you specify interval=’prediction’, instead, you will get an interval where 95% of the future observations are expected to fall. You should think about the former as your estimate of the mean in the population, and the latter as the expected range of individual observations. If the number of observations used for ﬁtting the model is large, the former (the conﬁdence interval of the estimated mean) will be tight. However, the latter (the individual values sampled from the population) is unlikely to be aﬀected by the increased number of observations during the model ﬁt. If you skip the interval option, predict() returns only the predicted values without any interval.

Exercise 7.11. Using the mlu data set (see the beginning of Section 3) perform the following:

Plot a scatter plot of the mother’s MLU vs. the child’s MLU.
Fit a linear regression model predicting the mother’s MLU from the child’s MLU, assign the result to a variable.
Draw the linear regression line over the scatter plot using a red solid line.
Draw lines for the lower and upper bounds of the 95% conﬁdence interval for the linear regression estimate over the scatter plot and the regression line. Use a diﬀerent color and line type than the regression line.
TIP: You can create a number of, e.g., 20, equally spaced values that span the x axis by
seq(min(mlu$chi), max(mlu$chi), length.out=20),
and use it with predict to obtain the lower and upper bounds of the conﬁdence intervals.

⊳

Exercise 7.12. Plot the scatter plot of geographic distance vs. linguistic diﬀerence. Use gray dots for the points plotted.

Plot geo vs. predicted values from the linear model without log transformation (the model in Exercise 7.5) using a red solid line.

Plot geo vs. predicted values from the linear model with log transformation (the model in Exercise 7.10) using a blue dashed line.

Use sensible axis titles, and include an appropriate legend in the resulting ﬁgure. $⊳$

: Exercise 7.13. We did not plot our conﬁdence bands using the lg data set since due to large number of data points the conﬁdence intervals is very small and not visible on the graph.
For the sake of demonstration, draw the conﬁdence bands as in Exercise 7.11 for the model predicting linguistic diﬀerences from the logarithm of geographic distances. While plotting the conﬁdence bands, use newdata option of predict() to obtain 20 equally spaced points over the x axis. $⊳$

[next] [prev] [prev-tail] [front] [up]