Basic data exploration and inference

2 Basic data exploration and inference

This section provides a brief glance at some of the basic summarization, visualization and inference techniques. We will return to most of the tools and topics introduced here in the rest of the tutorial.

In this section we will use data from a hypothetical experiment where the number of words spoken per day by 10 female and 10 male English speakers were measured. Note, again, that the data we will use is ﬁctional, for a real investigating of the problem see Mehl et al. [2007].

The following R commands create four vector variables.

 
words.f <- c(17667, 15347, 14401, 5037, 20845, 
             11211, 6008, 17140, 13284, 10930) 
words.m <- c(5599, 19776, 13961, 10144, 6107, 
             16776, 31955, 21140, 5482, 2152) 
words <- c(words.f, words.m) 
age <- c(24, 31, 28, 21, 29, 29, 25, 32, 30, 31, 
         33, 26, 22, 24, 23, 23, 20, 21, 29, 27)

The vectors words.f and word.m hold the number of words measured from female and male participants, respectively. The vector words holds their combination, and age contains the ages of the participants. Organizing this data in separate vector variables is not what we normally do. We will later use data frames for storing a related set of vectors such as the ones above.

2.1 Summarizing and visualizing one-dimensional data

Even for a data set as simple/small as words above, it is diﬃcult draw conclusions only by looking at the raw data. We generally want to summarize the data at hand to understand it better. In Listing 4, we have already seen how to get some useful summaries in R.

For one dimensional data, here are a few functions that produce common summaries:

mean() mean of the given data set.
median() median of the given data set.
min() minimum value.
max() maximum value.
summary() So-called 5-point summary (minimum, lower quartile, median, upper quartile and maximum) and mean.
sd() standard deviation.
mad() maximum absolute deviation from the median.

You are encouraged to try these functions (again) on words data set.

These summaries are often useful, indicating the center and spread of the data at hand. However, we often want to understand the data in more detail. In that case graphical summaries are more helpful.

One of the ways to visualize small data sets is stem-and-leaf plots. Stem-and-leaf plots can be produced with the function stem() in R.

: Exercise 2.1. Produce a stem-and-leaf plot of words. Does the plot indicate an outlier? Can you say more about the distribution of the word counts? $⊳$

One of the better ways of inspecting your data is producing a histogram. You can display a histogram in R using hist() function.

: Exercise 2.2. Produce a histogram of words. $⊳$

Yet another method of visualizing your data is plotting box-and-whisker plots (or box plots). Box plots display a summary similar to the ﬁve-point-summary in a graphical way. In a box plot, the box covers the range between ﬁrst and third quartile (the interquartile range, IQR). The middle bar represents the median. The whiskers extend 1.5 interquartile range from the box, or up to the maximum or minimum values if they are within this range. The data points more extreme than the whiskers are considered outliers and plotted separately.

: Exercise 2.3. Display a box plot of words. $⊳$

: Exercise 2.4. Normally, box plots are more useful for comparing two or more samples, or groups. Plot box plots for words.f, and words.m side by side. Do women talk more than men? TIP: you can use help() if you do not know how to use boxplot() to display two groups instead of one. $⊳$

2.2 Summarizing and visualizing two-dimensional data

The basic measure of relatedness of two sets of continuous variables is their correlation. The correlation of two can be calculated using the function cor().

: Exercise 2.5. Find the correlation between words and age. Do people speak more as they get older? Is this a strong correlation? $⊳$

: Exercise 2.6. Do you expect any correlation between the variables words.f and words.m? Calculate the correlation coeﬃcient between these two variables. Can you explain your ﬁndings? $⊳$

The function cor() by default calculates the Pearson product-moment correlation coeﬃcient, known as Pearson’s $r$ . We will return to the topic of correlation later, and discuss interpretation and inference of it in more detail.

To visualize the relationship between two numeric variables, we can use scatter plots. Given two vector variables of the same size, the function plot() creates a scatter plot.

: Exercise 2.7. Create a scatter plot for visualizing relationship between words and age. Which variable should be plotted along the x-axis? $⊳$

The relationship between two variables can also be summarized and visualized by a straight line. The equation for a straight line is

y = a + b x

This equation forms the basis for nearly all statistical methods in modern science. $y$ and $x$ in this equation are the variables of interest, and $a$ and $b$ are called intercept and slope. The standard method of estimating such a line that ﬁts the data is called least squares regression. To estimate a least squares regression line from related sets of data points, we use

 
    lm(y ~ x)

You should note that unlike correlation, regression is asymmetric (lm(y~x) and lm(x~y) will produce diﬀerent results). We put our response variable (outcome or dependent variable) before the tilde ‘~’, and the predictor (explanatory or independent variable) on the right side. Similar to the sign of the correlation coeﬃcient, the sign of the slope indicates the direction of the relationship. The magnitude of the slope indicates magnitude of the eﬀect of the predictor on the response variable. Here is how we ﬁt a regression line that reﬂects the eﬀect of the age on number of words spoken per day:

 
> lm(words ~ age) 
Call: 
lm(formula = words ~ age) 
Coefficients: 
(Intercept)          age 
    25610.1       -468.3

The output indicates that the intercept is $25610.1$ and slope is $- 468.3$ . In other words the ﬁtted regression line can be expressed as

w o r d s = 25610.1 - 468.3 \times a g e

Like in many other cases we will study, there is no meaningful interpretation of the intercept (according to this equation one is expected to speak 25610.1 words per day at age $0$ ). The slope indicates that we expect 468.3 fewer words spoken per day with every year of age. To arrive at these conclusions we need to make sure that the ‘model’ above meets certain criteria.

Exercise 2.8. The command

 
abline(lm(words ~ age))

plots least-squares regression line over the existing graph. Plot the regression line. Does the regression line agree with the correlation coeﬃcient you have calculated earlier? $⊳$

: Exercise 2.9. Create a scatter plot of words.f against words.m, and also draw the corresponding regression line. $⊳$

Note that this section includes a rather quick and dirty introduction to correlation and regression as exploratory/descriptive tools. Both topics will be revisited in more detail later. Similarly, the above graphs are a very ﬁrst introduction to making graphics in R. We will explore the graphical capabilities in R as we go, and dedicate a special section for producing informative and pretty graphics.

2.3 Simple inference

The summaries and graphs we discussed in this section so far helps us understand the data at hand better. Often, our questions are not about the particular sample we have at hand. We would like to know whether ‘women talks more than men’ in general, not only in this particular sample. We use our sample to estimate some quantities of the population that the sample comes from, e.g., mean number of words spoken per day for all humans. Naturally, we do not expect two samples taken from the same population to be exactly the same, and our estimation will include some uncertainty due to not having all the information about the population. Inferential statistics is about quantifying this uncertainty and making sure that the estimates we have reﬂects the population values within certain bounds.

It is worth to mention a very important aspect of statistical analysis here: the sample you took should be representative for the population you are interested to study. No statistical technique can ﬁx the eﬀects of wrong sampling. For example, for our ‘words per day’ example, you cannot generalize anything about the number of words spoken per day by very young and very old people, nor people speaking another language.

The simplest inference we can make is about the mean. The standard error of the mean (SE) we have calculated in Exercise 1.8 is an important quantity for assessing the uncertainty of the mean value estimated from a sample.

: Exercise 2.10. Calculate the SE for the complete words data set. Store the result in a variable and display it. $⊳$

In practice, it is more common to report conﬁdence intervals. 95% conﬁdence intervals are the most commonly reported ones. A quick way of calculate approximate 95% conﬁdence intervals that works ﬁne for large samples is $\hat{μ} \pm 2 \times S E$ where $\hat{μ}$ stands for the estimated mean.

: Exercise 2.11. Calculate the 95% conﬁdence interval for words data set with the above approximation. $⊳$

If the population standard deviation is known, or if the data set is large, we can use normal distribution to calculate the conﬁdence intervals. For 95% conﬁdence intervals we need to know the lower 2.5% and upper 97.5% quantiles. These values can be looked up in tables that most statistics textbooks include. Or, easier, can be calculated using the qnorm() function in R. The number $2$ used in the above approximation is based on the fact that these quantiles fall between $\pm 1.96$ standard deviations away from the mean for the normal distribution. In most cases, we estimate the population standard deviation from the sample, to correct for the uncertainty introduced by this, we use t distribution with $n - 1$ degrees of freedom. Similarly, we can use the function qt() for this purpose. For example, qt(0.025, 9) will give you the value corresponding to the lower 2.5% quantile for the t distribution with 9 degrees of freedom.

: Exercise 2.12. Calculate the 95% conﬁdence interval for words data set using values obtained with qnorm() and qt(). Compare the results to each other and the values obtained with the approximate calculation above. $⊳$

Conﬁdence intervals are related to the classical hypothesis testing. If a particular value does not fall into the $95$ % conﬁdence interval, we can reject the null hypothesis that the mean is equal to this value. For example if we calculated a $95$ % conﬁdence interval of $[10000, 16000]$ , we can reject the (null) hypothesis that the mean of the population is $20000$ , at the signiﬁcance level of $0.05$ .

: Exercise 2.13. Assume that it is known that average number of words spoken by a Dutch speaker per day is $16000$ . Using conﬁdence intervals you have calculated above on words, test the alternative hypothesis that the number of words spoken per day is diﬀerent for Dutch and English speakers. $⊳$

The standard way of testing hypothesis like the one above is to perform a t test, in this particular setting, a one-sample t test. The function t.test() in R performs one- and two-sample t tests.

: Exercise 2.14. Perform the same test in Exercise 2.13 using the function t.test(). TIP: see built-in help for speciﬁcation of the null hypothesis. $⊳$

: Exercise 2.15. A popular book claims that, on average, women speak $20000$ words per day. Can you reject this hypothesis using the data (words.f) above? $⊳$

Exercise 2.16. Finally: do women speak more than men?

Try to answer this question using the data sets in words.f and words.m. Note that our hypothesis is directional, hence, you should use a one-sided test.

What is your conclusion based on the R output? $⊳$

[next] [prev] [prev-tail] [front] [up]