This section provides a brief glance at some of the basic summarization, visualization and inference techniques. We will return to most of the tools and topics introduced here in the rest of the tutorial.
In this section we will use data from a hypothetical experiment where the number of words spoken per day by 10 female and 10 male English speakers were measured. Note, again, that the data we will use is fictional, for a real investigating of the problem see Mehl et al. [2007].
The following R commands create four vector variables.
words.f <- c(17667, 15347, 14401, 5037, 20845,
11211, 6008, 17140, 13284, 10930)
words.m <- c(5599, 19776, 13961, 10144, 6107,
16776, 31955, 21140, 5482, 2152)
words <- c(words.f, words.m)
age <- c(24, 31, 28, 21, 29, 29, 25, 32, 30, 31,
33, 26, 22, 24, 23, 23, 20, 21, 29, 27)
The vectors words.f and word.m hold the number of words measured from female and male participants, respectively. The vector words holds their combination, and age contains the ages of the participants. Organizing this data in separate vector variables is not what we normally do. We will later use data frames for storing a related set of vectors such as the ones above.
Even for a data set as simple/small as words above, it is difficult draw conclusions only by looking at the raw data. We generally want to summarize the data at hand to understand it better. In Listing 4, we have already seen how to get some useful summaries in R.
For one dimensional data, here are a few functions that produce common summaries:
You are encouraged to try these functions (again) on words data set.
These summaries are often useful, indicating the center and spread of the data at hand. However, we often want to understand the data in more detail. In that case graphical summaries are more helpful.
One of the ways to visualize small data sets is stem-and-leaf plots. Stem-and-leaf plots can be produced with the function stem() in R.
One of the better ways of inspecting your data is producing a histogram. You can display a histogram in R using hist() function.
Yet another method of visualizing your data is plotting box-and-whisker plots (or box plots). Box plots display a summary similar to the five-point-summary in a graphical way. In a box plot, the box covers the range between first and third quartile (the interquartile range, IQR). The middle bar represents the median. The whiskers extend 1.5 interquartile range from the box, or up to the maximum or minimum values if they are within this range. The data points more extreme than the whiskers are considered outliers and plotted separately.
The basic measure of relatedness of two sets of continuous variables is their correlation. The correlation of two can be calculated using the function cor().
The function cor() by default calculates the Pearson product-moment correlation coefficient, known as Pearson’s . We will return to the topic of correlation later, and discuss interpretation and inference of it in more detail.
To visualize the relationship between two numeric variables, we can use scatter plots. Given two vector variables of the same size, the function plot() creates a scatter plot.
The relationship between two variables can also be summarized and visualized by a straight line. The equation for a straight line is
This equation forms the basis for nearly all statistical methods in modern science. and in this equation are the variables of interest, and and are called intercept and slope. The standard method of estimating such a line that fits the data is called least squares regression. To estimate a least squares regression line from related sets of data points, we use
lm(y ~ x)
You should note that unlike correlation, regression is asymmetric (lm(y~x) and lm(x~y) will produce different results). We put our response variable (outcome or dependent variable) before the tilde ‘~’, and the predictor (explanatory or independent variable) on the right side. Similar to the sign of the correlation coefficient, the sign of the slope indicates the direction of the relationship. The magnitude of the slope indicates magnitude of the effect of the predictor on the response variable. Here is how we fit a regression line that reflects the effect of the age on number of words spoken per day:
> lm(words ~ age)
Call:
lm(formula = words ~ age)
Coefficients:
(Intercept) age
25610.1 -468.3
The output indicates that the intercept is and slope is . In other words the fitted regression line can be expressed as
Like in many other cases we will study, there is no meaningful interpretation of the intercept (according to this equation one is expected to speak 25610.1 words per day at age ). The slope indicates that we expect 468.3 fewer words spoken per day with every year of age. To arrive at these conclusions we need to make sure that the ‘model’ above meets certain criteria.
abline(lm(words ~ age))
plots least-squares regression line over the existing graph. Plot the regression line. Does the regression line agree with the correlation coefficient you have calculated earlier?
Note that this section includes a rather quick and dirty introduction to correlation and regression as exploratory/descriptive tools. Both topics will be revisited in more detail later. Similarly, the above graphs are a very first introduction to making graphics in R. We will explore the graphical capabilities in R as we go, and dedicate a special section for producing informative and pretty graphics.
The summaries and graphs we discussed in this section so far helps us understand the data at hand better. Often, our questions are not about the particular sample we have at hand. We would like to know whether ‘women talks more than men’ in general, not only in this particular sample. We use our sample to estimate some quantities of the population that the sample comes from, e.g., mean number of words spoken per day for all humans. Naturally, we do not expect two samples taken from the same population to be exactly the same, and our estimation will include some uncertainty due to not having all the information about the population. Inferential statistics is about quantifying this uncertainty and making sure that the estimates we have reflects the population values within certain bounds.
It is worth to mention a very important aspect of statistical analysis here: the sample you took should be representative for the population you are interested to study. No statistical technique can fix the effects of wrong sampling. For example, for our ‘words per day’ example, you cannot generalize anything about the number of words spoken per day by very young and very old people, nor people speaking another language.
The simplest inference we can make is about the mean. The standard error of the mean (SE) we have calculated in Exercise 1.8 is an important quantity for assessing the uncertainty of the mean value estimated from a sample.
In practice, it is more common to report confidence intervals. 95% confidence intervals are the most commonly reported ones. A quick way of calculate approximate 95% confidence intervals that works fine for large samples is where stands for the estimated mean.
If the population standard deviation is known, or if the data set is large, we can use normal distribution to calculate the confidence intervals. For 95% confidence intervals we need to know the lower 2.5% and upper 97.5% quantiles. These values can be looked up in tables that most statistics textbooks include. Or, easier, can be calculated using the qnorm() function in R. The number used in the above approximation is based on the fact that these quantiles fall between standard deviations away from the mean for the normal distribution. In most cases, we estimate the population standard deviation from the sample, to correct for the uncertainty introduced by this, we use t distribution with degrees of freedom. Similarly, we can use the function qt() for this purpose. For example, qt(0.025, 9) will give you the value corresponding to the lower 2.5% quantile for the t distribution with 9 degrees of freedom.
Confidence intervals are related to the classical hypothesis testing. If a particular value does not fall into the % confidence interval, we can reject the null hypothesis that the mean is equal to this value. For example if we calculated a % confidence interval of , we can reject the (null) hypothesis that the mean of the population is , at the significance level of .
The standard way of testing hypothesis like the one above is to perform a t test, in this particular setting, a one-sample t test. The function t.test() in R performs one- and two-sample t tests.
Try to answer this question using the data sets in words.f and words.m. Note that our hypothesis is directional, hence, you should use a one-sided test.