This section walks you through a set of utilities R provides for working with probability distributions. Probability distributions underlie all statistical analyses we perform. In most cases you will not use these utilities directly, R provides functions to do the analysis you are interested in without requiring to work with the distribution functions. Nevertheless, it is important to understand the concepts behind the analyses you are performing. If you read this section and do the exercises here, you will refresh your memory about probability distributions, and get a few additional tips and tricks about using R.
The probability distributions we have already mentioned directly or indirectly include normal (or Gaussian) distribution, Student’s t distribution, F distribution. There are many other theoretical distribution that are interesting for statistical analysis. Most data in real life comes in the form of one distribution or another, and in statistics we often assume that the data comes from a certain distribution, typically the normal distribution. Even in non-parametric tests, where we do not assume that the data is distributed according to a theoretical distribution, we use the fact that a relevant statistic is distributed (roughly) according to a well-known probability distribution.
For each distribution it knows about, R provides four functions that may come handy at times.
For example dnorm(1.96) will give you value of the density function at . You should remember that the values of density functions of continuous distributions, like dnorm() or dt(), are not probability values.
The CDFs are the way of obtaining the p-values. You can use these functions instead of the p-value tables that decorate inner covers or appendixes of statistics textbooks.
If you are lost with the explanation don’t worry (yet), the following example may help. Assume that we want a p-value of , for values distributed with standard normal distribution. If you run the command qnorm(0.025) in R, you will get the value of the variable for which you would obtain a p-value of (in a two-tailed test).
A probability distribution is specified using a number of parameters. For example normal distribution is typically parametrized by its mean and standard deviation, and the t-distribution has a single degrees of freedom parameter. For example, rnorm(10, mean=10, sd=5) will produce 10 random number from normal distribution with mean and standard deviation . If a distribution has standard parameter values, R will use the standard values if you do not specify the parameters. For normal distribution this is mean=0 and sd=1.
Note that this is similar to the Exercise 6.10 but this time we plot the CDFs instead of the density functions.
The binomial distribution characterizes n trials of an event with one of two outcomes. One of the outcomes occurs with probability p. For example, the binomial distribution with n=10 and p=0.5 characterizes number of heads (or tails) you get for 10 flips of a fair coin. Every 10 flips you perform will produce a number between 0 and 10 (more likely 5 than 1 or 9 though). The binomial distribution is not only for coin flips. Many interesting quantities are binomially distributed. Just name a few: whether a sentence is judged ‘grammatical’ or ‘ungrammatical’, whether a student passes the exam or not, whether one is diagnosed with dyslexia or not…
For large samples, it is said that the binomial distribution can be approximated by the normal distribution.
Plot histograms of increasing numbers of samples from from the binomial distribution with parameters and , determine visually what sample size looks like the normal distribution.
Note that in this exercise you are simulating multiple runs of an experiment with fair coin where we count number of heads (or tails) in 20 coin flips.
Once you are convinced that the number gives you
an approximately normal distribution, draw the normal
distribution with the same mean and standard deviation over
the histogram.
TIP: specifying probability=TRUE option to hist()
will produce a histogram with ‘relative frequencies’ making it
comparable to probability the density function.
Another interesting probability distribution that is sometimes used when the data are counts of occurrence of and event (in a fixed time period or location) is called Poisson distribution. It has a single parameter (lambda) which corresponds to the rate of occurrence of the event.
Plot normal Q-Q plots for each distribution on separate graphs on the same canvas.
Repeat the exercise for only (instead of ) samples from each distribution.
This exercise will give you a better idea of how non-normally distributed data looks like on a Q-Q plot.