10 Probability distributions

This section walks you through a set of utilities R provides for working with probability distributions. Probability distributions underlie all statistical analyses we perform. In most cases you will not use these utilities directly, R provides functions to do the analysis you are interested in without requiring to work with the distribution functions. Nevertheless, it is important to understand the concepts behind the analyses you are performing. If you read this section and do the exercises here, you will refresh your memory about probability distributions, and get a few additional tips and tricks about using R.

The probability distributions we have already mentioned directly or indirectly include normal (or Gaussian) distribution, Student’s t distribution, F distribution. There are many other theoretical distribution that are interesting for statistical analysis. Most data in real life comes in the form of one distribution or another, and in statistics we often assume that the data comes from a certain distribution, typically the normal distribution. Even in non-parametric tests, where we do not assume that the data is distributed according to a theoretical distribution, we use the fact that a relevant statistic is distributed (roughly) according to a well-known probability distribution.

For each distribution it knows about, R provides four functions that may come handy at times.

The functions that start with d are the probability density functions. The value of these functions indicate the likelihood of observing the given data point.

For example dnorm(1.96) will give you value of the density function at 1.96. You should remember that the values of density functions of continuous distributions, like dnorm() or dt(), are not probability values.

The functions that start with p are the cumulative distribution functions (CDF). Cumulative distribution functions return the probability of observing a value lower than the given data point. The value of the CDF correspond to the area under the distribution function up to the given value. For example, pnorm(-1.96) will give you the total probability of observing a value less than or equal to 1.96, given the data is distributed according to standard normal distribution.

The CDFs are the way of obtaining the p-values. You can use these functions instead of the p-value tables that decorate inner covers or appendixes of statistics textbooks.

The functions that start with q are the quantile functions. The quantile function of a distribution is the inverse of its CDF. In other words, given a probability p, the quantile function returns x where probability of observing a value less than or equal x is p.

If you are lost with the explanation don’t worry (yet), the following example may help. Assume that we want a p-value of 0.025, for values distributed with standard normal distribution. If you run the command qnorm(0.025) in R, you will get the value of the variable for which you would obtain a p-value of 0.05 (in a two-tailed test).

The functions that start with r are sampling functions. They return a vector of specified size that is sampled randomly from a given distribution. For example, rnorm(10) will produce 10 random numbers that are distributed according to standard normal distribution. The sampling functions are particularly handy for doing simulations.

A probability distribution is specified using a number of parameters. For example normal distribution is typically parametrized by its mean and standard deviation, and the t-distribution has a single degrees of freedom parameter. For example, rnorm(10, mean=10, sd=5) will produce 10 random number from normal distribution with mean 10 and standard deviation 5. If a distribution has standard parameter values, R will use the standard values if you do not specify the parameters. For normal distribution this is mean=0 and sd=1.

Exercise 10.1. For the normal distribution with μ = 3500 and σ = 200,

Exercise 10.2. Plot the cumulative distribution functions of the standard normal distribution and t-distribution with degrees of freedom 1, 5 and 20 on the same graph. Make sure all functions are plotted in the range -4 to 4 as smooth curves (not points), and choose different colors for each function. Use sensible axis labels, and include a legend to indicate which line belongs to which distribution.

Note that this is similar to the Exercise 6.10 but this time we plot the CDFs instead of the density functions.

The binomial distribution characterizes n trials of an event with one of two outcomes. One of the outcomes occurs with probability p. For example, the binomial distribution with n=10 and p=0.5 characterizes number of heads (or tails) you get for 10 flips of a fair coin. Every 10 flips you perform will produce a number between 0 and 10 (more likely 5 than 1 or 9 though). The binomial distribution is not only for coin flips. Many interesting quantities are binomially distributed. Just name a few: whether a sentence is judged ‘grammatical’ or ‘ungrammatical’, whether a student passes the exam or not, whether one is diagnosed with dyslexia or not…

Exercise 10.3.  Produce a random sample of 200 values from a binomial distribution with n = 100 and p = 0.55, and plot its histogram. Make sure that the histogram is ‘normalized’ such that the area under the histogram sums to 1, and modify the axes ranges to contain all possible values. Plot the theoretical density (or more correctly mass) function over the histogram.

Exercise 10.4. 

For large samples, it is said that the binomial distribution can be approximated by the normal distribution.

Plot histograms of increasing numbers of samples from from the binomial distribution with parameters p = 0.5 and size = 20, determine visually what sample size looks like the normal distribution.

Note that in this exercise you are simulating multiple runs of an experiment with fair coin where we count number of heads (or tails) in 20 coin flips.

Exercise 10.5.  Repeat Exercise 10.4 with p=0.9. Is the number of samples similar to what you have decided in Exercise 10.4?

Once you are convinced that the number gives you an approximately normal distribution, draw the normal distribution with the same mean and standard deviation over the histogram.
TIP: specifying probability=TRUE option to hist() will produce a histogram with ‘relative frequencies’ making it comparable to probability the density function.

Another interesting probability distribution that is sometimes used when the data are counts of occurrence of and event (in a fixed time period or location) is called Poisson distribution. It has a single parameter λ (lambda) which corresponds to the rate of occurrence of the event.

Exercise 10.6. Draw probability density function for the Poison distribution with rate parameters 3, 10, and 30 for the range 0 to 50 on the same graph.

Exercise 10.7. Create samples of 10000 items from each of the following distributions:

Plot normal Q-Q plots for each distribution on separate graphs on the same canvas.

Repeat the exercise for only 20 (instead of 10000) samples from each distribution.

This exercise will give you a better idea of how non-normally distributed data looks like on a Q-Q plot.