6 Graphics

Graphs are important tools for making sense of our data and communicating our results. R provides many graphical routines to produce graphs that visualize data in useful ways. We have already worked with a few basic graphs in R. In this section, we will work with graphics in some detail.

As usual, first we will introduce some new data to work with. The new data set comes from a corpus study of child language acquisition. In a nutshell, the study is about how children may be extracting words out of a continuous data stream. Particularly, we are interested whether a set of statistics is helpful for identifying the word boundaries. The data can be found at http://coltekin.net/cagri/R/data/seg.csv. It contains four different statistics: (pointwise mutual information (PMI), successor variety (SV), boundary entropy (H) and reverse boundary entropy (RH) ) calculated on all potential boundary locations for three child-directed utterances. The details are not that important for our purposes here, but if you like to know more about it, the data comes from Çöltekin [2011].

Exercise 6.1.  Load the CSV file http://coltekin.net/cagri/R/data/seg.csv into a data frame named seg in your R environment.

Make sure that all columns in the data frame has sensible data types. Do appropriate conversions if necessary.

6.1 Basic graphics

We used the plot() function before for scatter plots. This function behaves differently depending of the type of object to be plotted. In the simplest case, you can give a simple list of numbers, and plot() will plot them against their index, i.e., integers starting with one up to the number of elements in the data provided.

Exercise 6.2.  Plot the h values in the order given in the data (against thier index value) for only the first utterance in the data frame seg.

Exercise 6.3.  Using the seg data set, display the relationship between pmi and h using a scatter plot. Make sure that pmi is placed on the x-axis, and the color of the points is red.

Plot the linear regression line over the scatter plot in blue.

If you plot a data frame with plot(), it will result in a matrix of scatter plots where each variable is plotted against the other.

Exercise 6.4.  Plot scatter plots of pmi, h, rh and sv against each other on a single graph.

TIP: remember that you can extract the relevant columns of the data frame with the syntax seg[,4:7], or using symbolic names like seg[,c(pmi, h, rh, sv)].

We will see more examples of plotting different types of objects, and prettifying the plots like the one in the exercises above. For now, we will first exercise with some of the basic graphics we have seen before, and build on them slowly towards more advanced and nice-looking graphs.

Exercise 6.5.  Plot the histograms of pmi and h values. Do the distributions look similar?

Exercise 6.6.  Using normal Q-Q plots, check whether pmi and h are distributed normally.

Exercise 6.7.  Plot side-by-side box plots of pmi for boundary (where boundary == TRUE) and non-boundary (where boundary == FALSE) locations in the first utterance in the seg data.

So far, we have used plot() for plotting relations between two samples. We can also plot mathematical functions. For example, the following plots the well-known bell curve of the Gaussian function.

 
> x <- seq(-4,4,by=0.1) 
> plot(x, dnorm(x))  

We first create a vector variable that holds numbers between 4 to 4, and plot these number against dnorm(), which returns the value of the normal density function (more on density functions later). If you run the above command, you will see a set of dots tracing the standard Gaussian curve. If you want to see lines connecting these points instead, you can pass the option type=l to the plot() function. Similarly, the option type=b plots both (lines and points).

Exercise 6.8.  Another interesting distribution is the Student’s t distribution. The density function for the t distribution in R is dt(). Plot t distribution with degrees of freedom 5 using lines instead of individual points. Make sure that the curve is plotted in green, and the line is three times as thick as the default line width.

It would be interesting to see both the normal and t distributions on the same graph. However, every time we run a plot() command R clears the earlier plot, and initializes a new ‘canvas’. We have already seen the abline() function which allowed us to plot on an existing plot. There are more functions that draw over an existing graph. Most commonly used ones include,

Exercise 6.9.  Plot the density curves of the standard normal distribution and the t distribution with degrees of freedom 5 on the same graph. Make sure both are drawn with lines, and use a distinct color for each curve.

The colors are useful for distinguishing different lines in a graph. However, the color distinction will be unreliable in black-and-white print. A common practice for identifying different lines on the same graph is to use different line types, or patterns. The commands that draw lines allow you to specify a different pattern using the option lty. For example, setting lty=2 in plot() or lines() will draw a dashed line (the default is 1, which draws a solid line). Alternatively, you can use symbolic names like dotted, dashed etc.

Exercise 6.10.  Plot the standard normal distribution and t distributions with degrees of freedom 1, 5, and20 on the same graph. Use different colors and line types for each curve.

Similar to lty that sets the line-pattern, you can also customize the type of the points drawn by plot() or points(). The parameter that decides the shape of the points drawn is pch. If you provide a single-character text string to pch it will use this character instead of the default. Alternatively, you can provide a numeric value to obtain a number of predefined shapes. For example pch=22 will plot a filled square (see help text for points() for other symbols).

Exercise 6.11.  Repeat Exercise 6.3, but use ‘small solid circles’ instead of the default hollow circle.

With the text() command, you can place an arbitrary text on any point in the x-y plane. In its typical use, it is used like text(x, y, labels), where all arguments are vectors of equal size. Furthermore, you can adjust the position of the labels with pos and offset options (see help text for more information).

Exercise 6.12.  Repeat Exercise 6.9. Place the text strings standard normal and t(5) on appropriate places on the graph to identify the curves. Use the same colors for the text as the corresponding curve.

Exercise 6.13.  Repeat Exercise 6.2. However, use dotted lines instead of plotting individual points, and place the corresponding phoneme value above each point.

** Use red for the phonemes that correspond to boundaries and blue for word-internal locations.

This exercise, especially the last part, is rather tricky, but you have all the tools at hand to achieve this.

6.2 Labels, axes, legends …

In graphs like the ones in exercises 6.9 and 6.10, we typically include a legend to explain what colors or patterns mean. The command legend() in R adds a legend to an existing plot.

Exercise 6.14.  Add a legend for the graph you produced in Exercise 6.10. Make sure that both line type and color matches with the lines on the graph.

We improved the graph in Exercise 6.14 quite a bit, it is almost ready to be printed. However, the y-axis is labeled as ‘dnorm(x)’ which definitely is not the typical axis label found in printed material. You can specify the axis labels using xlab and ylab options, and a title on top with the option main.

Exercise 6.15.  Repeat Exercise 6.10 but this time set the main title as ‘normal and t distributions’, set the y-axis label to ‘density’, and remove the axis label of the x-axis.

By default R determines for a reasonable x-y region for your plots when you use plot() and the other functions that initialize a new graph. Sometimes, you may want to change the range of the values on the x- or the y-axis. Often you need to do this to make sure that the subsequent points() and lines() fit into the canvas prepared by a plotting function, sometimes you may want to extend one of the axes to leave some space for your legend, and sometimes you may want to include a particular reference point, for example, the origin (coordinates 0,0) in the graph no matter what data to be plotted. To set the ranges that will be visible on a graph we use xlim and ylim parameters to the plotting functions. For example, xlim=c(0, 10) will result in the x-axis to cover the range between 0 and 10. Note that any graphics drawn outside the region specified by xlim and ylim will be clipped out, they will not be visible.

Exercise 6.16.  Repeat Exercise 6.10 but this make sure that the origin, the point (0,0), is included in the graph.

Exercise 6.17.  Repeat the scatter plot in Exercise 6.3 in two steps. First, plot only the points that correspond to the boundary locations using a plus ‘+’ sign instead of a circle. Next, plot the points that correspond to the non-boundary locations on the same graph using a minus ‘-’ sign. Use different colors in each step. Include a main title, e.g., ‘PMI vs. H’, and make sure that the axis labels are printed in all capital letters. Place an appropriate legend indicating meanings of the symbols used.

Make sure all points fit into the graph.

6.3 More than one graph on the same canvas

Often, we would like to display more than one graph on the same figure in a publication or presentation. One way to achieve this is to set one of the graphical parameters mfrow or mfcol. The graphical parameters in R are set using the command par(). Once set, these parameters will be effective for all graphics related commands. For mfrow and mfcol we specify a two-element vector, where the first element specify the number of rows, and the second one specifies the number of columns. For example, the command par(mfrow=c(4,5)) creates a grid of four rows and five columns where next 20 plot() (or others like hist(), boxplot()) commands will place their output. The difference between mfrow and mfcol is the order of the plots. mfrow fills the specified grid following a row order, while mfcol fills the columns first. Figure 1 shows the order of graphs produced with mfrow and mfcol options.


pict


Figure 1: Order of graphs in 2x2 plots set up by (a) mfrow and (b) mfcol.


Exercise 6.18. In Exercise 6.7, we have created box plots of boundaries and non-boundaries for the pmi values.

Plot four graphs on a 2x2 grid on the same canvas each displaying side-by-side box plots for boundary and non-boundary positions for pmi, h, rh and sv values. Set the main title accordingly to identify the graphs.

The par() command sets many other graphical parameters. Some of these parameters, e.g., pch or lwd, serve as defaults to the later graphical commands, and can be overridden by the later commands like plot(). Some others, like mfrow, affect behavior that you cannot set through the individual commands. You are encouraged to skim through the help text for par() to get an impression of the parts of the R graphics that you can customize.

6.4 Writing your graphs to external files

Once you are happy with your graph, you will want to include it in your presentations and/or publications. You can, of course, get a screenshot, but in most cases, this method produces less than optimal graphics, especially for publishing. R supports a number of formats that produces ‘publication quality’ graphs. The possible file formats include Postscript, PDF, PNG and JPEG. Typically, if you want to have a bitmap file (for the web and possibly for your presentations), you should use PNG graphics. If it is for a publication, you should pick a vector file format, such as PDF (If you are a LATEX user, you should definitely check tikzDevice, though).

To plot your graphics to an external file, you first need to use appropriate function to initialize the output ‘device’. The initialization functions are typically the (lowercase) name of the graphics format you are interested in. For example, pdf(), postscript(), png() or tiff(). These functions somewhat differ depending on the file type you want to produce, but in almost all cases you need to specify a filename and the width and height of the resulting graphics. For bitmap graphics, the width and height are specified in pixels, for vector graphics it is specified in physical dimensions, e.g., in inches. You should consult the documentation of the functions you want to use. In general it is important to specify the correct size since some properties of the resulting graph, such as font sizes and line thickness, will be determined based on the size of the graphics. Once you have initialized the output, the commands you use for producing graphs are the same. When you are done with plotting your graph(s), you should type dev.off(). The resulting graphics will be written to the file you specified during the initialization.

Exercise 6.19.  Plot the histogram and Q-Q plot (including the theoretical line) of the pmi values in seg data set on the same canvas next to each other (one row, two columns). Make sure that your graphs have sensible titles and axis labels. Use filled triangles for the Q-Q plot instead of the default circle. Write the results to a PDF file suitable for printing on A4 paper with one-inch margins on both sides. The width of an A4 paper is 8.27 inches, and you probably do not want to fill the whole paper, so you should use an image height about half of the image width.

Exercise 6.20. Repeat Exercise 6.19 two times for producing PNG graphics of different sizes, 1024x512 (width x height) and 640x320. Display and compare the quality of the resulting graphics.

6.5 Additional exercises

Exercise 6.21. Plot line segments passing through the following X-Y coordinates: (0,0), (1,1), (2,3) and (4,4).

Exercise 6.22.  In R, you can draw a pie chart with the function pie(). Plot a pie chart for the data used in Exercise 1.10. Use capital letters ’A’ to ’D’ as labels.

Exercise 6.23.  The function barplot() in R produces a bar plot. Repeat Exercise 6.22, but use a bar plot instead of a pie chart.

Exercise 6.24.  Draw sine, sin(), and cosine, cos(), functions in the range [-π, π]. Use a different color for each curve.
TIP: for smoother curves, you need to use seq() to obtain data points with an interval smaller than one, for example 0.1.
TIP2: R defines a standard variable pi with the value of π.

Exercise 6.25. We already know that abline(a, b) draws a straight line whose intercept is a and slope is b.

Using abline(), add horizontal and vertical lines that pass from the origin (0,0) to the graph you produced in Exercise 6.24.

Exercise 6.26. Replicate the graph in Figure 1.