1 Starting R and finding your way around

Depending on the environment or operating system you are using, starting R may be a bit different. Typically you will click on the relevant icon or menu item, or on UNIX-like systems you can run the command R on the shell prompt.

When you start R, it will print some default information, and wait for your commands.

First thing you need to get used to (if you are not already) is that R is controlled through a command line interface. After the initial information, you should see the cursor next to the command prompt > ’. R presents this prompt when it is ready to accept commands.

The command-line interface may feel awkward or old-fashioned at first, but once you get used to, you will see that it is not as scary as it may seem at first sight, and it has its advantages in many cases.

1.1 Getting help

The greeting message you see at the startup already gives you a few tips. Now type

 
> help()  

including the parentheses but not the command prompt. In this tutorial we will follow this convention: the commands you should type will be displayed after the command prompt, ‘> ’.

If you type the above command and press enter, R will present the built-in documentation about how to get help. Depending on the R configuration on your system, you may get the help text either on the same window, or R may present the help in another window. If you get help on the same window, you can scroll up and down using arrow keys or page up/down keys on your keyboard. Pressing ‘q’ will quit help and give the command prompt back. As you were instructed at the greeting message, you could alternatively type

 
> help.start()  

and get the documentation in an external browser. To get help on a particular command, for example pnorm, you can type

 
> help(pnorm)  

but in case you do not remember the exact command, you can search a keyword in the documentation using help.search(). For example, if you were wondering what was the command that did Student’s T test, you can try

 
> help.search(student)  

R will list the help topics that match, and you can again use help to read the documentation. Two shortcuts you may appreciate if you use the help facility frequently are ? and ?? which correspond to help() and help.search() respectively. When using ? and ??, you should just type the keyword(s) without parentheses. If the keyword contains white spaces, you need to use double or single quotes around it.

A tip that you may be happy to hear is that R remembers your previous commands. You can return to the previous commands using the up arrow key on your keyboard, navigate between them with up and down arrow keys, and you can modify and re-run them if you wish.

There are many small tips and tricks you will collect while working with R, one last tip to mention here is that R command line allows ‘tab completion’. That is, if you type a unique initial segment of a command (or variable or file name in the right context), and press the ‘tab’ key on your keyboard, R will try to complete the rest for you. If there are more commands that match the initial string you typed, pressing ‘tab’ twice will list all matching commands.

Besides the documentation built in to R, the official web site of the R project contains the reference manual [R Core Team2014], and many useful documents and pointers to other sources. If all else fail, you can ask your questions at one of the mailing lists (that you can find through the official R site), or sites like http://stats.stackexchange.com/. Before asking questions on online lists and groups, you should always make an effort to find the answer in obvious places.

1.2 Doing simple calculations with R

R can be used as a calculator. Try typing a few arithmetic expressions at R’s prompt and check what happens. Listing 1 demonstrates some of the arithmetic operations.


Listing 1: Simple calculations.
 
1> 1 + 2 
2[1] 3 
3> 3  4 
4[1] 12 
5> 6 - 3⋆4 
6[1] -6 
7> 7/(6 + 2) 
8[1] 0.875 
9> 16̂(1/2) # 0.5th power (sq. root) of 16 
10[1] 4

The lines that do not start with a command prompt in Listing 1 are the outputs. In line 5, the multiplication operation takes precedence: it is calculated as 6 - 12, not 3⋆4. In line 7, to make sure that addition is done before division, we used parentheses. If you are familiar with usual operator precedence in programming languages, R will not surprise you. However, there is no harm in adding a couple of parentheses to make sure you get the result you want.

Another thing to note in this listing is that R regards any text after the hash sign (#) until the end of the line as a comment, and ignores it. Comments do not have much use during interactive use, but they come handy when you save command sequences (R scripts or programs) in files for future reference.

1.3 Variables

Under the hood, R provides a complete general purpose programming language (in fact R is an implementation of language SPLUS) which may be really handy if you have some programming background. In this set of exercises we will not go into programming. However, we will be using variables frequently.

Use of variables may save you from quite some typing, and R will save the values of variables on exit by default so that you can access the same values when you restart R.

To assign a value to a variable you can use the assignment operator, ‘=’, (or, equivalently, <- as R experts do). And you can use the variables in calculations or if you type a variable name and press enter, R will report the value. Listing 2 demonstrates the basic use of variables.


Listing 2: Variable assignment.
 
1> now = 2010 
2> birth.year <- 1988 
3> birth.month = "February" 
4> age = now - birth.year 
5> age 
6[1] 22 
7> now = now + 2 
8> Age = now - birth.year 
9> Age 
10[1] 24 
11> age 
12[1] 22

In line 1 we store the value 2010 in variable now (yes, now is relative). In line 2 we use the alternative assignment operator <-, this is equivalent to =. In this tutorial we use both somewhat randomly to remind you that you may see R code using both, and they are equivalent (see the answer of Exercise 1.3, for one more assignment operator).

In line 2 and 3 we use a dot ‘.’ instead of space. R variable names cannot contain space characters, and dot is the conventional character instead of space in R community. There are more rules for variable names. For example, they cannot contain many other special characters (like -, +, /) and they cannot start with numbers.

Line 3 demonstrates use of character strings. Character strings must be enclosed in matching double (") or single () quotes. R supports a variety of operations on string type, and it may come quite handy while working with language data (e.g., corpora). Apart from numbers and strings there are other types that your variables can take. For example booleans that take values TRUE or FALSE and categorical variables (or factor variables as R calls them) are interesting for many statistical tasks. We will return to discussion of these types later.

Line 4 subtracts value of birth.year from now and stores the result in a new variable age. As demonstrated in line 5, if we type the name of the variable R tells us the value stored in the variable.

Line 7 may be confusing for non-programmers. This line adds 2 to variable now, and re-assigns the new value to the same variable now. In other words, we increment now by 2.

In line 8, we (re)calculate the age, but beware: the case matters in variable names. Age is not the same as age. As a result we have two variables now, lowercase age still contains the previous calculation on line 4, and uppercase Age contains the calculation in line 8. The rest of the lines demonstrate this difference.

You should enter this command sequence in R to check if all works as in the listing.

If you’d like to see the user variables, you can use the function ls(), and if you want to get rid of one, for saving space, for keeping your environment clean and tidy or for any other reason, you can use rm().

1.4 Vectors in R

In statistics, we are generally interested in a sample, or a list of values. For that purpose, R offers a data structure called vector. Vectors in R are similar to arrays or lists in programming languages. The important thing to know is that a vector is a container of a set of values of the same type.

For the exercises in this section, we will use the following data. For a class, students are asked to submit a 3,500 to 4,000-word report. 10 students turned in the reports with the following word lengths:

 
3510,3508,3468,3520,3516,3525,3505,3519,3558,3487  

To enter this data into a vector variable we type,

 
> nwords = c(3510,3508,3468,3520,3516,3525, 
             3505,3519,3558,3487) 
> nwords 
 [1] 3510 3508 3468 3520 3516 3525 3505 3519 
 [9] 3558 3487  

This example demonstrates the primary way of assigning a vector to a variable. The function c (stands for concatenate), puts together its arguments into a vector. Like simple data types, if we type the name of the variable, we get its value displayed (in fact, the simple variables we have been working with are vectors containing single elements). Entering large datasets this way is, at best, cumbersome, and R provides other ways of entering data to which we will return later.

At this point you should type the above assignment command to create the vector nwords. We will use this data set in the next few sections.

R supports mathematical operations between vectors and the scalar values and vectors and vectors. Standard R functions that normally take a basic value can also take vectors as arguments, in which case the function is applied to all members of the vector.

Elements of a vector can be selected by specifying the position of the element(s) between square brackets after their name. For example, if we want to refer to the fourth element of vector nwords, nwords[4] (in fact, as we will see later, one can also select possibly discontinuous ranges of data with this notation).

Listing 3 demonstrates some of these operations.


Listing 3: Some vector operations.
 
1> nwords2 = 2  nwords 
2> nwords2 
3 [1] 7020 7016 6936 7040 7032 7050 7010 7038 7116 6974 
4> nwords + nwords2 
5 [1] 10530 10524 10404 10560 10548 10575 10515 10557 
6 [9] 10674 10461 
7> log(nwords + nwords2) 
8 [1] 9.261984 9.261414 9.249946 9.264829 9.263692 
9 [6] 9.266248 9.260558 9.264544 9.275566 9.255409 
10> nwords[1] 
11[1] 3510 
12> nwords[10] 
13[1] 3487

The first line multiplies a vector with a scalar value. In other words, all members of the vector is multiplied with 2. Line 4, on the other hand, sums two vectors. Finally, in line 6, the function log() is applied to each member of the resulting vector.

Besides the arithmetic operations and scalar functions applied to vector elements, there are a set of functions that operate on vectors. Listing 4 demonstrates some of these functions. Note that the listing already includes a few statistical functions (finally we are getting closer to the point!).


Listing 4: More vector operations.
 
1> length(nwords) 
2[1] 10 
3> sum(nwords) 
4[1] 35116 
5> min(nwords) 
6[1] 3468 
7> max(nwords) 
8[1] 3558 
9> head(nwords,2) # first two elements 
10[1] 3510 3508 
11> tail(nwords,3) # last three elements 
12[1] 3519 3558 3487 
13> sort(nwords) 
14 [1] 3468 3487 3505 3508 3510 3516 3519 3520 3525 3558 
15> range(nwords) 
16[1] 3468 3558 
17> mean(nwords) 
18[1] 3511.6 
19> median(nwords) 
20[1] 3513 
21> summary(nwords) 
22   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
23   3468    3506    3513    3512    3520    3558

Exercises

Exercise 1.1. You want to do a ‘Multivariate ANOVA’, but you do not know the exact command that does it. Use help.search() (or ??) to find the name of the command.

Exercise 1.2. The R function shapiro.test() implements well-known Shapiro-Wilk normality test.
  1. How many of the initial characters you need type before R can complete the function name using tab completion?
  2. How many R functions start with sh?

Exercise 1.3.  Perform the following actions in R:
  1. store 20 in the variable x
  2. store 10 in the variable m
  3. store 5 in the variable s
  4. subtract m from x, store the result in t
  5. divide t by s, store the result in variable z

What is the value of the variable z?

Exercise 1.4. Redo the calculations in Exercise 1.3 steps 4 and 5 without the use of the temporary variable t. Use parentheses if necessary to get the same result.

Exercise 1.5.  R supports a wide range of mathematical functions. A function that is often useful in statistical analysis is the logarithm function. Using the shortcut you learned in Section 1 search for the function name that returns logarithm of a given number, and use it to calculate logarithm of 2.7.

Exercise 1.6. You should have obtained a value close to 1 in Exercise 1.5. This is because of the fact that R calculates natural logarithm (base e=2.718282…) by default. Often we use base-2 logarithm. The function you used above can be used to calculate base-2 logarithm as well. Using the shortcut you learned in Section 1 read the help for the logarithm function to learn how to specify which base to use. Calculate base-2 logarithm of 2.7.

Exercise 1.7. After completing the exercises in this section, you should have a set of variables that we will not use in the future. List the variables in your R session, and delete the variables t and x (If you wish, you can delete all variables except the vectors nwords and nwords2. We will not use the other variables in the rest of the tutorial.). List your variables again to see if you have achieved the desired result.

Exercise 1.8.  Remember that the standard error of the mean can be calculated using the formula
s n

where s is the (estimated) standard deviation, and n is the size of the sample. Calculate the standard error of the mean for the word count data stored in nwords. You can calculate the standard deviation using the function sd(). It is easy to just count the number of elements in nwords, but you can use the length() function to get the number of elements in a vector.

Exercise 1.9. In Exercise 1.3 you calculated z-score for a single value using pre-specified mean and standard deviation. More formally, z-score is calculated with the following formula:
z = x μ σ

Calculate z-scores of the values in the vector nwords, and assign it to a new vector variable named znwords. Display the resulting vector, its mean and the standard deviation.

Exercise 1.10.  In a (hypothetical) country, four political parties got 36,35,8 and 71 seats in the parliament with 150 seats. Sort the numbers of seats in reverse order, with the largest element first and the smallest as last.

Exercise 1.11. Use the seat counts in Exercise 1.10 to calculate the percentages of seats for each party. Use a single expression, and do not hard code the number of seats in the parliament into your expression. You may want to store the data in a variable for convenience.

Exercise 1.12. You realized that the word counts we used in this section included the essay title and the student’s name by mistake. For each essay, we would like to discount word counts by 6,8,6,5,7,5,9,7,10,9. Store these values in a new vector variable wdiff.

Exercise 1.13. Create a backup copy of the data stored in the vector nwords in the vector nwords2 (this may sound like serious work, but you can simply use the assignment operator). Subtract the vector wdiff from the vector nwords and store the result again in the vector nwords. Display the contents of nwords and nwords2.

Exercise 1.14. Find the differences of means of the values stored in nwords2 and nwords. Is it the same as the mean of the vector wdiff? (In other words is the difference of the means the mean of the differences?)