B Model formulas

Various commands in R accept a notation called model formula, or simply formula. The simplest form of the formula is,

y ~ x

where x and y are two variables. You can read this as ‘y is explained by x’. The dependent or response variable goes to the left of the tilde ‘~’ and the explanatory or independent variables goes to the right. This formula roughly corresponds to the linear equation,

y = a + bx

The interpretation is slightly different if the variables are categorical. Note that the intercept, a, is implicit in the model formula. If you like, you can be explicit by using the notation + 1. Or if you want to exclude it, e.g., force a regression line passing through the origin, you can exclude it by - 1. In case you have multiple explanatory variables, it is easy to include them using the same notation. For example if you had two explanatory variables x1 and x2, you can specify it like this:

y ~ x1 + x2

The linear equation that correspond to this notation would be y = a + b1x1 + b2x2.

As you may have figured out already, the arithmetic operators such as + and - have different meanings in a formula. So, if the variable you are interested is a combination of R variables, then you need a special notation. For example, you might be interested in fitting a linear model where y is explained by the sum of x1 and x2. That is, the equation you want to describe is y = a + b × (x1 + x2). In such cases you need to use a special function, I(), to protect the arithmetic operation from being interpreted as part of the formula. In the case of our example, the correct formula notation is

y ~ I(x1 + x2)

If your explanatory variables are categorical, as in ANOVA, you may fit a model where interaction of the variables is important. Interaction of variables in a formula is expressed with a term where variable names are concatenated with column(s) between the variables. For example, the formula

y ~ x1 + x2 + x1:x2

expresses a model where interaction of x1 and x2 are also included in the model fitting. For two variables, we have only one possible interaction. If you have many variables, and want to include all interaction terms, it may be a hassle to type all the interaction terms separately. For example, all interactions of three variables x1, x2 and x3 consist of the two-way interactions x1:x2, x1:x3, x2:x3 and the three way interaction x1:x2:x3. To include all interactions, you can use ‘’ instead of ‘+’. For example, to include three variables and all interactions in a model formula, we simply type y ~ x1  x2  x3.

The formula notation is quite flexible and can express many other forms of ‘models’. The above explanation should be enough to get you started. R documentation you can find on CRAN is the main reference, and you can find further information in the many books and documents on R.