Tutorial for regression, classification, clustering

Introduction

You can download the data set for this exercise here.
This data set contains the questions asked for each class presentation. The filenames follow the format A-YYYYMMDD-P where A is the author ID, YYYYMMDD is the date the paper is presented in the class and P is the paper/presentation id (arbitrary 1 or 0) on the same date. The files contain all questions asked by each participant for each paper. The author names are anonymized (listed as letters A through Z).

The files in the directory raw/ are processed minimally (bullets/itemizations are removed, some spacing may have changed, but no spelling correction or additional formatting is introduced).

The files under tcf/ include tokenized, POS-tagged and dependency parsed version of the files in WebLicht TCF format. The processing is done through WebLicht (in fact using WebLicht as a service). The detailed information about the tools used for linguistic processing can be found in the file weblicht-en-parsing-chain.xml (see WebLicht links above for more information on the file formats and the tools used).

For your convenience, the tokenized version of the files are extracted from the TCF files, and placed under tokens/. The tokens are converted to lowercase. The tokenization errors are not fixed. Similarly, the files under postags/ include a version where each token is replaced with its POS tag.

For the class exercises, you do not need any of the above files directly. The following files contain summaries that will be used during this tutorial:

document-counts is a tab-separated file containing counts of sentences, tokens and characters for each file, as well as separate columns for author id, date, lecture, and gender of the author.
document-token contains a typical document-term matrix where each column refers to a term (token) and each row refers to a document. The numbers in the individual cells are the raw frequency (count) of the term in the corresponding document.
three files document-token-stopw, document-token-content and document-token-punct contain subsets document-term matrices formed by selecting subsets of the tokens. Note that subsets are formed by a rather quick and dirty procedure. It probably contains many errors, but it should suffice for our purposes.
document-postag is the same as document-term, but instead of tokens, we have POS tag frequencies.

Reading documents into R

As determined in the last class, we will use R for the task. You can load the data above into R using read.table() function. Here is how to doi it:

d <- read.table('document-counts', header=T, quote="")
d.pos <- read.table('document-postag', header=T, quote="")
d.tok <- read.table('document-token', header=T, quote="")
d.tok.stopw <- read.table('document-token-stopw', header=T, quote="")
d.tok.punct <- read.table('document-token-punct', header=T, quote="")
d.tok.content <- read.table('document-token-content', header=T, quote="")

Note that some of the variable (token) names will be normalized by R. Now you have the data in R, and you can process and analyze it in many ways. For the rest of this document, we will refer to the data using the variable names above.

You can investigate each data frame using functions like summary(), head() even plot(). But most (except d) will be too big for these methods to be useful for quicks summaries. If you want to pick particular variables from a data frame, you can use the notation d$author, which would only show you the author ids from the data frame d.

Regression

We will first try regression. The function that fits regression in R is lm(). The syntax is lm(y ~ x), the notation y ~ x should be read as y is explained by x, which fits a regression model with single explanatory variable corresponding to the equation y = a + bx + noise.

Exercise 1: The instructor of the course has the hypothesis that this particular set of authors tend to get less enthusiastic or more busy during the course of the semester. Hence, the length of the questions are expected to drop as the semester progresses. Test this hypothesis by fitting a linear regression model to predict number of tokens in each document from the lecture number (d$lecture). How do you interpret the output of lm()?

Exercise 2: summary() function run on the model created by lm() gives further details about the regression model. Use summary() and reason about the given output. Assuming all the model's assumptions are correct, do you see a significant change in the length of the questions during the semester? Tip: it is often convenient to save the model fit by lm() to a variable like m <- lm() and then use the variable for further investigation, e.g., summary(m).

Exercise 3: Plot the relevant data and the fitted regression line for the exercises above. You can use plot() for plotting the data, and abline() with a model parameter to draw the regression line.

Exercise 4: The above exercises demonstrate the typical use of regression in experimental/observational studies. In machine learning (in computational linguistics) we are often interested in the predictions. Analytically calculate the predicted number of tokens per document if we had a lecture '10' using the model fit Exercise 1. You can also use predict() function in R for this purpose, which will also give a 'confidence interval' for the prediction. You are encouraged to try it.

Exercise 5: (optional) Fit a model predicting the lecture number from the number of tokens (reversing the predictor and the outcome variable). Compare the result with the original model. What changes do you observe? What does stay the same?

Exercise 6: (optional) Fit a model adding the number of sentences (d$sentences) as another predictor predictors in Exercise 5. Does the model prediction become better? Do you need both, or is one of them enough? Which 'feature' is a better predictor?

Classification

We will exercise with another toy example for classification with logistic regression. Logistic regression is a binary classification method (can be extended to multiple classes). In R, you can use glm(). For fitting a logistic regression model, we need to specify a special parameter family=binomial as in glm(y ~ x, familiy=binomial). Of course, your variable y here should be binary. Note that R will convert any variable with two possible values arbitrarily to 0 and 1 internally while fitting the model.

Exercise 7: Fit a logistic regression model predicting gender of the author from the number of tokens per document. Produce the relevant summary, and try to interpret the result. Note that R will map F to 0 and M to one.

Exercise 8: Plot the data and the prediction curve as instructed below (the R commands are given for the benefit of time, but you should make the effort to understand what is going on).

# assuming the model fit in Exercise 7 is saves as 'm'
plot(as.integer(gender)-1 ~ tokens, data=d)
x <- seq(0, max(d$tokens))
y <- predict(m, type='response', newdata=data.frame(tokens=x))
lines(x, y)
abline(h=0.5)

Do you think this model is useful?

Exercise 9: (optional, but answer is almost given above) Based on the model fit earlier, what is the predicted gender for texts with 250 and 350 tokens.

Clustering

R provides many methods for clustering. We will only experiment with hierarchical clustering using hclust(). Before clustering we need to calculate the distances between the documents, which can be done with dist(). If we combine calculating the distances, clustering and plotting the resulting dendrogram at once the R command would be: plot(hclust(dist(d))) where d is a matrix whose rows are the objects (documents) and columns are the variables (terms).

Exercise 10: Cluster the documents belonging to first 4 authors in the data set based on stopwords (d.term.stopw), punctuation (d.term.punct) and content words (d.term.content). The notation d.tok.punct[1:37,] selects the first 37 rows from the matrix d.tok.punct, which includes only the first four authors in the set.
Based on your visual inspection, which data set(s) seems to separate the authors better?

This is all we do for clustering in this short tutorial. You are encouraged to experiment more by,

Normalizing the data, e.g., using relative frequencies or TF/IDF instead of raw frequencies
Trying the cluster full data set
Visualizing the result better with coloring the documents belonging to the same author, or the same topic.
Using other features, like POS tags, or POS tag ngrams

Some of these tasks are made easier by the packages like Stylo R