Gabmap

Gabmap is a web-based application for dialectometry. For Gabmap, we will follow an existing tutorial.

If you like to exercise more, there is also an additional tutorial for creating maps for Gabmap and another one for the "cluster determinants" which is not included in the tutorial above. If you like to explore other data, you can find more examples here, and another sample data set for Germany.

Stylo R

Stylo R is an R package for performing some common stylometric analyses. There is also a tutorial from the developers of the package on the same page (see the link "stylo_howto.pdf" at the bottom of the page).

We will not repeat the tutorial here, but here are the commands you need for the start

  • To install the stylo package type install.packages("stylo") on the R command prompt.
  • You need to instruct R that you will use the stylo package by the command library("stylo").
  • Changing R working directory with setwd("corpus_directory") allows you not to type it again with the following stylo commands (replace corpus_directory with the location of the corpus on your computer).
  • The command stylo() performs a number of unsupervised analyses methods. By default you are guided by a graphical interface. If you like to specify the options as parameters, see help("stylo") for help.
  • The command classify() performs supervised classification using a few different methods.

In this exercise, you are asked to use the data set here. The data contains English texts from two authors as well as five texts from "unknown" authors. The texts belonging to the known authors are files starting with C_ and D_. The files that whose authors are unknown start with U_. The corpus is prepared to be used directly with stylo() from the stylo package.

The following are the questions that you should try to answer:

  • Using clustering (with the default options), try to determine the authors of the unknown texts (either C or D. Which unknown texts are likely to belong each author?
  • Try clustering with word trigrams (this may take some time depending on your computer). Does the clustering improve in comparison to unigrams?
  • Analyzing the data using MDS and PCA. Do they agree with the clustering analysis you performed earlier?
  • Rearrange the corpus for use with the classify() command (see stylo_howto.pdf for the corpus format), and classify the unknown texts using SVMs. Does the result agree with the unsupervised methods you tried earlier?
The above should be enough for the starters, but you are free to play with the many options provided in the stylo package. Using character ngrams and different frequency ranges are some obvious options that comes for free with stylo. You can also try to analyze the data and use some higher-level NLP features, such as POS tags or dependency relations. For some of these tasks, you can use WebLicht to analyze the data further.

If you need some introduction to R, there are many good books and online tutorials that it is impossible to list here. For finding the best otion for yourself, your choice of search egine is your friend. But here is shameless plug for a tutorial on statistics with R using linguistic examples.