Gabmap
Gabmap is a web-based application for dialectometry. For Gabmap, we will follow an existing tutorial.
If you like to exercise more, there is also an additional tutorial for creating maps for Gabmap and another one for the "cluster determinants" which is not included in the tutorial above. If you like to explore other data, you can find more examples here, and another sample data set for Germany.
Stylo R
Stylo R is an R package for performing some common stylometric analyses. There is also a tutorial from the developers of the package on the same page (see the link "stylo_howto.pdf" at the bottom of the page).
We will not repeat the tutorial here, but here are the commands you need for the start
- To install the stylo package type
install.packages("stylo")
on the R command prompt. - You need to instruct R that you will use the stylo
package by the command
library("stylo")
. - Changing R working directory with
setwd("corpus_directory")
allows you not to type it again with the following stylo commands (replacecorpus_directory
with the location of the corpus on your computer). - The command
stylo()
performs a number of unsupervised analyses methods. By default you are guided by a graphical interface. If you like to specify the options as parameters, seehelp("stylo")
for help. - The command
classify()
performs supervised classification using a few different methods.
In this exercise, you are asked to use the data set
here.
The data contains English texts from two authors as well as
five texts from "unknown" authors. The texts belonging to the
known authors are files starting with C_
and D_
.
The files that whose authors are unknown start with U_
.
The corpus is prepared to be used directly with stylo()
from the stylo package.
The following are the questions that you should try to answer:
- Using clustering (with the default options), try
to determine the authors of the unknown texts (either
C
orD
. Which unknown texts are likely to belong each author? - Try clustering with word trigrams (this may take some time depending on your computer). Does the clustering improve in comparison to unigrams?
- Analyzing the data using MDS and PCA. Do they agree with the clustering analysis you performed earlier?
- Rearrange the corpus for use with
the
classify()
command (seestylo_howto.pdf
for the corpus format), and classify the unknown texts using SVMs. Does the result agree with the unsupervised methods you tried earlier?
If you need some introduction to R, there are many good books and online tutorials that it is impossible to list here. For finding the best otion for yourself, your choice of search egine is your friend. But here is shameless plug for a tutorial on statistics with R using linguistic examples.