Machine Learning lab 3

Machine Learning

Lab 3, Assignment 2 - Decision Trees + Instance-Based Learning

There are three parts in this assignment, one about WEKA input files and feature selection, one about decision trees and one about IBL. You need 60% correct answers to pass this assignment. The deadline for assignment reports are Monday 23th September 23:59. Send the assignments via email to c<DOT>coltekin<AT>rug.nl.

The data sets you will need in this assignment are:

past-tense.dat, data set containing 4330 English verbs.
past-tense-100.dat, a random selection of 100 instances from the above data set.

0. WEKA File Format (.arff) and Attribute Selection

Extracting features from data, and selecting relevant features are important parts of developing a machine learning model. In this exercise, we will first do a very trivial conversion on our data set, and try to find the valuable features in the list of features we have. Knowledge of a programming language such as perl or python, or standard UNIX tools (grep, cut, paste, sed etc.) is valuable for this type of tasks. However, you will only need to use a text editor, (such as Notepad on Windows, and Vi or Emacs on UNIX like systems).

Note that our past tense data is in comma separated value (CSV) format. The data file contains 12 fields separated by comma. First field is orthographic form of the verb, followed by frequency of the verb. Fields 3 to 11 are representations of consecutive sounds (phonemes) for each verb. The last field is the class of the verb, e.g. regular or a particular irregular form.

Even though the features are available in .dat files given above, the files are not in the format recognized by WEKA. Convert past-tense-100.dat to arff data format, save the new file as past-tense-100.arff. Use only nominal or numeric features. Report(briefly) the modification(s) you did?
The list of verbs, phonemes and classes in the data set are provided for your convenience.
WEKA provides some feature selection methods (see 'Select attributes' tab in explorer gui, or command line routines below). Use the arff file generated in the previous step to rank the attributes using,
1. InformationGain.
```
    java weka.attributeSelection.InfoGainAttributeEval -s weka.attributeSelection.Ranker -i past-tense-100.arff 
```
2. GainRatio:
```
    java weka.attributeSelection.GainRatioAttributeEval -s weka.attributeSelection.Ranker -i past-tense-100.arff 
```
The above commands will rank the features based on how well they classify the given data. You should be familiar with Information Gain and Gain Ratio from decision trees.
Report the differences between the two rankings. Why the rankings of 'verb' feature differ for these two measures. In general, the verb feature is not a useful feature for learning. Why?
(optional) For the rest of this assignment we will be working with the full data set, past-tense.dat. In the previous steps, we concluded that the 'verb' feature is useless. Remove this column from the CSV file, and convert remaining columns to features according to their type. If everything goes fine, you should end up with an input file similar to past-tense.arff.

1. Decision Trees

This assignment uses the WEKA implementation of C4.5, a decision tree learner. To invoke this learner e.g. using a file called "train.arff" as a training set you can type:

  java weka.classifiers.trees.J48 -t train.arff

This will construct a decision tree from train.arff and then apply it to train.arff. After that it will perform a 10-fold cross-validation on train.arff.

Invoke the J48 classifier using past-tense.arff, without pruning. Report the number of correctly and incorrectly classified instances both for testing on training set, and 10-fold CV .
java weka.classifiers.trees.J48 -U -t past-tense.arff
Note the time it took for building the model and testing it on training data (available in WEKA output). We will need this information later.
Repeat the above with pruning. (remove the -U option.
1. Report the correctly and incorrectly classified instances.
2. Did pruning improve the results in this case? Compare the detailed performance measures (-i option after the classifier) and confusion matrices.
3. What is pruning used for?
4. C4.5 algorithm tries to improve accuracy during post-pruning. Can you relate this with performance difference with and without pruning, and class distribution in the data.
Consider the following data set (from exercise 3.2 in your textbook):

Instance Classification a1 a2

1 + T T
2 + T T
3 - T F
4 + F F
5 - F T
6 - F T
1. What is the entropy of this collection of training examples with respect to target function classification?
2. What is the information gain of a2 relative to these training examples?
3. Create an input file for WEKA including the training instances above, and train a J48 classifier without pruning. Include in your report a description of the decision tree constructed by the J48.
4. WEKA uses 10-fold cross validation by default. With the data above, 10-fold cross validation is not possible (why?). What is the behavior of WEKA in similar cases.
5. Leave-one-out cross validation is a special case of k-fold cross validation. What is the number k for the above data to perform leave-one-out cross validation?

Instance	Classification	a1	a2
1	+	T	T
2	+	T	T
3	-	T	F
4	+	F	F
5	-	F	T
6	-	F	T

2. Instance-Based (Memory-Based) Learning

In Instance-Based Learning (IBL), a set of training instances are stored and when classifying a new instance, the k closest instances from the training set are selected and the most common class of those instances is predicted. The WEKA implementation of IBL can be invoked thus:

     java weka.classifiers.lazy.IBk -t training_file

The "-K" option of the weka.classifiers.lazy.IBk classifier enables you to specify the number of nearest neighbors used when classifying a novel item. The default value is 1.

Run the weka.classifiers.IBk classifier with default values using past-tense.arff as a training file.
1. Report the number of instances classified correctly and incorrectly on both training data, and using 10-fold CV.
2. kNN classifier depends on the a distance metric. When the attributes are numeric the distance is simply the Euclidean distance. However, in this example most of our attributes are nominal values. How does weka.classifiers.IBk classifier calculate the distance between instances with nominal values. Can you think of a better representation of features for this data set?
3. Compare time taken time it took for building the model, and testing it on training data. Compare it with the times you have noted on step 4 above. Explain the differences between the run times for two different algorithms.
4. Select 5 top-ranked features using GainRatio measure, and discard the rest of the features. Re-run IBk classifier with the default parameters again. Explain the difference in the performance change compared to (a).
5. Irrelevant attributes in training data incur the performance of kNN classifier. How can you make instance-based learning less sensitive to this problem? (Describe the method and explain why it will help).
kNN classifier may give different results based on choice of `K'.
1. If using a value of K > 1 improves performance what does this tell you about the data?
2. On what sort of data set would setting K = 1 work best?
3. For the data set past-tense.arff, find the K value, 1 ≤ K ≤ 10, that gives the best performance. Note that weka.classifiers.IBk implementation can do that for you (Hint: check the -X option for IBk classifier).