There are two parts in this assignment, one about decision trees and one about IBL. You need 6 (out of 10) points to pass this assignment.
This assignment uses the WEKA implementation of C4.5, a decision tree learner. To invoke this learner e.g. using a file called "train.arff" as a training set you can type: java weka.classifiers.trees.J48 -t train.arff This will construct a decision tree from train.arff and then apply it to train.arff. After that it will perform a 10-fold cross-validation on train.arff. 1. (1 points) Copy the file zoo.arff zoo.train.arff zoo.test.arff from the directory ~tiedeman/ML06 to your home directory. This data includes instances of animals described by their features (hairy, feathered, etc) and classifications of those animals (e.g. mammal, bird, reptile). Invoke the J48 classifier using zoo.train.arff and zoo.test.arff as the training and testing files respectively. Note that zoo.{train,test}.arff together contain the same data as zoo.arff, i.e. the latter was split to create the training and testing sets. (a) Report the number of correctly and incorrectly classified instances for the test data for Decision Trees. (b) Include in your report a description of the decision tree constructed by the J48 classifier and explain how the decision tree is used to classify a new instance. 2. (2 points) If you invoke a WEKA classifier with a training set but no testing set, WEKA will automatically perform a 10-fold cross-validation on the training set and report how many instances are correctly and incorrectly classified when those instances are used as test data during the cross-validation. (a) When you ran the J48 classifier with the zoo.arff file, a 10-fold cross validation was performed. Report the number of instances correctly and incorrectly classified during the cross-validation. (b) With the WEKA classifiers, the "-x [value]" option can be used to specify how many folds to use in a cross validation -- e.g. "-x 5", will specify a 5-fold cross-validation. Suppose you wish to perform a "leave one out" cross-validation on the zoo.arff data. How many folds must you specify to achieve this? (c) Run the J48 classifier on the zoo.arff data, using "leave one out" cross-validation. Report the entire "stratified cross-validation" section from the J48 output. 3. (2 points) Make a copy of the weather.arff file in ~tiedeman/ML06 and modify it to train a classifier with multiple (discrete) target values. The classifier is supposed to assign your favorite sport according to weather conditions (take, for example, "swimming", "badminton" and "none"). Modify the training data according to your settings (create at least 20 training examples). (a) Train the J48 classifier on the original data in weather.arff and on your modified version and report the number of correctly and incorrectly classified instances during the 10-fold cross-validation. (b) Include your training data and the corresponding decision tree in your report and comment on its structure. (c) The "-U" option can be used to turn pruning off. What does pruning do and why? What happens to the tree learned from your data when pruning is "off"? Comment on the differences or explain why there is no difference.
In Instance-Based Learning (IBL), a set of training instances are stored and when classifying a new instance, the k closest instances from the training set are selected and the most common class of those instances is predicted. The Weka implementation of IBL can be invoked thus: java weka.classifiers.lazy.IBk -t-T For this exercise we will be using the $WEKAHOME/data/labor.arff data set. The labor.arff data set describes the outcomes of various labour contract negotiations in Canada, plus whether the outcomes were good or bad (page 16 of the Frank and Witten textbook has more detail for those interested). 1. (3 points) Copy the $WEKAHOME/data/labor.arff file into your home directory. The "-K" option of the weka.classifiers.lazy.IBk classifier enables you to specify the number of nearest neighbors used when classifying a novel item. The default value is 1. Run the weka.classifiers.IBk classifier using "labor.arff" as a training file, performing 10-fold cross-validation, and using values of K from 1 to 10 inclusive. (a) Which of the value(s) used for K produced the best results on the cross-validation? Report the number of correctly classified instances in the cross-validation for this(these) value(s) of K. (b) If using a value of K > 1 improves performance what does this tell you about the data? (c) On what sort of data set would setting K = 1 work best? (d) Instance-based learning is a type of lazy learning. What does this mean? 2. (2 points) Copy the files xor.arff, xor-4.arff, xor-8.arff, xor-12.arff and xor-16.arff from ~borges/data to your home directory. The xor.arff file contains 40 instances conforming to the exclusive or function. I.e. the instances are one of "false,false,false","false,true,true","true,false,true" or "true,true,false". xor-{4,8,12,16}.arff contains the same 40 instances but with {4,8,12,16} extra random attributes added. Thus the final 3 attributes on each line are the same as the attributes in xor.arff but the rest are randomly set to true or false. (a) Report the number of correctly and incorrectly classified instances for leave one out cross-validation on each of the files, using K=1. (b) How does the performance on xor-16.arff compare with random guessing? (c) Why does the performance drop when adding the random attributes? (d) How can you make instance-based learning less sensitive to this problem? (Describe the method and explain why it will help). [Further reading for those interested: "Memory-Based Shallow Parsing", E.F. Tjong Kim Sang. Journal of Machine Learning Research, Special Issue on Machine Learning Approaches to Shallow Parsing. Hammerton J.A., Osborne M., Daelemans W. and Armstrong S. (eds), 559--594, 2002 -- see www.jmlr.org]