Machine Learning


Lab 3, Assignment 2 - Decision Trees + Instance-Based Learning

Note that the instructions below is provided as an example. The final version of the assignment will be available on September 15, 2008.

There are two parts in this assignment, one about decision trees and one about IBL. You need 6 (out of 10) points to pass this assignment.

Decision Trees

This assignment uses the WEKA implementation of C4.5, a decision tree
learner. To invoke this learner e.g. using a file called "train.arff"
as a training set you can type:

  java weka.classifiers.trees.J48 -t train.arff

This will construct a decision tree from train.arff and then apply it
to train.arff. After that it will perform a 10-fold cross-validation
on train.arff.


1.  (1 points) Copy the file zoo.arff zoo.train.arff zoo.test.arff
from the directory ~tiedeman/ML06 to your home directory. This data
includes instances of animals described by their features (hairy,
feathered, etc) and classifications of those animals (e.g. mammal,
bird, reptile). 

Invoke the J48 classifier using
zoo.train.arff and zoo.test.arff as the training and testing files
respectively. Note that zoo.{train,test}.arff together contain the
same data as zoo.arff, i.e. the latter was split to create the
training and testing sets.

(a) Report the number of correctly and incorrectly classified
instances for the test data for Decision Trees.

(b) Include in your report a description of the decision tree
    constructed by the J48 classifier and explain how the decision tree is
    used to classify a new instance.


2. (2 points) If you invoke a WEKA classifier with a training set but no testing
set, WEKA will automatically perform a 10-fold cross-validation on the
training set and report how many instances are correctly and
incorrectly classified when those instances are used as test data
during the cross-validation.

(a) When you ran the J48 classifier with the zoo.arff file, a
    10-fold cross validation was performed. Report the number of instances
    correctly and incorrectly classified during the cross-validation.

(b) With the WEKA classifiers, the "-x [value]" option can be used to
    specify how many folds to use in a cross validation -- e.g. "-x 5",
    will specify a 5-fold cross-validation. Suppose you wish to perform a
    "leave one out" cross-validation on the zoo.arff data. How many folds
    must you specify to achieve this?

(c) Run the J48 classifier on the zoo.arff data, using "leave one out"
    cross-validation. Report the entire "stratified cross-validation"
    section from the J48 output.

3. (2 points) Make a copy of the weather.arff file in ~tiedeman/ML06 and modify it to train a classifier
with multiple (discrete) target values. The classifier is supposed to assign
your favorite sport according to weather conditions (take, for example,
"swimming", "badminton" and "none"). Modify the training data according to your
settings (create at least 20 training examples).

(a) Train the J48 classifier on the original data in weather.arff and on your
    modified version and report the number of correctly and incorrectly
    classified instances during the 10-fold cross-validation.

(b) Include your training data and the corresponding decision tree in your
    report and comment on its structure.

(c) The "-U" option can be used to turn pruning off.
    What does pruning do and why? What happens to the tree learned from your
    data when pruning is "off"? Comment on the differences or explain why there
    is no difference.


Instance-Based (Memory-Based) Learning

In Instance-Based Learning (IBL), a set of training instances are
stored and when classifying a new instance, the k closest instances
from the training set are selected and the most common class of those
instances is predicted. The Weka implementation of IBL can be invoked
thus:

     java weka.classifiers.lazy.IBk -t  -T 

For this exercise we will be using the $WEKAHOME/data/labor.arff data
set. The labor.arff data set describes the outcomes of various labour
contract negotiations in Canada, plus whether the outcomes were good
or bad (page 16 of the Frank and Witten textbook has more detail for
those interested).



1. (3 points) Copy the $WEKAHOME/data/labor.arff file into your home
   directory. The "-K" option of the weka.classifiers.lazy.IBk classifier
   enables you to specify the number of nearest neighbors used when
   classifying a novel item. The default value is 1. Run the
   weka.classifiers.IBk classifier using "labor.arff" as a training
   file, performing 10-fold cross-validation, and using values of K
   from 1 to 10 inclusive.

   (a) Which of the value(s) used for K produced the best results on
   the cross-validation? Report the number of correctly classified
   instances in the cross-validation for this(these) value(s) of K.

   (b) If using a value of K > 1 improves performance what does this
   tell you about the data?

   (c) On what sort of data set would setting K = 1 work best? 

   (d) Instance-based learning is a type of lazy learning. What does
   this mean?

2. (2 points) Copy the files 
xor.arff, 
xor-4.arff, 
xor-8.arff, 
xor-12.arff and
xor-16.arff from ~borges/data to your home directory. The
   xor.arff file contains 40 instances conforming to the exclusive or
   function. I.e. the instances are one of
   "false,false,false","false,true,true","true,false,true" or
   "true,true,false". 

   xor-{4,8,12,16}.arff contains the same 40 instances but with
   {4,8,12,16} extra random attributes added. Thus the final 3
   attributes on each line are the same as the attributes in xor.arff
   but the rest are randomly set to true or false.

   (a) Report the number of correctly and incorrectly classified instances
   for leave one out cross-validation on each of the files, using K=1.

   (b) How does the performance on xor-16.arff compare with random
   guessing?

   (c) Why does the performance drop when adding the random
   attributes?

   (d) How can you make instance-based learning less sensitive to
   this problem? (Describe the method and explain why it will help). 

[Further reading for those interested: "Memory-Based Shallow Parsing",
E.F. Tjong Kim Sang. Journal of Machine Learning Research, Special
Issue on Machine Learning Approaches to Shallow Parsing. Hammerton
J.A., Osborne M., Daelemans W. and Armstrong S. (eds), 559--594, 2002
-- see www.jmlr.org]