Lab 3, Assignment 2 - Decision Trees + Instance-Based Learning
There are three parts in this assignment, one about WEKA input
files and feature selection, one about decision trees and one about
IBL. You need 60% correct answers to pass this assignment. The
deadline for assignment reports are Monday 23th September 23:59. Send
the assignments via email to c<DOT>coltekin<AT>rug.nl.
The data sets you will need in this assignment are:
0. WEKA File Format (.arff) and Attribute Selection
Extracting features from data, and selecting relevant features are
important parts of developing a machine learning model. In this
exercise, we will first do a very trivial conversion on our data set,
and try to find the valuable features in the list of features we have.
Knowledge of a programming language such as perl or python, or
standard UNIX tools (grep, cut, paste, sed etc.) is valuable for this
type of tasks. However, you will only need to use a text editor, (such
as Notepad on Windows, and Vi or Emacs on UNIX like systems).
Note that our past tense data is in comma separated value (CSV)
format. The data file contains 12 fields separated by comma. First
field is orthographic form of the verb, followed by frequency of the
verb. Fields 3 to 11 are representations of consecutive sounds
(phonemes) for each verb. The last field is the class of the verb, e.g.
regular or a particular irregular form.
- Even though the features are available in .dat files given above,
the files are not in
the format recognized by WEKA. Convert
past-tense-100.dat to arff data format,
save the new file as past-tense-100.arff.
Use only nominal or numeric features. Report(briefly) the modification(s) you did?
The list of verbs, phonemes
and classes in the data set are provided for your
convenience.
- WEKA provides some feature selection methods (see 'Select attributes'
tab in explorer gui, or command line routines below). Use the arff file
generated in the previous step to rank the attributes using,
- InformationGain.
java weka.attributeSelection.InfoGainAttributeEval -s weka.attributeSelection.Ranker -i past-tense-100.arff
- GainRatio:
java weka.attributeSelection.GainRatioAttributeEval -s weka.attributeSelection.Ranker -i past-tense-100.arff
The above commands will rank the features based on how well they
classify the given data. You should be familiar with Information
Gain and Gain Ratio from decision trees.
Report the differences between the two rankings. Why the rankings
of 'verb' feature differ for these two measures. In general, the verb
feature is not a useful feature for learning. Why?
- (optional) For the rest of this assignment we will be working
with the full data set, past-tense.dat.
In the previous steps, we concluded that the 'verb' feature is useless.
Remove this column from the CSV file, and convert remaining columns to features
according to their type. If everything goes fine, you should end up with an input
file similar to past-tense.arff.
1. Decision Trees
This assignment uses the WEKA implementation of C4.5, a decision tree
learner. To invoke this learner e.g. using a file called "train.arff"
as a training set you can type:
java weka.classifiers.trees.J48 -t train.arff
This will construct a decision tree from train.arff and then apply it
to train.arff. After that it will perform a 10-fold cross-validation
on train.arff.
- Invoke the J48 classifier using past-tense.arff, without pruning.
Report the number of correctly and incorrectly classified
instances both for testing on training set, and 10-fold CV .
java weka.classifiers.trees.J48 -U -t past-tense.arff
Note the time it took for building the model and testing it
on training data (available in WEKA output). We will need this
information later.
- Repeat the above with pruning. (remove the -U option.
- Report the correctly and incorrectly classified instances.
- Did pruning improve the results in this case? Compare the
detailed performance measures (-i option after the classifier)
and confusion matrices.
- What is pruning used for?
- C4.5 algorithm tries to improve accuracy during post-pruning.
Can you relate this with performance difference with and without pruning,
and class distribution in the data.
- Consider the following data set (from exercise 3.2 in your textbook):
Instance | Classification | a1 | a2 |
1 | + | T | T
|
2 | + | T | T
|
3 | - | T | F
|
4 | + | F | F
|
5 | - | F | T
|
6 | - | F | T
|
- What is the entropy of this collection of training examples with respect to
target function classification?
- What is the information gain of a2 relative to these training examples?
- Create an input file for WEKA including the training instances above, and
train a J48 classifier without pruning. Include in your report a description
of the decision tree constructed by the J48.
- WEKA uses 10-fold cross validation by default. With the data above, 10-fold
cross validation is not possible (why?). What is the behavior of WEKA
in similar cases.
- Leave-one-out cross validation is a special case of k-fold cross
validation. What is the number k for the above data to perform
leave-one-out cross validation?
2. Instance-Based (Memory-Based) Learning
In Instance-Based Learning (IBL), a set of training instances are
stored and when classifying a new instance, the k closest instances
from the training set are selected and the most common class of those
instances is predicted. The WEKA implementation of IBL can be invoked
thus:
java weka.classifiers.lazy.IBk -t training_file
The "-K" option of the weka.classifiers.lazy.IBk classifier
enables you to specify the number of nearest neighbors used when
classifying a novel item. The default value is 1.
- Run the weka.classifiers.IBk classifier with default values using
past-tense.arff as a training file.
- Report the number of instances classified correctly and
incorrectly on both training data, and using 10-fold CV.
- kNN classifier depends on the a distance metric. When
the attributes are numeric the distance is simply the
Euclidean distance. However, in this example most of our
attributes are nominal values. How does weka.classifiers.IBk
classifier calculate the distance between instances with
nominal values. Can you think of a better representation of
features for this data set?
- Compare time taken time it took for building the model,
and testing it on training data. Compare it with the times you
have noted on step 4 above. Explain the differences between the
run times for two different algorithms.
- Select 5 top-ranked features using GainRatio measure, and discard
the rest of the features. Re-run IBk classifier with the
default parameters again. Explain the difference in the performance
change compared to (a).
- Irrelevant attributes in training data incur the performance
of kNN classifier. How can you make instance-based learning less
sensitive to this problem? (Describe the method and explain why it
will help).
- kNN classifier may give different results based on choice of `K'.
- If using a value of K > 1 improves performance what does this
tell you about the data?
- On what sort of data set would setting K = 1 work best?
- For the data set past-tense.arff,
find the K value, 1 ≤ K ≤ 10, that gives the best
performance. Note that
weka.classifiers.IBk implementation can do that for you
(Hint: check the -X option for IBk classifier).