To evaluate the same classifier
on a new batch of test data, you load it back using
-l
instead of rebuilding it. If the classifier can be updated incrementally, you
can provide both a training file and an input file, and Weka will load the clas-
sifier and update it with the given training instances.
If you wish only to assess the performance of a learning scheme, use
-o
to
suppress output of the model. Use
-i
to see the performance measures of pre-
cision, recall, and F-measure (Section 5.7). Use
-k
to compute information-
theoretical measures from the probabilities derived by a learning scheme
(Section 5.6).
Weka users often want to know which class values the learning scheme actu-
ally predicts for each test instance. The
-p
option prints each test instance’s
number, its class, the confidence of the scheme’s prediction, and
the predicted
class value. It also outputs attribute values for each instance and must be fol-
lowed by a specification of the range (e.g., 1–2)—use 0 if you don’t want any
attribute values. You can also output the cumulative margin distribution for the
training data, which shows how the distribution of the margin measure (Section
7.5, page 324) changes with the number of boosting iterations. Finally, you can
output the classifier’s source representation, and a graphical representation if
the classifier can produce one.
Scheme-specific options
Table 13.2 shows the options specific to J4.8. You can force the algorithm to use
the unpruned tree instead of the pruned one. You can suppress subtree raising,
which increases efficiency. You can set the confidence threshold for pruning and
the minimum number of instances permissible at any leaf—both parameters
were described in Section 6.1 (page 199). As well as C4.5’s standard pruning
4 5 8
C H A P T E R 1 3
|
T H E C O M M A N D - L I N E I N T E R FAC E
Table 13.2
Scheme-specific options for the J4.8 decision tree learner.
Option
Function
-U
Use unpruned tree
-C
Specify confidence threshold for pruning
-M
Specify minimum number of instances in any leaf
-R
Use reduced-error pruning
-N
Specify number of folds for reduced-error pruning; use one fold as
pruning set
-B
Use binary splits only
-S
Don’t perform subtree raising
-L
Retain instance information
-A
Smooth the probability estimates using Laplace smoothing
-Q
Seed for shuffling data
P088407-Ch013.qxd 4/30/05 11:01 AM Page 458
procedure, reduced-error pruning (Section 6.2, pages 202–203) can be per-
formed. The
-N
option governs the size of the holdout set: the dataset is divided
equally into that number of parts and the last is held out (default value 3). You
can smooth the probability estimates using the Laplace technique, set the
random number seed for shuffling the data when selecting a pruning set, and
store the instance information for future visualization. Finally, to build a binary
tree instead of one with multiway branches for nominal attributes, use
-B
.
1 3 . 3
C O M M A N D - L I N E O P T I O N S
4 5 9
P088407-Ch013.qxd 4/30/05 11:01 AM Page 459
When invoking learning schemes from the graphical user interfaces or the
command line, there is no need to know anything about programming in Java.
In this section we show how to access these algorithms from your own code. In
doing so, the advantages of using an object-oriented programming language will
become clear. From now on, we assume that you have at least some rudimen-
tary knowledge of Java. In most practical applications of data mining the learn-
ing component is an integrated part of a far larger software environment. If the
environment is written in Java, you can use Weka to solve the learning problem
without writing any machine learning code yourself.
14.1 A simple data mining application
We present a simple data mining application for learning a model that classi-
fies text files into two categories, hit and miss. The application works for arbi-
trary documents: We refer to them as messages. The implementation uses the
c h a p t e r
14
Embedded Machine Learning
4 6 1
P088407-Ch014.qxd 4/30/05 11:04 AM Page 461
StringToWordVector filter mentioned in Section 10.3 (page 399) to convert the
messages into attribute vectors in the manner described in Section 7.3. We
assume that the program is called every time a new file is to be processed. If the
Weka user provides a class label for the file, the system uses it for training; if
not, it classifies it. The decision tree classifier J48 is used to do the work.
14.2 Going through the code
Figure 14.1 shows the source code for the application program, implemented in
a class called MessageClassifier. The command-line arguments that the main()
method accepts are the name of a text file (given by
-m
), the name of a file
holding an object of class
MessageClassifier (
-t
), and, optionally, the classifica-
tion of the message in the file (
-c
). If the user provides a classification, the
message will be converted into an example for training; if not, the
Message-
Classifier object will be used to classify it as
hit or
miss.
main()
The main() method reads the message into a Java StringBuffer and checks
whether the user has provided a classification for it. Then it reads a Message-
Classifier object from the file given by
-t
and creates a new object of class
MessageClassifier if this file does not exist. In either
case the resulting object is
called messageCl. After checking for illegal command-line options, the program
calls the method updateData() to update the training data stored in messageCl
if a classification has been provided; otherwise, it calls classifyMessage() to clas-
sify it. Finally, the messageCl object is saved back into the file, because
it may have changed. In the following sections, we first describe how a new
MessageClassifier object is created by the constructor
MessageClassifier() and
then explain how the two methods updateData() and classifyMessage() work.
MessageClassifier()
Each time a new MessageClassifier is created, objects for holding the filter and
classifier are generated automatically. The only nontrivial part of the process is
creating a dataset, which is done by the constructor MessageClassifier(). First the
dataset’s name is stored as a string. Then an Attribute object is created for each
attribute, one to hold the string corresponding to a text message and the other
for its class. These objects are stored in a dynamic array of type FastVector.
(FastVector is Weka’s own implementation of the standard Java Vector class and
is used throughout Weka for historical reasons.)
Attributes are created by invoking one of the constructors in the class
Attribute. This class has a constructor that takes one parameter—the attribute’s
name—and creates a numeric attribute. However, the constructor we use here
4 6 2
C H A P T E R 1 4
|
E M B E D D E D M AC H I N E L E A R N I N G
P088407-Ch014.qxd 4/30/05 11:04 AM Page 462