Data Mining: Practical Machine Learning Tools and Techniques, Second Edition



Yüklə 4,3 Mb.
Pdf görüntüsü
səhifə199/219
tarix08.10.2017
ölçüsü4,3 Mb.
#3816
1   ...   195   196   197   198   199   200   201   202   ...   219

To evaluate the same classifier on a new batch of test data, you load it back using

-l

instead of rebuilding it. If the classifier can be updated incrementally, you



can provide both a training file and an input file, and Weka will load the clas-

sifier and update it with the given training instances.

If you wish only to assess the performance of a learning scheme, use 

-o

to



suppress output of the model. Use 

-i

to see the performance measures of pre-



cision, recall, and F-measure (Section 5.7). Use 

-k

to compute information-



theoretical measures from the probabilities derived by a learning scheme

(Section 5.6).

Weka users often want to know which class values the learning scheme actu-

ally predicts for each test instance. The 

-p

option prints each test instance’s



number, its class, the confidence of the scheme’s prediction, and the predicted

class value. It also outputs attribute values for each instance and must be fol-

lowed by a specification of the range (e.g., 1–2)—use 0 if you don’t want any

attribute values. You can also output the cumulative margin distribution for the

training data, which shows how the distribution of the margin measure (Section

7.5, page 324) changes with the number of boosting iterations. Finally, you can

output the classifier’s source representation, and a graphical representation if

the classifier can produce one.



Scheme-specific options

Table 13.2 shows the options specific to J4.8. You can force the algorithm to use

the unpruned tree instead of the pruned one. You can suppress subtree raising,

which increases efficiency. You can set the confidence threshold for pruning and

the minimum number of instances permissible at any leaf—both parameters

were described in Section 6.1 (page 199). As well as C4.5’s standard pruning

4 5 8

C H A P T E R   1 3



|

T H E   C O M M A N D - L I N E   I N T E R FAC E



Table 13.2

Scheme-specific options for the J4.8 decision tree learner.

Option


Function

-U

Use unpruned tree



-C

Specify confidence threshold for pruning

-M

Specify minimum number of instances in any leaf

-R

Use reduced-error pruning



-N

Specify number of folds for reduced-error pruning; use one fold as 

pruning set

-B

Use binary splits only



-S

Don’t perform subtree raising

-L

Retain instance information



-A

Smooth the probability estimates using Laplace smoothing

-Q

Seed for shuffling data



P088407-Ch013.qxd  4/30/05  11:01 AM  Page 458


procedure, reduced-error pruning (Section 6.2, pages 202–203) can be per-

formed. The 

-N

option governs the size of the holdout set: the dataset is divided



equally into that number of parts and the last is held out (default value 3). You

can smooth the probability estimates using the Laplace technique, set the

random number seed for shuffling the data when selecting a pruning set, and

store the instance information for future visualization. Finally, to build a binary

tree instead of one with multiway branches for nominal attributes, use 

-B

.



1 3 . 3

C O M M A N D - L I N E   O P T I O N S

4 5 9

P088407-Ch013.qxd  4/30/05  11:01 AM  Page 459




P088407-Ch013.qxd  4/30/05  11:01 AM  Page 460


When invoking learning schemes from the graphical user interfaces or the

command line, there is no need to know anything about programming in Java.

In this section we show how to access these algorithms from your own code. In

doing so, the advantages of using an object-oriented programming language will

become clear. From now on, we assume that you have at least some rudimen-

tary knowledge of Java. In most practical applications of data mining the learn-

ing component is an integrated part of a far larger software environment. If the

environment is written in Java, you can use Weka to solve the learning problem

without writing any machine learning code yourself.

14.1 A simple data mining application

We present a simple data mining application for learning a model that classi-

fies text files into two categorieshit and miss. The application works for arbi-

trary documents: We refer to them as messages. The implementation uses the

c h a p t e r

14

Embedded Machine Learning



4 6 1

P088407-Ch014.qxd  4/30/05  11:04 AM  Page 461




StringToWordVector filter mentioned in Section 10.3 (page 399) to convert the

messages into attribute vectors in the manner described in Section 7.3. We

assume that the program is called every time a new file is to be processed. If the

Weka user provides a class label for the file, the system uses it for training; if

not, it classifies it. The decision tree classifier J48 is used to do the work.

14.2 Going through the code

Figure 14.1 shows the source code for the application program, implemented in

a class called MessageClassifier. The command-line arguments that the main()

method accepts are the name of a text file (given by 

-m

), the name of a file



holding an object of class MessageClassifier (

-t

), and, optionally, the classifica-



tion of the message in the file (

-c

). If the user provides a classification, the



message will be converted into an example for training; if not, the Message-

Classifier object will be used to classify it as hit or miss.

main()

The  main() method reads the message into a Java StringBuffer and checks

whether the user has provided a classification for it. Then it reads a Message-

Classifier object from the file given by 

-t

and creates a new object of class 



MessageClassifier if this file does not exist. In either case the resulting object is

called messageCl. After checking for illegal command-line options, the program

calls the method updateData() to update the training data stored in messageCl

if a classification has been provided; otherwise, it calls classifyMessage() to clas-

sify it. Finally, the messageCl object is saved back into the file, because 

it may have changed. In the following sections, we first describe how a new 



MessageClassifier object is created by the constructor MessageClassifier() and

then explain how the two methods updateData() and classifyMessage() work.



MessageClassifier()

Each time a new MessageClassifier is created, objects for holding the filter and

classifier are generated automatically. The only nontrivial part of the process is

creating a dataset, which is done by the constructor MessageClassifier(). First the

dataset’s name is stored as a string. Then an Attribute object is created for each

attribute, one to hold the string corresponding to a text message and the other

for its class. These objects are stored in a dynamic array of type FastVector.

(FastVector is Weka’s own implementation of the standard Java Vector class and

is used throughout Weka for historical reasons.)

Attributes are created by invoking one of the constructors in the class



Attribute. This class has a constructor that takes one parameter—the attribute’s

name—and creates a numeric attribute. However, the constructor we use here

4 6 2

C H A P T E R   1 4



|

E M B E D D E D   M AC H I N E   L E A R N I N G

P088407-Ch014.qxd  4/30/05  11:04 AM  Page 462



Yüklə 4,3 Mb.

Dostları ilə paylaş:
1   ...   195   196   197   198   199   200   201   202   ...   219




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə