Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Yüklə 4,3 Mb.

Pdf görüntüsü

səhifə	199/219
tarix	08.10.2017
ölçüsü	4,3 Mb.
	#3816

1 ... 195 196 197 198 199 200 201 202 ... 219

Scheme-speciﬁc options
Table 13.2 Scheme-speciﬁc options for the J4.8 decision tree learner.
14.1 A simple data mining application
14.2 Going through the code
MessageClassiﬁer()

To evaluate the same classiﬁer on a new batch of test data, you load it back using

-l

instead of rebuilding it. If the classiﬁer can be updated incrementally, you

can provide both a training ﬁle and an input ﬁle, and Weka will load the clas-

siﬁer and update it with the given training instances.

If you wish only to assess the performance of a learning scheme, use

-o

suppress output of the model. Use

-i

to see the performance measures of pre-

cision, recall, and F-measure (Section 5.7). Use

-k

to compute information-

theoretical measures from the probabilities derived by a learning scheme

(Section 5.6).

Weka users often want to know which class values the learning scheme actu-

ally predicts for each test instance. The

-p

option prints each test instance’s

number, its class, the conﬁdence of the scheme’s prediction, and the predicted

class value. It also outputs attribute values for each instance and must be fol-

lowed by a speciﬁcation of the range (e.g., 1–2)—use 0 if you don’t want any

attribute values. You can also output the cumulative margin distribution for the

training data, which shows how the distribution of the margin measure (Section

7.5, page 324) changes with the number of boosting iterations. Finally, you can

output the classiﬁer’s source representation, and a graphical representation if

the classiﬁer can produce one.

Scheme-speciﬁc options

Table 13.2 shows the options speciﬁc to J4.8. You can force the algorithm to use

the unpruned tree instead of the pruned one. You can suppress subtree raising,

which increases efﬁciency. You can set the conﬁdence threshold for pruning and

the minimum number of instances permissible at any leaf—both parameters

were described in Section 6.1 (page 199). As well as C4.5’s standard pruning

4 5 8

C H A P T E R 1 3

T H E C O M M A N D - L I N E I N T E R FAC E

Table 13.2

Scheme-speciﬁc options for the J4.8 decision tree learner.

Option

Function

-U

Use unpruned tree

-C

Specify conﬁdence threshold for pruning

-M

Specify minimum number of instances in any leaf

-R

Use reduced-error pruning

-N

Specify number of folds for reduced-error pruning; use one fold as

pruning set

-B

Use binary splits only

-S

Don’t perform subtree raising

-L

Retain instance information

-A

Smooth the probability estimates using Laplace smoothing

-Q

Seed for shufﬂing data

P088407-Ch013.qxd 4/30/05 11:01 AM Page 458

procedure, reduced-error pruning (Section 6.2, pages 202–203) can be per-

formed. The

-N

option governs the size of the holdout set: the dataset is divided

equally into that number of parts and the last is held out (default value 3). You

can smooth the probability estimates using the Laplace technique, set the

random number seed for shufﬂing the data when selecting a pruning set, and

store the instance information for future visualization. Finally, to build a binary

tree instead of one with multiway branches for nominal attributes, use

-B

1 3 . 3

C O M M A N D - L I N E O P T I O N S

4 5 9

P088407-Ch013.qxd 4/30/05 11:01 AM Page 459

P088407-Ch013.qxd 4/30/05 11:01 AM Page 460

When invoking learning schemes from the graphical user interfaces or the

command line, there is no need to know anything about programming in Java.

In this section we show how to access these algorithms from your own code. In

doing so, the advantages of using an object-oriented programming language will

become clear. From now on, we assume that you have at least some rudimen-

tary knowledge of Java. In most practical applications of data mining the learn-

ing component is an integrated part of a far larger software environment. If the

environment is written in Java, you can use Weka to solve the learning problem

without writing any machine learning code yourself.

14.1 A simple data mining application

We present a simple data mining application for learning a model that classi-

ﬁes text ﬁles into two categories, hit and miss. The application works for arbi-

trary documents: We refer to them as messages. The implementation uses the

c h a p t e r

Embedded Machine Learning

4 6 1

P088407-Ch014.qxd 4/30/05 11:04 AM Page 461

StringToWordVector ﬁlter mentioned in Section 10.3 (page 399) to convert the

messages into attribute vectors in the manner described in Section 7.3. We

assume that the program is called every time a new ﬁle is to be processed. If the

Weka user provides a class label for the ﬁle, the system uses it for training; if

not, it classiﬁes it. The decision tree classiﬁer J48 is used to do the work.

14.2 Going through the code

Figure 14.1 shows the source code for the application program, implemented in

a class called MessageClassiﬁer. The command-line arguments that the main()

method accepts are the name of a text ﬁle (given by

-m

), the name of a ﬁle

holding an object of class MessageClassiﬁer (

-t

), and, optionally, the classiﬁca-

tion of the message in the ﬁle (

-c

). If the user provides a classiﬁcation, the

message will be converted into an example for training; if not, the Message-

Classiﬁer object will be used to classify it as hit or miss.

main()

The main() method reads the message into a Java StringBuffer and checks

whether the user has provided a classiﬁcation for it. Then it reads a Message-

Classiﬁer object from the ﬁle given by

-t

and creates a new object of class

MessageClassiﬁer if this ﬁle does not exist. In either case the resulting object is

called messageCl. After checking for illegal command-line options, the program

calls the method updateData() to update the training data stored in messageCl

if a classiﬁcation has been provided; otherwise, it calls classifyMessage() to clas-

sify it. Finally, the messageCl object is saved back into the ﬁle, because

it may have changed. In the following sections, we ﬁrst describe how a new

MessageClassiﬁer object is created by the constructor MessageClassiﬁer() and

then explain how the two methods updateData() and classifyMessage() work.

MessageClassiﬁer()

Each time a new MessageClassiﬁer is created, objects for holding the ﬁlter and

classiﬁer are generated automatically. The only nontrivial part of the process is

creating a dataset, which is done by the constructor MessageClassiﬁer(). First the

dataset’s name is stored as a string. Then an Attribute object is created for each

attribute, one to hold the string corresponding to a text message and the other

for its class. These objects are stored in a dynamic array of type FastVector.

(FastVector is Weka’s own implementation of the standard Java Vector class and

is used throughout Weka for historical reasons.)

Attributes are created by invoking one of the constructors in the class

Attribute. This class has a constructor that takes one parameter—the attribute’s

name—and creates a numeric attribute. However, the constructor we use here

4 6 2

C H A P T E R 1 4

E M B E D D E D M AC H I N E L E A R N I N G

P088407-Ch014.qxd 4/30/05 11:04 AM Page 462

Yüklə 4,3 Mb.

Dostları ilə paylaş:

1 ... 195 196 197 198 199 200 201 202 ... 219