Data Mining: Practical Machine Learning Tools and Techniques, Second Edition



Yüklə 4,3 Mb.
Pdf görüntüsü
səhifə201/219
tarix08.10.2017
ölçüsü4,3 Mb.
#3816
1   ...   197   198   199   200   201   202   203   204   ...   219

1 4 . 2

G O I N G   T H RO U G H   T H E   C O D E

4 6 7

      } 


      // Check if there are any options left

      Utils.checkForRemainingOptions(options);

      // Process message.

      if (classValue.length() != 0) { 

        messageCl.updateData(message.toString(), classValue);

      } else { 

        messageCl.classifyMessage(message.toString());

      } 


      // Save message classifier object.

      ObjectOutputStream modelOutObjectFile = 

        new ObjectOutputStream(new FileOutputStream(modelName));

      modelOutObjectFile.writeObject(messageCl);

      modelOutObjectFile.close();

    } catch (Exception e) { 

      e.printStackTrace();

    } 


  } 



Figure 14.1 (continued)

takes two parameters: the attribute’s name and a reference to a FastVector. If this

reference is null, as in the first application of this constructor in our program,

Weka creates an attribute of type string. Otherwise, a nominal attribute is

created. In that case it is assumed that the FastVector holds the attribute values

as strings. This is how we create a class attribute with two values hit and miss:

by passing the attribute’s name (class) and its values—stored in a FastVector

to Attribute().

To create a dataset from this attribute information, MessageClassifier() must

create an object of the class Instances from the core package. The constructor 

of Instances used by MessageClassifier() takes three arguments: the dataset’s

name, a  FastVector containing the attributes, and an integer indicating the

dataset’s initial capacity. We set the initial capacity to 100; it is expanded auto-

matically if more instances are added. After constructing the dataset, Message-

Classifier() sets the index of the class attribute to be the index of the last

attribute.

P088407-Ch014.qxd  4/30/05  11:04 AM  Page 467



updateData()

Now that you know how to create an empty dataset, consider how the Mes-



sageClassifier object actually incorporates a new training message. The method

updateData() does this job. It first converts the given message into a training

instance by calling makeInstance(), which begins by creating an object of class



Instance that corresponds to an instance with two attributes. The constructor of

the Instance object sets all the instance’s values to be missing and its weight to

1. The next step in makeInstance() is to set the value of the string attribute

holding the text of the message. This is done by applying the setValue() method

of the Instance object, providing it with the attribute whose value needs to be

changed, and a second parameter that corresponds to the new value’s index in

the definition of the string attribute. This index is returned by the addString-

Value() method, which adds the message text as a new value to the string attrib-

ute and returns the position of this new value in the definition of the string

attribute.

Internally, an Instance stores all attribute values as double-precision floating-

point numbers regardless of the type of the corresponding attribute. In the case

of nominal and string attributes this is done by storing the index of the corre-

sponding attribute value in the definition of the attribute. For example, the first

value of a nominal attribute is represented by 0.0, the second by 1.0, and so on.

The same method is used for string attributes: addStringValue() returns the index

corresponding to the value that is added to the definition of the attribute.

Once the value for the string attribute has been set, makeInstance() gives the

newly created instance access to the data’s attribute information by passing it a

reference to the dataset. In Weka, an Instance object does not store the type of

each attribute explicitly; instead, it stores a reference to a dataset with the 

corresponding attribute information.

Returning to updateData(), once the new instance has been returned from



makeInstance() its class value is set and it is added to the training data. We also

initialize m_UpToDate, a flag indicating that the training data has changed and

the predictive model is hence not up to date.

classifyMessage()

Now let’s examine how MessageClassifier processes a message whose class label

is unknown. The classifyMessage() method first checks whether a classifier has

been built by determining whether any training instances are available. It then

checks whether the classifier is up to date. If not (because the training data has

changed) it must be rebuilt. However, before doing so the data must be con-

verted into a format appropriate for learning using the StringToWordVector filter.

First, we tell the filter the format of the input data by passing it a reference to

the input dataset using setInputFormat(). Every time this method is called, the

4 6 8


C H A P T E R   1 4

|

E M B E D D E D   M AC H I N E   L E A R N I N G



P088407-Ch014.qxd  4/30/05  11:04 AM  Page 468


filter is initialized—that is, all its internal settings are reset. In the next step, the

data is transformed by useFilter(). This generic method from the Filter class

applies a filter to a dataset. In this case, because StringToWordVector has just been

initialized, it computes a dictionary from the training dataset and then uses it

to form word vectors. After returning from useFilter(), all the filter’s internal set-

tings are fixed until it is initialized by another call of inputFormat(). This makes

it possible to filter a test instance without updating the filter’s internal settings

(in this case, the dictionary).

Once the data has been filtered, the program rebuilds the classifier—in our

case a J4.8 decision tree—by passing the training data to its buildClassifier()

method and sets m_UpToDate to true. It is an important convention in Weka

that the buildClassifier() method completely initializes the model’s internal set-

tings before generating a new classifier. Hence we do not need to construct a

new J48 object before we call buildClassifier().

Having ensured that the model stored in m_Classifier is current, we proceed

to classify the message. Before makeInstance() is called to create an Instance

object from it, a new Instances object is created to hold the new instance and

passed as an argument to makeInstance(). This is done so that makeInstance()

does not add the text of the message to the definition of the string attribute in

m_Data. Otherwise, the size of the m_Data object would grow every time a new

message was classified, which is clearly not desirable—it should only grow when

training instances are added. Hence a temporary Instances object is created and

discarded once the instance has been processed. This object is obtained using

the method stringFreeStructure(), which returns a copy of m_Data with an

empty string attribute. Only then is makeInstance() called to create the new

instance.

The test instance must also be processed by the StringToWordVector filter

before being classified. This is easy: the input() method enters the instance into

the filter object, and the transformed instance is obtained by calling output().

Then a prediction is produced by passing the instance to the classifier’s classi-

fyInstance() method. As you can see, the prediction is coded as a double value.

This allows Weka’s evaluation module to treat models for categorical and

numeric prediction similarly. In the case of categorical prediction, as in this

example, the  double variable holds the index of the predicted class value. To

output the string corresponding to this class value, the program calls the value()

method of the dataset’s class attribute.

There is at least one way in which our implementation could be improved.

The classifier and the StringToWordVector filter could be combined using the 



FilteredClassifier metalearner described in Section 10.3 (page 401). This classi-

fier would then be able to deal with string attributes directly, without explicitly

calling the filter to transform the data. We didn’t do this because we wanted to

demonstrate how filters can be used programmatically.

1 4 . 2

G O I N G   T H RO U G H   T H E   C O D E



4 6 9

P088407-Ch014.qxd  4/30/05  11:04 AM  Page 469




Yüklə 4,3 Mb.

Dostları ilə paylaş:
1   ...   197   198   199   200   201   202   203   204   ...   219




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə