Data Mining: Practical Machine Learning Tools and Techniques, Second Edition



Yüklə 4,3 Mb.
Pdf görüntüsü
səhifə204/219
tarix08.10.2017
ölçüsü4,3 Mb.
#3816
1   ...   200   201   202   203   204   205   206   207   ...   219

4 7 8

C H A P T E R   1 5

|

W R I T I N G   N EW   L E A R N I N G   S C H E M E S



    Enumeration instEnum = data.enumerateInstances();

    while (instEnum.hasMoreElements()) { 

      Instance inst = (Instance) instEnum.nextElement();

      classCounts[(int) inst.classValue()]++;

    } 

    double entropy = 0;



    for (int j = 0; j < data.numClasses(); j++) { 

      if (classCounts[j] > 0) { 

        entropy -= classCounts[j] * Utils.log2(classCounts[j]);

      } 


    } 

    entropy /= (double) data.numInstances();

    return entropy + Utils.log2(data.numInstances());

  } 


  /**

   * Splits a dataset according to the values of a nominal attribute.

   * 

   * @param data the data which is to be split



   * @param att the attribute to be used for splitting

   * @return the sets of instances produced by the split

   */

  private Instances[] splitData(Instances data, Attribute att) { 



    Instances[] splitData = new Instances[att.numValues()];

    for (int j = 0; j < att.numValues(); j++) { 

      splitData[j] = new Instances(data, data.numInstances());

    } 


    Enumeration instEnum = data.enumerateInstances();

    while (instEnum.hasMoreElements()) { 

      Instance inst = (Instance) instEnum.nextElement();

      splitData[(int) inst.value(att)].add(inst);

    } 

    for (int i = 0; i < splitData.length; i++) { 



      splitData[i].compactify();

    } 


    return splitData;

  } 


  /**

   * Outputs a tree at a certain level. 



Figure 15.1 (continued)

P088407-Ch015.qxd  4/30/05  10:58 AM  Page 478




1 5 . 1

A N   E X A M P L E   C L A S S I F I E R

4 7 9

   * 


   * @param level the level at which the tree is to be printed

   */


  private String toString(int level) { 

    StringBuffer text = new StringBuffer();

    if (m_Attribute == null) { 

      if (Instance.isMissingValue(m_ClassValue)) { 

        text.append(": null");

      } else { 

        text.append(": " + m_ClassAttribute.value((int) m_ClassValue));

      } 


    } else { 

      for (int j = 0; j < m_Attribute.numValues(); j++) { 

        text.append("\n");

        for (int i = 0; i < level; i++) { 

          text.append("|  ");

        } 

        text.append(m_Attribute.name() + " = " + m_Attribute.value(j));

        text.append(m_Successors[j].toString(level + 1));

      } 

    } 


    return text.toString();

  } 


  /**

   * Main method.

   * 

   * @param args the options for the classifier



   */

  public static void main(String[] args) { 

    try { 

      System.out.println(Evaluation.evaluateModel(new Id3(), args));

    } catch (Exception e) { 

      System.err.println(e.getMessage());

    } 

  } 




Figure 15.1 (continued)

P088407-Ch015.qxd  4/30/05  10:58 AM  Page 479




ute is passed to the attribute() method from weka.core.Instances, which returns

the corresponding attribute.

You might wonder what happens to the array field corresponding to the class

attribute. We need not worry about this because Java automatically initializes

all elements in an array of numbers to zero, and the information gain is always

greater than or equal to zero. If the maximum information gain is zeromake-



Tree() creates a leaf. In that case m_Attribute is set to null, and makeTree() com-

putes both the distribution of class probabilities and the class with greatest

probability. (The normalize() method from weka.core.Utils normalizes an array

of doubles to sum to one.)

When it makes a leaf with a class value assigned to it, makeTree() stores the

class attribute in m_ClassAttribute. This is because the method that outputs the

decision tree needs to access this to print the class label.

If an attribute with nonzero information gain is foundmakeTree() splits the

dataset according to the attribute’s values and recursively builds subtrees for

each of the new datasets. To make the split it calls the method splitData(). This

creates as many empty datasets as there are attribute values, stores them in an

array (setting the initial capacity of each dataset to the number of instances in

the original dataset), and then iterates through all instances in the original

dataset, and allocates them to the new dataset that corresponds to the attribute’s

value. It then reduces memory requirements by compacting the Instances

objects. Returning to makeTree(), the resulting array of datasets is used for

building subtrees. The method creates an array of Id3 objects, one for each

attribute value, and calls makeTree() on each one by passing it the correspon-

ding dataset.

computeInfoGain()

Returning to computeInfoGain(), the information gain associated with an attrib-

ute and a dataset is calculated using a straightforward implementation of the

formula in Section 4.3 (page 102). First, the entropy of the dataset is computed.

Then, splitData() is used to divide it into subsets, and computeEntropy() is called

on each one. Finally, the difference between the former entropy and the

weighted sum of the latter ones—the information gain—is returned. The

method computeEntropy() uses the log2() method from weka.core.Utils to obtain

the logarithm (to base 2) of a number.

classifyInstance()

Having seen how ID3 constructs a decision tree, we now examine how it uses

the tree structure to predict class values and probabilities. Every classifier must

implement the classifyInstance() method or the distributionForInstance()

method (or both). The Classifier superclass contains default implementations

4 8 0


C H A P T E R   1 5

|

W R I T I N G   N EW   L E A R N I N G   S C H E M E S



P088407-Ch015.qxd  4/30/05  10:58 AM  Page 480


Yüklə 4,3 Mb.

Dostları ilə paylaş:
1   ...   200   201   202   203   204   205   206   207   ...   219




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə