Data Mining
for the Masses
122
predictions for a scoring data set. This can become a powerful addition to k-means models, giving
us the ability to apply our clusters to other data sets that haven’t yet been classified.
Discriminant analysis can be useful where the classification for some observations is known and is
not known for others. Some classic applications of discriminant analysis are in the fields of
biology and organizational behavior. In biology, for example, discriminant analysis has been
successfully applied to the classification of plant and animal species based on the traits of those
living things. In organizational behavior, this type of data modeling has been used to help workers
identify potentially successful career paths based on personality traits, preferences and aptitudes.
By coupling known past performance with unknown but similarly structured data, we can use
discriminant analysis to effectively train a model that can then score the unknown records for us,
giving us a picture of what categories the unknown observations would likely be in.
REVIEW QUESTIONS
1)
What type of attribute does a data set need in order to conduct discriminant analysis
instead of k-means clustering?
2)
What is a ‘label’ role in RapidMiner and why do you need an attribute with this role in
order to conduct discriminant analysis?
3)
What is the difference between a training data set and a scoring data set?
4)
What is the purpose of the Apply Model operator in RapidMiner?
5)
What are confidence percent attributes used for in RapidMiner? What was the likely
reason that did we not find any in this chapter’s example? Are
there attributes about young
athletes that you can think of that were not included in our data sets that might have
helped up find some confidence percents? (Hint: think of things that are fairly specific to
only one or two sports.)
6)
What would be problematic about including both male and female athletes in this chapter’s
example data?
Chapter 7:
Discriminant Analysis
123
EXERCISE
For this chapter’s exercise, you will compile your own data set based on people you know and the
cars they drive, and then create a linear discriminant analysis of your data in order to predict
categories for a scoring data set. Complete the following steps:
1)
Open a new blank spreadsheet in OpenOffice Calc. At the bottom of the spreadsheet
there will be three default tabs labeled Sheet1, Sheet2, Sheet3. Rename the first one
Training and the second one Scoring. You can rename the tabs by double clicking on their
labels. You can delete or ignore the third default sheet.
2)
On the training sheet, starting in cell A1 and going across, create attribute labels for six
attributes: Age, Gender, Marital_Status,
Employment, Housing, and Car_Type.
3)
Copy each of these attribute names except Car_Type into the Scoring sheet.
4)
On the Training sheet, enter values for each of these attributes for several people that you
know who have a car. These could be family members, friends and neighbors, coworkers
or fellow students, etc. Try to do at least 20 observations; 30 or more would be better.
Enter husband and wife couples as two separate observations, so long as each spouse has a
different vehicle. Use the following to guide your data entry:
a.
For Age, you could put the person’s actual age in years, or you could put them in
buckets. For example, you could put 10 for people aged 10-19; 20 for people aged
20-29; etc.
b.
For Gender, enter 0 for female and 1 for male.
c.
For Marital_Status, use 0 for single, 1 for married, 2 for divorced, and 3 for
widowed.
d.
For Employment, enter 0 for student, 1 for full-time, 2 for part-time, and 3 for
retired.
e.
For Housing, use 0 for lives rent-free with someone else, 1 for rents housing, and 2
for owns housing.
f.
For Car_Type, you can record data in a number of ways. This will be your label, or
the attribute you wish to predict. You could record each person’s car by make (e.g.
Data Mining for the Masses
124
Toyota, Honda, Ford, etc.), or you could record it by body style (e.g. Car, Truck,
SUV, etc.). Be consistent in assigning classifications, and note that depending on
the size of the data set you create, you won’t want to have too many possible
classificatons, or your predictions in the scoring data set will be spread out too
much. With small data sets containing only 20-30 observations, the number of
categories should be limited to three or four. You might even consider using
Japanese, American, European as your Car_Types values.
5)
Once you’ve compiled your Training data set, switch to the Scoring sheet in OpenOffice
Calc. Repeat the data entry process for at least 20 people (more is better) that you know
who
do not have a car. You will use the training set to try to predict the type of car each of
these people would drive if they had one.
6)
Use the File > Save As menu option in OpenOffice Calc to save your
Training and Scoring
sheets as CSV files.
7)
Import your two CSV files into your RapidMiner respository. Be sure to give them
descriptive names.
8)
Drag your two data sets into a new process window. If you have prepared your data well
in OpenOffice Calc, you shouldn’t have any missing or inconsistent data to contend with,
so data preparation should be minimal. Rename the two retrieve operators so you can tell
the difference between your training and scoring data sets.
9)
One necessary data preparation step is to add a Set Role operator and define the Car_Type
attribute as your label.
10)
Add a Linear Discriminant Analysis operator to your Training stream.
11)
Apply your LDA model to your scoring data and run your model. Evaluate and report
your results. Did you get any confidence percentages? Do the predicted Car_Types seem
reasonable and consistent with your training data? Why or why not?