Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	35/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 31 32 33 34 35 36 37 38 ... 65

REVIEW QUESTIONS

Data Mining for the Masses
122
predictions for a scoring data set.  This can become a powerful addition to k-means models, giving
us the ability to apply our clusters to other data sets that haven’t yet been classified.

Discriminant analysis can be useful where the classification for some observations is known and is
not  known  for  others.    Some  classic  applications  of  discriminant  analysis  are  in  the  fields  of
biology  and  organizational  behavior.    In  biology,  for  example,  discriminant  analysis  has  been
successfully applied to the classification of plant and animal species based on the traits of those
living things.  In organizational behavior, this type of data modeling has been used to help workers
identify potentially successful career paths based on personality traits,  preferences and aptitudes.
By  coupling  known  past  performance  with  unknown  but  similarly  structured  data,  we  can  use
discriminant analysis to effectively train a model that can then score the unknown records for us,
giving us a picture of what categories the unknown observations would likely be in.

REVIEW QUESTIONS

1)

What  type  of  attribute  does  a  data  set  need  in  order  to  conduct  discriminant  analysis
instead of k-means clustering?

2)

What  is  a  ‘label’  role  in  RapidMiner  and  why  do  you  need  an  attribute  with  this  role  in
order to conduct discriminant analysis?

3)

What is the difference between a training data set and a scoring data set?

4)

What is the purpose of the Apply Model operator in RapidMiner?

5)

What  are  confidence  percent  attributes  used  for  in  RapidMiner?    What  was  the  likely
reason that did we not find any in this chapter’s example? Are there attributes about young
athletes  that  you  can  think  of  that  were  not  included  in  our  data  sets  that  might  have
helped up find some confidence percents?  (Hint: think of things that are fairly specific to
only one or two sports.)

6)

What would be problematic about including both male and female athletes in this chapter’s
example data?

Chapter 7: Discriminant Analysis
123
EXERCISE

For this chapter’s exercise, you will compile your own data set based on people you know and the
cars  they  drive,  and  then  create  a  linear  discriminant  analysis  of  your  data  in  order  to  predict
categories for a scoring data set. Complete the following steps:

1)

Open  a  new  blank  spreadsheet  in  OpenOffice  Calc.    At  the  bottom  of  the  spreadsheet
there  will  be  three  default  tabs  labeled  Sheet1,  Sheet2,  Sheet3.    Rename  the  first  one
Training and the second one Scoring.  You can rename the tabs by double clicking on their
labels. You can delete or ignore the third default sheet.

2)

On  the  training  sheet,  starting  in  cell  A1  and  going  across,  create  attribute  labels  for  six
attributes: Age, Gender, Marital_Status, Employment, Housing, and Car_Type.

3)

Copy each of these attribute names except Car_Type into the Scoring sheet.

4)

On the Training sheet, enter values for each of these attributes for several people that you
know who have a car.  These could be family members, friends and neighbors, coworkers
or fellow students, etc.  Try to do at least 20 observations; 30 or more would be better.
Enter husband and wife couples as two separate observations, so long as each spouse has a
different vehicle. Use the following to guide your data entry:
a.

For Age, you could put the person’s actual age in years, or you could put them in
buckets. For example, you could put 10 for people aged 10-19; 20 for people aged
20-29; etc.
b.

For Gender, enter 0 for female and 1 for male.
c.

For  Marital_Status,  use  0  for  single,  1  for  married,  2  for  divorced,  and  3  for
widowed.
d.

For  Employment,  enter  0  for  student,  1  for  full-time,  2  for  part-time,  and  3  for
retired.
e.

For Housing, use 0 for lives rent-free with someone else, 1 for rents housing, and 2
for owns housing.
f.

For Car_Type, you can record data in a number of ways.  This will be your label, or
the attribute you wish to predict.  You could record each person’s car by make (e.g.

Data Mining for the Masses
124
Toyota, Honda, Ford, etc.), or you could record it by body style (e.g. Car, Truck,
SUV, etc.).  Be consistent in assigning classifications, and note that depending on
the  size  of  the  data  set  you  create,  you  won’t  want  to  have  too  many  possible
classificatons,  or  your  predictions  in  the  scoring  data  set  will  be  spread  out  too
much.    With  small  data  sets  containing  only  20-30  observations,  the  number  of
categories  should  be  limited  to  three  or  four.    You  might  even  consider  using
Japanese, American, European as your Car_Types values.

5)

Once you’ve compiled your Training data set, switch to the Scoring sheet in OpenOffice
Calc.  Repeat the data entry process for at least 20 people (more is better) that you know
who do not have a car.  You will use the training set to try to predict the type of car each of
these people would drive if they had one.

6)

Use the File > Save As menu option in OpenOffice Calc to save your Training and Scoring
sheets as CSV files.

7)

Import  your  two  CSV  files  into  your  RapidMiner  respository.    Be  sure  to  give  them
descriptive names.

8)

Drag your two data sets into a new process window.  If you have prepared your data well
in OpenOffice Calc, you shouldn’t have any missing or inconsistent data to contend with,
so data preparation should be minimal.  Rename the two retrieve operators so you can tell
the difference between your training and scoring data sets.

9)

One necessary data preparation step is to add a Set Role operator and define the Car_Type
attribute as your label.

10)

Add a Linear Discriminant Analysis operator to your Training stream.

11)

Apply  your  LDA  model  to  your  scoring  data  and  run  your  model.    Evaluate  and  report
your results.  Did you get any confidence percentages?  Do the predicted Car_Types seem
reasonable and consistent with your training data?  Why or why not?

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 31 32 33 34 35 36 37 38 ... 65