Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	30/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 26 27 28 29 30 31 32 33 ... 65

Data Mining for the Masses
102

5)

How might the presence of outliers in the attributes of a data set influence the usefulness
of a k-Means clustering model? What could be done to address the problem?

EXERCISE

Think of an example of a problem that could be at least partially addressed by being able to group
observations in a data set into clusters.  Some examples might be grouping kids who might be at
risk  for  delinquency,  grouping  product  sale  volumes,  grouping  workers  by  productivity  and
effectiveness, etc.  Search the Internet or other resources available to you for a data set that would
allow you to investigate your question using a k-means model.  As with all exercises in this text,
please ensure that you have permission to use any data set that might belong to your employer or
another entity. When you have secured your data set, complete the following steps:

1)

Ensure  that  your  data  set  is  saved  as  a  CSV  file.    Import  your  data  set  into  your
RapidMiner  repository  and  save  it  with  a  meaningful  name.    Drag  it  into  a  new  process
window in RapidMiner.

2)

Conduct any data preparation that you need for your data set.  This may include handling
inconsistent data, dealing with missing values, or changing data types.  Remember that in
order to calculate means, each attribute in your data set will need to be numeric.  If, for
example, one of your attributes contains the values ‘yes’ and ‘no’, you may need to change
these to be 1 and 0 respectively, in order for the k-Means operator to work.

3)

Connect  a  k-Means  operator  to  your  data  set,  configure  your  parameters  (especially  set
your k to something meaningful for your question) and then run your model.

4)

Investigate your Centroid Table, Folder View, and the other evaluation tools.

5)

Report  your  findings  for  your  clusters.    Discuss  what  is  interesting  about  them  and
describe  what  iterations  of  modeling  you  went  through,  such  as  experimentation  with
different parameter values, to generate the clusters.  Explain how your findings are relevant
to your original question.

Chapter 6: k-Means Clustering
103

Challenge Step!

6)

Experiment with the other k-Means operators in RapidMiner, such as Kernel or Fast.
How are they different from your original model. Did the use of these operators change
your clusters, and if so, how?

Chapter 7: Discriminant Analysis
105

CHAPTER SEVEN:
DISCRIMINANT ANALYSIS

CONTEXT AND PERSPECTIVE

Gill  runs  a  sports  academy  designed  to  help  high  school  aged  athletes  achieve  their  maximum
athletic  potential.    On  the  boys  side  of  his  academy,  he  focuses  on  four major  sports:  Football,
Basketball,  Baseball  and  Hockey.    He  has  found  that  while  many  high  school  athletes  enjoy
participating in a number of sports in high school, as they begin to consider playing a sport at the
college level, they would prefer to specialize in one sport.  As he’s worked with athletes over the
years,  Gill  has  developed  an  extensive  data  set,  and  he  now  is  wondering  if  he  can  use  past
performance  from  some  of  his  previous  clients  to  predict  prime  sports  for  up-and-coming  high
school  athletes.    Ultimately,  he  hopes  he  can  make  a  recommendation  to  each  athlete  as  to  the
sport  in  which  they  should  most  likely  choose  to  specialize.    By  evaluating  each  athlete’s
performance across a battery of test, Gill hopes we can help him figure out for which sport each
athlete has the highest aptitude.

LEARNING OBJECTIVES

After completing the reading and exercises in this chapter, you should be able to:


Explain what discriminant analysis is, how it is used and the benefits of using it.


Recognize the necessary format for data in order to perform discriminant analysis.


Explain  the  differences  and  similarities  between  k-Means  clustering  and  discriminant
analysis.


Develop a  discriminant analysis data mining model in RapidMiner using a training data
set.


Interpret the model output and apply it to a scoring data set in order to deploy the model.

Data Mining for the Masses
106

ORGANIZATIONAL UNDERSTANDING

Gill’s objective is to examine young athletes and, based upon their performance across a number
of  metrics,  help  them  decide  which  sport  is  the  most  prime  for  their  specialized  success.  Gill
recognizes  that  all  of  his  clients  possess  some  measure  of  athleticism,  and  that  they  enjoy
participating in a number of sports.   Being young, athletic, and adaptive, most of his clients are
quite good at a number of sports, and he has seen over the years that some people are so naturally
gifted that they would excel in any sport they choose for specialization.  Thus, he recognizes, as a
limitation of this data mining exercise, that he may not be able to use data to determine an athlete’s
“best” sport.  Still, he has seen metrics and evaluations work in the past, and has seen that some of
his previous athletes really were pre-disposed to a certain sport, and that they were successful as
they went on to specialize in that sport.  Based on his industry experience, he has decided to go
ahead with an experiment in mining data for athletic aptitude, and has enlisted our help.

DATA UNDERSTANDING

In order to begin to formulate a plan, we sit down with Gill to review his data assets. Every athlete
that has enrolled at Gill’s academy over the past several years has taken a battery test, which tested
for a number of athletic and personal traits.  The battery has been administered to both boys and
girls participating in a number of different sports, but for this preliminary study we have decided
with  Gill  that  we  will  look  at  data  only  for  boys.    Because  the  academy  has  been  operating  for
some time, Gill has the benefit of knowing which of his former pupils have gone on to specialize
in a single sport, and which sport it was for each of them.  Working with Gill, we gather the results
of  the  batteries  for  all  former  clients  who  have  gone  on  to  specialize,  Gill  adds  the  sport  each
person  specialized  in,  and  we  have  a  data  set  comprised  of  493  observations  containing  the
following attributes:


Age:    This  is  the  age  in  years  (one  decimal  precision  for  the  part  of  the  year  since  the
client’s  last  birthday)  at  the  time  that  the  athletic  and  personality  trait  battery  test  was
administered.    Participants  ranged  in  age  from  13-19  years  old  at  the  time  they  took  the
battery.


Strength:  This  is  the  participant’s  strength  measured  through  a  series  of  weight  lifting
exercises  and  recorded  on  a  scale  of  0-10,  with  0  being  limited  strength  and  10  being

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 26 27 28 29 30 31 32 33 ... 65