Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	27/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 23 24 25 26 27 28 29 30 ... 65

CHAPTER SIX: K-MEANS CLUSTERING CONTEXT AND PERSPECTIVE
LEARNING OBJECTIVES
ORGANIZATIONAL UNDERSTANDING
DATA UNDERSTANDING

Chapter 6: k-Means Clustering
91

CHAPTER SIX:
K-MEANS CLUSTERING

CONTEXT AND PERSPECTIVE

Sonia is a program director for a major health insurance provider.  Recently she has been reading
in  medical  journals  and  other  articles,  and  found  a  strong  emphasis  on  the  influence  of  weight,
gender  and  cholesterol  on  the  development  of  coronary  heart  disease.    The  research  she’s  read
confirms time after time that there is a connection between these three variables, and while there is
little that can be done about one’s gender, there are certainly life choices that can be made to alter
one’s cholesterol and weight.  She begins brainstorming ideas for her company to offer weight and
cholesterol  management  programs  to  individuals  who  receive  health  insurance  through  her
employer.  As she considers where her efforts might be most effective, she finds herself wondering
if there are natural groups of individuals who are most at risk for high weight and high cholesterol,
and if there are such groups, where the natural dividing lines between the groups occur.

LEARNING OBJECTIVES

After completing the reading and exercises in this chapter, you should be able to:


Explain what k-means clusters are, how they are found and the benefits of using them.


Recognize the necessary format for data in order to create k-means clusters.


Develop a k-means cluster data mining model in RapidMiner.


Interpret the clusters generated by a k-means model and explain their significance, if any.

ORGANIZATIONAL UNDERSTANDING

Sonia’s goal is to identify and then try to reach out to individuals insured by her employer who are
at  high  risk  for  coronary  heart  disease  because  of  their  weight  and/or  high  cholesterol.    She
understands that those at low risk, that is, those with low weight and cholesterol, are unlikely to

Data Mining for the Masses
92
participate  in  the  programs  she  will  offer.    She  also  understands  that  there  are  probably  policy
holders  with  high  weight  and  low  cholesterol,  those  with  high  weight  and  high  cholesterol,  and
those with low weight and high cholesterol.  She further recognizes there are likely to be a lot of
people somewhere in between.  In order to accomplish her goal, she needs to search among the
thousands  of  policy  holders  to  find  groups  of  people  with  similar  characteristics  and  craft
programs  and  communications  that  will  be  relevant  and  appealing  to  people  in  these  different
groups.

DATA UNDERSTANDING

Using the insurance company’s claims database, Sonia extracts three attributes for 547 randomly
selected individuals.  The three attributes are the insured’s weight in pounds as recorded on the
person’s most recent medical examination, their last cholesterol level determined by blood work in
their doctor’s lab, and their gender.  As is typical in many data sets, the gender attribute uses 0 to
indicate  Female  and  1  to  indicate  Male.    We  will  use  this  sample  data  from  Sonia’s  employer’s
database to build a cluster model to help Sonia understand how her company’s clients, the health
insurance  policy  holders,  appear  to  group  together  on  the  basis  of  their  weights,  genders  and
cholesterol levels.  We should remember as we do this that means are particularly susceptible to
undue influence by extreme outliers, so watching for inconsistent data when using the  k-Means
clustering data mining methodology is very important.

DATA PREPARATION

As with previous chapters, a data set has been prepared for this chapter’s example, and is available
as  Chapter06DataSet.csv on  the  book’s  companion  web  site.    If  you  would  like  to  follow  along
with  this  example  exercise,  go  ahead  and  download  the  data  set  now,  and  import  it  into  your
RapidMiner  data  repository.    At  this  point  you  are  probably  getting  comfortable  with importing
CSV data sets into a RapidMiner repository, but remember that the steps are outlined in Chapter 3
if you need to review them.  Be sure to designate the attribute names correctly and to check your
data types as you import.  Once you have imported the data set, drag it into a new, blank process
window so that you can begin to set up your k-means clustering data mining model.  Your process
should look like Figure 6-1.

Chapter 6: k-Means Clustering
93

Figure 6-1. Cholesterol, Weight and Gender data set added to a new process.

Go ahead and click the play button to run your model and examine the data set.  In Figure 6-2 we
can see that we have 547 observations across our three previously defined attributes.  We can see
the  averages  for  each  of the  three  attributes,  along  with  their accompanying  standard  deviations
and ranges. None of these values appear to be inconsistent (remember the earlier comments about
using standard deviations to find statistical outliers).  We have no missing values to handle, so our
data appear to be very clean and ready to be mined.

Figure 6-2. A view of our data set’s meta data.

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 23 24 25 26 27 28 29 30 ... 65