Chapter 6:
k-Means Clustering
91
CHAPTER SIX:
K-MEANS CLUSTERING
CONTEXT AND PERSPECTIVE
Sonia is a program director for a major health insurance provider. Recently she has been reading
in medical journals and other articles, and found a strong emphasis on the influence of weight,
gender and cholesterol on the development of coronary heart disease. The research she’s read
confirms time after time that there is a connection
between these three variables, and while there is
little that can be done about one’s gender, there are certainly life choices that can be made to alter
one’s cholesterol and weight. She begins brainstorming ideas for her company to offer weight and
cholesterol management programs to individuals who receive health insurance through her
employer. As she considers where her efforts might be most effective, she finds herself wondering
if there are natural groups of individuals who are most at risk for high weight and high cholesterol,
and if there are such groups, where the natural dividing lines between the groups occur.
LEARNING OBJECTIVES
After completing the reading
and exercises in this chapter, you should be able to:
Explain what k-means clusters are, how they are found and the benefits of using them.
Recognize the necessary format for data in order to create k-means clusters.
Develop a k-means cluster data mining model in RapidMiner.
Interpret the clusters generated by a k-means model and
explain their significance, if any.
ORGANIZATIONAL UNDERSTANDING
Sonia’s goal is to identify and then try to reach out to individuals insured by her employer who are
at high risk for coronary heart disease because of their weight and/or high cholesterol. She
understands that those at low risk, that is, those with low weight and cholesterol, are unlikely to
Data Mining
for the Masses
92
participate in the programs she will offer. She also understands that there are probably policy
holders with high weight and low cholesterol, those with high weight
and high cholesterol, and
those with low weight and high cholesterol. She further recognizes there are likely to be a lot of
people somewhere in between. In order to accomplish her goal, she needs to search among the
thousands of policy holders to find groups of people with similar characteristics and craft
programs and communications that will be relevant and appealing to people in these different
groups.
DATA UNDERSTANDING
Using the insurance company’s claims database, Sonia extracts three attributes for 547 randomly
selected individuals. The three attributes are the insured’s weight in pounds as recorded on the
person’s most recent medical examination, their last cholesterol level determined by blood work in
their doctor’s lab, and their gender. As is typical in many data sets, the gender attribute uses 0 to
indicate Female and 1 to indicate Male. We will use this sample data from Sonia’s employer’s
database to build a cluster model to help Sonia understand how her company’s clients, the health
insurance policy holders, appear to group together on the basis of their weights, genders and
cholesterol levels. We should remember as we do this that means are particularly susceptible to
undue influence by extreme outliers, so watching for inconsistent data when using the
k-Means
clustering data mining methodology is very important.
DATA PREPARATION
As with previous chapters, a data set has been prepared for this chapter’s example, and is available
as Chapter06DataSet.csv on the book’s companion web site. If you would like to follow along
with this example exercise, go ahead and download the data set now, and import it into your
RapidMiner data repository. At this point you are probably getting comfortable with importing
CSV data sets into a RapidMiner repository, but remember that the steps are outlined in Chapter 3
if you need to review them. Be sure to designate the attribute names correctly and to check your
data types as you import. Once you have imported the data set, drag it into a new, blank process
window so that you can begin to set up your k-means clustering data mining model. Your process
should look like Figure 6-1.
Chapter 6: k-Means Clustering
93
Figure 6-1.
Cholesterol, Weight and Gender data set added to a new process.
Go ahead and click the play button to run your model and examine the data set. In Figure 6-2 we
can see that we have 547 observations across our three previously defined attributes. We can see
the averages for each of the three attributes, along with their accompanying standard deviations
and ranges. None of these values appear to be inconsistent (remember the earlier comments about
using standard deviations to find statistical outliers). We have no missing values to handle, so our
data appear to be very clean and ready to be mined.
Figure 6-2. A view of our data set’s meta data.