Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	28/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 24 25 26 27 28 29 30 31 ... 65

Data Mining for the Masses
94
MODELING

The ‘k’ in k-means clustering stands for some number of groups, or clusters.  The aim of this data
mining methodology is to look at each observation’s individual attribute values and compare them
to the means, or in other words averages, of potential groups of other observations in order to find
natural  groups  that  are  similar  to  one  another.    The  k-means  algorithm  accomplishes  this  by
sampling  some  set  of  observations  in  the  data  set,  calculating  the  averages,  or  means,  for  each
attribute for the observations in that sample, and then comparing the other attributes in the data
set  to  that  sample’s  means.    The  system  does  this  repetitively  in  order  to  ‘circle-in’  on  the  best
matches and then to formulate groups of observations which become the clusters.  As the means
calculated  become  more  and  more  similar,  clusters  are  formed,  and  each  observation  whose
attributes values are most like the means of a cluster become members of that cluster.  Using this
process,  k-means  clustering  models  can  sometimes  take  a  long  time  to  run,  especially  if  you
indicate  a  large  number  of  “max  runs”  through  the  data,  or  if  you  seek  for  a  large  number  of
clusters (k). To build your k-means cluster model, complete the following steps:

1)

Return to design view in RapidMiner if you have not done so already.  In the operators
search box, type k-means (be sure to include the hyphen).  There are three operators that
conduct k-means clustering work in RapidMiner. For this exercise, we will choose the first,
which  is  simply  named  “k-Means”.    Drag  this  operator  into  your  stream,  and  shown  in
Figure 6-3.

Figure 6-3. Adding the k-Means operator to our model.

Chapter 6: k-Means Clustering
95

2)

Because  we  did  not  need  to  add  any  other  operators  in  order  to  prepare  our  data  for
mining, our model in this exercise is very simple.  We could, at this point, run our model
and  begin  to  interpret  the  results.    This  would  not  be  very  interesting  however.    This  is
because  the  default  for  our  k,  or  our  number  of  clusters,  is  2,  as  indicated  by  the  black
arrow on the right hand side of Figure 6-3.  This means we are asking RapidMiner to find
only two clusters in our data.  If we only wanted to find those with high and low levels of
risk  for  coronary  heart  disease,  two  clusters  would  work.    But  as  discussed  in  the
Organizational Understanding section earlier in the chapter, Sonia has already recognized
that  there  are  likely  a  number  of  different  types  of  groups  to  be  considered.    Simply
splitting the data set into two clusters is probably not going to give Sonia the level of detail
she  seeks.    Because  Sonia  felt  that  there  were  probably  at  least  4  potentially  different
groups, let’s change the k value to four, as depicted in Figure 6-4.  We could also increase
of number of ‘max runs’, but for now, let’s accept the default and run the model.

Figure 6-4. Setting the desired number of clusters for our model.

3)

When the model is run, we find an initial report of the number of items that fell into each
of  our  four  clusters.    (Note  that  the  clustered  are  numbered  starting  from  0,  a  result  of
RapidMiner being written in the Java programming language.)  In this particular model, our

Data Mining for the Masses
96
clusters are fairly well balanced.  While Cluster 1, with only 118 observations (Figure 6-5),
is smaller than the other clusters, it is not unreasonably so.

Figure 6-5. The distribution of observations across our four clusters.

We could go back at this point and adjust our number of clusters, our number of ‘max runs’, or
even  experiment  with  the  other  parameters  offered  by  the  k-Means  operator.    There  are  other
options for measurement type or divergence algorithms.  Feel free to try out some of these options
if you wish.  As was the case with Association Rules, there may be some back and forth trial-and-
error as you test different parameters to generate model output.  When you are satisfied with your
model parameters, you can proceed to…

EVALUATION

Recall  that  Sonia’s  major  objective  in  the  hypothetical  scenario  posed  at  the  beginning  of  the
chapter  was  to  try  to  find  natural  breaks  between  different  types  of  heart  disease  risk  groups.
Using the k-Means operator in RapidMiner, we have identified four clusters for Sonia, and we can
now evaluate their usefulness in addressing Sonia’s question. Refer back to Figure 6-5. There are a
number of radio buttons which allow us to select options for analyzing our clusters.  We will start
by looking at our Centroid Table.  This view of our results, shown in Figure 6-6, give the means
for each attribute in each of the four clusters we created.

Chapter 6: k-Means Clustering
97

Figure 6-6. The means for each attribute in our four (k) clusters.

We  see  in  this  view  that  cluster  0  has  the  highest  average  weight  and  cholesterol.    With  0
representing Female and 1 representing Male, a mean of 0.591 indicates that we have more men
than women represented  in this cluster.  Knowing that  high cholesterol and weight are two key
indicators of heart disease risk that policy holders can do something about, Sonia would likely want
to start with the members of cluster 0 when promoting her new programs.  She could then extend
her  programming  to  include  the  people  in  clusters  1  and  2,  which  have  the  next  incrementally
lower  means  for  these  two  key  risk  factor  attributes.    You  should  note  that  in  this  chapter’s
example, the clusters’ numeric order (0, 1, 2, 3) corresponds to decreasing means for each cluster.
This  is  coincidental.    Sometimes,  depending  on  your  data  set,  cluster  0  might  have  the  highest
means, but cluster 2 might have then next highest, so it’s important to pay close attention to your
centroid values whenever you generate clusters.

So we know that cluster 0 is where Sonia will likely focus her early efforts, but how does she know
who  to  try  to  contact?    Who  are  the  members  of  this  highest  risk  cluster?    We  can  find  this
information by selecting the Folder View radio button. Folder View is depicted in Figure 6-7.

Figure 6-7.  Folder view showing the observations included in Cluster 0.

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 24 25 26 27 28 29 30 31 ... 65