Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	29/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 25 26 27 28 29 30 31 32 ... 65

CHAPTER SUMMARY
REVIEW QUESTIONS

Data Mining for the Masses
98

By clicking the small + sign next to cluster 0 in Folder View, we can see all of the observations that
have  means  which  are  similar  to  the  mean  for  this  cluster.    Remember  that  these  means  are
calculated for each attribute.   You can see the details for any observation in the cluster by clicking
on it. Figure 6-8 shows the results of clicking on observation 6 (6.0):

Figure 6-8. The details of an observation within cluster 0.

The means for cluster 0 were just over 184 pounds for weight and just under 219 for cholesterol.
The person represented in observation 6 is heavier and has higher cholesterol than the average for
this  highest  risk  group.    Thus,  this  is  a  person  Sonia  is  really  hoping  to  help  with  her  outreach
program.  But we know from the Centroid Table that there are 154 individuals in the data set who
fall into this cluster.  Clicking on each one of them in Folder View probably isn’t the most efficient
use of Sonia’s time.  Furthermore, we know from our Data Understanding paragraph earlier in this
chapter that this model is built on only a sample data set of policy holders.  Sonia might want to
extract these attributes for all policy holders from the company’s database and run the model again
on that data set.  Or, if she is satisfied that the sample has given her what she wants in terms of
finding the breaks between the groups, she can move forward with…

DEPLOYMENT

We  can  help  Sonia  extract  the  observations  from  cluster  0  fairly  quickly  and  easily.    Return  to
design perspective in RapidMiner.  Recall from Chapter 3 that we can filter out observations in our

Chapter 6: k-Means Clustering
99
data set. In that chapter, we discussed filtering out observations as a Data Preparation step, but we
can use the same operator in our Deployment as well.  Using the search field in the Operators tab,
locate  the  Filter  Examples  operator  and  connect  it  to  your  k-Means  Clustering  operator,  as  is
depicted in Figure 6-9.  Note that we have not disconnected the  clu (cluster) port from  the ‘res’
(result  set)  port,  but  rather,  we  have  connected  a  second  clu  port  to  our  exa  port  on  the  Filter
Examples operator, and connected the exa port from Filter Examples to its own res port.

Figure 6-9. Filtering our cluster model’s output for only observations in cluster 0.

As indicated by the black arrows in Figure 6-9, we are filtering out our observations based on an
attribute  filter,  using  the  parameter  string  cluster=cluster_0.    This  means  that  only  those
observations in the data set that are classified in the cluster_0 group will be retained.  Go ahead
and click the play button to run the model again.

You will see that we have not lost our Cluster Model tab.  It is still available to us, but now we
have added an ExampleSet tab, which contains only those 154 observations which fell into cluster
0.    As  with  the  result  of  previous  models  we’ve  created,  we  have  descriptive  statistics  for  the
various attributes in the data set.

Data Mining for the Masses
100

Figure 6-10. Filtered results for only cluster 0 observations.

Sonia could use these figures to begin contacting potential participants in her programs.  With the
high risk group having weights between 167 and 203 pounds, and cholesterol levels between 204
and  235  (these  are  taken  from  the  Range  statistics  in  Figure  6-10),  she  could  return  to  her
company’s database and issue a SQL query like this one:

SELECT First_Name, Last_Name, Policy_Num, Address, Phone_Num

FROM PolicyHolders_view

WHERE Weight >= 167

AND Cholesterol >= 204;

This would give her the contact list for every person, male or female, insured by her employer who
would fall into the higher risk group (cluster 0) in our data mining model.  She could change the
parameter criteria in our Filter Examples operator to be cluster=cluster_1 and re-run the model to
get the descriptive statistics for those in the next highest risk group, and modify her SQL statement
to  get  the  contact  list  for  that  group  from  her  organizational  database;  something  akin  to  this
query:

SELECT First_Name, Last_Name, Policy_Num, Address, Phone_Num

FROM PolicyHolders_view

WHERE (Weight >= 140 AND Weight <= 169)

AND (Cholesterol >= 168 AND Cholesterol <= 204);

If  she  wishes  to  also  separate  her  groups  by  gender,  she  could  add  that  criteria  as  well,  such as
“AND Gender = 1” in the WHERE clause of the SQL statement.  As she continues to develop
her health improvement programs, Sonia would have the lists of individuals that she most wants to

Chapter 6: k-Means Clustering
101
target  in  the  hopes  of  raising  awareness,  educating  policy  holders,  and  modifying  behaviors  that
will lead to lower incidence of heart disease among her employer’s clients.

CHAPTER SUMMARY

k-Means clustering is a data mining model that falls primarily on the side of Classification when
referring to the Venn diagram from Chapter 1 (Figure 1-2).  For this chapter’s example, it does not
necessarily predict which insurance policy holders will or will not develop heart disease.  It simply
takes known indicators from the attributes in a data set, and groups them together based on those
attributes’ similarity to group averages.  Because any  attributes that can be quantified can also have
means calculated, k-means clustering provides an effective way of grouping observations together
based on what is typical or normal for that group.  It also helps us understand where one group
begins and the other ends, or in other words, where the natural breaks occur between groups in a
data set.

k-Means  clustering  is  very  flexible  in  its  ability  to  group  observations  together.    The  k-Means
operator in RapidMiner allows data miners to set the number of clusters they wish to generate, to
dictate  the  number  of  sample  means  used  to  determine  the  clusters,  and  to  use  a  number  of
different algorithms to evaluate means.  While fairly simple in its set-up and definition, k-Means
clustering is a powerful method for finding natural groups of observations in a data set.

REVIEW QUESTIONS

1)

What does the k in k-Means clustering stand for?

2)

How  are  clusters  identified?    What  process  does  RapidMiner  use  to  define  clusters  and
place observations in a given cluster?

3)

What does the Centroid Table tell the data miner?  How do you interpret the values in a
Centroid Table?

4)

How  do  descriptive  statistics  aid  in  the  process  of  evaluating  and  deploying  a  k-Means
clustering model?

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 25 26 27 28 29 30 31 32 ... 65