Data Mining for the Masses
98
By clicking the small + sign next to cluster 0 in Folder View, we can see all of the observations that
have means which are similar to the mean for this cluster. Remember that these means are
calculated for each attribute. You can see the details for any observation in the cluster by clicking
on it. Figure 6-8 shows the results of clicking on observation 6 (6.0):
Figure 6-8. The details of an observation within cluster 0.
The means for cluster 0 were just over 184 pounds for weight and just under 219 for cholesterol.
The person represented in observation 6 is heavier and has higher cholesterol than the average for
this highest risk group. Thus, this is a person Sonia is really hoping to help with her outreach
program. But we know from the Centroid Table that there are 154 individuals in the data set who
fall into this cluster. Clicking on each one of them in Folder View probably isn’t the most efficient
use of Sonia’s time. Furthermore, we know from our Data Understanding paragraph earlier in this
chapter that this model is built on only a sample data set of policy holders. Sonia might want to
extract these attributes for all policy holders from the company’s database and run the model again
on that data set. Or, if she is satisfied that the sample has given her what she wants in terms of
finding the breaks between the groups, she can move forward with…
DEPLOYMENT
We can help Sonia extract the observations from cluster 0 fairly quickly and easily. Return to
design perspective in RapidMiner. Recall from Chapter 3 that we can filter out observations in our
Chapter 6: k-Means Clustering
99
data set. In that chapter, we discussed filtering out observations as a Data Preparation step, but we
can use the same operator in our Deployment as well. Using the search field in the Operators tab,
locate the Filter Examples operator and connect it to your k-Means Clustering operator, as is
depicted in Figure 6-9. Note that we have not disconnected the clu (cluster) port from the ‘res’
(result set) port, but rather, we have connected a second clu port to our exa port on the Filter
Examples operator, and connected the exa port from Filter Examples to its own res port.
Figure 6-9. Filtering our cluster model’s output for only observations in cluster 0.
As indicated by the black arrows in Figure 6-9, we are filtering out our observations based on an
attribute filter, using the parameter string cluster=cluster_0. This means that only those
observations in the data set that are classified in the cluster_0 group will be retained. Go ahead
and click the play button to run the model again.
You will see that we have not lost our Cluster Model tab. It is still available to us, but now we
have added an ExampleSet tab, which contains only those 154 observations which fell into cluster
0. As with the result of previous models we’ve created, we have descriptive statistics for the
various attributes in the data set.
Data Mining for the Masses
100
Figure 6-10. Filtered results for only cluster 0 observations.
Sonia could use these figures to begin contacting potential participants in her programs. With the
high risk group having weights between 167 and 203 pounds, and cholesterol levels between 204
and 235 (these are taken from the Range statistics in Figure 6-10), she could return to her
company’s database and issue a SQL query like this one:
SELECT First_Name, Last_Name, Policy_Num, Address, Phone_Num
FROM PolicyHolders_view
WHERE Weight >= 167
AND Cholesterol >= 204;
This would give her the contact list for every person, male or female, insured by her employer who
would fall into the higher risk group (cluster 0) in our data mining model. She could change the
parameter criteria in our Filter Examples operator to be cluster=cluster_1 and re-run the model to
get the descriptive statistics for those in the next highest risk group, and modify her SQL statement
to get the contact list for that group from her organizational database; something akin to this
query:
SELECT First_Name, Last_Name, Policy_Num, Address, Phone_Num
FROM PolicyHolders_view
WHERE (Weight >= 140 AND Weight <= 169)
AND (Cholesterol >= 168 AND Cholesterol <= 204);
If she wishes to also separate her groups by gender, she could add that criteria as well, such as
“AND Gender = 1” in the WHERE clause of the SQL statement. As she continues to develop
her health improvement programs, Sonia would have the lists of individuals that she most wants to
Chapter 6: k-Means Clustering
101
target in the hopes of raising awareness, educating policy holders, and modifying behaviors that
will lead to lower incidence of heart disease among her employer’s clients.
CHAPTER SUMMARY
k-Means clustering is a data mining model that falls primarily on the side of Classification when
referring to the Venn diagram from Chapter 1 (Figure 1-2). For this chapter’s example, it does not
necessarily predict which insurance policy holders will or will not develop heart disease. It simply
takes known indicators from the attributes in a data set, and groups them together based on those
attributes’ similarity to group averages. Because any attributes that can be quantified can also have
means calculated, k-means clustering provides an effective way of grouping observations together
based on what is typical or normal for that group. It also helps us understand where one group
begins and the other ends, or in other words, where the natural breaks occur between groups in a
data set.
k-Means clustering is very flexible in its ability to group observations together. The k-Means
operator in RapidMiner allows data miners to set the number of clusters they wish to generate, to
dictate the number of sample means used to determine the clusters, and to use a number of
different algorithms to evaluate means. While fairly simple in its set-up and definition, k-Means
clustering is a powerful method for finding natural groups of observations in a data set.
REVIEW QUESTIONS
1)
What does the k in k-Means clustering stand for?
2)
How are clusters identified? What process does RapidMiner use to define clusters and
place observations in a given cluster?
3)
What does the Centroid Table tell the data miner? How do you interpret the values in a
Centroid Table?
4)
How do descriptive statistics aid in the process of evaluating and deploying a k-Means
clustering model?
Dostları ilə paylaş: |