Data Mining for the Masses
94
MODELING
The ‘ k’ in k-means clustering stands for some number of groups, or clusters. The aim of this data
mining methodology is to look at each observation’s individual attribute values and compare them
to the means, or in other words averages, of potential groups of other observations in order to find
natural groups that are similar to one another. The k-means algorithm accomplishes this by
sampling some set of observations in the data set, calculating the averages, or means, for each
attribute for the observations in that sample, and then comparing the other attributes in the data
set to that sample’s means. The system does this repetitively in order to ‘circle-in’ on the best
matches and then to formulate groups of observations which become the clusters. As the means
calculated become more and more similar, clusters are formed, and each observation whose
attributes values are most like the means of a cluster become members of that cluster. Using this
process, k-means clustering models can sometimes take a long time to run, especially if you
indicate a large number of “max runs” through the data, or if you seek for a large number of
clusters ( k). To build your k-means cluster model, complete the following steps:
1)
Return to design view in RapidMiner if you have not done so already. In the operators
search box, type k-means (be sure to include the hyphen). There are three operators that
conduct k-means clustering work in RapidMiner. For this exercise, we will choose the first,
which is simply named “k-Means”. Drag this operator into your stream, and shown in
Figure 6-3.
Figure 6-3. Adding the k-Means operator to our model.
Chapter 6: k-Means Clustering
95
2)
Because we did not need to add any other operators in order to prepare our data for
mining, our model in this exercise is very simple. We could, at this point, run our model
and begin to interpret the results. This would not be very interesting however. This is
because the default for our k, or our number of clusters, is 2, as indicated by the black
arrow on the right hand side of Figure 6-3. This means we are asking RapidMiner to find
only two clusters in our data. If we only wanted to find those with high and low levels of
risk for coronary heart disease, two clusters would work. But as discussed in the
Organizational Understanding section earlier in the chapter, Sonia has already recognized
that there are likely a number of different types of groups to be considered. Simply
splitting the data set into two clusters is probably not going to give Sonia the level of detail
she seeks. Because Sonia felt that there were probably at least 4 potentially different
groups, let’s change the k value to four, as depicted in Figure 6-4. We could also increase
of number of ‘max runs’, but for now, let’s accept the default and run the model.
Figure 6-4. Setting the desired number of clusters for our model.
3)
When the model is run, we find an initial report of the number of items that fell into each
of our four clusters. (Note that the clustered are numbered starting from 0, a result of
RapidMiner being written in the Java programming language.) In this particular model, our
Data Mining for the Masses
96
clusters are fairly well balanced. While Cluster 1, with only 118 observations (Figure 6-5),
is smaller than the other clusters, it is not unreasonably so.
Figure 6-5. The distribution of observations across our four clusters.
We could go back at this point and adjust our number of clusters, our number of ‘max runs’, or
even experiment with the other parameters offered by the k-Means operator. There are other
options for measurement type or divergence algorithms. Feel free to try out some of these options
if you wish. As was the case with Association Rules, there may be some back and forth trial-and-
error as you test different parameters to generate model output. When you are satisfied with your
model parameters, you can proceed to…
EVALUATION
Recall that Sonia’s major objective in the hypothetical scenario posed at the beginning of the
chapter was to try to find natural breaks between different types of heart disease risk groups.
Using the k-Means operator in RapidMiner, we have identified four clusters for Sonia, and we can
now evaluate their usefulness in addressing Sonia’s question. Refer back to Figure 6-5. There are a
number of radio buttons which allow us to select options for analyzing our clusters. We will start
by looking at our Centroid Table. This view of our results, shown in Figure 6-6, give the means
for each attribute in each of the four clusters we created.
Chapter 6: k-Means Clustering
97
Figure 6-6. The means for each attribute in our four (k) clusters.
We see in this view that cluster 0 has the highest average weight and cholesterol. With 0
representing Female and 1 representing Male, a mean of 0.591 indicates that we have more men
than women represented in this cluster. Knowing that high cholesterol and weight are two key
indicators of heart disease risk that policy holders can do something about, Sonia would likely want
to start with the members of cluster 0 when promoting her new programs. She could then extend
her programming to include the people in clusters 1 and 2, which have the next incrementally
lower means for these two key risk factor attributes. You should note that in this chapter’s
example, the clusters’ numeric order (0, 1, 2, 3) corresponds to decreasing means for each cluster.
This is coincidental. Sometimes, depending on your data set, cluster 0 might have the highest
means, but cluster 2 might have then next highest, so it’s important to pay close attention to your
centroid values whenever you generate clusters.
So we know that cluster 0 is where Sonia will likely focus her early efforts, but how does she know
who to try to contact? Who are the members of this highest risk cluster? We can find this
information by selecting the Folder View radio button. Folder View is depicted in Figure 6-7.
Figure 6-7. Folder view showing the observations included in Cluster 0.
Dostları ilə paylaş: |