Data Mining
for the Masses
84
12)
Bummer. No rules found. Did we do all that work for nothing? It seemed like we had
some hope for some associations back in step 9, what happened? Remember from
Chapter 1 that the CRISP-DM process is cyclical in nature, and sometimes, you have to go
back and forth between steps before you will create a model that yields results. Such is the
case here. We have nothing to consider here, so perhaps we need to tweak some of our
model’s parameters. This may be a process of trial and error, which will take us back and
forth between our current CRISP-DM step of Modeling and…
EVALUATION
13)
So we’ve evaluated our model’s first run. No rules found. Not much to evaluate there,
right? So let’s switch back to design perspective, and take a look at those parameters we
highlighted briefly in the previous steps. There are two main factors that dictate whether
or not frequency patterns get translated into association rules:
Confidence percent and
Support percent. Confidence percent is a measure of how confident we are that when
one attribute is flagged as true, the associated attribute will also be flagged as true. In the
classic shopping basket analysis example, we could look at two items often associated with
one another: cookies and milk. If we examined ten shopping baskets and found that
cookies were
purchased in four of them, and
milk was purchased in seven, and that further,
in three of the four instances where cookies were purchased, milk was also in those
baskets, we would have a 75% confidence in the association rule: cookies → milk. This is
calculated by dividing the three instances where cookies and milk coincided by the four
instances where they
could have coincided (3/4 = .75, or 75%). The rule cookies → milk
had a chance to occur four times, but it only occurred three, so our confidence in this rule
is not absolute.
Now consider the reciprocal of the rule: milk → cookies. Milk was found in seven of our
ten hypothetical baskets, while cookies were found in four. We know that the coincidence,
or frequency of connection between these two products is three. So our confidence in
milk → cookies falls to only 43% (3/7 = .429, or 43%). Milk had a chance to be found
with cookies seven times, but it was only found with them three times, so our confidence
in milk → cookies is a good bit lower than our confidence in cookies → milk. If a person
Chapter 5:
Association Rules
85
comes to the store with the intention of buying cookies, we are more confident that they
will also buy milk than if their intentions were reversed. This concept is referred to in
association rule mining as
Premise → Conclusion. Premises are sometimes also referred
to as
antecedents, while conclusions are sometimes referred to as
consequents. For each
pairing, the confidence percentages will differ based on which attribute is the premise and
which the conclusion. When associations between three or more attributes are found, for
example, cookies, crackers → milk, the confidence percentages are calculated based on the
two attributes being found with the third. This can become complicated to do manually,
so it is nice to have RapidMiner to find these combinations and run the calculations for us!
The support percent is an easier measure to calculate. This is simply the number of times
that the rule
did occur, divided by the number of observations in the data set. The number
of items in the data set is the absolute number of times the association
could have occurred,
since every customer could have purchased cookies and milk together in their shopping
basket. The fact is, they didn’t, and such a phenomenon would be highly unlikely in any
analysis. Possible, but unlikely. We know that in our hypothetical example, cookies and
milk were found together in three out of ten shopping baskets, so our support percentage
for this association is 30% (3/10 = .3, or 30%). There is no reciprocal for support
percentages since this metric is simply the number of times the association did occur over
the number of times it could have occurred in the data set.
So now that we understand these two pivotal parameters in association rule mining, let’s
make a parameter modification and see if we find any association rules in our data. You
should be in design perspective again, but if not, switch back now. Click on your Create
Association Rules operator and change the
min confidence parameter to .5 (see Figure 5-10).
This indicates to RapidMiner that any association with at least 50% confidence should be
displayed as a rule. With this as the confidence percent threshold, if we were using the
hypothetical shopping baskets discussed in the previous paragraphs to explain confidence
and support, cookies → milk would return as a rule because its confidence percent was
75%, while milk → cookies would not, due to that association’s 43% confidence percent.
Let’s run our model again with the .5 confidence value and see what we get.