Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	25/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 21 22 23 24 25 26 27 28 ... 65

Confidence percent and Support percent
Premise → Conclusion

Data Mining for the Masses
84

12)

Bummer.  No rules found.  Did we do all that work for nothing?  It seemed like we had
some  hope  for  some  associations  back  in  step  9,  what  happened?    Remember  from
Chapter 1 that the CRISP-DM process is cyclical in nature, and sometimes, you have to go
back and forth between steps before you will create a model that yields results.  Such is the
case here.  We have nothing to consider here, so perhaps we need to tweak some of our
model’s parameters.  This may be a process of trial and error, which will take us back and
forth between our current CRISP-DM step of Modeling and…

EVALUATION

13)

So we’ve evaluated our model’s first run.  No rules found.  Not much to evaluate there,
right?  So let’s switch back to design perspective, and take a look at those parameters we
highlighted briefly in the previous steps.  There are two main factors that dictate whether
or  not  frequency  patterns  get  translated  into  association  rules:  Confidence  percent  and
Support  percent.  Confidence percent is a measure of how confident we are that when
one attribute is flagged as true, the associated attribute will also be flagged as true.  In the
classic shopping basket analysis example, we could look at two items often associated with
one  another:  cookies  and  milk.    If  we  examined  ten  shopping  baskets  and  found  that
cookies were purchased in four of them, and milk was purchased in seven, and that further,
in  three  of  the  four  instances  where  cookies  were  purchased,  milk  was  also  in  those
baskets, we would have a 75% confidence in the association rule: cookies → milk.  This is
calculated  by  dividing  the  three  instances  where  cookies  and  milk  coincided  by  the  four
instances where they could have coincided (3/4 = .75, or 75%).  The rule cookies → milk
had a chance to occur four times, but it only occurred three, so our confidence in this rule
is not absolute.

Now consider the reciprocal of the rule: milk → cookies.  Milk was found in seven of our
ten hypothetical baskets, while cookies were found in four.  We know that the coincidence,
or  frequency  of  connection  between  these  two  products  is  three.    So  our  confidence  in
milk → cookies falls to only 43% (3/7 = .429, or 43%).  Milk had a chance to be found
with cookies seven times, but it was only found with them three times, so our confidence
in milk → cookies is a good bit lower than our confidence in cookies → milk.  If a person

Chapter 5: Association Rules
85
comes to the store with the intention of buying cookies, we are more confident that they
will  also  buy  milk  than  if  their  intentions  were  reversed.    This  concept  is  referred  to  in
association rule mining as Premise → Conclusion.  Premises are sometimes also referred
to as antecedents, while conclusions are sometimes referred to as consequents.  For each
pairing, the confidence percentages will differ based on which attribute is the premise and
which the conclusion.  When associations between three or more attributes are found, for
example, cookies, crackers → milk, the confidence percentages are calculated based on the
two attributes being found with the third.  This can become complicated to do manually,
so it is nice to have RapidMiner to find these combinations and run the calculations for us!

The support percent is an easier measure to calculate.  This is simply the number of times
that the rule did occur, divided by the number of observations in the data set.  The number
of items in the data set is the absolute number of times the association could have occurred,
since  every  customer  could  have  purchased  cookies  and  milk  together  in  their  shopping
basket.  The fact is, they didn’t, and such a phenomenon would be highly unlikely in any
analysis.  Possible, but unlikely.  We know that in our hypothetical example, cookies and
milk were found together in three out of ten shopping baskets, so our support percentage
for  this  association  is  30%  (3/10  =  .3,  or  30%).    There  is  no  reciprocal  for  support
percentages since this metric is simply the number of times the association did occur over
the number of times it could have occurred in the data set.

So now that we understand these two pivotal parameters in association rule mining, let’s
make a parameter modification and see if we find any association rules in our data.  You
should be in design perspective again, but if not, switch back now.  Click on your Create
Association Rules operator and change the min confidence parameter to .5 (see Figure 5-10).
This indicates to RapidMiner that any association with at least 50% confidence should be
displayed  as  a  rule.    With  this  as  the  confidence  percent  threshold,  if  we  were  using  the
hypothetical shopping baskets discussed in the previous paragraphs to explain confidence
and  support,  cookies  →  milk  would  return  as  a  rule  because  its  confidence  percent  was
75%, while milk → cookies would not, due to that association’s 43% confidence percent.
Let’s run our model again with the .5 confidence value and see what we get.

Data Mining for the Masses
86

Figure 5-10. Chaning the confidence percent threshold.

Figure 5-11. Four rules found with the 50% confidence threshold.

14)

Eureka!    We  have  found  rules,  and  our  hunch  that  Religious,  Family  and  Hobby
organizations are related was correct (remember Figure 5-7).  Look at rule number four.  It
just  barely  missed  being  considered  a  rule  with  an  80%  confidence  threshold  at  79.6%.
Our other associations have lower confidence percentages, but are still quite good.  We can
see that for each of these four rules, more than 20% of the observations in our data set
support  them.    Remember  that  since  support  is  not  reciprocal,  the  support  percents  for
rules 1 and 3 are the same, as they are for rules 2 and 4.  As the premises and conclusions
were reversed, their confidence percentages did vary however.  Had we set our confidence
percent threshold at .55 (or 55% percent), rule 1 would drop out of our results, so Family
→ Religious would be a rule but Religious → Family would not.  The other calculations to
the right (LaPlace…Conviction) are additional arithmetic indicators of the strength of the
rules’ relationships.  As you compare these values to support and confidence percents, you
will see that they track fairly consistently with one another.

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 21 22 23 24 25 26 27 28 ... 65