Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	60/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 57 58 59 60 61 62 63 64 65

Chapter 13: Evaluation and Deployment
227

11)

Re-run  the  model.    You  do  not  need  to  switch  back  to  the  main  process  to  re-run  the
model.  You may if you wish, but you can stay in sub-process view and run it too.  When
you  return  from  results  perspective  to  design  perspective,  you  will  see  whichever  design
window  you were  last  in.    When  you  re-run  the model,  you  will  see  a  new  performance
matrix, showing the model’s predictive power using Gini as the underlying algorithm.

Figure 13-10. New cross-validation performance results based on
the gini_index decision tree model.

We see in Figure 13-10 that our model’s ability to predict is significantly improved if we
use Gini for our decision tree model.  This should also not come as a great surprise.  We
knew from Chapter 10 that the granularity in our tree’s detail under Gini was much greater.
Greater  detail  in  the  predictive  tree  should  result  in  a  more  reliably  predictive  model.
Feeding  more  and  better  training  data  into  the  training  data  set  would  likely  raise  this
model’s reliability even more.

CHAPTER SUMMARY: THE VALUE OF EXPERIENCE

So  now  we  have  seen  one  way  to  statistically  evaluate  a  model’s  reliability.    You  have  seen  that
there  are  a  number  of  cross-validation  and  performance  operators  that  you  can  use  to  check  a
training  data  set’s  ability  to  perform.    But  the  bottom  line  is  that  there  is  no  substitute  for
experience and expertise. Use subject matter experts to review your data mining results.  Ask them
to give you feedback on your model’s output.  Run pilot tests and use focus groups to try out your
model’s predictions before rolling them out organization-wide.  Do not be offended if someone
questions  or  challenges  the  reliability  of  your  model’s  results—be  humble  enough  to  take  their

Data Mining for the Masses
228
questions as an opportunity to validate and strengthen your model.  Remember that ‘pride goeth
before  the  fall’!    Data  mining  is  a  process.    If  you  present  your  data  mining  results  and
recommendations as infallible, you are not participating in the cyclical nature of CRISP-DM, and
you’ll likely end up looking foolish sooner or later.  CRISP-DM is such a good process precisely
because of its ability to help us investigate data, learn from our investigation, and then do it again
from  a  more  informed  position.    Evaluation  and  Deployment  are  the  two  steps  in  the  process
where we establish that more informed position.

REVIEW QUESTIONS

1)

What is cross-validation and why should you do it?

2)

What is a false positive and why might one be generated?

3)

Why would false positives not negate all value for a data mining model?

4)

How does a model’s overall performance percentage relate to the target attribute’s (label’s)
individual performance percentages?

5)

How  can  changing  a  data  mining  methodology’s  underlying  algorithm  affect  a  model’s
cross-validation performance percentages?

EXERCISE

For this chapter’s exercise, you will  create a cross-validation model for your Chapter 10 exercise
training data set. Complete the following steps.

1)

Open  RapidMiner  to  a  new,  blank  process  and  add  the  training  data  set  you  created  for
your Chapter 10 exercise (the Titanic survival data set).
2)

Set roles as necessary.
3)

Apply a cross-validation operator to the data set.
4)

Configure  your  sub-process  using  gain_ratio  for  the  Decision  Tree  operator’s  algorithm.
Apply the model and run it through a Performance (Classification) operator.

Chapter 13: Evaluation and Deployment
229
5)

Report your training data set’s ability to predict.
6)

Change your Decision Tree operator’s algorthim to gini_index and re-run your model.
7)

Report your results in the context of any changes that occurred in your training data set’s
ability to predict.

Challenge Step!

8)

Change  your  Decision  Tree  operator’s  algorithm  to  one  of  the  other  options,  such  as
information_schema,  and  report  your  results  again,  comparative  to  gain_ratio  and
gini_index.

Extra Challenge Step!

9)

Repeat steps 1-7 for the linear regression training data set (Chapter 8).  You will need to
use  a  slightly  different  Performance  operator.    Report  your  results.    If  you  would  like,
repeat step 8 for your Chapter 8 exercise training data set and report your results.

Chapter 14: Data Mining Ethics
231

CHAPTER FOURTEEN:
DATA MINING ETHICS

WHY DATA MINING ETHICS?

It has been said that when you are teaching someone something, you should leave the thing that
you want them to remember most to the very end.  It will be the last thing they remember hearing
from you, the thing they take with them as they depart from your instruction.   It is in harmony
with this philosophy that the chapter on data mining ethics has been left to the end of this book.
Please don’t misconstrue this chapter’s placement as an afterthought.  It is here at the end so you’ll
take it with you and remember it.  It is believed that especially if you make a big deal out of it, the
last thing you share with your audience will end up being what they remember from your teaching,
so here is our effort at making a big deal about data mining ethics:

FIGURE 14-1.  This Just In:
BEING AN ETHICAL DATA MINER IS IMPORTANT
DATA
MINING
ETHICS!!

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 57 58 59 60 61 62 63 64 65