Chapter 13:
Evaluation and Deployment
227
11)
Re-run the model. You do not need to switch back to the main process to re-run the
model. You may if you wish, but you can stay in sub-process view and run it too. When
you return from results perspective to design perspective, you will see whichever design
window you were last in. When you re-run the model, you will see a new performance
matrix, showing the model’s predictive power using Gini as the underlying algorithm.
Figure 13-10. New cross-validation performance
results based on
the gini_index decision tree model.
We see in Figure 13-10 that our model’s ability to predict is significantly improved if we
use Gini for our decision tree model. This should also not come as a great surprise. We
knew from Chapter 10 that the granularity in our tree’s detail under Gini was much greater.
Greater detail in the predictive tree
should result in a more reliably predictive model.
Feeding more and better training data into the training data set would likely raise this
model’s reliability even more.
CHAPTER SUMMARY: THE VALUE OF EXPERIENCE
So now we have seen one way to statistically evaluate a model’s reliability. You have seen that
there are a number of cross-validation and performance operators that you can use to check a
training data set’s ability to perform. But the bottom line is that there is no substitute for
experience and expertise. Use subject matter experts to review your data mining results. Ask them
to give you feedback on your model’s output. Run pilot tests and use focus groups to try out your
model’s predictions before rolling them out organization-wide. Do not be offended if someone
questions or challenges the reliability of your model’s results—be humble enough to take their
Data Mining
for the Masses
228
questions as an opportunity to validate and strengthen your model. Remember that ‘pride goeth
before the fall’! Data mining is a process. If you present your data mining results and
recommendations as infallible, you are not participating in the cyclical nature of CRISP-DM, and
you’ll likely end up looking foolish sooner or later. CRISP-DM is such a good process precisely
because of its ability to help us investigate data, learn from our investigation, and then do it again
from a more informed position. Evaluation and Deployment are the two steps in the process
where we establish that more informed position.
REVIEW QUESTIONS
1)
What is cross-validation and why should you do it?
2)
What is a false positive and why might one be generated?
3)
Why would false positives not negate all value for a data mining model?
4)
How does a model’s overall performance percentage relate to the target attribute’s (label’s)
individual performance percentages?
5)
How can changing a data mining methodology’s underlying algorithm affect a model’s
cross-validation performance percentages?
EXERCISE
For this chapter’s exercise, you will create a cross-validation model for your Chapter 10 exercise
training data set. Complete the following steps.
1)
Open RapidMiner to a new, blank process and add the training data set you created for
your Chapter 10 exercise (the Titanic survival data set).
2)
Set roles as necessary.
3)
Apply a cross-validation operator to the data set.
4)
Configure your sub-process using gain_ratio for the Decision Tree operator’s algorithm.
Apply the model and run it through a Performance (Classification) operator.
Chapter 14:
Data Mining Ethics
231
CHAPTER FOURTEEN:
DATA MINING ETHICS
WHY DATA MINING ETHICS?
It has been said that when you are teaching someone something, you should leave the thing that
you want them to remember most to the very end. It will be the last thing they remember hearing
from you, the thing they take with them as they depart from your instruction. It is in harmony
with this philosophy that the chapter on data mining ethics has been left to the end of this book.
Please don’t misconstrue this chapter’s placement as an afterthought. It is here at the end so you’ll
take it with you and remember it. It is believed that especially if you make a big deal out of it, the
last thing you share with your audience will end up being what they remember from your teaching,
so here is our effort at making a big deal about data mining ethics:
FIGURE 14-1. This Just In:
BEING AN ETHICAL DATA MINER IS IMPORTANT
DATA
MINING
ETHICS!!