Chapter 3:
Data Preparation
53
Figure 3-32. Selecting a subset of a data set’s attributes.
2)
In the Parameters pane, set the attribute filter type to ‘subset’, then click the Select
Attributes button; a window similar to Figure 3-33 will appear.
Figure 3-33. The attribute subset selection window.
Data Mining
for the Masses
54
3)
Using the green right and left arrows, you can select which attributes you would like to
keep. Suppose we were going to study the demographics of Internet users. In this
instance, we might select Birth_Year, Gender, Marital_Status, Race, and perhaps
Years_on_Internet, and move them to the right under Selected Attributes using the right
green arrow. You can select more than one attribute at a time by holding down your
control or shift keys (on a Windows computer) while clicking on the attributes you want to
select or deselect. We could then click OK, and these would be the only attributes we
would see in results perspective when we run our model. All subsequent downstream data
mining operations added to our model will act only upon this subset of our attributes.
CHAPTER SUMMARY
This chapter has introduced you to a number of concepts related to data preparation. Recall that
Data Preparation is the third step in the CRISP-DM process. Once you have established
Organizational Understanding as it relates to your data mining plans, and developed Data
Understanding in terms of what data you need, what data you have, where it is located, and so
forth; you can begin to prepare your data for mining. This has been the focus of this chapter.
The chapter used a small and very simple data set to help you learn to set up the RapidMiner data
mining environment. You have learned about viewing data sets in OpenOffice Base, and learned
some ways that data sets in relational databases can be collated. You have also learned about
comma separated values (CSV) files.
We have then stepped through adding CSV files to a RapidMiner data repository in order to
handle missing data, reduce data through observation filtering, handle inconsistencies in data, and
reduce the number of attributes in a model. All of these methods will be
used in future chapters to
prepare data for modeling.
Data mining is most successful when conducted upon a foundation of well-prepared data. Recall
the quotation from Chapter 1from
Alice’s Adventures in Wonderland—which way you go does not
matter very much if you don’t know, or don’t care, where you are going. Likewise, the value of
where you arrive when you complete a data mining exercise will largely depend upon how well you
prepared to get there. Sometimes we hear the phrase “It’s better than nothing”. Well, in data
mining, results gleaned from poorly prepared data might be “Worse than nothing”, because they
Chapter 3: Data Preparation
55
may be misleading. Decisions based upon them could lead an organization down a detrimental
and costly path. Learn to value the process of data preparation, and you will learn to be a better
data miner.
REVIEW QUESTIONS
1)
What are the four main processes of data preparation discussed in this chapter? What do
they accomplish and why are they important?
2)
What are some ways to collate data from a relational database?
3)
For what kinds of problems might a data set need to be scrubbed?
4)
Why is it often better to perform reductions using operators rather than excluding
attributes or observations as data are imported?
5)
What is a data repository in RapidMiner and how is one created?
6)
How might inconsistent data cause later trouble in data mining activities?
EXERCISE
1)
Locate a data set of any number of attributes and observations. You may have access to
data sets through personal data collection or through your employment, although if you
use an employer’s data, make sure to do so only by permission! You can also search the
Internet for data set libraries. A simple search on the term ‘data sets’ in your favorite
search engine will yield a number of web sites that offer libraries of data sets that you can
use for academic and learning purposes. Download a data set that looks interesting to you
and complete the following:
2)
Format the data set into a CSV file. It may come in this format, or you may need to open
the data in OpenOffice Calc or some similar software, and then use the File > Save As
feature to save your data as a CSV file.