Chapter 3:
Data Preparation
29
HANDS ON EXERCISE
Starting now, and throughout the next chapters of this book, there will be opportunities for you to
put your hands on your computer and follow along. In order to do this, you will need to be sure
to install OpenOffice and RapidMiner, as was discussed in the section
A Note about Tools in
Chapter 1. You will also need to have an Internet connection to access this book’s companion
web site, where copies of all data sets used in the chapter exercises are available. The companion
web site is located at:
https://sites.google.com/site/dataminingforthemasses/
Figure 3-4.
Data Mining for the Masses companion web site.
You can download the Chapter 3 data set, which is an export of the view created in OpenOffice
Base, from the web site by locating it in the list of files and then clicking the down arrow to the far
right of the file name, as indicated by the black arrows in Figure 3-4 You may want to consider
creating a folder labeled ‘data mining’ or something similar where you can keep copies of your
data—more files will be required and created as we continue through the rest of the book,
especially when we get into building data mining models in RapidMiner. Having a central place to
keep everything together will simplify things, and upon your first launch of the RapidMiner
software, you’ll be prompted to create a repository, so it’s a good idea to have a space ready. Once
Data Mining
for the Masses
30
you’ve downloaded the Chapter 3 data set, you’re ready to begin learning how to handle and
prepare data for mining in RapidMiner.
PREPARING RAPIDMINER, IMPORTING DATA, AND
HANDLING MISSING DATA
Our first task in data preparation is to handle missing data, however, because this will be our first
time using RapidMiner, the first few steps will involve getting RapidMiner set up. We’ll then move
straight into handling missing data.
Missing data are data that do not exist in a data set. As you
can see in Figure 3-5, missing data is not the same as zero or some other value. It is blank, and the
value is unknown. Missing data are also sometimes known in the database world as
null.
Depending on your objective in data mining, you may choose to leave missing data as they are, or
you may wish to replace missing data with some other value.
Figure 3-5: Some missing data within the survey data set.
The creation of views is one way that data from a relational database can be collated and organized
in preparation for data mining activities. In this example, our database view has missing data in a
number of its attributes. Black arrows indicate a couple of these attributes in Figure 3-5 above. In
some instances, missing data are not a problem, they are expected. For example, in the Other
Social Network attribute, it is entirely possible that the survey respondent did not indicate that they
use social networking sites other than the ones proscribed in the survey. Thus, missing data are
probably accurate and acceptable. On the other hand, in the Online Gaming attribute, there are
answers of either ‘Y’ or ‘N’, indicating that the respondent either does, or does not participate in
online gaming. But what do the missing, or null values in this attribute indicate? It is unknown to
us. For the purposes of data mining, there are a number of options available for handling missing
data.
To learn about handling missing data in RapidMiner, follow the steps below to connect to your
data set and begin modifying it: