Data Mining
for the Masses
40
Figure 3-18. Setting data types, roles and import attributes.
14)
The final step is to choose a repository to store the data set in, and to give the data set a
name within RapidMiner. In Figure 3-19, we have chosen to store the data set in the
RapidMiner Book repository, and given it the name Chapter3. Once we click Finish, this
data set will become available to us for any type of data mining process we would like to
build upon it.
Figure 3-19. Selecting the repository
and setting a data set name
for our imported CSV file.
Chapter 3:
Data Preparation
41
15)
We can now see that the data set is available for use in RapidMiner. To begin using it in a
RapidMiner data mining process, simply drag the data set and drop it in the Main Process
window, as has been done in Figure 3-20.
Figure 3-20. Adding a data set to a process in RapidMiner.
16)
Each rectangle in a process in RapidMiner is an
operator. The Retrieve operator simply
gets a data set and makes it available for use. The small half-circles on the sides of the
operator, and
of the Main Process window, are called
ports. In Figure 3-20, an output (
out)
port from our data set’s Retrieve operator is connected to a result set (
res) port via a
spline.
The splines, combined with the operators connected by them, constitute a data mining
stream. To run a data mining stream and see the results, click the blue, triangular Play
button in the toolbar at the top of the RapidMiner window. This will change your view
from
Design Perspective, which is the view pictured in Figure 3-20 where you can
change your data mining stream, to
Results Perspective, which shows your stream’s
results, as pictured in Figure 3-21. When you hit the Play button, you may be prompted to
save your process, and you are encouraged to do so. RapidMiner may also ask you if you
wish to overwrite a saved process each time it is run, and you can select your preference on
this prompt as well.
Chapter 3: Data Preparation
43
implications about the persons represented in each observation of our data set would be
unfair and misrepresentative. Thus, we will change the missing values in the current
example to illustrate how to handle missing values in RapidMiner,
recognizing that what we
are about to do won’t always be the right way to handle missing data. In order to have
RapidMiner handle the change from missing to ‘N’ for the three observations in our
Online_Gaming variable, click the design perspective icon.
Figure 3-22. Finding an operator to handle missing values.
18)
In order to find a tool in the Operators area, you can navigate through the folder tree in
the lower left hand corner. RapidMiner offers many tools, and sometimes, finding the one
you want can be tricky. There is a handy search box, indicated
by the black arrow in Figure
3-22, that allows you to type in key words to find tools that might do what you need. Type
the word ‘missing’ into this box, and you will see that RapidMiner automatically searches
for tools with this word in their name. We want to replace missing values, and we can see
that within the Data Transformation tool area, inside a sub-area called Value Modification,
there is an operator called Replace Missing Values. Let’s add this operator to our stream.
Click and hold on the operator name, and drag it up to your spline. When you point your
mouse cursor on the spline, the spline will turn slightly bold, indicating that when you let
go of your mouse button, the operator will be connected into the stream. If you let go and
the Replace Missing Values operator fails to connect into your stream, you can reconfigure