Chapter 3: Data Preparation
33
Figure 3-9. Installing updates and adding the Text Mining module.
5)
Once the updates and installations are complete, RapidMiner will open and your window
should look like Figure 3-10:
Figure 3-10. The RapidMiner start screen.
Data Mining for the Masses
34
6)
Next we will need to start a new data mining project in RapidMiner. To do this we click
on the ‘New’ icon as indicated by the black arrow in Figure 3-10. The resulting window
should look like Figure 3-11.
Figure 3-11. Getting started with a new project in RapidMiner.
7)
Within RapidMiner there are two main areas that hold useful tools: Repositories and
Operators. These are accessed by the tabs indicated by the black arrow in Figure 3-11.
The Repositories area is the place where you will connect to each data set you wish to
mine. The Operators area is where all data mining tools are located. These are used to
build models and otherwise manipulate data sets. Click on Repositories. You will find that
the initial repository we created upon our first launch of the RapidMiner software is
present in the list.
Chapter 3: Data Preparation
35
Figure 3-12. Adding a data set to a repository in RapidMiner.
8)
Because the focus of this book is to introduce data mining to the broadest possible
audience, we will not use all of the tools available in RapidMiner. At this point, we could
do a number of complicated and technical things, such as connecting to a remote
enterprise database. This however would likely be overwhelming and inaccessible to many
readers. For the purposes of this text, we will therefore only be connecting to comma
separate values (CSV) files. You should know that most data mining projects
incorporate extremely large data sets encompassing dozens of attributes and thousands or
even millions of observations. We will use smaller data sets in this text, but the
foundational concepts illustrated are the same for large or small data. The Chapter 3 data
set downloaded from the companion web site is very small, comprised of only 15 attributes
and 11 observations. Our next step is to connect to this data set. Click on the Import
icon, which is the second icon from the left in the Repositories area, as indicated by the
black arrow in Figure 3-12.
Data Mining for the Masses
36
Figure 3-13. Importing a CSV file.
9)
You will see by the black arrow in Figure 3-13 that you can import from a number of
different data sources. Note that by importing, you are bringing your data into a
RapidMiner file, rather than working with data that are already stored elsewhere. If your
data set is extremely large, it may take some time to import the data, and you should be
mindful of disk space that is available to you. As data sets grow, you may be better off
using the first (leftmost) icon to set up a remote repository in order to work with data
already stored in other areas. As previously explained, all examples in this text will be
conducted by importing CSV files that are small enough to work with quickly and easily.
Click on the Import CSV File option.
Chapter 3: Data Preparation
37
Figure 3-14. Locating the data set to import.
10)
When the data import wizard opens, navigate to the folder where your data set is stored
and select the file. In this example, only one file is visible: the Chapter 3 data set
downloaded from the companion web site. Click Next.
Figure 3-15. Configuring attribute separation.
Data Mining for the Masses
38
11)
By default, RapidMiner looks for semicolons as attribute separators in our data. We must
change the column separation delimiter to be Comma, in order to be able to see each
attribute separated correctly. Note: If your data naturally contain commas, then you
should be careful as you are collecting or collating your data to use a delimiter that does
not naturally occur in the data. A semicolon or a pipe (|) symbol can often help you avoid
unintended column separation.
Figure 3-16. A preview of attributes separated into columns
with the Comma option selected.
12)
Once the preview shows columns for each attribute, click Next. Note that RapidMiner has
treated our attribute names as if they are our first row of data, or in other words, our first
observation. To fix this, click the Annotation dropdown box next to this row and set it to
Name, as indicated in Figure 3-17. With the attribute names designated correctly, click
Next.
Chapter 3: Data Preparation
39
Figure 3-17. Setting the attribute names.
13)
In step 4 of the data import wizard, RapidMiner will take its best guess at a data type for
each attribute. The data type is the kind of data an attribute holds, such as numeric, text or
date. These can be changed in this screen, but for our purposes in Chapter 3, we will
accept the defaults. Just below each attribute’s data type, RapidMiner also indicates a Role
for each attribute to play. By default, all columns are imported simply with the role of
‘attribute’, however we can change these here if we know that one attribute is going to play
a specific role in a data mining model that we will create. Since roles can be set within
RapidMiner’s main process window when building data mining models, we will accept the
default of ‘attribute’ whenever we import data sets in exercises in this text. Also, you may
note that the check boxes above each attribute in this window allow you to not import
some of the attributes if you don’t want to. This is accomplished by simply clearing the
checkbox. Again, attributes can be excluded from models later, so for the purposes of this
text, we will always include all attributes when importing data. All of these functions are
indicated by the black arrows in Figure 3-18. Go ahead and accept these defaults as they
stand and click Next.
Dostları ilə paylaş: |