Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	13/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 9 10 11 12 13 14 15 16 ... 65

Repositories and Operators

Chapter 3: Data Preparation
33

Figure 3-9. Installing updates and adding the Text Mining module.

5)

Once the updates and installations are complete, RapidMiner will open and your window
should look like Figure 3-10:

Figure 3-10. The RapidMiner start screen.

Data Mining for the Masses
34
6)

Next we will need to start a new data mining project in RapidMiner.  To do this we click
on the ‘New’ icon as indicated by the black arrow in Figure 3-10.  The resulting window
should look like Figure 3-11.

Figure 3-11. Getting started with a new project in RapidMiner.

7)

Within  RapidMiner  there  are  two  main  areas  that  hold  useful  tools:  Repositories  and
Operators.  These are accessed by the tabs indicated by the black arrow in Figure 3-11.
The  Repositories  area  is  the  place  where  you  will  connect  to  each  data  set  you  wish  to
mine.  The Operators area is where all data mining tools are located.  These are used to
build models and otherwise manipulate data sets.  Click on Repositories.  You will find that
the  initial  repository  we  created  upon  our  first  launch  of  the  RapidMiner  software  is
present in the list.

Chapter 3: Data Preparation
35

Figure 3-12. Adding a data set to a repository in RapidMiner.

8)

Because  the  focus  of  this  book  is  to  introduce  data  mining  to  the  broadest  possible
audience, we will not use all of the tools available in RapidMiner.  At this point, we could
do  a  number  of  complicated  and  technical  things,  such  as  connecting  to  a  remote
enterprise database.  This however would likely be overwhelming and inaccessible to many
readers.    For  the  purposes  of  this  text,  we  will  therefore  only  be  connecting  to  comma
separate  values  (CSV)  files.    You  should  know  that  most  data  mining  projects
incorporate extremely large data sets encompassing dozens of attributes and thousands or
even  millions  of  observations.    We  will  use  smaller  data  sets  in  this  text,  but  the
foundational concepts illustrated are the same for large or small data.  The Chapter 3 data
set downloaded from the companion web site is very small, comprised of only 15 attributes
and 11 observations.  Our next step is to connect  to this data set.  Click on the Import
icon, which is the second icon from the left in the Repositories area, as indicated by the
black arrow in Figure 3-12.

Data Mining for the Masses
36

Figure 3-13. Importing a CSV file.

9)

You  will  see  by  the  black  arrow  in  Figure  3-13  that  you  can  import  from  a  number  of
different  data  sources.    Note  that  by  importing,  you  are  bringing  your  data  into  a
RapidMiner file, rather than working with data that are already stored elsewhere.  If your
data set is extremely large, it may take some time to import the data, and you should be
mindful of disk space that is available to you.  As data sets grow, you may be better off
using  the  first  (leftmost)  icon  to  set  up  a  remote  repository  in  order  to  work  with  data
already  stored  in  other  areas.    As  previously  explained,  all  examples  in  this  text  will  be
conducted by importing CSV files that are small enough to work with quickly and easily.
Click on the Import CSV File option.

Chapter 3: Data Preparation
37

Figure 3-14. Locating the data set to import.

10)

When the data import wizard opens, navigate to the folder where your data set is stored
and select the file. In this example, only one file is visible: the Chapter 3 data set
downloaded from the companion web site. Click Next.

Figure 3-15. Configuring attribute separation.

Data Mining for the Masses
38

11)

By default, RapidMiner looks for semicolons as attribute separators in our data.  We must
change  the  column  separation  delimiter  to  be  Comma,  in  order  to  be  able  to  see  each
attribute  separated  correctly.    Note:  If  your  data  naturally  contain  commas,  then  you
should be careful as you are collecting or collating your data to use a delimiter that does
not naturally occur in the data.  A semicolon or a pipe (|) symbol can often help you avoid
unintended column separation.

Figure 3-16. A preview of attributes separated into columns
with the Comma option selected.

12)

Once the preview shows columns for each attribute, click Next.  Note that RapidMiner has
treated our attribute names as if they are our first row of data, or in other words, our first
observation.  To fix this, click the Annotation dropdown box next to this row and set it to
Name,  as  indicated  in  Figure  3-17.    With  the  attribute  names  designated  correctly,  click
Next.

Chapter 3: Data Preparation
39

Figure 3-17. Setting the attribute names.

13)

In step 4 of the data import wizard, RapidMiner will take its best guess at a data type for
each attribute.  The data type is the kind of data an attribute holds, such as numeric, text or
date.    These  can  be  changed  in  this  screen,  but  for  our  purposes  in  Chapter  3,  we  will
accept the defaults.  Just below each attribute’s data type, RapidMiner also indicates a Role
for  each  attribute  to  play.    By  default,  all  columns  are  imported  simply  with  the  role  of
‘attribute’, however we can change these here if we know that one attribute is going to play
a  specific  role  in  a  data mining  model  that we will  create.    Since  roles  can  be  set  within
RapidMiner’s main process window when building data mining models, we will accept the
default of ‘attribute’ whenever we import data sets in exercises in this text.  Also, you may
note  that  the  check  boxes  above  each  attribute  in  this  window  allow  you  to  not  import
some of the attributes if you don’t want to.  This is accomplished by simply clearing the
checkbox.  Again, attributes can be excluded from models later, so for the purposes of this
text, we will always include all attributes when importing data.  All of these functions are
indicated by the black arrows in Figure 3-18.  Go ahead and accept these defaults as they
stand and click Next.

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 9 10 11 12 13 14 15 16 ... 65