Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	12/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 8 9 10 11 12 13 14 15 ... 65

HANDS ON EXERCISE
PREPARING RAPIDMINER, IMPORTING DATA, AND HANDLING MISSING DATA
Missing data

Chapter 3: Data Preparation
29
HANDS ON EXERCISE

Starting now, and throughout the next chapters of this book, there will be opportunities for you to
put your hands on your computer and follow along.  In order to do this, you will need to be sure
to  install  OpenOffice  and  RapidMiner,  as  was  discussed  in  the  section  A  Note  about  Tools  in
Chapter 1.  You will also need to have an Internet connection to access this book’s companion
web site, where copies of all data sets used in the chapter exercises are available.  The companion
web site is located at:

https://sites.google.com/site/dataminingforthemasses/

Figure 3-4. Data Mining for the Masses companion web site.

You can download the Chapter 3 data set, which is an export of the view created in OpenOffice
Base, from the web site by locating it in the list of files and then clicking the down arrow to the far
right of the file name, as indicated by the black arrows in Figure 3-4  You may want to consider
creating  a  folder  labeled  ‘data  mining’  or  something  similar  where  you  can  keep  copies  of  your
data—more  files  will  be  required  and  created  as  we  continue  through  the  rest  of  the  book,
especially when we get into building data mining models in RapidMiner.  Having a central place to
keep  everything  together  will  simplify  things,  and  upon  your  first  launch  of  the  RapidMiner
software, you’ll be prompted to create a repository, so it’s a good idea to have a space ready.  Once

Data Mining for the Masses
30
you’ve  downloaded  the  Chapter  3  data  set,  you’re  ready  to  begin  learning  how  to  handle  and
prepare data for mining in RapidMiner.

PREPARING RAPIDMINER, IMPORTING DATA, AND
HANDLING MISSING DATA

Our first task in data preparation is to handle missing data, however, because this will be our first
time using RapidMiner, the first few steps will involve getting RapidMiner set up.  We’ll then move
straight into handling missing data.  Missing data are data that do not exist in a data set.  As you
can see in Figure 3-5, missing data is not the same as zero or some other value.  It is blank, and the
value  is  unknown.    Missing  data  are  also  sometimes  known  in  the  database  world  as  null.
Depending on your objective in data mining, you may choose to leave missing data as they are, or
you may wish to replace missing data with some other value.

Figure 3-5: Some missing data within the survey data set.

The creation of views is one way that data from a relational database can be collated and organized
in preparation for data mining activities.  In this example, our database view has missing data in a
number of its attributes.  Black arrows indicate a couple of these attributes in Figure 3-5 above.  In
some  instances,  missing  data  are  not  a  problem,  they  are  expected.    For  example,  in  the  Other
Social Network attribute, it is entirely possible that the survey respondent did not indicate that they
use social networking sites other than the ones proscribed in the survey.  Thus, missing data are
probably accurate and acceptable.  On the other hand, in the Online Gaming attribute, there are
answers of either ‘Y’ or ‘N’, indicating that the respondent either does, or does not participate in
online gaming.  But what do the missing, or null values in this attribute indicate?  It is unknown to
us.  For the purposes of data mining, there are a number of options available for handling missing
data.

To learn about handling missing data in RapidMiner, follow the steps below to connect to your
data set and begin modifying it:

Chapter 3: Data Preparation
31

1)

Launch  the  RapidMiner  application.    This  can  be  done  by  double  clicking  your  desktop
icon or by finding it in your application menu.  The first time RapidMiner is launched, you
will get the message depicted in Figure 3-6. Click OK to set up a repository.

Figure 3-6. The prompt to create an initial data repository for RapidMiner to use.

2)

For most purposes (and for all examples in this book), a local repository will be sufficient.
Click OK to accept the default option as depicted in Figure 3-7.

Figure 3-7. Setting up a local data repository.

3)

In the example given in Figure 3-8, we have named our repository ‘RapidMinerBook, and
pointed it to our data folder, RapidMiner Data, which is found on our E: drive.  Use the
folder  icon  to  browse  and  find  the  folder  or  directory  you  created  for  storing  your
RapidMiner data sets. Then click Finish.

Data Mining for the Masses
32

Figure 3-8. Setting the repository name and directory.

4)

You may get a notice that updates are available.  If this is the case, go ahead and accept the
option to update, where you will be presented with a window similar to Figure 3-9.  Take
advantage  of  the  opportunity  to  add  in  the  Text  Mining  module  (indicated  by  the  black
arrow), since Chapter 12 will deal with Text Mining.  Double click the check box to add a
green  check  mark  indicating  that  you  wish  to  install  or  update  the  module,  then  click
Install.

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 8 9 10 11 12 13 14 15 ... 65