Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	14/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 10 11 12 13 14 15 16 17 ... 65

Design Perspective

Data Mining for the Masses
40

Figure 3-18. Setting data types, roles and import attributes.

14)

The final step is to choose a repository to store the data set in, and to give the data set a
name  within  RapidMiner.    In  Figure  3-19,  we  have  chosen  to  store  the  data  set  in  the
RapidMiner Book repository, and given it the name Chapter3.  Once we click Finish, this
data set will become available to us for any type of data mining process we would like to
build upon it.

Figure 3-19. Selecting the repository and setting a data set name
for our imported CSV file.

Chapter 3: Data Preparation
41
15)

We can now see that the data set is available for use in RapidMiner.  To begin using it in a
RapidMiner data mining process, simply drag the data set and drop it in the Main Process
window, as has been done in Figure 3-20.

Figure 3-20. Adding a data set to a process in RapidMiner.

16)

Each rectangle in a process in RapidMiner is an operator.  The Retrieve operator simply
gets  a  data  set and  makes  it  available  for  use.    The  small  half-circles  on  the  sides  of  the
operator, and of the Main Process window, are called ports.  In Figure 3-20, an output (out)
port from our data set’s Retrieve operator is connected to a result set (res) port via a spline.
The  splines,  combined  with  the  operators  connected  by  them,  constitute  a  data  mining
stream.    To  run  a  data mining  stream  and  see  the  results,  click  the  blue,  triangular  Play
button in the toolbar at the top of the RapidMiner window.  This will change your view
from  Design  Perspective,  which  is  the  view  pictured  in  Figure  3-20  where  you  can
change  your  data  mining  stream,  to  Results  Perspective,  which  shows  your  stream’s
results, as pictured in Figure 3-21.  When you hit the Play button, you may be prompted to
save your process, and you are encouraged to do so.  RapidMiner may also ask you if you
wish to overwrite a saved process each time it is run, and you can select your preference on
this prompt as well.

Data Mining for the Masses
42

Figure 3-21. Results perspective for the Chapter3 data set.

17)

You can toggle between design and results perspectives using the two icons indicated by
the black arrows in Figure 3-21.  As you can see, there is a rich set of information in results
perspective.  In the meta data view, basic descriptive statistics are given.  It is here that we
can  also  get  a  sense  for  the  number  of  observations  that  have  missing  values  in  each
attribute of the data set.  The columns in meta data view can be stretched to make their
contents  more  readable.    This  is  accomplished  by  hovering  your  mouse  over  the  faint
vertical gray bars between each column, then clicking and dragging to make them wider.
The  information  presented  here  can  be  very  helpful  in  deciding  where  missing  data  are
located, and  what  to  do about  it.    Take  for  example  the  Online_Gaming  attribute.    The
results  perspective  shows  us  that  we  have  six  ‘N’  responses  in  that  attribute,  two  ‘Y’
responses,  and  three  missing.    We  could  use  the  mode,  or  most  common  response  to
replace  the  missing  values.    This  of  course  assumes  that  the  most  common  response  is
accurate for all observations, and this may not be accurate.  As data miners, we must be
responsible for thinking about each change we make in our data, and whether or not  we
threaten  the  integrity  of  our  data  by  making  that  change.    In  some  instances  the
consequences  could  be  drastic.    Consider,  for  instance,  if  the  mode  for  an  attribute  of
Felony_Conviction  were ‘Y’.   Would  we  really  want  to  convert  all  missing values  in  this
attribute  to  ‘Y’  simply  because  that  is  the  mode  in  our  data  set?    Probably  not;  the

Chapter 3: Data Preparation
43
implications about the persons represented in each observation of our data set would be
unfair  and  misrepresentative.    Thus,  we  will  change  the  missing  values  in  the  current
example to illustrate how to handle missing values in RapidMiner, recognizing that what we
are about to do won’t always be the right way to handle missing data.  In order to have
RapidMiner  handle  the  change  from  missing  to  ‘N’  for  the  three  observations  in  our
Online_Gaming variable, click the design perspective icon.

Figure 3-22. Finding an operator to handle missing values.

18)

In order to find a tool in the Operators area, you can navigate through the folder tree in
the lower left hand corner.  RapidMiner offers many tools, and sometimes, finding the one
you want can be tricky. There is a handy search box, indicated by the black arrow in Figure
3-22, that allows you to type in key words to find tools that might do what you need.  Type
the word ‘missing’ into this box, and you will see that RapidMiner automatically searches
for tools with this word in their name.  We want to replace missing values, and we can see
that within the Data Transformation tool area, inside a sub-area called Value Modification,
there is an operator called Replace Missing Values.  Let’s add this operator to our stream.
Click and hold on the operator name, and drag it up to your spline.  When you point your
mouse cursor on the spline, the spline will turn slightly bold, indicating that when you let
go of your mouse button, the operator will be connected into the stream.  If you let go and
the Replace Missing Values operator fails to connect into your stream, you can reconfigure

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 10 11 12 13 14 15 16 17 ... 65