Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	15/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 11 12 13 14 15 16 17 18 ... 65

DATA REDUCTION
Online_Shopping=.

Data Mining for the Masses
44
your  splines  manually.    Simply  click  on  the  out  port  in  your  Retrieve  operator,  and  then
click on the exa port on the Replace Missing Values operator.  Exa stands for example set,
and remember that ‘examples’ is the word RapidMiner uses for observations in a data set.
Be sure the exa port from the Replace Missing Values operator is connected to your result
set (res) port so that when you run your process, you will have output.  Your model should
now look similar to Figure 3-23.

Figure 3-23. Adding a missing value operator to the stream.

19)

When an operator is selected in RapidMiner, it has an orange rectangle around it.  This will
also enable you to modify that operator’s parameters, or properties.  The Parameters pane
is located on the right side of the RapidMiner window, as indicated by the black arrow in
Figure  3-23.    For  this  exercise,  we  have  decided  to  change  all  missing  values  in  the
Online_Gaming  attribute  to  be  ‘N’,  since  this  is  the  most  common  response  in  that
attribute.  To do this, change the ‘attribute filter type’ to ‘single’, and you will see that a
dropdown box appears, allowing you to choose the Online_Gaming attribute as the target
for modification.  Next, expand the ‘default’ dropdown box, and select ‘value’, which will
cause a ‘replenishment value’ box to appear.  Type the replacement value ‘N’ in this box.
Note that you may need to expand your RapidMiner window, or use the vertical scroll bar
on the left of the Parameters pane in order to see all options, as the options change based
on what you have selected.  When you are finished, your parameters should look like the

Chapter 3: Data Preparation
45
ones  in  Figure  3-24.    Parameter  settings  that  were  changed  are  highlighted  with  black
arrows.

Figure 3-24. Missing value parameters.

20)

You  should  understand  that  there  are  many  other  options  available  to  you  in  the
parameters pane.  We will not explore all of them here, but feel free to experiment with
them.  For example, instead of changing a single attribute at a time, you could change a
subset  of  the  attributes  in  your  data  set.    You  will  learn  much  about  the  flexibility  and
power  of  RapidMiner  by  trying  out  different  tools  and  features.    When  you  have  your
parameter set, click the play button.  This will run your process and switch you to results
perspective once again. Your results should look like Figure 3-25.

Data Mining for the Masses
46

Figure 3-25. Results of changing missing data.

21)

You can see now that the Online_Gaming attribute has been moved to the top of our list,
and that there are zero missing values.  Click on the Data View radio button, above and to
the left hand side of the attribute list to see your data in a spreadsheet-type view.  You will
see that the Online_Gaming variable is now populated with only ‘Y’ and ‘N’ values.   We
have  successfully  replaced  all  missing  values  in  that  attribute.    While  in  Data  View,  take
note of how missing values are annotated in other variables, Online_Shopping for example.
A  question  mark  (?)  denotes  a  missing  value  in  an  observation.    Suppose  that  for  this
variable, we do not wish to replace the null values with the mode, but rather, that we wish
to remove those observations from our data set prior to mining it.  This is accomplished
through data reduction.

DATA REDUCTION

Go ahead and switch back to design perspective. The next set of steps will teach you to reduce the
number of observations in your data set through the process of filtering.

1)

In the search box within the Operators tab, type in the word ‘filter’.  This will help you
locate the ‘Filter Examples’ operator, which is what we will use in this example.  Drag the

Chapter 3: Data Preparation
47
Filter  Examples  operator  over  and  connect  it  into  your  stream,  right  after  the  Replace
Missing Values operator. Your window will look like Figure 3-26.

Figure 3-26. Adding a filter to the stream.

2)

In  the  condition  class,  choose  ‘attribute_value_filter’,  and  for  the  parameter_string,  type
the following: Online_Shopping=.  Be sure to include the period.  This parameter string
refers  to  our  attribute,  Online_Shopping,  and  it  tells  RapidMiner  to  filter  out  all
observations where the value in that attribute is missing.  This is a bit confusing, because in
Data View in results perspective, missings are denoted by a question mark (?), but when
entering  the  parameter  string,  missings  are  denoted  by  a  period  (.).    Once  you’ve  typed
these parameter values in, your screen will look like Figure 3-27.

Data Mining for the Masses
48

Figure 3-27. Adding observation filter parameters.

Go ahead and run your model by clicking the play button.  In results perspective, you will now see
that  your  data  set  has  been  reduced  from  eleven  observations  (or  examples)  to  nine.    This  is
because the two observations where the Online_Shopping attribute had a missing value have been
removed.  You’ll be able to see that they’re gone by selecting the Data View radio button.  They
have not been deleted from the original source data, they are simply removed from the data set at
the point in the stream where the filter operator is located and will no longer be considered in any
downstream  data  mining  operations.    In  instances  where  the  missing  value  cannot  be  safely
assumed or computed, removal of the entire observation is often the best course of action.  When
attributes  are  numeric  in  nature,  such  as  with  ages  or  number  of  visits  to  a  certain  place,  an
arithmetic measure of central tendency, such as mean, median or mode might be an acceptable
replacement for missing values, but in more subjective attributes, such as whether one is an online
shopper or not, you may be better off simply filtering out observations where the datum is missing.
(One cool trick you can try in RapidMiner is to use the Invert Filter option in design perspective.
In  this  example,  if  you  check  that  check  box  in  the  parameters  pane  of  the  Filter  Examples
operator, you will keep the missing observations, and filter out the rest.)

Data mining can be confusing and overwhelming, especially when data sets get large.  It doesn’t
have to be though, if we manage our data well.  The previous example has shown how to filter out
observations  containing  undesired  data  (or  missing  data)  in  an  attribute,  but  we  can also  reduce
data  to  test  out  a  data  mining  model  on  a  smaller  subset  of  our  data.    This  can  greatly  reduce

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 11 12 13 14 15 16 17 18 ... 65