Data Mining for the Masses
44
your splines manually. Simply click on the out port in your Retrieve operator, and then
click on the exa port on the Replace Missing Values operator. Exa stands for example set,
and remember that ‘examples’ is the word RapidMiner uses for observations in a data set.
Be sure the exa port from the Replace Missing Values operator is connected to your result
set ( res) port so that when you run your process, you will have output. Your model should
now look similar to Figure 3-23.
Figure 3-23. Adding a missing value operator to the stream.
19)
When an operator is selected in RapidMiner, it has an orange rectangle around it. This will
also enable you to modify that operator’s parameters, or properties. The Parameters pane
is located on the right side of the RapidMiner window, as indicated by the black arrow in
Figure 3-23. For this exercise, we have decided to change all missing values in the
Online_Gaming attribute to be ‘N’, since this is the most common response in that
attribute. To do this, change the ‘attribute filter type’ to ‘single’, and you will see that a
dropdown box appears, allowing you to choose the Online_Gaming attribute as the target
for modification. Next, expand the ‘default’ dropdown box, and select ‘value’, which will
cause a ‘replenishment value’ box to appear. Type the replacement value ‘N’ in this box.
Note that you may need to expand your RapidMiner window, or use the vertical scroll bar
on the left of the Parameters pane in order to see all options, as the options change based
on what you have selected. When you are finished, your parameters should look like the
Chapter 3: Data Preparation
45
ones in Figure 3-24. Parameter settings that were changed are highlighted with black
arrows.
Figure 3-24. Missing value parameters.
20)
You should understand that there are many other options available to you in the
parameters pane. We will not explore all of them here, but feel free to experiment with
them. For example, instead of changing a single attribute at a time, you could change a
subset of the attributes in your data set. You will learn much about the flexibility and
power of RapidMiner by trying out different tools and features. When you have your
parameter set, click the play button. This will run your process and switch you to results
perspective once again. Your results should look like Figure 3-25.
Data Mining for the Masses
46
Figure 3-25. Results of changing missing data.
21)
You can see now that the Online_Gaming attribute has been moved to the top of our list,
and that there are zero missing values. Click on the Data View radio button, above and to
the left hand side of the attribute list to see your data in a spreadsheet-type view. You will
see that the Online_Gaming variable is now populated with only ‘Y’ and ‘N’ values. We
have successfully replaced all missing values in that attribute. While in Data View, take
note of how missing values are annotated in other variables, Online_Shopping for example.
A question mark (?) denotes a missing value in an observation. Suppose that for this
variable, we do not wish to replace the null values with the mode, but rather, that we wish
to remove those observations from our data set prior to mining it. This is accomplished
through data reduction.
DATA REDUCTION
Go ahead and switch back to design perspective. The next set of steps will teach you to reduce the
number of observations in your data set through the process of filtering.
1)
In the search box within the Operators tab, type in the word ‘filter’. This will help you
locate the ‘Filter Examples’ operator, which is what we will use in this example. Drag the
Chapter 3: Data Preparation
47
Filter Examples operator over and connect it into your stream, right after the Replace
Missing Values operator. Your window will look like Figure 3-26.
Figure 3-26. Adding a filter to the stream.
2)
In the condition class, choose ‘attribute_value_filter’, and for the parameter_string, type
the following: Online_Shopping=. Be sure to include the period. This parameter string
refers to our attribute, Online_Shopping, and it tells RapidMiner to filter out all
observations where the value in that attribute is missing. This is a bit confusing, because in
Data View in results perspective, missings are denoted by a question mark (?), but when
entering the parameter string, missings are denoted by a period (.). Once you’ve typed
these parameter values in, your screen will look like Figure 3-27.
Data Mining for the Masses
48
Figure 3-27. Adding observation filter parameters.
Go ahead and run your model by clicking the play button. In results perspective, you will now see
that your data set has been reduced from eleven observations (or examples) to nine. This is
because the two observations where the Online_Shopping attribute had a missing value have been
removed. You’ll be able to see that they’re gone by selecting the Data View radio button. They
have not been deleted from the original source data, they are simply removed from the data set at
the point in the stream where the filter operator is located and will no longer be considered in any
downstream data mining operations. In instances where the missing value cannot be safely
assumed or computed, removal of the entire observation is often the best course of action. When
attributes are numeric in nature, such as with ages or number of visits to a certain place, an
arithmetic measure of central tendency, such as mean, median or mode might be an acceptable
replacement for missing values, but in more subjective attributes, such as whether one is an online
shopper or not, you may be better off simply filtering out observations where the datum is missing.
(One cool trick you can try in RapidMiner is to use the Invert Filter option in design perspective.
In this example, if you check that check box in the parameters pane of the Filter Examples
operator, you will keep the missing observations, and filter out the rest.)
Data mining can be confusing and overwhelming, especially when data sets get large. It doesn’t
have to be though, if we manage our data well. The previous example has shown how to filter out
observations containing undesired data (or missing data) in an attribute, but we can also reduce
data to test out a data mining model on a smaller subset of our data. This can greatly reduce
Dostları ilə paylaş: |