Chapter 3: Data Preparation
49
processing time while testing a model to see if it will work to answer our questions. Follow the
steps below to take a sample of our data set in RapidMiner.
1)
Using the search techniques previously demonstrated, use the Operators search feature to
find an operator called ‘Sample’ and add this to your stream. In the parameters pane, set
the sample to be to be a ‘relative’ sample, and then indicate you want to retain 50% of your
observations in the resulting data set by typing .5 into the sample ratio field. Your window
should look like Figure 3-28.
Figure 3-28. Taking a 50% random sample of the data set.
2)
When you run your model now, you will find that your results only contain four or five
observations, randomly selected from the nine that were remaining after our filter operator
removed records that had missing Online_Shopping values.
Thus you can see that there are many ways, and various reasons to reduce data by decreasing the
number of observations in your data set. We’ll now move on to handling inconsistent data, but
before doing so, it is going to be important to reset our data back to its original form. While
filtering, we removed an observation that we will need in order to illustrate what inconsistent data
is, and to demonstrate how to handle it in RapidMiner. This is a good time to learn how to
remove operators from your stream. Switch back to design perspective and click on your
Sampling operator. Next, right click and choose Delete, or simply press the Delete key on your
Data Mining for the Masses
50
keyboard. Delete the Filter Examples operator at this time as well. Note that your spline that was
connected to the res port is also deleted. This is not a problem, you can reconnect the exa port
from the Replace Missing Values operator to the res port, or you will find that the spline will
reappear when you complete the steps under Handling Inconsistent Data.
HANDLING INCONSISTENT DATA
Inconsistent data is different from missing data. Inconsistent data occurs when a value does
exist, however that value is not valid or meaningful. Refer back to Figure 3-25, a close up version
of that image is shown here as Figure 3-29.
Figure 3-29. Inconsisten data in the Twitter attribute.
What is that 99 doing there? It seems that the only two valid values for the Twitter attribute
should be ‘Y’ and ‘N’. This is a value that is inconsistent and is therefore meaningless. As data
miners, we can decide if we want to filter this observation out, as we did with the missing
Online_Shopping records, or, we could use an operator designed to allow us to replace certain
values with others.
1)
Return to design perspective if you are not already there. Ensure that you have deleted
your sampling and filter operators from your stream, so that your window looks like Figure
3-30.
Figure 3-30. Returning to a full data set in RapidMiner.
?!?!
Chapter 3: Data Preparation
51
2)
Note that we don’t need to remove the Replace Missing Values operator, because it is not
removing any observations in our data set. It only changes the values in the
Online_Gaming attribute, which won’t affect our next operator. Use the search feature in
the Operators tab to find an operator called Replace. Drag this operator into your stream.
If your splines had been disconnected during the deletion of the sampling and filtering
operators, as is the case in Figure 3-30, you will see that your splines are automatically
reconnected when you add the Replace operator to the stream.
3)
In the parameters pane, change the attribute filter type to single, then indicate Twitter as
the attribute to be modified. In truth, in this data set there is only one instance of the value
99 across all attributes and observations, so this change to a single attribute is not actually
necessary in this example, but it is good to be thoughtful and intentional with every step in
a data mining process. Most data sets will be far larger and more complex that the Chapter
3 data set we are currently working with. In the ‘replace what’ field, type the value 99, since
this is the value we’re looking to replace. Finally, in the ‘replace by’ field, we must decide
what we want to have in the place of the 99. If we leave this field blank, then the
observation will have a missing (?) when we run the model and switch to Data View in
results perspective. We could also choose the mode of ‘N’, and given that 80% of the
survey respondents indicated that they did not use Twitter, this would seem a safe course
of action. You may choose the value you would like to use. For the book’s example, we
will enter ‘N’ and then run our model. You can see in Figure 3-31 that we now have nine
values of ‘N’, and two of ‘Y’ for our Twitter attribute.
Figure 3-31. Replacement of inconsistent value with a consistent one.
Data Mining for the Masses
52
Keep in mind that not all inconsistent data is going to be as easy to handle as replacing a single
value. It would be entirely possible that in addition to the inconsistent value of 99, values of 87,
96, 101, or others could be present in a data set. If this were the case, it might take multiple
replacements and/or missing data operators to prepare the data set for mining. In numeric data
we might also come across data which are accurate, but which are also statistical outliers. These
might also be considered to be inconsistent data, so an example in a later chapter will illustrate the
handling of statistical outliers. Sometimes data scrubbing can become tedious, but it will ultimately
affect the usefulness of data mining results, so these types of activities are important, and attention
to detail is critical.
ATTRIBUTE REDUCTION
In many data sets, you will find that some attributes are simply irrelevant to answering a given
question. In Chapter 4 we will discuss methods for evaluating correlation, or the strength of
relationships between given attributes. In some instances, you will not know the extent to which a
certain attribute will be useful without statistically assessing that attribute’s correlation to the other
data you will be evaluating. In our process stream in RapidMiner, we can remove attributes that
are not very interesting in terms of answering a given question without completely deleting them
from the data set. Remember, simply because certain variables in a data set aren’t interesting for
answering a certain question doesn’t mean those variables won’t ever be interesting. This is why
we recommended bringing in all attributes when importing the Chapter 3 data set earlier in this
chapter—uninteresting or irrelevant attributes are easy to exclude within your stream by following
these steps:
1)
Return to design perspective. In the operator search field, type Select Attribute. The
Select Attributes operator will appear. Drag it onto the end of your stream so that it fits
between the Replace operator and the result set port. Your window should look like
Figure 3-32.
Dostları ilə paylaş: |