Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	16/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 12 13 14 15 16 17 18 19 ... 65

HANDLING INCONSISTENT DATA Inconsistent data
ATTRIBUTE REDUCTION

Chapter 3: Data Preparation
49
processing time while testing a model to see if it will work to answer our questions.   Follow the
steps below to take a sample of our data set in RapidMiner.

1)

Using the search techniques previously demonstrated, use the Operators search feature to
find an operator called ‘Sample’ and add this to your stream.  In the parameters pane, set
the sample to be to be a ‘relative’ sample, and then indicate you want to retain 50% of your
observations in the resulting data set by typing .5 into the sample ratio field.  Your window
should look like Figure 3-28.

Figure 3-28. Taking a 50% random sample of the data set.

2)

When you run your model now, you will find that  your results only contain four or five
observations, randomly selected from the nine that were remaining after our filter operator
removed records that had missing Online_Shopping values.

Thus you can see that there are many ways, and various reasons to reduce data by decreasing the
number of observations in your data set.  We’ll now move on to handling inconsistent data, but
before  doing  so,  it  is  going  to  be  important  to  reset  our  data  back  to  its  original  form.    While
filtering, we removed an observation that we will need in order to illustrate what inconsistent data
is,  and  to  demonstrate  how  to  handle  it  in  RapidMiner.    This  is  a  good  time  to  learn  how  to
remove  operators  from  your  stream.    Switch  back  to  design  perspective  and  click  on  your
Sampling operator.  Next, right click and choose Delete, or simply press the Delete key on your

Data Mining for the Masses
50
keyboard.  Delete the Filter Examples operator at this time as well.  Note that your spline that was
connected to the res port is also deleted.  This is not a problem,  you can reconnect the exa port
from  the  Replace  Missing  Values  operator  to  the  res  port,  or  you  will  find  that  the  spline  will
reappear when you complete the steps under Handling Inconsistent Data.

HANDLING INCONSISTENT DATA

Inconsistent  data  is  different  from  missing  data.    Inconsistent  data  occurs  when  a  value  does
exist, however that value is not valid or meaningful.  Refer back to Figure 3-25, a close up version
of that image is shown here as Figure 3-29.

Figure 3-29. Inconsisten data in the Twitter attribute.

What  is  that  99  doing  there?    It  seems  that  the  only  two  valid  values  for  the  Twitter  attribute
should be ‘Y’ and ‘N’.  This  is a value that is inconsistent and is therefore meaningless.  As data
miners,  we  can  decide  if  we  want  to  filter  this  observation  out,  as  we  did  with  the  missing
Online_Shopping  records,  or,  we  could  use  an  operator  designed  to  allow  us  to  replace  certain
values with others.

1)

Return  to  design  perspective  if  you  are  not already  there.   Ensure  that you  have  deleted
your sampling and filter operators from your stream, so that your window looks like Figure
3-30.

Figure 3-30. Returning to a full data set in RapidMiner.
?!?!

Chapter 3: Data Preparation
51
2)

Note that we don’t need to remove the Replace Missing Values operator, because it is not
removing  any  observations  in  our  data  set.    It  only  changes  the  values  in  the
Online_Gaming attribute, which won’t affect our next operator.  Use the search feature in
the Operators tab to find an operator called Replace.  Drag this operator into your stream.
If  your  splines  had  been  disconnected  during  the  deletion  of  the  sampling  and  filtering
operators,  as  is  the  case  in  Figure  3-30,  you  will  see  that  your  splines  are  automatically
reconnected when you add the Replace operator to the stream.

3)

In the parameters pane, change the attribute filter type to single, then indicate Twitter as
the attribute to be modified. In truth, in this data set there is only one instance of the value
99 across all attributes and observations, so this change to a single attribute is not actually
necessary in this example, but it is good to be thoughtful and intentional with every step in
a data mining process.  Most data sets will be far larger and more complex that the Chapter
3 data set we are currently working with. In the ‘replace what’ field, type the value 99, since
this is the value we’re looking to replace.  Finally, in the ‘replace by’ field, we must decide
what  we  want  to  have  in  the  place  of  the  99.    If  we  leave  this  field  blank,  then  the
observation  will  have  a missing  (?)  when  we  run  the  model  and  switch  to  Data  View  in
results  perspective.    We  could  also  choose  the  mode  of  ‘N’,  and  given  that  80%  of  the
survey respondents indicated that they did not use Twitter, this would seem a safe course
of action.  You may choose the value you would like to use.  For the book’s example, we
will enter ‘N’ and then run our model.  You can see in Figure 3-31 that we now have nine
values of ‘N’, and two of ‘Y’ for our Twitter attribute.

Figure 3-31. Replacement of inconsistent value with a consistent one.

Data Mining for the Masses
52

Keep in mind that not all inconsistent data is going to be as easy to handle as  replacing a single
value.  It would be entirely possible that in addition to the inconsistent value of 99, values of 87,
96,  101,  or  others  could  be  present  in  a  data  set.    If  this  were  the  case,  it  might  take  multiple
replacements and/or missing data operators to prepare the data set for mining.  In numeric data
we might also come across data which are accurate, but which are also statistical outliers.  These
might also be considered to be inconsistent data, so an example in a later chapter will illustrate the
handling of statistical outliers.  Sometimes data scrubbing can become tedious, but it will ultimately
affect the usefulness of data mining results, so these types of activities are important, and attention
to detail is critical.

ATTRIBUTE REDUCTION

In  many  data  sets,  you  will  find  that  some  attributes  are  simply  irrelevant  to  answering  a  given
question.    In  Chapter  4  we  will  discuss  methods  for  evaluating  correlation,  or  the  strength  of
relationships between given attributes.  In some instances, you will not know the extent to which a
certain attribute will be useful without statistically assessing that attribute’s correlation to the other
data you will be evaluating.  In our process stream in RapidMiner, we can remove attributes that
are not very interesting in terms of answering a given question without completely deleting them
from the data set.  Remember, simply because certain variables in a data set aren’t interesting for
answering a certain question doesn’t mean those variables won’t ever be interesting.  This is why
we recommended bringing in all attributes when importing the Chapter 3 data set earlier in this
chapter—uninteresting or irrelevant attributes are easy to exclude within your stream by following
these steps:

1)

Return  to  design  perspective.    In  the  operator  search  field,  type  Select  Attribute.    The
Select Attributes operator will appear.  Drag it onto the end of your stream so that it fits
between  the  Replace  operator  and  the  result  set  port.    Your  window  should  look  like
Figure 3-32.

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 12 13 14 15 16 17 18 19 ... 65