Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	17/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 13 14 15 16 17 18 19 20 ... 65

CHAPTER SUMMARY
REVIEW QUESTIONS

Chapter 3: Data Preparation
53

Figure 3-32. Selecting a subset of a data set’s attributes.

2)

In the Parameters pane, set the attribute filter type to ‘subset’, then click the Select
Attributes button; a window similar to Figure 3-33 will appear.

Figure 3-33. The attribute subset selection window.

Data Mining for the Masses
54
3)

Using  the  green  right  and  left  arrows,  you  can  select  which  attributes  you  would  like  to
keep.    Suppose  we  were  going  to  study  the  demographics  of  Internet  users.    In  this
instance,  we  might  select  Birth_Year,  Gender,  Marital_Status,  Race,  and  perhaps
Years_on_Internet, and move them to the right under Selected Attributes using the right
green  arrow.    You  can  select  more  than  one  attribute  at  a  time  by  holding  down  your
control or shift keys (on a Windows computer) while clicking on the attributes you want to
select  or  deselect.    We  could  then  click  OK,  and  these  would  be  the  only  attributes  we
would see in results perspective when we run our model.  All subsequent downstream data
mining operations added to our model will act only upon this subset of our attributes.

CHAPTER SUMMARY

This chapter has introduced you to a number of concepts related to data preparation.  Recall that
Data  Preparation  is  the  third  step  in  the  CRISP-DM  process.    Once  you  have  established
Organizational  Understanding  as  it  relates  to  your  data  mining  plans,  and  developed  Data
Understanding  in  terms  of  what  data  you  need, what  data  you  have,  where  it  is  located,  and  so
forth; you can begin to prepare your data for mining. This has been the focus of this chapter.

The chapter used a small and very simple data set to help you learn to set up the RapidMiner data
mining environment.  You have learned about viewing data sets in OpenOffice Base, and learned
some  ways  that  data  sets  in  relational  databases  can  be  collated.    You  have  also  learned  about
comma separated values (CSV) files.

We  have  then  stepped  through  adding  CSV  files  to  a  RapidMiner  data  repository  in  order  to
handle missing data, reduce data through observation filtering, handle inconsistencies in data, and
reduce the number of attributes in a model. All of these methods will be used in future chapters to
prepare data for modeling.

Data mining is most successful when conducted upon a foundation of well-prepared data.  Recall
the  quotation  from  Chapter  1from  Alice’s  Adventures  in  Wonderland—which  way  you  go  does  not
matter very much if you don’t know, or don’t care, where you are going.  Likewise, the value of
where you arrive when you complete a data mining exercise will largely depend upon how well you
prepared  to  get  there.    Sometimes  we  hear  the  phrase  “It’s  better  than  nothing”.    Well,  in  data
mining, results gleaned from poorly prepared data might be “Worse than nothing”, because they

Chapter 3: Data Preparation
55
may  be  misleading.    Decisions  based  upon  them  could  lead  an  organization  down  a  detrimental
and costly path.  Learn to value the process of data preparation, and you will learn to be a better
data miner.

REVIEW QUESTIONS

1)

What are the four main processes of data preparation discussed in this chapter?  What do
they accomplish and why are they important?

2)

What are some ways to collate data from a relational database?

3)

For what kinds of problems might a data set need to be scrubbed?

4)

Why  is  it  often  better  to  perform  reductions  using  operators  rather  than  excluding
attributes or observations as data are imported?

5)

What is a data repository in RapidMiner and how is one created?

6)

How might inconsistent data cause later trouble in data mining activities?

EXERCISE

1)

Locate a data set of any number of attributes and observations.  You may have access to
data  sets  through  personal  data  collection  or  through  your  employment,  although  if  you
use an employer’s data, make sure to do so only by permission!  You can also search the
Internet  for  data  set  libraries.    A  simple  search  on  the  term  ‘data  sets’  in  your  favorite
search engine will yield a number of web sites that offer libraries of data sets that you can
use for academic and learning purposes.  Download a data set that looks interesting to you
and complete the following:

2)

Format the data set into a CSV file.  It may come in this format, or you may need to open
the  data  in  OpenOffice  Calc  or  some  similar  software,  and  then  use  the  File  >  Save As
feature to save your data as a CSV file.

Data Mining for the Masses
56

3)

Import  your  data  into  your  RapidMiner  repository.    Save  it  in  the  repository  as
Chapter3_Exercise.

4)

Create a new, blank process stream in RapidMiner and drag your data set into the process
window.

5)

Run your process and examine your data set in both meta data view and Data View.  Note
if any attributes have missing or inconsistent data.

6)

If you found any missing or inconsistent data, use operators to handle these.  Perhaps try
browsing through the folder tree in the Operators tab and experiment with some operators
that were not covered in this chapter.

7)

Try  filtering  out  some  observations  based  on  some  attibute’s  value,  and  filter  out  some
attributes.

8)

Document  where  you  found  your  data  set,  how  you  prepared  it  for  import  into
RapidMiner, and what data preparation activities you applied to it.

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 13 14 15 16 17 18 19 20 ... 65