Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	11/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 7 8 9 10 11 12 13 14 ... 65

Data Mining for the Masses
26
has a number of tasks before him, each of which fall into one of the first three phases of CRISP.
First, Jerry must ensure that he has developed a clear Organizational Understanding.  What is
the  purpose  of  this  project  for  his  employer?    Why  is  he  surveying  Internet  users?    Which  data
points  are  important  to  collect, which would  be  nice  to  have,  and  which  would  be  irrelevant  or
even distracting to the project?  Once the data are collected, who will have access to the data set
and through what mechanisms?  How will the business ensure privacy is protected?  All of these
questions, and perhaps others, should be answered before Jerry even creates the survey mentioned
in the second paragraph above.

Once  answered,  Jerry  can  then  begin  to  craft  his  survey.    This  is  where  Data  Understanding
enters  the  process.    What  database  system  will  he  use?    What  survey  software?    Will  he  use  a
publicly available tool like SurveyMonkey™, a commercial product, or something homegrown?  If
he uses publicly available tool, how will he access and extract data for mining?  Can he trust this
third-party to secure his data and if so, why?  How will the underlying database be designed?  What
mechanisms  will  be  put  in  place  to  ensure  consistency  and  integrity  in  the  data?    These  are  all
questions of data understanding.  An easy example of ensuring consistency might be if a person’s
home city were to be collected as part of the data.  If the online survey just provides an open text
box for entry, respondents could put just about anything as their home city.  They might put New
York, NY, N.Y., Nwe York, or any number of other possible combinations, including typos.  This
could  be  avoided  by  forcing  users  to  select  their  home  city  from  a  dropdown  menu,  but
considering the number cities there are in most countries, that list could be unacceptably long!  So
the choice of how to handle this potential data consistency problem isn’t necessarily an obvious or
easy one, and this is just one of many data points to be collected.  While ‘home state’ or ‘country’
may  be  reasonable  to  constrain  to  a  dropdown,  ‘city’  may  have  to  be  entered  freehand  into  a
textbox, with some sort of data correction process to be applied later.

The  ‘later’  would  come  once  the  survey  has  been  developed  and  deployed,  and  data  have  been
collected.  With the data in place, the third CRISP-DM phase, Data Preparation, can begin.  If
you  haven’t  installed  OpenOffice  and  RapidMiner  yet,  and  you  want  to  work  along  with  the
examples given in the rest of the book, now would be a good time to go ahead and install these
applications.    Remember  that  both  are  freely  available  for  download  and  installation  via  the
Internet, and the links to both applications are given in Chapter 1.  We’ll begin by doing some data
preparation  in  OpenOffice  Base  (the  database  application),  OpenOffice  Calc  (the  spreadsheet
application),  and  then  move  on  to  other  data  preparation  tools  in  RapidMiner.    You  should

Chapter 3: Data Preparation
27
understand that the examples of data preparation in this book are only a subset of possible data
preparation approaches.

COLLATION

Suppose  that  the  database  underlying  Jerry’s  Internet  survey  is  designed  as  depicted  in  the
screenshot from OpenOffice Base in Figure 3-1.

Figure 3-1: A simple relational (one-to-one) database for Internet survey data.

This  design  would  enable  Jerry  to  collect  data  about  people  in  one  table,  and  data  about  their
Internet behaviors in another.  RapidMiner would be able to connect to either of these tables in
order to mine the responses, but what if Jerry were interested in mining data from both tables at
once?

One simple way to collate data in multiple tables into a single location for data mining is to create a
database  view.    A  view  is  a  type  of  pseudo-table,  created  by  writing  a  SQL  statement  which  is
named and stored in the database.  Figure 3-2 shows the creation of a view in OpenOffice Base,
while Figure 3-3 shows the view in datasheet view.

Data Mining for the Masses
28

Figure 3-2: Creation of a view in OpenOffice Base.

Figure 3-3: Results of the view from Figure 3-2 in datasheet view.

The creation of views is one way that data from a relational database can be collated and organized
in preparation for data mining activities.  In this example, although the personal information in the
‘Respondents’  table  is  only  stored  once  in  the  database,  it  is  displayed  for  each  record  in  the
‘Responses’  table,  creating  a  data  set  that  is  more  easily  mined  because  it  is  both  richer  in
information and consistent in its formatting.

DATA SCRUBBING

In  spite  of  our  very  best  efforts  to  maintain  quality  and  integrity  during  data  collection,  it  is
inevitable that some anomalies will be introduced into our data at some point.  The process of data
scrubbing allows us to handle these anomalies in ways that make sense for us.  In the remainder of
this chapter, we will examine data scrubbing in four different ways: handling missing data, reducing
data (observations), handling inconsistent data, and reducing attributes.

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 7 8 9 10 11 12 13 14 ... 65