Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	6/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 2 3 4 5 6 7 8 9 ... 65

Data Mining for the Masses
8
was there a standard method of collection?  What do the various columns and rows of data mean?
Are  there  acronyms  or  abbreviations  that  are  unknown  or  unclear?    You  may  need  to  do  some
research in the Data Preparation phase of your data mining activities.  Sometimes you will need to
meet with subject matter experts in various departments to unravel where certain data came from,
how they were collected, and how they have been coded and stored.  It is critically important that
you verify the accuracy and reliability of the data as well.  The old adage “It’s better than nothing”
does not apply in data mining.  Inaccurate or incomplete data could be worse than nothing in a
data mining activity, because decisions based upon partial or wrong data are likely to be partial or
wrong decisions.  Once you have gathered, identified and understood your data assets, then you
may engage in…

CRISP-DM Step 3: Data Preparation

Data come in many shapes and formats.  Some data are numeric, some are in paragraphs of text,
and  others  are  in  picture  form  such  as  charts,  graphs  and  maps.    Some  data  are  anecdotal  or
narrative,  such  as  comments  on  a  customer  satisfaction  survey  or  the  transcript  of  a  witness’s
testimony.    Data  that  aren’t  in  rows  or  columns  of  numbers  shouldn’t  be  dismissed  though—
sometimes non-traditional data formats can be the most information rich.  We’ll talk in this book
about approaches to formatting data, beginning in Chapter 2.  Although rows and columns will be
one of our most common layouts, we’ll also get into text mining where paragraphs can be fed into
RapidMiner and analyzed for patterns as well.

Data Preparation involves a number of activities.  These may include joining two or more data
sets together, reducing data sets to only those variables that are interesting in a given data mining
exercise,  scrubbing  data  clean  of  anomalies  such  as  outlier  observations  or  missing  data,  or  re-
formatting  data  for  consistency  purposes.    For  example,  you  may  have  seen  a  spreadsheet  or
database that held phone numbers in many different formats:
(555) 555-5555
555/555-5555
555-555-5555
555.555.5555
555 555 5555
5555555555

Each of these offers the same phone number, but stored in different formats. The results of a data
mining  exercise  are  most  likely  to  yield  good,  useful  results  when  the  underlying  data  are  as

Chapter 1: Introduction to Data Mining and CRISP-DM
9
consistent as possible.  Data preparation can help to ensure that you improve your chances of a
successful outcome when you begin…

CRISP-DM Step 4: Modeling

A  model,  in  data  mining  at  least,  is  a  computerized  representation  of  real-world  observations.
Models are the application of algorithms to seek out, identify, and display any patterns or messages
in your data.  There are two basic kinds or types of models in data mining: those that classify and
those that predict.

Figure 1-2: Types of Data Mining Models.

As you can see in Figure 1-2, there is some overlap between the types of models data mining uses.
For example, this book will teaching you about decision trees.  Decision Trees are a predictive
model used to determine which attributes of a given data set are the strongest indicators of a given
outcome.  The outcome is usually expressed as the likelihood that an observation will fall into a
certain category. Thus, Decision Trees are predictive in nature, but they also help us to classify our
data.  This will probably make more sense when we get to the chapter on Decision Trees, but for
now, it’s important just to understand that models help us to classify and predict based on patterns
the models find in our data.

Models may be simple or complex. They may contain only a single process, or stream, or they may
contain  sub-processes.    Regardless  of  their  layout,  models  are  where  data  mining  moves  from
preparation  and  understanding  to  development  and  interpretation.    We  will  build  a  number  of
example models in this text. Once a model has been built, it is time for…

Data Mining for the Masses
10
CRISP-DM Step 5: Evaluation

All  analyses  of  data  have  the  potential  for  false  positives.    Even  if  a  model  doesn’t  yield  false
positives  however,  the  model  may  not  find  any  interesting  patterns  in  your  data.    This  may  be
because the model isn’t set up well to find the patterns, you could be using the wrong technique, or
there simply may not be anything interesting in your data for the model to find.  The Evaluation
phase of CRISP-DM is there specifically to help you determine how valuable your model is, and
what you might want to do with it.

Evaluation can be accomplished using a number of techniques, both mathematical and logical in
nature.  This book will examine techniques for cross-validation and testing for false positives using
RapidMiner. For some models, the power or strength indicated by certain test statistics will also be
discussed.  Beyond these measures however, model evaluation must also include a human aspect.
As  individuals  gain  experience  and  expertise  in  their  field,  they  will  have  operational  knowledge
which  may  not  be  measurable  in  a  mathematical  sense,  but  is  nonetheless  indispensable  in
determining  the  value  of  a  data  mining  model.    This  human  element  will  also  be  discussed
throughout the book.  Using both data-driven and instinctive evaluation techniques to determine a
model’s usefulness, we can then decide how to move on to…

CRISP-DM Step 6: Deployment

If you have successfully identified your questions, prepared data that can answer those questions,
and created a model that passes the test of being interesting and useful, then you have arrived at
the point of actually using your results.  This is deployment, and it is a happy and busy time for a data
miner.  Activities in this phase include setting up automating your model, meeting with consumers
of your model’s outputs, integrating with existing management or operational information systems,
feeding  new  learning  from  model  use  back  into  the  model  to  improve  its  accuracy  and
performance, and monitoring and measuring the outcomes of model use.  Be prepared for a bit of
distrust of your model at first—you may even face pushback from groups who may feel their jobs
are threatened by this new tool, or who may not trust the reliability or accuracy of the outputs. But
don’t  let  this  discourage  you!    Remember  that  CBS  did  not  trust  the  initial  predictions  of  the
UNIVAC, one of the first commercial computer systems, when the network used it to predict the
eventual outcome of the 1952 presidential election on election night.  With only 5% of the votes
counted, UNIVAC predicted Dwight D. Eisenhower would defeat Adlai Stevenson in a landslide;

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 2 3 4 5 6 7 8 9 ... 65