Data Mining
for the Masses
8
was there a standard method of collection? What do the various columns and rows of data mean?
Are there acronyms or abbreviations that are unknown or unclear? You may need to do some
research in the Data Preparation phase of your data mining activities. Sometimes you will need to
meet with subject matter experts in various departments to unravel where certain data came from,
how they were collected, and how they have been coded and stored. It is critically important that
you verify the accuracy and reliability of the data as well. The old adage “It’s better than nothing”
does not apply in data mining. Inaccurate or incomplete data could be worse than nothing in a
data mining activity, because decisions based upon partial or wrong data are likely to be partial or
wrong decisions. Once you have gathered, identified and understood your data assets, then you
may engage in…
CRISP-DM Step 3: Data Preparation
Data come in many shapes and formats. Some data are numeric, some are in paragraphs of text,
and others are in picture form such as charts, graphs and maps. Some data are anecdotal or
narrative, such as comments on a customer satisfaction survey or the transcript of a witness’s
testimony. Data that aren’t in rows or columns of numbers shouldn’t be dismissed though—
sometimes non-traditional data formats can be the most information rich. We’ll talk in this book
about approaches to formatting data, beginning in Chapter 2. Although rows and columns will be
one of our most common layouts, we’ll also get into text mining where paragraphs can be fed into
RapidMiner and analyzed for patterns as well.
Data Preparation involves a number of activities. These may include joining two or more data
sets together, reducing data sets to only those variables that are interesting in a given data mining
exercise, scrubbing data clean of anomalies such as outlier observations or missing data, or re-
formatting data for consistency purposes. For example, you may have seen a spreadsheet or
database that held phone numbers in many different formats:
(555) 555-5555
555/555-5555
555-555-5555
555.555.5555
555 555 5555
5555555555
Each of these
offers the same phone number, but stored in different formats. The results of a data
mining exercise are most likely to yield good, useful results when the underlying data are as
Chapter 1: Introduction
to Data Mining and CRISP-DM
9
consistent as possible. Data preparation can help to ensure that you improve your chances of a
successful outcome when you begin…
CRISP-DM Step 4: Modeling
A
model, in data mining at least, is a computerized representation of real-world observations.
Models are the application of algorithms to seek out, identify, and display any patterns or messages
in your data. There are two basic kinds or types of models in data mining: those that
classify and
those that
predict.
Figure 1-2: Types of Data Mining Models.
As you can see in Figure 1-2, there is some overlap between the types of models data mining uses.
For example, this book will teaching you about
decision trees. Decision Trees are a predictive
model used to determine which attributes of a given data set are the strongest indicators of a given
outcome. The outcome is usually expressed as the likelihood that an observation will fall into a
certain category. Thus, Decision
Trees are predictive in nature, but they also help us to classify our
data. This will probably make more sense when we get to the chapter on Decision Trees, but for
now, it’s important just to understand that models help us to classify and predict based on patterns
the models find in our data.
Models may be simple or complex. They may
contain only a single process, or stream, or they may
contain sub-processes. Regardless of their layout, models are where data mining moves from
preparation and understanding to development and interpretation. We will build a number of
example models in this text.
Once a model has been built, it is time for…
Data Mining for the Masses
10
CRISP-DM Step 5: Evaluation
All analyses of data have the potential for false positives. Even if a model doesn’t yield false
positives however, the model may not find any interesting patterns in your data. This may be
because the model isn’t set up well to find the patterns, you could
be using the wrong technique, or
there simply may not be anything interesting in your data for the model to find. The Evaluation
phase of CRISP-DM is there specifically to help you determine how valuable your model is, and
what you might want to do with it.
Evaluation can be accomplished using a number of techniques, both mathematical and logical in
nature. This book will examine techniques for cross-validation and testing for false positives using
RapidMiner. For some models, the power or strength indicated by certain test
statistics will also be
discussed. Beyond these measures however, model evaluation must also include a human aspect.
As individuals gain experience and expertise in their field, they will have operational knowledge
which may not be measurable in a mathematical sense, but is nonetheless indispensable in
determining the value of a data mining model. This human element will also be discussed
throughout the book. Using both data-driven and instinctive evaluation techniques to determine a
model’s usefulness, we can then decide how to move on to…
CRISP-DM Step 6: Deployment
If you have successfully identified your questions, prepared data that can answer those questions,
and created a model that passes the test of being interesting and useful, then you have arrived at
the point of
actually using your results. This is
deployment, and it is a happy and busy time for a data
miner. Activities in this phase include setting up automating your model, meeting with consumers
of your model’s outputs, integrating with existing management or operational information systems,
feeding new learning from model use back into the model to improve its accuracy and
performance, and monitoring and measuring the outcomes of model use. Be prepared for a bit of
distrust of your model at first—you may even face pushback from groups who may feel their jobs
are
threatened by this new tool, or who may not trust the reliability or accuracy of the outputs. But
don’t let this discourage you! Remember that CBS did not trust the initial predictions of the
UNIVAC, one of the first commercial computer systems, when the network used it to predict the
eventual outcome of the 1952 presidential election on election night. With only 5% of the votes
counted, UNIVAC predicted Dwight D. Eisenhower would defeat Adlai Stevenson in a landslide;