Chapter 1: Introduction
to Data Mining and CRISP-DM
5
Both RapidMiner and OpenOffice provide intuitive graphical user interface environments
which make it easier for general computer-using audiences to the experience the power
of data mining.
All examples using OpenOffice or RapidMiner in this book will be illustrated in a Microsoft
Windows environment, although it should be noted that these software packages will work on a
variety of computing platforms. It is recommended that you download and install these two
software packages on your computer now, so that you can work along with the examples in the
book if you would like.
OpenOffice can be downloaded from:
http://www.openoffice.org/
RapidMiner Community Edition can be downloaded from:
http://rapid-i.com/content/view/26/84/
THE DATA MINING PROCESS
Although data mining’s roots can be traced back to the late 1980s, for most of the 1990s the field
was still in its infancy. Data mining was still being defined, and refined. It was largely a loose
conglomeration of data models, analysis algorithms, and ad hoc outputs. In 1999, several sizeable
companies including auto maker Daimler-Benz, insurance provider OHRA, hardware and software
manufacturer NCR Corp. and statistical software maker SPSS, Inc. began working together to
formalize and standardize an approach to data mining. The result of their work was
CRISP-DM,
the CRoss-Industry Standard Process for Data Mining. Although
the participants in the creation of CRISP-DM certainly had vested interests in certain software and
hardware tools, the process was designed independent of any specific tool. It was written in such a
way as to be conceptual in nature—something that could be applied independent of any certain
tool or kind of data. The process consists of six
steps or phases, as illustrated in Figure 1-1.
Data Mining for the Masses
6
Figure 1-1: CRISP-DM Conceptual Model.
CRISP-DM Step 1: Business (Organizational) Understanding
The first step in CRISP-DM is
Business Understanding, or what will be referred to in this text
as
Organizational Understanding, since organizations of all kinds, not just businesses, can use
data mining to answer questions and solve problems. This step is crucial to a successful data
mining outcome, yet is often overlooked as folks try to dive right into mining their data. This is
natural of course—we are often anxious to generate some interesting output; we want to find
answers. But you wouldn’t begin building a car without first defining what you want the vehicle to
do, and without first
designing what you are going to
build. Consider these oft-quoted lines from
Lewis Carroll’s
Alice’s Adventures in Wonderland:
"Would you tell me, please, which way I ought to go from here?"
"That depends a good deal on where you want to get to," said the Cat.
"I don’t much care where--" said Alice.
"Then it doesn’t matter which way you go," said the Cat.
"--so
long as I get SOMEWHERE," Alice added as an explanation.
"Oh, you’re sure to do that," said the Cat, "if you only walk long enough."
Indeed. You can mine data all
day long and into the night, but if you don’t know what you want to
know, if you haven’t defined any questions to answer, then the efforts of your data mining are less
likely to be fruitful. Start with high level ideas: What is making my customers complain so much?
1. Business
Understanding
2. Data
Understanding
5. Evaluation
4. Modeling
3. Data
Preparation
6. Deployment
Data
Chapter 1: Introduction to Data Mining and CRISP-DM
7
How can I increase my per-unit profit margin? How can I anticipate and fix manufacturing flaws
and thus avoid shipping a defective product? From there, you can begin to develop the more
specific questions you
want to answer, and this will enable you to proceed to …
CRISP-DM Step 2: Data Understanding
As with Organizational Understanding,
Data Understanding is a preparatory activity, and
sometimes, its value is lost on people. Don’t let its value be lost on you! Years ago when workers
did not have their own computer (or multiple computers) sitting on their desk (or lap, or in their
pocket), data were centralized. If you needed information from a company’s data store, you could
request a report from someone who could query that information from a central database (or fetch
it from a company filing cabinet) and provide the results to you. The inventions of the personal
computer, workstation, laptop, tablet computer and even smartphone have each triggered moves
away from data centralization. As hard drives became simultaneously larger
and cheaper, and as
software like Microsoft Excel and Access became increasingly more accessible and easier to use,
data began to disperse across the enterprise. Over time, valuable data stores became strewn across
hundred and even thousands of devices, sequestered in marketing managers’ spreadsheets,
customer
support databases, and human resources file systems.
As you can imagine, this has created a multi-faceted data problem. Marketing may have wonderful
data that could be a valuable asset to senior management, but senior management may not be
aware of the data’s existence—either because of territorialism on the part of the marketing
department, or because the marketing folks simply haven’t thought to tell the executives about the
data they’ve gathered. The same could be said of the information sharing, or lack thereof, between
almost any two business units in an organization. In Corporate America lingo, the term ‘silos’ is
often invoked to describe the separation of units to the point where interdepartmental sharing and
communication is almost non-existent. It is unlikely that effective organizational data mining can
occur when employees do not know
what data they have (or could have) at their disposal or
where
those data are currently located. In chapter two we will take a closer look at some mechanisms
that organizations are using to try bring all their data into a common location. These include
databases, data marts and data warehouses.
Simply centralizing data is not enough however. There are plenty of question that arise once an
organization’s data have been corralled. Where did the data come from? Who collected them and