Data Mining
for the Masses
xi
ACKNOWLEDGEMENTS
I would not have had the expertise to write this book if not for the assistance of many colleagues at
various institutions. I would like to acknowledge Drs. Thomas Hilton and Jean Pratt, formerly of
Utah State University and now of University of Wisconsin—Eau Claire who served as my Master’s
degree advisors. I would also like to acknowledge Drs. Terence Ahern and Sebastian Diaz of West
Virginia University, who served as doctoral advisors to me.
I express my sincere and heartfelt gratitude for the assistance of Dr. Simon Fischer and the rest of
the team at Rapid-I. I thank them for their excellent work on the RapidMiner software product
and for their willingness to share their time and expertise with me on my visit to Dortmund.
Finally, I am grateful to the Kenneth M. Mason, Sr. Faculty Research Fund and Washington &
Jefferson College, for providing financial support for my work on this text.
Chapter 1: Introduction
to Data Mining and CRISP-DM
3
CHAPTER ONE:
INTRODUCTION TO DATA MINING AND CRISP-DM
INTRODUCTION
Data mining as a discipline is largely transparent to the world. Most of the time, we never even
notice that it’s happening. But whenever we sign up for a grocery store shopping card, place a
purchase using a credit card, or surf the Web, we are creating data. These
data are stored in large
sets on powerful computers owned by the companies we deal with every day. Lying within those
data sets are patterns—indicators of our interests, our habits, and our behaviors. Data mining
allows people to locate and interpret those patterns, helping them make better informed decisions
and better serve their customers. That being said, there are also concerns about the practice of
data mining. Privacy watchdog groups in particular are vocal about organizations that amass vast
quantities of data, some of which can be very personal in nature.
The intent of this book is to introduce you to concepts and practices common in data mining. It is
intended primarily for undergraduate college students and for business professionals who may be
interested in using information systems and technologies to solve business problems by mining
data, but who likely do not have a formal background or education in computer science. Although
data mining is the fusion of applied statistics, logic, artificial intelligence, machine learning and data
management systems, you are not required to have a strong background in these fields to use this
book. While having taken introductory college-level courses in statistics and databases will be
helpful, care has been taken to explain within this book, the necessary concepts and techniques
required to successfully learn how to mine data.
Each chapter in this book will explain a data mining concept or technique. You should understand
that the book is not designed to be an instruction manual or tutorial for the tools we will use
(RapidMiner and OpenOffice Base and Calc). These software packages are capable of many types
of
data analysis, and this text is not intended to cover all of their capabilities, but rather, to
illustrate how these software tools can be used to perform certain kinds of data mining. The book
Data Mining for the Masses
4
is also not exhaustive; it includes a variety of common data mining techniques, but RapidMiner in
particular
is capable of many, many data mining tasks that are not covered in the book.
The chapters will all follow a common format. First, chapters will present a scenario referred to as
Context and Perspective. This section will help you to gain a real-world idea about a certain kind of
problem that data mining can help solve. It is intended to help you think of ways that the data
mining technique in that given chapter can be applied to organizational problems you might face.
Following
Context and Perspective, a set of
Learning Objectives is offered. The idea behind this section
is that each chapter is designed to teach you something new about data mining. By listing the
objectives at the beginning of the chapter, you will have a better idea of what you should expect to
learn by reading it. The chapter will follow with several sections addressing the chapter’s topic. In
these sections, step-by-step examples will frequently be given to enable you to work alongside an
actual data mining task. Finally, after the main concepts of the chapter have been delivered, each
chapter will conclude with a
Chapter Summary, a set of
Review Questions to help reinforce the main
points of the chapter, and one or more
Exercise to allow you to try your hand at applying what was
taught in the chapter.
A NOTE ABOUT TOOLS
There are many software tools designed to facilitate data mining, however many of these are often
expensive and complicated to install, configure and use. Simply put, they’re not a good fit for
learning the basics of data mining. This book will use OpenOffice Calc and Base in conjunction
with an open source software product called RapidMiner, developed by Rapid-I, GmbH of
Dortmund, Germany. Because OpenOffice is widely available and very intuitive, it is a logical
place to begin teaching introductory level data mining concepts. However, it lacks some of the
tools data miners like to use. RapidMiner is an ideal complement to OpenOffice, and was selected
for this book for several reasons:
RapidMiner provides specific data mining functions not currently found in OpenOffice,
such as decision
trees and association rules, which you will learn to use later in this book.
RapidMiner is easy to install and will run on just about any computer.
RapidMiner’s maker provides a Community Edition of its software, making it free for
readers to obtain and use.