Chapter 2: Organizational Understanding
and Data Understanding
21
have an ethical obligation to protect these individuals’ rights. This requires the utmost care in
terms of information security. Simply because a government representative or contractor asks for
data does not mean it should be given.
Beyond technological security however, we must also consider our moral obligation to those
individuals behind the numbers. Recall the grocery store shopping card example given at the
beginning of this chapter. In order to encourage use of frequent shopper cards, grocery stores
frequently list two prices for items, one with use of the card and one without. For each individual,
the answer to this question may vary, however, answer it for yourself: At what price mark-up has
the grocery store crossed an ethical line between encouraging consumers to participate in frequent
shopper programs, and forcing them to participate in order to afford to buy groceries? Again, your
answer will be unique from others’, however it is important to keep such moral obligations in mind
when
gathering, storing and mining data.
The objectives hoped for through data mining activities should never justify unethical means of
achievement. Data mining can be a powerful tool for customer relationship management,
marketing, operations management, and production, however in all cases the human element must
be kept sharply in focus. When working long hours at a data mining task, interacting primarily
with hardware, software, and numbers, it can be easy to forget about the people, and therefore it is
so emphasized here.
CHAPTER SUMMARY
This chapter has introduced you to the discipline of data mining. Data mining brings statistical
and logical methods of analysis to large data sets for the purposes of describing them and using
them to create predictive models. Databases, data warehouses and data sets are all unique kinds of
digital record keeping systems, however, they do share many similarities. Data mining is generally
most effectively executed on data data sets, extracted from OLAP, rather than OLTP systems.
Both operational data and organizational data provide good starting points for data mining
activities, however both come with their own issues that may inhibit quality data mining activities.
These should be mitigated before beginning to mine the data. Finally, when mining data, it is
critical to remember the human factor behind manipulation of numbers and figures. Data miners
have an ethical responsibility to the individuals whose lives may be affected by the decisions that
are made as a result of data mining activities.
Data Mining for the Masses
22
REVIEW QUESTIONS
1)
What is data mining in general terms?
2)
What is the difference between a database, a data warehouse and a data set?
3)
What are some of the limitations of data mining? How can we address those limitations?
4)
What is the difference between operational and organizational data? What are the pros and
cons of each?
5)
What are some of the ethical issues we face in data mining? How can they be addressed?
6)
What is meant by out-of-synch data? How can this situation be remedied?
7)
What is normalization? What are some reasons why it is a good thing in OLTP systems,
but not so good in OLAP systems?
EXERCISES
1)
Design a relational database with at least three tables. Be sure to create the columns
necessary within each table to relate the tables to one another.
2)
Design a data warehouse table with some columns which would usually be normalized.
Explain why it makes sense to denormalize in a data warehouse.
3)
Perform an Internet search to find information about data security and privacy. List three
web sites that you found that provided information that could be applied to data mining.
Explain how it might be applied.
4)
Find a newspaper, magazine or Internet news article related to information privacy or
security. Summarize the article and explain how it might be related to data mining.
Chapter 3:
Data Preparation
25
CHAPTER THREE:
DATA PREPARATION
CONTEXT AND PERSPECTIVE
Jerry is the marketing manager for a small Internet design and advertising firm. Jerry’s boss asks
him to develop a data set containing information about Internet users. The company will use this
data to determine what kinds of people are using the Internet and how the firm may be able to
market their services to this group of users.
To accomplish his assignment, Jerry creates an online survey and places links to the survey on
several popular Web sites. Within two weeks, Jerry has collected
enough data to begin analysis, but
he finds that his data needs to be denormalized. He also notes that some observations in the set
are missing values or they appear to contain invalid values. Jerry realizes that some additional work
on the data needs to take place before analysis begins.
LEARNING OBJECTIVES
After completing the reading and exercises in this chapter, you should be able to:
Explain the concept and purpose
of data scrubbing
List possible solutions for handling missing data
Explain the role and perform basic methods for data
reduction
Define and handle inconsistent data
Discuss the important and process of attribute reduction
APPLYING THE CRISP DATA MINING MODEL
Recall from Chapter 1 that the CRISP Data Mining methodology requires three phases
before any
actual data mining models are constructed. In the Context and Perspective paragraphs above, Jerry