status of such a “discovery”? What information is it based on?
Under what con-
ditions was that information collected? In what ways is it ethical to use it?
Clearly, insurance companies are in the business of discriminating among
people based on stereotypes—young males pay heavily for automobile insur-
ance—but such stereotypes are not based solely on statistical correlations; they
also involve common-sense knowledge about the world. Whether the preceding
finding says something about the kind of person who chooses a red car, or
whether it should be discarded as an irrelevancy, is a matter for human
judgment based on knowledge of the world rather than on purely statistical
criteria.
When presented with data, you need to ask who is permitted to have access
to it, for what purpose it was collected, and what kind of conclusions is it legit-
imate to draw from it. The ethical dimension raises tough questions for those
involved in practical data mining. It is necessary to consider the norms of the
community that is used to dealing with the kind of data involved, standards that
may have evolved over decades or centuries but ones that may not be known to
the information specialist. For example, did you know that in the library com-
munity, it is taken for granted that the privacy of readers is a right that is
jealously protected? If you call your university library and ask who has such-
and-such a textbook out on loan, they will not tell you. This prevents a student
from being subjected to pressure from an irate professor to yield access to a book
that she desperately needs for her latest grant application. It also prohibits
enquiry into the dubious recreational reading tastes of the university ethics
committee chairman. Those who build, say, digital libraries may not be aware
of these sensitivities and might incorporate data mining systems that analyze
and compare individuals’ reading habits to recommend new books—perhaps
even selling the results to publishers!
In addition to community standards for the use of data, logical and scientific
standards must be adhered to when drawing conclusions from it. If you do come
up with conclusions (such as red car owners being greater credit risks), you need
to attach caveats to them and back them up with arguments other than purely
statistical ones. The point is that data mining is just a tool in the whole process:
it is people who take the results, along with other knowledge, and decide what
action to apply.
Data mining prompts another question, which is really a political one: to
what use are society’s resources being put? We mentioned previously the appli-
cation of data mining to basket analysis, where supermarket checkout records
are analyzed to detect associations among items that people purchase. What use
should be made of the resulting information? Should the supermarket manager
place the beer and chips together, to make it easier for shoppers, or farther apart,
making it less convenient for them, maximizing their time in the store, and
therefore increasing their likelihood of being drawn into unplanned further
3 6
C H A P T E R 1
|
W H AT ’ S I T A L L A B O U T ?
P088407-Ch001.qxd 4/30/05 11:11 AM Page 36
purchases? Should the manager move the most expensive, most profitable
diapers near the beer, increasing sales to harried fathers of a high-margin item
and add further luxury baby products nearby?
Of course, anyone who uses advanced technologies should consider the
wisdom of what they are doing. If data is characterized as recorded facts, then
information is the set of patterns, or expectations, that underlie the data. You
could go on to define knowledge as the accumulation of your set of expectations
and wisdom as the value attached to knowledge. Although we will not pursue it
further here, this issue is worth pondering.
As we saw at the very beginning of this chapter, the techniques described in
this book may be called upon to help make some of the most profound and
intimate decisions that life presents. Data mining is a technology that we need
to take seriously.
1.7 Further reading
To avoid breaking up the flow of the main text, all references are collected in a
section at the end of each chapter. This first Further reading section describes
papers, books, and other resources relevant to the material covered in Chapter
1. The human in vitro fertilization research mentioned in the opening to this
chapter was undertaken by the Oxford University Computing Laboratory,
and the research on cow culling was performed in the Computer Science
Department at the University of Waikato, New Zealand.
The example of the weather problem is from Quinlan (1986) and has been
widely used to explain machine learning schemes. The corpus of example prob-
lems mentioned in the introduction to Section 1.2 is available from Blake et al.
(1998). The contact lens example is from Cendrowska (1998), who introduced
the PRISM rule-learning algorithm that we will encounter in Chapter 4. The iris
dataset was described in a classic early paper on statistical inference (Fisher
1936). The labor negotiations data is from the Collective bargaining review, a
publication of Labour Canada issued by the Industrial Relations Information
Service (BLI 1988), and the soybean problem was first described by Michalski
and Chilausky (1980).
Some of the applications in Section 1.3 are covered in an excellent paper that
gives plenty of other applications of machine learning and rule induction
(Langley and Simon 1995); another source of fielded applications is a special
issue of the Machine Learning Journal (Kohavi and Provost 1998). The loan
company application is described in more detail by Michie (1989), the oil slick
detector is from Kubat et al. (1998), the electric load forecasting work is by
Jabbour et al. (1988), and the application to preventative maintenance of
electromechanical devices is from Saitta and Neri (1998). Fuller descriptions
1 . 7
F U RT H E R R E A D I N G
3 7
P088407-Ch001.qxd 4/30/05 11:11 AM Page 37