split early on using a particular attribute might turn
out later to be ill consid-
ered in light of how the tree develops below that node. To get around these prob-
lems, a beam search could be used in which irrevocable commitments are not
made but instead a set of several active alternatives—whose number is the beam
width—are pursued in parallel. This will complicate the learning algorithm
quite considerably but has the potential to avoid the myopia associated with a
greedy search. Of course, if the beam width is not large enough, myopia may
still occur. There are more complex search strategies that help to overcome this
problem.
A more general and higher-level kind of search bias concerns whether the
search is done by starting with a general description and refining it, or by
starting with a specific example and generalizing it. The former is called a
general-to-specific search bias; the latter a
specific-to-general one. Many learning
algorithms adopt the former policy, starting with an empty decision tree, or a
very general rule, and specializing it to fit the examples. However, it is perfectly
possible to work in the other direction. Instance-based methods start with a
particular example and see how it can be generalized to cover nearby examples
in the same class.
Overfitting-avoidance bias
Overfitting-avoidance bias is often just another kind of search bias. But because
it addresses a rather special problem, we treat it separately. Recall the disjunc-
tion problem described previously. The problem is that if disjunction is allowed,
useless concept descriptions that merely summarize the data become possible,
whereas if it is prohibited, some concepts are unlearnable. To get around this
problem, it is common to search the concept space starting with the simplest
concept descriptions and proceeding to more complex ones: simplest-first
ordering. This biases the search toward simple concept descriptions.
Using a simplest-first search and stopping when a sufficiently complex
concept description is found is a good way of avoiding overfitting. It is some-
times called forward pruning or prepruning because complex descriptions are
pruned away before they are reached. The alternative, backward pruning or post-
pruning, is also viable. Here, we first find a description that fits the data well and
then prune it back to a simpler description that also fits the data. This is not as
redundant as it sounds: often the only way to arrive at a simple theory is to find
a complex one and then simplify it. Forward and backward pruning are both a
kind of overfitting-avoidance bias.
In summary, although generalization as search is a nice way to think about
the learning problem, bias is the only way to make it feasible in practice. Dif-
ferent learning algorithms correspond to different concept description spaces
searched with different biases. This is what makes it interesting: different
3 4
C H A P T E R 1
|
W H AT ’ S I T A L L A B O U T ?
P088407-Ch001.qxd 4/30/05 11:11 AM Page 34
description languages and biases serve some problems well and other problems
badly. There is no universal “best” learning method—as every teacher knows!
1.6 Data mining and ethics
The use of data—particularly data about people—for data mining has serious
ethical implications, and practitioners of data mining techniques must act
responsibly by making themselves aware of the ethical issues that surround their
particular application.
When applied to people, data mining is frequently used to discriminate—
who gets the loan, who gets the special offer, and so on. Certain kinds of
discrimination—racial, sexual, religious, and so on—are not only unethical
but also illegal. However, the situation is complex: everything depends on the
application. Using sexual and racial information for medical diagnosis is
certainly ethical, but using the same information when mining loan payment
behavior is not. Even when sensitive information is discarded, there is a risk
that models will be built that rely on variables that can be shown to substitute
for racial or sexual characteristics. For example, people frequently live in
areas that are associated with particular ethnic identities, so using an area
code in a data mining study runs the risk of building models that are based on
race—even though racial information has been explicitly excluded from the
data.
It is widely accepted that before people make a
decision to provide personal
information they need to know how it will be used and what it will be used for,
what steps will be taken to protect its confidentiality and integrity, what the con-
sequences of supplying or withholding the information are, and any rights of
redress they may have. Whenever such information is collected, individuals
should be told these things—not in legalistic small print but straightforwardly
in plain language they can understand.
The potential use of data mining techniques means that the ways in which a
repository of data can be used may stretch far beyond what was conceived when
the data was originally collected. This creates a serious problem: it is necessary
to determine the conditions under which the data was collected and for what
purposes it may be used. Does the ownership of data bestow the right to use it
in ways other than those purported when it was originally recorded? Clearly in
the case of explicitly collected personal data it does not. But in general the
situation is complex.
Surprising things emerge from data mining. For example, it has been
reported that one of the leading consumer groups in France has found that
people with red cars are more likely to default on their car loans. What is the
1 . 6
DATA M I N I N G A N D E T H I C S
3 5
P088407-Ch001.qxd 4/30/05 11:11 AM Page 35