Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

particular attribute might turn out later to be ill consid-

Yüklə 4,3 Mb.

Pdf görüntüsü

səhifə	26/219
tarix	08.10.2017
ölçüsü	4,3 Mb.
	#3816

1 ... 22 23 24 25 26 27 28 29 ... 219

Overﬁtting-avoidance bias
1.6 Data mining and ethics

split early on using a particular attribute might turn out later to be ill consid-

ered in light of how the tree develops below that node. To get around these prob-

lems, a beam search could be used in which irrevocable commitments are not

made but instead a set of several active alternatives—whose number is the beam

width—are pursued in parallel. This will complicate the learning algorithm

quite considerably but has the potential to avoid the myopia associated with a

greedy search. Of course, if the beam width is not large enough, myopia may

still occur. There are more complex search strategies that help to overcome this

problem.

A more general and higher-level kind of search bias concerns whether the

search is done by starting with a general description and reﬁning it, or by

starting with a speciﬁc example and generalizing it. The former is called a

general-to-speciﬁc search bias; the latter a speciﬁc-to-general one. Many learning

algorithms adopt the former policy, starting with an empty decision tree, or a

very general rule, and specializing it to ﬁt the examples. However, it is perfectly

possible to work in the other direction. Instance-based methods start with a

particular example and see how it can be generalized to cover nearby examples

in the same class.

Overﬁtting-avoidance bias

Overﬁtting-avoidance bias is often just another kind of search bias. But because

it addresses a rather special problem, we treat it separately. Recall the disjunc-

tion problem described previously. The problem is that if disjunction is allowed,

useless concept descriptions that merely summarize the data become possible,

whereas if it is prohibited, some concepts are unlearnable. To get around this

problem, it is common to search the concept space starting with the simplest

concept descriptions and proceeding to more complex ones: simplest-ﬁrst

ordering. This biases the search toward simple concept descriptions.

Using a simplest-ﬁrst search and stopping when a sufﬁciently complex

concept description is found is a good way of avoiding overﬁtting. It is some-

times called forward pruning or prepruning because complex descriptions are

pruned away before they are reached. The alternative, backward pruning or post-

pruning, is also viable. Here, we ﬁrst ﬁnd a description that ﬁts the data well and

then prune it back to a simpler description that also ﬁts the data. This is not as

redundant as it sounds: often the only way to arrive at a simple theory is to ﬁnd

a complex one and then simplify it. Forward and backward pruning are both a

kind of overﬁtting-avoidance bias.

In summary, although generalization as search is a nice way to think about

the learning problem, bias is the only way to make it feasible in practice. Dif-

ferent learning algorithms correspond to different concept description spaces

searched with different biases. This is what makes it interesting: different

3 4

C H A P T E R 1

W H AT ’ S I T A L L A B O U T ?

P088407-Ch001.qxd 4/30/05 11:11 AM Page 34

description languages and biases serve some problems well and other problems

badly. There is no universal “best” learning method—as every teacher knows!

1.6 Data mining and ethics

The use of data—particularly data about people—for data mining has serious

ethical implications, and practitioners of data mining techniques must act

responsibly by making themselves aware of the ethical issues that surround their

particular application.

When applied to people, data mining is frequently used to discriminate—

who gets the loan, who gets the special offer, and so on. Certain kinds of

discrimination—racial, sexual, religious, and so on—are not only unethical

but also illegal. However, the situation is complex: everything depends on the

application. Using sexual and racial information for medical diagnosis is

certainly ethical, but using the same information when mining loan payment

behavior is not. Even when sensitive information is discarded, there is a risk

that models will be built that rely on variables that can be shown to substitute

for racial or sexual characteristics. For example, people frequently live in

areas that are associated with particular ethnic identities, so using an area

code in a data mining study runs the risk of building models that are based on

race—even though racial information has been explicitly excluded from the

data.

It is widely accepted that before people make a decision to provide personal

information they need to know how it will be used and what it will be used for,

what steps will be taken to protect its conﬁdentiality and integrity, what the con-

sequences of supplying or withholding the information are, and any rights of

redress they may have. Whenever such information is collected, individuals

should be told these things—not in legalistic small print but straightforwardly

in plain language they can understand.

The potential use of data mining techniques means that the ways in which a

repository of data can be used may stretch far beyond what was conceived when

the data was originally collected. This creates a serious problem: it is necessary

to determine the conditions under which the data was collected and for what

purposes it may be used. Does the ownership of data bestow the right to use it

in ways other than those purported when it was originally recorded? Clearly in

the case of explicitly collected personal data it does not. But in general the

situation is complex.

Surprising things emerge from data mining. For example, it has been

reported that one of the leading consumer groups in France has found that

people with red cars are more likely to default on their car loans. What is the

1 . 6

DATA M I N I N G A N D E T H I C S

3 5

P088407-Ch001.qxd 4/30/05 11:11 AM Page 35

Yüklə 4,3 Mb.

Dostları ilə paylaş:

1 ... 22 23 24 25 26 27 28 29 ... 219