Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Yüklə 4,3 Mb.

Pdf görüntüsü

səhifə	31/219
tarix	08.10.2017
ölçüsü	4,3 Mb.
	#3816

1 ... 27 28 29 30 31 32 33 34 ... 219

Table 2.3 Family tree represented as a table.
Table 2.4 The sister-of relation represented in a table.

of the concept that is to be learned. Of course, the instances are not really inde-

pendent—there are plenty of relationships among different rows of the table!—

but they are independent as far as the concept of sisterhood is concerned. Most

machine learning schemes will still have trouble dealing with this kind of data,

as we will see in Section 3.6, but at least the problem has been recast into the

right form. A simple rule for the sister-of relation is as follows:

If second person’s gender

= female

and ﬁrst person’s parent1

= second person’s parent1

then sister-of

= yes

This example shows how you can take a relationship between different nodes

of a tree and recast it into a set of independent instances. In database terms, you

take two relations and join them together to make one, a process of ﬂattening

that is technically called denormalization. It is always possible to do this with

any (ﬁnite) set of (ﬁnite) relations.

The structure of Table 2.4 can be used to describe any relationship between

two people—grandparenthood, second cousins twice removed, and so on. Rela-

2 . 2

W H AT ’ S I N A N E X A M P L E ?

4 7

Table 2.3

Family tree represented as a table.

Name

Gender

Parent1

Parent2

Peter

male

Peggy

female

Steven

male

Peter

Peggy

Graham

male

Peter

Peggy

Pam

female

Peter

Peggy

Ian

male

Grace

Ray

. . .

Table 2.4

The sister-of relation represented in a table.

First person

Second person

Name

Gender

Parent1

Parent2

Name

Gender

Parent1

Parent2

Sister of?

Steven

male

Peter

Peggy

Pam

female

Peter

Peggy

yes

Graham

male

Peter

Peggy

Pam

female

Peter

Peggy

yes

Ian

male

Grace

Ray

Pippa

female

Grace

Ray

yes

Brian

male

Grace

Ray

Pippa

female

Grace

Ray

yes

Anna

female

Pam

Ian

Nikki

female

Pam

Ian

yes

Nikki

female

Pam

Ian

Anna

female

Pam

Ian

yes

all the rest

P088407-Ch002.qxd 4/30/05 11:10 AM Page 47

tionships among more people would require a larger table. Relationships in which

the maximum number of people is not speciﬁed in advance pose a more serious

problem. If we want to learn the concept of nuclear family (parents and their chil-

dren), the number of people involved depends on the size of the largest nuclear

family, and although we could guess at a reasonable maximum (10? 20?), the

actual number could only be found by scanning the tree itself. Nevertheless, given

a ﬁnite set of ﬁnite relations we could, at least in principle, form a new “superre-

lation” that contained one row for every combination of people, and this would

be enough to express any relationship between people no matter how many were

involved. The computational and storage costs would, however, be prohibitive.

Another problem with denormalization is that it produces apparent regular-

ities in the data that are completely spurious and are in fact merely reﬂections

of the original database structure. For example, imagine a supermarket data-

base with a relation for customers and the products they buy, one for products

and their supplier, and one for suppliers and their address. Denormalizing this

will produce a ﬂat ﬁle that contains, for each instance, customer, product, sup-

plier, and supplier address. A database mining tool that seeks structure in the

database may come up with the fact that customers who buy beer also buy chips,

a discovery that could be signiﬁcant from the supermarket manager’s point of

view. However, it may also come up with the fact that supplier address can be

predicted exactly from supplier—a “discovery” that will not impress the super-

market manager at all. This fact masquerades as a signiﬁcant discovery from the

ﬂat ﬁle but is present explicitly in the original database structure.

Many abstract computational problems involve relations that are not ﬁnite,

although clearly any actual set of input instances must be ﬁnite. Concepts such

as ancestor-of involve arbitrarily long paths through a tree, and although the

human race, and hence its family tree, may be ﬁnite (although prodigiously large),

many artiﬁcial problems generate data that truly is inﬁnite. Although it may

sound abstruse, this situation is the norm in areas such as list processing and logic

programming and is addressed in a subdiscipline of machine learning called

inductive logic programming. Computer scientists usually use recursion to deal

with situations in which the number of possible instances is inﬁnite. For example,

If person1 is a parent of person2

then person1 is an ancestor of person2

If person1 is a parent of person2

and person2 is an ancestor of person3

then person1 is an ancestor of person3

is a simple recursive deﬁnition of ancestor that works no matter how distantly

two people are related. Techniques of inductive logic programming can learn

recursive rules such as these from a ﬁnite set of instances such as those in Table

2.5.

4 8

C H A P T E R 2

I N P U T: C O N C E P TS , I N S TA N C E S , A N D AT T R I BU T E S

P088407-Ch002.qxd 4/30/05 11:10 AM Page 48

Yüklə 4,3 Mb.

Dostları ilə paylaş:

1 ... 27 28 29 30 31 32 33 34 ... 219