Data Mining: Practical Machine Learning Tools and Techniques, Second Edition



Yüklə 4,3 Mb.
Pdf görüntüsü
səhifə31/219
tarix08.10.2017
ölçüsü4,3 Mb.
#3816
1   ...   27   28   29   30   31   32   33   34   ...   219

of the concept that is to be learned. Of course, the instances are not really inde-

pendent—there are plenty of relationships among different rows of the table!—

but they are independent as far as the concept of sisterhood is concerned. Most

machine learning schemes will still have trouble dealing with this kind of data,

as we will see in Section 3.6, but at least the problem has been recast into the

right form. A simple rule for the sister-of relation is as follows:

If second person’s gender 

= female


and first person’s parent1 

= second person’s parent1

then sister-of 

= yes


This example shows how you can take a relationship between different nodes

of a tree and recast it into a set of independent instances. In database terms, you

take two relations and join them together to make one, a process of flattening

that is technically called denormalization. It is always possible to do this with

any (finite) set of (finite) relations.

The structure of Table 2.4 can be used to describe any relationship between

two people—grandparenthood, second cousins twice removed, and so on. Rela-

2 . 2


W H AT ’ S   I N  A N   E X A M P L E ?

4 7


Table 2.3

Family tree represented as a table.

Name


Gender

Parent1


Parent2

Peter


male

?

?



Peggy

female


?

?

Steven



male

Peter


Peggy

Graham


male

Peter


Peggy

Pam


female

Peter


Peggy

Ian


male

Grace


Ray

. . .


Table 2.4

The sister-of relation represented in a table.

First person

Second person

Name


Gender

Parent1


Parent2

Name


Gender

Parent1


Parent2

Sister of?

Steven

male


Peter

Peggy


Pam

female


Peter

Peggy


yes

Graham


male

Peter


Peggy

Pam


female

Peter


Peggy

yes


Ian

male


Grace

Ray


Pippa

female


Grace

Ray


yes

Brian


male

Grace


Ray

Pippa


female

Grace


Ray

yes


Anna

female


Pam

Ian


Nikki

female


Pam

Ian


yes

Nikki


female

Pam


Ian

Anna


female

Pam


Ian

yes


all the rest

no

P088407-Ch002.qxd  4/30/05  11:10 AM  Page 47




tionships among more people would require a larger table. Relationships in which

the maximum number of people is not specified in advance pose a more serious

problem. If we want to learn the concept of nuclear family (parents and their chil-

dren), the number of people involved depends on the size of the largest nuclear

family, and although we could guess at a reasonable maximum (10? 20?), the

actual number could only be found by scanning the tree itself. Nevertheless, given

a finite set of finite relations we could, at least in principle, form a new “superre-

lation” that contained one row for every combination of people, and this would

be enough to express any relationship between people no matter how many were

involved. The computational and storage costs would, however, be prohibitive.

Another problem with denormalization is that it produces apparent regular-

ities in the data that are completely spurious and are in fact merely reflections

of the original database structure. For example, imagine a supermarket data-

base with a relation for customers and the products they buy, one for products

and their supplier, and one for suppliers and their address. Denormalizing this

will produce a flat file that contains, for each instance, customer, product, sup-

plier, and supplier address. A database mining tool that seeks structure in the

database may come up with the fact that customers who buy beer also buy chips,

a discovery that could be significant from the supermarket manager’s point of

view. However, it may also come up with the fact that supplier address can be

predicted exactly from supplier—a “discovery” that will not impress the super-

market manager at all. This fact masquerades as a significant discovery from the

flat file but is present explicitly in the original database structure.

Many abstract computational problems involve relations that are not finite,

although clearly any actual set of input instances must be finite. Concepts such

as  ancestor-of involve arbitrarily long paths through a tree, and although the

human race, and hence its family tree, may be finite (although prodigiously large),

many artificial problems generate data that truly is infinite. Although it may

sound abstruse, this situation is the norm in areas such as list processing and logic

programming and is addressed in a subdiscipline of machine learning called



inductive logic programming. Computer scientists usually use recursion to deal

with situations in which the number of possible instances is infinite. For example,

If person1 is a parent of person2

then person1 is an ancestor of person2

If person1 is a parent of person2

and person2 is an ancestor of person3

then person1 is an ancestor of person3

is a simple recursive definition of ancestor that works no matter how distantly

two people are related. Techniques of inductive logic programming can learn

recursive rules such as these from a finite set of instances such as those in Table

2.5.

4 8


C H A P T E R   2

|

I N P U T: C O N C E P TS , I N S TA N C E S , A N D   AT T R I BU T E S



P088407-Ch002.qxd  4/30/05  11:10 AM  Page 48


Yüklə 4,3 Mb.

Dostları ilə paylaş:
1   ...   27   28   29   30   31   32   33   34   ...   219




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə