of the concept that is to be learned. Of course, the instances
are not really inde-
pendent—there are plenty of relationships among different rows of the table!—
but they are independent as far as the concept of sisterhood is concerned. Most
machine learning schemes will still have trouble dealing with this kind of data,
as we will see in Section 3.6, but at least the problem has been recast into the
right form. A simple rule for the sister-of relation is as follows:
If second person’s gender
= female
and first person’s parent1
= second person’s parent1
then sister-of
= yes
This example shows how you can take a relationship
between different nodes
of a tree and recast it into a set of independent instances. In database terms, you
take two relations and join them together to make one, a process of flattening
that is technically called denormalization. It is always possible to do this with
any (finite) set of (finite) relations.
The structure of Table 2.4 can be used to describe any relationship between
two people—grandparenthood, second cousins twice removed, and so on. Rela-
2 . 2
W H AT ’ S I N A N E X A M P L E ?
4 7
Table 2.3
Family tree represented as a table.
Name
Gender
Parent1
Parent2
Peter
male
?
?
Peggy
female
?
?
Steven
male
Peter
Peggy
Graham
male
Peter
Peggy
Pam
female
Peter
Peggy
Ian
male
Grace
Ray
. . .
Table 2.4
The sister-of relation represented in a table.
First person
Second person
Name
Gender
Parent1
Parent2
Name
Gender
Parent1
Parent2
Sister of?
Steven
male
Peter
Peggy
Pam
female
Peter
Peggy
yes
Graham
male
Peter
Peggy
Pam
female
Peter
Peggy
yes
Ian
male
Grace
Ray
Pippa
female
Grace
Ray
yes
Brian
male
Grace
Ray
Pippa
female
Grace
Ray
yes
Anna
female
Pam
Ian
Nikki
female
Pam
Ian
yes
Nikki
female
Pam
Ian
Anna
female
Pam
Ian
yes
all the rest
no
P088407-Ch002.qxd 4/30/05 11:10 AM Page 47
tionships among more people would require a larger table. Relationships in which
the maximum number of people is not specified in advance pose a more serious
problem. If we want to learn the concept of nuclear family (parents and their chil-
dren), the number of people involved depends on the size of the largest nuclear
family, and although we could guess at a reasonable maximum (10? 20?), the
actual number could only be found by scanning the tree itself. Nevertheless, given
a finite set of finite relations we could, at least in principle, form a new “superre-
lation” that contained one row for every combination of people, and this would
be enough to express any relationship between people no matter how many were
involved. The computational and storage costs would, however, be prohibitive.
Another problem with denormalization is that it produces apparent regular-
ities in the data that are completely spurious and are in fact merely reflections
of the original database structure. For example, imagine a supermarket data-
base with a relation for customers and the products they buy, one for products
and their supplier, and one for suppliers and their address. Denormalizing this
will produce a flat file that contains, for each instance, customer, product, sup-
plier, and supplier address. A database mining tool that seeks structure in the
database may come up with the fact that customers who buy beer also buy chips,
a discovery that could be significant from the supermarket manager’s point of
view. However, it may also come up with the fact that supplier address can be
predicted exactly from supplier—a “discovery” that will not impress the super-
market manager at all. This fact masquerades as a significant discovery from the
flat file but is present explicitly in the original database structure.
Many abstract computational problems involve relations that are not finite,
although clearly any actual set of input instances must be finite. Concepts such
as ancestor-of involve arbitrarily long paths through a tree, and although the
human race, and hence its family tree, may be finite (although prodigiously large),
many artificial problems generate data that truly is infinite. Although it may
sound abstruse, this situation is the norm in areas such as list processing and logic
programming and is addressed in a subdiscipline of machine learning called
inductive logic programming. Computer scientists usually use recursion to deal
with situations in which the number of possible instances is infinite. For example,
If person1 is a parent of person2
then person1 is an ancestor of person2
If person1 is a parent of person2
and person2 is an ancestor of person3
then person1 is an ancestor of person3
is a simple recursive definition of ancestor that works no matter how distantly
two people are related. Techniques of inductive logic programming can learn
recursive rules such as these from a finite set of instances such as those in Table
2.5.
4 8
C H A P T E R 2
|
I N P U T: C O N C E P TS , I N S TA N C E S , A N D AT T R I BU T E S
P088407-Ch002.qxd 4/30/05 11:10 AM Page 48