HAN
10-ch03-083-124-9780123814791
2011/6/1
3:16
Page 118
#36
118
Chapter 3 Data Preprocessing
form a hierarchy at the schema level, a user could define some intermediate levels
manually, such as “{Alberta, Saskatchewan, Manitoba} ⊂ prairies Canada” and
“{British Columbia, prairies Canada} ⊂ Western Canada.”
3.
Specification of a
set of attributes, but not of their partial ordering: A user may
specify a set of attributes forming a concept hierarchy, but omit to explicitly state
their partial ordering. The system can then try to automatically generate the attribute
ordering so as to construct a meaningful concept hierarchy.
“Without knowledge of data semantics, how can a hierarchical ordering for an
arbitrary set of nominal attributes be found?” Consider the observation that since
higher-level concepts generally cover several subordinate lower-level concepts, an
attribute defining a high concept level (e.g., country) will usually contain a smaller
number of distinct values than an attribute defining a lower concept level (e.g.,
street). Based on this observation, a concept hierarchy can be automatically gener-
ated based on the number of distinct values per attribute in the given attribute set.
The attribute with the most distinct values is placed at the lowest hierarchy level. The
lower the number of distinct values an attribute has, the higher it is in the gener-
ated concept hierarchy. This heuristic rule works well in many cases. Some local-level
swapping or adjustments may be applied by users or experts, when necessary, after
examination of the generated hierarchy.
Let’s examine an example of this third method.
Example 3.7
Concept hierarchy generation based on the number of distinct values per attribute.
Suppose a user selects a set of location-oriented attributes—street, country, province
or state, and
city—from the
AllElectronics database, but does not specify the hierarchical
ordering among the attributes.
A concept hierarchy for location can be generated automatically, as illustrated in
Figure 3.13. First, sort the attributes in ascending order based on the number of dis-
tinct values in each attribute. This results in the following (where the number of distinct
values per attribute is shown in parentheses): country (15), province or state (365), city
(3567), and street (674,339). Second, generate the hierarchy from the top down accord-
ing to the sorted order, with the first attribute at the top level and the last attribute at the
bottom level. Finally, the user can examine the generated hierarchy, and when necessary,
modify it to reflect desired semantic relationships among the attributes. In this example,
it is obvious that there is no need to modify the generated hierarchy.
Note that this heuristic rule is not foolproof. For example, a time dimension in a
database may contain 20 distinct years, 12 distinct months, and 7 distinct days of the
week. However, this does not suggest that the time hierarchy should be “year
<
month <
days of the week,” with
days of the week at the top of the hierarchy.
4.
Specification of only a partial set of attributes: Sometimes a user can be careless
when defining a hierarchy, or have only a vague idea about what should be included
in a hierarchy. Consequently, the user may have included only a small subset of the
HAN
10-ch03-083-124-9780123814791
2011/6/1
3:16
Page 119
#37
3.5 Data Transformation and Data Discretization
119
country
15 distinct values
province_or_state
city
street
365 distinct values
3567 distinct values
674,339 distinct values
Figure 3.13
Automatic generation of a schema concept hierarchy based on the number of distinct
attribute values.
relevant attributes in the hierarchy specification. For example, instead of including
all of the hierarchically relevant attributes for location, the user may have specified
only street and city. To handle such partially specified hierarchies, it is important to
embed data semantics in the database schema so that attributes with tight semantic
connections can be pinned together. In this way, the specification of one attribute
may trigger a whole group of semantically tightly linked attributes to be “dragged in”
to form a complete hierarchy. Users, however, should have the option to override this
feature, as necessary.
Example 3.8
Concept hierarchy generation using prespecified semantic connections. Suppose that
a data mining expert (serving as an administrator) has pinned together the five attri-
butes number, street, city, province or state, and country, because they are closely linked
semantically regarding the notion of location. If a user were to specify only the attribute
city for a hierarchy defining
location, the system can automatically drag in all five seman-
tically related attributes to form a hierarchy. The user may choose to drop any of
these attributes (e.g., number and street) from the hierarchy, keeping city as the lowest
conceptual level.
In summary, information at the schema level and on attribute–value counts can be
used to generate concept hierarchies for nominal data. Transforming nominal data with
the use of concept hierarchies allows higher-level knowledge patterns to be found. It
allows mining at multiple levels of abstraction, which is a common requirement for data
mining applications.