HAN
22-ind-673-708-9780123814791
2011/6/1
3:27
Page 700
#28
700
Index
similarity (Continued)
measuring, 65–78, 79
nominal attributes, 70
similarity measures, 447–448, 525–528
constraints on, 533
geodesic distance, 525–526
SimRank, 526–528
similarity searches, 587
in information networks, 594
in multimedia data mining, 596
simple random sample with replacement
(SRSWR), 108
simple random sample without replacement
(SRSWOR), 108
SimRank, 526–528, 539
computation, 527–528
random walk, 526–528
structural context, 528
simultaneous aggregation, 195
single-dimensional association rules, 17, 287
single-linkage algorithm, 460, 461
singular value decomposition (SVD), 587
skewed data
balanced, 271
negatively, 47
positively, 47
wavelet transforms on, 102
slice operation, 148
small-world phenomenon, 592
smoothing, 112
by bin boundaries, 89
by bin means, 89
by bin medians, 89
for data discretization, 90
snowflake schema, 140
example, 141
illustrated, 141
star schema versus, 140
social networks, 524–525, 526–528
densification power law, 592
evolution of, 594
mining, 623
small-world phenomenon, 592
See also networks
social science/social studies data mining,
613
soft clustering, 501
soft constraints, 534, 539
example, 534
handling, 536–537
space-filling curve, 58
sparse data, 102
sparse data cubes, 190
sparsest cuts, 539
sparsity coefficient, 579
spatial data, 14
spatial data mining, 595
spatiotemporal data analysis, 319
spatiotemporal data mining, 595, 623–624
specialized SQL servers, 165
specificity measure, 367
spectral clustering, 520–522, 539
effectiveness, 522
framework, 521
steps, 520–522
speech recognition, 430
speed, classification, 369
spiral method, 152
split-point, 333, 340, 342
splitting attributes, 333
splitting criterion, 333, 342
splitting rules. See attribute selection measures
splitting subset, 333
SQL, as relational query language, 10
square-error function, 454
squashing function, 403
standard deviation, 51
example, 51
function of, 50
star schema, 139
example, 139–140
illustrated, 140
snowflake schema versus, 140
Star-Cubing, 204–210, 235
algorithm illustration, 209
bottom-up computation, 205
example, 207
for full cube computation, 210
ordering of dimensions and, 210
performance, 210
shared dimensions, 204–205
starnet query model, 149
example, 149–150
star-nodes, 205
star-trees, 205
compressed base table, 207
construction, 205
statistical data mining, 598–600
analysis of variance, 600
discriminant analysis, 600
factor analysis, 600
generalized linear models, 599–600
mixed-effect models, 600
quality control, 600
HAN
22-ind-673-708-9780123814791
2011/6/1
3:27
Page 701
#29
Index
701
regression, 599
survival analysis, 600
statistical databases (SDBs), 148
OLAP systems versus, 148–149
statistical descriptions, 24, 79
graphic displays, 44–45, 51–56
measuring the dispersion, 48–51
statistical hypothesis test, 24
statistical models, 23–24
of networks, 592–594
statistical outlier detection methods, 552, 553–560,
581
computational cost of, 560
for data analysis, 625
effectiveness, 552
example, 552
nonparametric, 553, 558–560
parametric, 553–558
See also outlier detection
statistical theory, in exceptional behavior disclosure,
291
statistics, 23
inferential, 24
predictive, 24
StatSoft, 602, 603
stepwise backward elimination, 105
stepwise forward selection, 105
stick figure visualization, 61–63
STING, 479–481
advantages, 480–481
as density-based clustering method, 480
hierarchical structure, 479, 480
multiresolution approach, 481
See also cluster analysis; grid-based methods
stratified cross-validation, 371
stratified samples, 109–110
stream data, 598, 624
strong association rules, 272
interestingness and, 264–265
misleading, 265
Structural Clustering Algorithm for Networks
(SCAN), 531–532
structural context-based similarity, 526
structural data analysis, 319
structural patterns, 282
structure similarity search, 592
structures
as contexts, 575
discovery of, 318
indexing, 319
substructures, 243
Student’s t-test, 372
subcube queries, 216, 217–218
sub-itemset pruning, 263
subjective interestingness measures, 22
subject-oriented data warehouses, 126
subsequence, 589
matching, 587
subset checking, 263–264
subset testing, 250
subspace clustering, 448
frequent patterns for, 318–319
subspace clustering methods, 509, 510–511,
538
biclustering, 511
correlation-based, 511
examples, 538
subspace search methods, 510–511
subspaces
bottom-up search, 510–511
cube space, 228–229
outliers in, 578–579
top-down search, 511
substitution matrices, 590
substructures, 243
sum of the squared error (SSE), 501
summary fact tables, 165
superset checking, 263
supervised learning, 24, 330
supervised outlier detection, 549–550
challenges, 550
support, 21
association rule, 21
group-based, 286
reduced, 285, 286
uniform, 285–286
support, rule, 245, 246
support vector machines (SVMs), 393, 408–415,
437
interest in, 408
maximum marginal hyperplane, 409, 412
nonlinear, 413–415
for numeric prediction, 408
with sigmoid kernel, 415
support vectors, 411
for test tuples, 412–413
training/testing speed improvement, 415
support vectors, 411, 437
illustrated, 411
SVM finding, 412
supremum distance, 73–74
surface web, 597
survival analysis, 600
SVMs. See support vector machines
HAN
22-ind-673-708-9780123814791
2011/6/1
3:27
Page 702
#30
702
Index
symbolic sequences, 586, 588
applications, 589
sequential pattern mining in, 588–589
symmetric binary dissimilarity, 70
synchronous generalization, 175
T
tables, 9
attributes, 9
contingency, 95
dimension, 136
fact, 165
tuples, 9
tag clouds, 64, 66
Tanimoto coefficient, 78
target classes, 15, 180
initial working relations, 177
prime relation, 175, 177
targeted marketing, 609
taxonomy formation, 20
technologies, 23–27, 33, 34
telecommunications industry, 611
temporal data, 14
term-frequency vectors, 77
cosine similarity between, 78
sparse, 77
table, 77
terminating conditions, 404
test sets, 330
test tuples, 330
text data, 14
text mining, 596–597, 624
theoretical foundations, 600–601, 625
three-layer neural networks, 399
threshold-moving approach, 385
tilted time windows, 598
timeliness, data, 85
time-series data, 586, 587
cyclic movements, 588
discretization and, 590
illustrated, 588
random movements, 588
regression analysis, 587–588
seasonal variations, 588
shapelets method, 590
subsequence matching, 587
transformation into aggregate approximations,
587
trend analysis, 588
trend or long-term movements, 588
time-series data analysis, 319
time-series forecasting, 588
time-variant data warehouses, 127
top-down design approach, 133, 151
top-down subspace search, 511
top-down view, 151
topic model, 26–27
top-k patterns/rules, 281
top-k queries, 225
example, 225–226
ranking cubes to answer, 226–227
results, 225
user-specified preference components,
225
top- k strategies
comparison illustration, 311
summarized pattern, 311
traditional, 311
TrAdaBoost, 436
training
Bayesian belief networks, 396–397
data, 18
sets, 328
tuples, 332–333
transaction reduction, 255
transactional databases, 13
example, 13–14
transactions, components of, 13
transfer learning, 430, 435, 434–436, 438
applications, 435
approaches to, 436
heterogeneous, 436
negative transfer and, 436
target task, 435
traditional learning versus, 435
treemaps, 63, 65
trend analysis
spatial, 595
in time-series data, 588
for time-series forecasting, 588
trends, data mining, 622–625, 626
triangle inequality, 73
trimmed mean, 46
trimodal, 47
true negatives, 365
true positives, 365
t-test, 372
tuples, 9
duplication, 98–99
negative, 364
partitioning, 334, 337
positive, 364
training, 332–333
two sample t-test, 373
HAN
22-ind-673-708-9780123814791
2011/6/1
3:27
Page 703
#31
Index
703
two-layer neural networks, 399
two-level hash index structure, 264
U
ubiquitous data mining, 618–620, 625
uncertainty sampling, 433
undersampling, 384, 386
example, 384–385
uniform support, 285–286
unimodal, 47
unique rules, 92
univariate distribution, 40
univariate Gaussian mixture model, 504
univariate outlier detection, 554–555
unordered attributes, 103
unordered rules, 358
unsupervised learning, 25, 330, 445, 490
clustering as, 25, 445, 490
example, 25
supervised learning versus, 330
unsupervised outlier detection, 550
assumption, 550
clustering methods acting as, 551
upper approximation, 427
user interaction, 30–31
V
values
exception, 234
expected, 97, 234
missing, 88–89
residual, 234
in rules or patterns, 281
variables
grouping, 231
predicate, 295
predictor, 105
response, 105
variance, 51, 98
example, 51
function of, 50
variant graph patterns, 591
version space, 433
vertical data format, 260
example, 260–262
frequent itemset mining with, 259–262,
272
video data analysis, 319
virtual warehouses, 133
visibility graphs, 537
visible points, 537
visual data mining, 602–604, 625
data mining process visualization, 603
data mining result visualization, 603
data visualization, 602–603
as discipline integration, 602
illustrations, 604–607
interactive, 604, 607
as mining trend, 624
Viterbi algorithm, 591
W
warehouse database servers, 131
warehouse refresh software, 151
waterfall method, 152
wavelet coefficients, 100
wavelet transforms, 99, 100–102
discrete (DWT), 100–102
for multidimensional data, 102
on sparse and skewed data, 102
web directories, 28
web mining, 597, 624
content, 597
as mining trend, 624
structure, 597–598
usage, 598
web search engines, 28, 523–524
web-document classification, 435
weight arithmetic mean, 46
weighted Euclidean distance, 74
Wikipedia, 597
WordNet, 597
working relations, 172
initial, 168, 169
World Wide Web (WWW), 1–2, 4, 14
Worlds-with-Worlds, 63, 64
wrappers, 127
Z
z-score normalization, 114–115
Document Outline - Front Cover
- Data Mining: Concepts and Techniques
- Copyright
- Dedication
- Table of Contents
- Foreword
- Foreword to Second Edition
- Preface
- Acknowledgments
- About the Authors
- Chapter 1. Introduction
- 1.1 Why Data Mining?
- 1.2 What Is Data Mining?
- 1.3 What Kinds of Data Can Be Mined?
- 1.4 What Kinds of Patterns Can Be Mined?
- 1.5 Which Technologies Are Used?
- 1.6 Which Kinds of Applications Are Targeted?
- 1.7 Major Issues in Data Mining
- 1.8 Summary
- 1.9 Exercises
- 1.10 Bibliographic Notes
- Chapter 2. Getting to Know Your Data
- 2.1 Data Objects and Attribute Types
- 2.2 Basic Statistical Descriptions of Data
- 2.3 Data Visualization
- 2.4 Measuring Data Similarity and Dissimilarity
- 2.5 Summary
- 2.6 Exercises
- 2.7 Bibliographic Notes
- Chapter 3. Data Preprocessing
- 3.1 Data Preprocessing: An Overview
- 3.2 Data Cleaning
- 3.3 Data Integration
- 3.4 Data Reduction
- 3.5 Data Transformation and Data Discretization
- 3.6 Summary
- 3.7 Exercises
- 3.8 Bibliographic Notes
- Chapter 4. Data Warehousing and Online Analytical Processing
- 4.1 Data Warehouse: Basic Concepts
- 4.2 Data Warehouse Modeling: Data Cube and OLAP
- 4.3 Data Warehouse Design and Usage
- 4.4 Data Warehouse Implementation
- 4.5 Data Generalization by Attribute-Oriented Induction
- 4.6 Summary
- 4.7 Exercises
- 4.8 Bibliographic Notes
- Chapter 5. Data Cube Technology
- 5.1 Data Cube Computation: Preliminary Concepts
- 5.2 Data Cube Computation Methods
- 5.3 Processing Advanced Kinds of Queries by Exploring Cube Technology
- 5.4 Multidimensional Data Analysis in Cube Space
- 5.5 Summary
- 5.6 Exercises
- 5.7 Bibliographic Notes
- Chapter 6. Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods
- 6.1 Basic Concepts
- 6.2 Frequent Itemset Mining Methods
- 6.3 Which Patterns Are Interesting?—Pattern Evaluation Methods
- 6.4 Summary
- 6.5 Exercises
- 6.6 Bibliographic Notes
- Chapter 7. Advanced Pattern Mining
- 7.1 Pattern Mining: A Road Map
- 7.2 Pattern Mining in Multilevel, Multidimensional Space
- 7.3 Constraint-Based Frequent Pattern Mining
- 7.4 Mining High-Dimensional Data and Colossal Patterns
- 7.5 Mining Compressed or Approximate Patterns
- 7.6 Pattern Exploration and Application
- 7.7 Summary
- 7.8 Exercises
- 7.9 Bibliographic Notes
- Chapter 8. Classification: Basic Concepts
- 8.1 Basic Concepts
- 8.2 Decision Tree Induction
- 8.3 Bayes Classification Methods
- 8.4 Rule-Based Classification
- 8.5 Model Evaluation and Selection
- 8.6 Techniques to Improve Classification Accuracy
- 8.7 Summary
- 8.8 Exercises
- 8.9 Bibliographic Notes
- Chapter 9. Classification: Advanced Methods
- 9.1 Bayesian Belief Networks
- 9.2 Classification by Backpropagation
- 9.3 Support Vector Machines
- 9.4 Classification Using Frequent Patterns
- 9.5 Lazy Learners (or Learning from Your Neighbors)
- 9.6 Other Classification Methods
- 9.7 Additional Topics Regarding Classification
- 9.8 Summary
- 9.9 Exercises
- 9.10 Bibliographic Notes
- Chapter 10. Cluster Analysis: Basic Concepts and Methods
- 10.1 Cluster Analysis
- 10.2 Partitioning Methods
- 10.3 Hierarchical Methods
- 10.4 Density-Based Methods
- 10.5 Grid-Based Methods
- 10.6 Evaluation of Clustering
- 10.7 Summary
- 10.8 Exercises
- 10.9 Bibliographic Notes
- Chapter 11. Advanced Cluster Analysis
- 11.1 Probabilistic Model-Based Clustering
- 11.2 Clustering High-Dimensional Data
- 11.3 Clustering Graph and Network Data
- 11.4 Clustering with Constraints
- 11.5 Summary
- 11.6 Exercises
- 11.7 Bibliographic Notes
- Chapter 12. Outlier Detection
- 12.1 Outliers and Outlier Analysis
- 12.2 Outlier Detection Methods
- 12.3 Statistical Approaches
- 12.4 Proximity-Based Approaches
- 12.5 Clustering-Based Approaches
- 12.6 Classification-Based Approaches
- 12.7 Mining Contextual and Collective Outliers
- 12.8 Outlier Detection in High-Dimensional Data
- 12.9 Summary
- 12.10 Exercises
- 12.11 Bibliographic Notes
- Chapter 13. Data Mining Trends and Research Frontiers
- 13.1 Mining Complex Data Types
- 13.2 Other Methodologies of Data Mining
- 13.3 Data Mining Applications
- 13.4 Data Mining and Society
- 13.5 Data Mining Trends
- 13.6 Summary
- 13.7 Exercises
- 13.8 Bibliographic Notes
- Bibliography
- Index
Dostları ilə paylaş: |