Data Mining. Concepts and Techniques, 3rd Edition


HAN 22-ind-673-708-9780123814791



Yüklə 7,95 Mb.
Pdf görüntüsü
səhifə343/343
tarix08.10.2017
ölçüsü7,95 Mb.
#3817
1   ...   335   336   337   338   339   340   341   342   343

HAN

22-ind-673-708-9780123814791

2011/6/1

3:27

Page 700

#28

700

Index

similarity (Continued)

measuring, 65–78, 79

nominal attributes, 70

similarity measures, 447–448, 525–528

constraints on, 533

geodesic distance, 525–526

SimRank, 526–528

similarity searches, 587

in information networks, 594

in multimedia data mining, 596

simple random sample with replacement

(SRSWR), 108

simple random sample without replacement

(SRSWOR), 108

SimRank, 526–528, 539

computation, 527–528

random walk, 526–528

structural context, 528

simultaneous aggregation, 195

single-dimensional association rules, 17, 287

single-linkage algorithm, 460, 461

singular value decomposition (SVD), 587

skewed data

balanced, 271

negatively, 47

positively, 47

wavelet transforms on, 102

slice operation, 148

small-world phenomenon, 592

smoothing, 112

by bin boundaries, 89

by bin means, 89

by bin medians, 89

for data discretization, 90

snowflake schema, 140

example, 141

illustrated, 141

star schema versus, 140

social networks, 524–525, 526–528

densification power law, 592

evolution of, 594

mining, 623

small-world phenomenon, 592



See also networks

social science/social studies data mining,

613

soft clustering, 501



soft constraints, 534, 539

example, 534

handling, 536–537

space-filling curve, 58

sparse data, 102

sparse data cubes, 190

sparsest cuts, 539

sparsity coefficient, 579

spatial data, 14

spatial data mining, 595

spatiotemporal data analysis, 319

spatiotemporal data mining, 595, 623–624

specialized SQL servers, 165

specificity measure, 367

spectral clustering, 520–522, 539

effectiveness, 522

framework, 521

steps, 520–522

speech recognition, 430

speed, classification, 369

spiral method, 152

split-point, 333, 340, 342

splitting attributes, 333

splitting criterion, 333, 342

splitting rules. See attribute selection measures

splitting subset, 333

SQL, as relational query language, 10

square-error function, 454

squashing function, 403

standard deviation, 51

example, 51

function of, 50

star schema, 139

example, 139–140

illustrated, 140

snowflake schema versus, 140

Star-Cubing, 204–210, 235

algorithm illustration, 209

bottom-up computation, 205

example, 207

for full cube computation, 210

ordering of dimensions and, 210

performance, 210

shared dimensions, 204–205

starnet query model, 149

example, 149–150

star-nodes, 205

star-trees, 205

compressed base table, 207

construction, 205

statistical data mining, 598–600

analysis of variance, 600

discriminant analysis, 600

factor analysis, 600

generalized linear models, 599–600

mixed-effect models, 600

quality control, 600



HAN

22-ind-673-708-9780123814791

2011/6/1

3:27

Page 701

#29

Index

701

regression, 599

survival analysis, 600

statistical databases (SDBs), 148

OLAP systems versus, 148–149

statistical descriptions, 24, 79

graphic displays, 44–45, 51–56

measuring the dispersion, 48–51

statistical hypothesis test, 24

statistical models, 23–24

of networks, 592–594

statistical outlier detection methods, 552, 553–560,

581

computational cost of, 560



for data analysis, 625

effectiveness, 552

example, 552

nonparametric, 553, 558–560

parametric, 553–558

See also outlier detection

statistical theory, in exceptional behavior disclosure,

291

statistics, 23



inferential, 24

predictive, 24

StatSoft, 602, 603

stepwise backward elimination, 105

stepwise forward selection, 105

stick figure visualization, 61–63

STING, 479–481

advantages, 480–481

as density-based clustering method, 480

hierarchical structure, 479, 480

multiresolution approach, 481

See also cluster analysis; grid-based methods

stratified cross-validation, 371

stratified samples, 109–110

stream data, 598, 624

strong association rules, 272

interestingness and, 264–265

misleading, 265

Structural Clustering Algorithm for Networks

(SCAN), 531–532

structural context-based similarity, 526

structural data analysis, 319

structural patterns, 282

structure similarity search, 592

structures

as contexts, 575

discovery of, 318

indexing, 319

substructures, 243

Student’s t-test, 372

subcube queries, 216, 217–218

sub-itemset pruning, 263

subjective interestingness measures, 22

subject-oriented data warehouses, 126

subsequence, 589

matching, 587

subset checking, 263–264

subset testing, 250

subspace clustering, 448

frequent patterns for, 318–319

subspace clustering methods, 509, 510–511,

538

biclustering, 511



correlation-based, 511

examples, 538

subspace search methods, 510–511

subspaces

bottom-up search, 510–511

cube space, 228–229

outliers in, 578–579

top-down search, 511

substitution matrices, 590

substructures, 243

sum of the squared error (SSE), 501

summary fact tables, 165

superset checking, 263

supervised learning, 24, 330

supervised outlier detection, 549–550

challenges, 550

support, 21

association rule, 21

group-based, 286

reduced, 285, 286

uniform, 285–286

support, rule, 245, 246

support vector machines (SVMs), 393, 408–415,

437


interest in, 408

maximum marginal hyperplane, 409, 412

nonlinear, 413–415

for numeric prediction, 408

with sigmoid kernel, 415

support vectors, 411

for test tuples, 412–413

training/testing speed improvement, 415

support vectors, 411, 437

illustrated, 411

SVM finding, 412

supremum distance, 73–74

surface web, 597

survival analysis, 600

SVMs. See support vector machines



HAN

22-ind-673-708-9780123814791

2011/6/1

3:27

Page 702

#30

702

Index

symbolic sequences, 586, 588

applications, 589

sequential pattern mining in, 588–589

symmetric binary dissimilarity, 70

synchronous generalization, 175



T

tables, 9

attributes, 9

contingency, 95

dimension, 136

fact, 165

tuples, 9

tag clouds, 64, 66

Tanimoto coefficient, 78

target classes, 15, 180

initial working relations, 177

prime relation, 175, 177

targeted marketing, 609

taxonomy formation, 20

technologies, 23–27, 33, 34

telecommunications industry, 611

temporal data, 14

term-frequency vectors, 77

cosine similarity between, 78

sparse, 77

table, 77

terminating conditions, 404

test sets, 330

test tuples, 330

text data, 14

text mining, 596–597, 624

theoretical foundations, 600–601, 625

three-layer neural networks, 399

threshold-moving approach, 385

tilted time windows, 598

timeliness, data, 85

time-series data, 586, 587

cyclic movements, 588

discretization and, 590

illustrated, 588

random movements, 588

regression analysis, 587–588

seasonal variations, 588

shapelets method, 590

subsequence matching, 587

transformation into aggregate approximations,

587


trend analysis, 588

trend or long-term movements, 588

time-series data analysis, 319

time-series forecasting, 588

time-variant data warehouses, 127

top-down design approach, 133, 151

top-down subspace search, 511

top-down view, 151

topic model, 26–27

top-patterns/rules, 281

top-queries, 225

example, 225–226

ranking cubes to answer, 226–227

results, 225

user-specified preference components,

225


top-strategies

comparison illustration, 311

summarized pattern, 311

traditional, 311

TrAdaBoost, 436

training


Bayesian belief networks, 396–397

data, 18


sets, 328

tuples, 332–333

transaction reduction, 255

transactional databases, 13

example, 13–14

transactions, components of, 13

transfer learning, 430, 435, 434–436, 438

applications, 435

approaches to, 436

heterogeneous, 436

negative transfer and, 436

target task, 435

traditional learning versus, 435

treemaps, 63, 65

trend analysis

spatial, 595

in time-series data, 588

for time-series forecasting, 588

trends, data mining, 622–625, 626

triangle inequality, 73

trimmed mean, 46

trimodal, 47

true negatives, 365

true positives, 365



t-test, 372

tuples, 9

duplication, 98–99

negative, 364

partitioning, 334, 337

positive, 364

training, 332–333

two sample t-test, 373




HAN

22-ind-673-708-9780123814791

2011/6/1

3:27

Page 703

#31

Index

703

two-layer neural networks, 399

two-level hash index structure, 264

U

ubiquitous data mining, 618–620, 625

uncertainty sampling, 433

undersampling, 384, 386

example, 384–385

uniform support, 285–286

unimodal, 47

unique rules, 92

univariate distribution, 40

univariate Gaussian mixture model, 504

univariate outlier detection, 554–555

unordered attributes, 103

unordered rules, 358

unsupervised learning, 25, 330, 445, 490

clustering as, 25, 445, 490

example, 25

supervised learning versus, 330

unsupervised outlier detection, 550

assumption, 550

clustering methods acting as, 551

upper approximation, 427

user interaction, 30–31



V

values


exception, 234

expected, 97, 234

missing, 88–89

residual, 234

in rules or patterns, 281

variables

grouping, 231

predicate, 295

predictor, 105

response, 105

variance, 51, 98

example, 51

function of, 50

variant graph patterns, 591

version space, 433

vertical data format, 260

example, 260–262

frequent itemset mining with, 259–262,

272

video data analysis, 319



virtual warehouses, 133

visibility graphs, 537

visible points, 537

visual data mining, 602–604, 625

data mining process visualization, 603

data mining result visualization, 603

data visualization, 602–603

as discipline integration, 602

illustrations, 604–607

interactive, 604, 607

as mining trend, 624

Viterbi algorithm, 591



W

warehouse database servers, 131

warehouse refresh software, 151

waterfall method, 152

wavelet coefficients, 100

wavelet transforms, 99, 100–102

discrete (DWT), 100–102

for multidimensional data, 102

on sparse and skewed data, 102

web directories, 28

web mining, 597, 624

content, 597

as mining trend, 624

structure, 597–598

usage, 598

web search engines, 28, 523–524

web-document classification, 435

weight arithmetic mean, 46

weighted Euclidean distance, 74

Wikipedia, 597

WordNet, 597

working relations, 172

initial, 168, 169

World Wide Web (WWW), 1–2, 4, 14

Worlds-with-Worlds, 63, 64

wrappers, 127



Z

z-score normalization, 114–115

Document Outline

  • Front Cover 
  • Data Mining: Concepts and Techniques
  • Copyright
  • Dedication
  • Table of Contents
  • Foreword
  • Foreword to Second Edition
  • Preface
  • Acknowledgments
  • About the Authors
  • Chapter 1. Introduction
    • 1.1 Why Data Mining?
    • 1.2 What Is Data Mining?
    • 1.3 What Kinds of Data Can Be Mined?
    • 1.4 What Kinds of Patterns Can Be Mined?
    • 1.5 Which Technologies Are Used?
    • 1.6 Which Kinds of Applications Are Targeted?
    • 1.7 Major Issues in Data Mining
    • 1.8 Summary
    • 1.9 Exercises
    • 1.10 Bibliographic Notes
  • Chapter 2. Getting to Know Your Data
    • 2.1 Data Objects and Attribute Types
    • 2.2 Basic Statistical Descriptions of Data
    • 2.3 Data Visualization
    • 2.4 Measuring Data Similarity and Dissimilarity
    • 2.5 Summary
    • 2.6 Exercises
    • 2.7 Bibliographic Notes
  • Chapter 3. Data Preprocessing
    • 3.1 Data Preprocessing: An Overview
    • 3.2 Data Cleaning
    • 3.3 Data Integration
    • 3.4 Data Reduction
    • 3.5 Data Transformation and Data Discretization
    • 3.6 Summary
    • 3.7 Exercises
    • 3.8 Bibliographic Notes
  • Chapter 4. Data Warehousing and Online Analytical Processing
    • 4.1 Data Warehouse: Basic Concepts
    • 4.2 Data Warehouse Modeling: Data Cube and OLAP
    • 4.3 Data Warehouse Design and Usage
    • 4.4 Data Warehouse Implementation
    • 4.5 Data Generalization by Attribute-Oriented Induction
    • 4.6 Summary
    • 4.7 Exercises
    • 4.8 Bibliographic Notes
  • Chapter 5. Data Cube Technology
    • 5.1 Data Cube Computation: Preliminary Concepts
    • 5.2 Data Cube Computation Methods
    • 5.3 Processing Advanced Kinds of Queries by Exploring Cube Technology
    • 5.4 Multidimensional Data Analysis in Cube Space
    • 5.5 Summary
    • 5.6 Exercises
    • 5.7 Bibliographic Notes
  • Chapter 6. Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods
    • 6.1 Basic Concepts
    • 6.2 Frequent Itemset Mining Methods
    • 6.3 Which Patterns Are Interesting?—Pattern Evaluation Methods
    • 6.4 Summary
    • 6.5 Exercises
    • 6.6 Bibliographic Notes
  • Chapter 7. Advanced Pattern Mining
    • 7.1 Pattern Mining: A Road Map
    • 7.2 Pattern Mining in Multilevel, Multidimensional Space
    • 7.3 Constraint-Based Frequent Pattern Mining
    • 7.4 Mining High-Dimensional Data and Colossal Patterns
    • 7.5 Mining Compressed or Approximate Patterns
    • 7.6 Pattern Exploration and Application
    • 7.7 Summary
    • 7.8 Exercises
    • 7.9 Bibliographic Notes
  • Chapter 8. Classification: Basic Concepts
    • 8.1 Basic Concepts
    • 8.2 Decision Tree Induction
    • 8.3 Bayes Classification Methods
    • 8.4 Rule-Based Classification
    • 8.5 Model Evaluation and Selection
    • 8.6 Techniques to Improve Classification Accuracy
    • 8.7 Summary
    • 8.8 Exercises
    • 8.9 Bibliographic Notes
  • Chapter 9. Classification: Advanced Methods
    • 9.1 Bayesian Belief Networks
    • 9.2 Classification by Backpropagation
    • 9.3 Support Vector Machines
    • 9.4 Classification Using Frequent Patterns
    • 9.5 Lazy Learners (or Learning from Your Neighbors)
    • 9.6 Other Classification Methods
    • 9.7 Additional Topics Regarding Classification
    • 9.8 Summary
    • 9.9 Exercises
    • 9.10 Bibliographic Notes
  • Chapter 10. Cluster Analysis: Basic Concepts and Methods
    • 10.1 Cluster Analysis
    • 10.2 Partitioning Methods
    • 10.3 Hierarchical Methods
    • 10.4 Density-Based Methods
    • 10.5 Grid-Based Methods
    • 10.6 Evaluation of Clustering
    • 10.7 Summary
    • 10.8 Exercises
    • 10.9 Bibliographic Notes
  • Chapter 11. Advanced Cluster Analysis
    • 11.1 Probabilistic Model-Based Clustering
    • 11.2 Clustering High-Dimensional Data
    • 11.3 Clustering Graph and Network Data
    • 11.4 Clustering with Constraints
    • 11.5 Summary
    • 11.6 Exercises
    • 11.7 Bibliographic Notes
  • Chapter 12. Outlier Detection
    • 12.1 Outliers and Outlier Analysis
    • 12.2 Outlier Detection Methods
    • 12.3 Statistical Approaches
    • 12.4 Proximity-Based Approaches
    • 12.5 Clustering-Based Approaches
    • 12.6 Classification-Based Approaches
    • 12.7 Mining Contextual and Collective Outliers
    • 12.8 Outlier Detection in High-Dimensional Data
    • 12.9 Summary
    • 12.10 Exercises
    • 12.11 Bibliographic Notes
  • Chapter 13. Data Mining Trends and Research Frontiers
    • 13.1 Mining Complex Data Types
    • 13.2 Other Methodologies of Data Mining
    • 13.3 Data Mining Applications
    • 13.4 Data Mining and Society
    • 13.5 Data Mining Trends
    • 13.6 Summary
    • 13.7 Exercises
    • 13.8 Bibliographic Notes
  • Bibliography
  • Index

Yüklə 7,95 Mb.

Dostları ilə paylaş:
1   ...   335   336   337   338   339   340   341   342   343




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2022
rəhbərliyinə müraciət

    Ana səhifə