## Data Mining and Knowledge Discovery in Databases (KDD) State of the Art ## Prof. Dr. T. Nouri ## Computer Science Department ## FHNW Switzerland
## Conference overview ## Overview of KDD and data mining ## Demo ## Summary
## Overview of data mining ## What is KDD? ## Why is KDD necessary ## The KDD process ## KDD operations and methods
## The iterative and interactive process of discovering valid, novel, useful, and understandable knowledge ( patterns, models, rules etc.) in **Massive** databases
## What is data mining? ## Valid: generalize to the future ## Novel: what we don't know ## Useful: be able to take some action ## Understandable: leading to insight ## Iterative: takes multiple passes ## Interactive: human in the loop
## Why data mining? - Number of records too large (millions or billions)
- High dimensional (attributes/features/ fields) data (thousands)
## Increased opportunity for access - Web navigation, on-line collections
## Data mining goals
## Data mining operations ## Verification driven - Validating hypothesis
- Querying and reporting (spreadsheets, pivot tables)
- Multidimensional analysis (dimensional summaries); On Line Analytical Processing
- Statistical analysis
## Data mining operations ## Discovery driven - Exploratory data analysis
- Predictive modeling
- Database segmentation
- Link analysis
- Deviation detection
## Data mining process
## Data mining process - Prior knowledge, user goals
## Create target dataset - Select data, focus on subsets
## Data cleaning and transformation - Remove noise, outliers, missing values
- Select features, reduce dimensions
## Data mining process ## Apply data mining algorithm - Associations, sequences, classification, clustering, etc.
## Interpret, evaluate and visualize patterns - What's new and interesting?
- Iterate if needed
## Manage discovered knowledge
## Data mining process
## Related fields ## AI ## Machine learning ## Statistics ## Databases and data warehousing ## High performance computing ## Visualization
## Need for data mining tools ## Human analysis breaks down with volume and dimensionality - How quickly can one digest 1 million records, with 100 attributes
- High rate of growth, changing sources
## What is done by non-statisticians? - Select a few fields and fit simple models or attempt to visualize
## Conference overview ## Overview of KDD and data mining ## Data mining techniques ## Demo ## Summary
## Data mining methods ## Predictive modeling (classification, regression) ## Segmentation (clustering) ## Dependency modeling (graphical models, density estimation) ## Summarization (associations) ## Change and deviation detection
## Data mining techniques ## Association rules: detect sets of attributes that frequently co-occur, and rules among them, e.g. 90% of the people who buy cookies, also buy milk (60% of all grocery shoppers buy both) ## Sequence mining (categorical): discover sequences of events that commonly occur together, .e.g. In a set of DNA sequences ACGTC is followed by GTCA after a gap of 9, with 30% probability
## Data mining techniques ## CBR or Similarity search: given a database of objects, and a “query” object, find the object(s) that are within a user-defined distance of the queried object, or find all pairs within some distance of each other. ## Deviation detection: find the record(s) that is (are) the most different from the other records, i.e., find all outliers. These may be thrown away as noise or may be the “interesting” ones.
## Data mining techniques ## Classification and regression: assign a new data record to one of several predefined categories or classes. Regression deals with predicting real-valued fields. Also called supervised learning. ## Clustering: partition the dataset into subsets or groups such that elements of a group share a common set of properties, with high within group similarity and small inter-group similarity. Also called unsupervised learning.
## Data mining techniques ## Many other methods, such as - Decision trees
- Neural networks
- Genetic algorithms
- Hidden markov models
- Time series
- Bayesian networks
- Soft computing: rough and fuzzy sets
## Research challenges for KDD ## Scalability - Efficient and sufficient sampling
- In-memory vs. disk-based processing
- High performance computing
## Automation - Ease of use
- Using prior knowledge
## Types of data mining tasks ## General descriptive knowledge - Summarizations
- symbolic descriptions of subsets
## Discriminative knowledge - Distinguish between K classes
- Accurate classification (also black box)
- Separate spaces
## Components of DM methods ## Representation: language for patterns/models, expressive power ## Evaluation: scoring methods for deciding what is a good fit of model to data ## Search: method for enumerating patterns/models
## Data mining techniques ## Association rules ## Sequence mining ## Classification(decision tree etc.) ## Clustering ## Deviation detection ## K-nearest neighbors
## What is association mining? ## Given a set of items/attributes, and a set of objects containing a subset of the items ## Find rules: if I1 then I2 (sup, conf) ## I1, I2 are sets of items ## I1, I2 have sufficient support: P(I1+I2) ## Rule has sufficient confidence: P(I2|I1)
## Support & Confidence
## Support & Confidence
## Association Mining ex.
## What is association mining?
## What is sequence mining? ## Given a set of items, list of events per sequence ordered in time ## Find rules: if S1 then S2 (sup, conf) ## S1, S2 are sequences of items ## S1, S2 have sufficient support: P(S1+S2) ## Rule has sufficient confidence: P(S2|S1)
## Sequence mining ## User specifies “interestingness” - Minimum support (minsup)
- Minimum confidence (minconf)
## Find all frequent sequences (> minsup) - Exponential Search Space
- Computation and I/O Intensive
## Generate strong rules (> minconf)
## Predictive modeling ## A “black box” that makes predictions about the future based on information from the past and present ## Large number of input available
## Models ## Some models are better than others - Accuracy
- Understandability
## Models range from easy to understand to incomprehensible - Decision trees
- Rule induction
- Regression models
- Neural networks
## What is Classification? ## Classification is the process of assigning new objects to predefined categories or classes ## Given a set of labeled records ## Build a model (decision tree)
## Classification learning ## Supervised learning (labels known) ## Example described in terms of attributes - Categorical (unordered symbolic values)
- Numeric (integers, reals)
## Class (output/predicted attribute): categorical for classification, numeric for regression
## Decision-tree classification
## From tree to rules
## What is clustering? ## Given N k-dimensional feature vectors , find a “meaningful” partition of the N examples into c subsets or groups ## Discover the “labels” automatically ## c may be given, or “discovered” ## much more difficult than classification, since in the latter the groups are given, and we seek a compact description
## Clustering ## Have to define some notion of “similarity” between examples ## Goal: maximize intra-cluster similarity and minimize inter-cluster similarity ## Feature vector be - All numeric (well defined distances)
- All categorical or mixed (harder to define similarity; geometric notions don’t work)
## Clustering schemes ## Distance-based - Numeric
- Euclidean distance (root of sum of squared differences along each dimension)
- Angle between two vectors
- Categorical
- Number of common features (categorical)
## Partition-based - Enumerate partitions and score each
## K-means algorithm
## K-means algorithm
## K-means algorithm
## Deviation detection
## K-nearest neighbors ## Classification technique to assign a class to a new example ## Find k-nearest neighbors, i.e., most similar points in the dataset (compare against all points!) ## Assign the new case to the same class to which most of its neighbors belong
## K-nearest neighbors
## Conference overview ## Overview of KDD and data mining ## Data mining techniques ## Demo ## Research Trends ## Summary ## KDD resources pointers
## Conference overview ## Overview of KDD and data mining ## Data mining techniques ## Demo ## Summary ## KDD resources pointers
## Conclusions ## Scientific and economic need for KDD ## Made possible by recent advances in data collection, processing power, and sophisticated techniques from AI, databases and visualization ## KDD is a complex process ## Several techniques need to be used
## Conclusions ## Need for rich knowledge representation ## Need to integrate specific domain knowledge. ## KDD using Fuzzy-categorical and Uncertainty Techniques ## Web Mining and User profile ## KDD for Bio-Informatique
## KDD resources pointers ## ACM SIGKDD: www.acm.org/sigkdd ## KDD Nuggets: www.kdnuggets.com ## Book: Advances in KDD, MIT Press, ’96 ## Journal: Data Mining and KDD, - research.microsoft.com/datamine
**Dostları ilə paylaş:** |