Data Mining and Knowledge Discovery in Databases (kdd) State of the Art Prof. Dr. T. Nouri



Yüklə 510 b.
tarix08.10.2017
ölçüsü510 b.
#3830


Data Mining and Knowledge Discovery in Databases (KDD) State of the Art

  • Prof. Dr. T. Nouri

  • Computer Science Department

  • FHNW Switzerland


Conference overview





Overview of data mining

  • What is KDD?

  • Why is KDD necessary

  • The KDD process

  • KDD operations and methods



  • The iterative and interactive process of discovering valid, novel, useful, and understandable knowledge ( patterns, models, rules etc.) in Massive databases



What is data mining?

  • Valid: generalize to the future

  • Novel: what we don't know

  • Useful: be able to take some action

  • Understandable: leading to insight

  • Iterative: takes multiple passes

  • Interactive: human in the loop



Why data mining?

  • Data volume too large for classical analysis

    • Number of records too large (millions or billions)
    • High dimensional (attributes/features/ fields) data (thousands)
  • Increased opportunity for access

    • Web navigation, on-line collections


Data mining goals

  • Prediction

    • What? Opaque
  • Description

    • Why? Transparent


Data mining operations

  • Verification driven

    • Validating hypothesis
    • Querying and reporting (spreadsheets, pivot tables)
    • Multidimensional analysis (dimensional summaries); On Line Analytical Processing
    • Statistical analysis


Data mining operations

  • Discovery driven

    • Exploratory data analysis
    • Predictive modeling
    • Database segmentation
    • Link analysis
    • Deviation detection


Data mining process



Data mining process

  • Understand application domain

    • Prior knowledge, user goals
  • Create target dataset

    • Select data, focus on subsets
  • Data cleaning and transformation

    • Remove noise, outliers, missing values
    • Select features, reduce dimensions


Data mining process

  • Apply data mining algorithm

    • Associations, sequences, classification, clustering, etc.
  • Interpret, evaluate and visualize patterns

    • What's new and interesting?
    • Iterate if needed
  • Manage discovered knowledge



Data mining process



Related fields

  • AI

  • Machine learning

  • Statistics

  • Databases and data warehousing

  • High performance computing

  • Visualization



Need for data mining tools

  • Human analysis breaks down with volume and dimensionality

    • How quickly can one digest 1 million records, with 100 attributes
    • High rate of growth, changing sources
  • What is done by non-statisticians?

    • Select a few fields and fit simple models or attempt to visualize


Conference overview

  • Overview of KDD and data mining

  • Data mining techniques

  • Demo

  • Summary

  • KDD resources pointers



Data mining methods

  • Predictive modeling (classification, regression)

  • Segmentation (clustering)

  • Dependency modeling (graphical models, density estimation)

  • Summarization (associations)

  • Change and deviation detection



Data mining techniques

  • Association rules: detect sets of attributes that frequently co-occur, and rules among them, e.g. 90% of the people who buy cookies, also buy milk (60% of all grocery shoppers buy both)

  • Sequence mining (categorical): discover sequences of events that commonly occur together, .e.g. In a set of DNA sequences ACGTC is followed by GTCA after a gap of 9, with 30% probability



Data mining techniques

  • CBR or Similarity search: given a database of objects, and a “query” object, find the object(s) that are within a user-defined distance of the queried object, or find all pairs within some distance of each other.

  • Deviation detection: find the record(s) that is (are) the most different from the other records, i.e., find all outliers. These may be thrown away as noise or may be the “interesting” ones.



Data mining techniques

  • Classification and regression: assign a new data record to one of several predefined categories or classes. Regression deals with predicting real-valued fields. Also called supervised learning.

  • Clustering: partition the dataset into subsets or groups such that elements of a group share a common set of properties, with high within group similarity and small inter-group similarity. Also called unsupervised learning.



Data mining techniques

  • Many other methods, such as

    • Decision trees
    • Neural networks
    • Genetic algorithms
    • Hidden markov models
    • Time series
    • Bayesian networks
    • Soft computing: rough and fuzzy sets


Research challenges for KDD

  • Scalability

    • Efficient and sufficient sampling
    • In-memory vs. disk-based processing
    • High performance computing
  • Automation

    • Ease of use
    • Using prior knowledge


Types of data mining tasks

  • General descriptive knowledge

    • Summarizations
    • symbolic descriptions of subsets
  • Discriminative knowledge

    • Distinguish between K classes
    • Accurate classification (also black box)
    • Separate spaces


Components of DM methods

  • Representation: language for patterns/models, expressive power

  • Evaluation: scoring methods for deciding what is a good fit of model to data

  • Search: method for enumerating patterns/models



Data mining techniques

  • Association rules

  • Sequence mining

  • Classification(decision tree etc.)

  • Clustering

  • Deviation detection

  • K-nearest neighbors



What is association mining?

  • Given a set of items/attributes, and a set of objects containing a subset of the items

  • Find rules: if I1 then I2 (sup, conf)

  • I1, I2 are sets of items

  • I1, I2 have sufficient support: P(I1+I2)

  • Rule has sufficient confidence: P(I2|I1)





Support & Confidence



Support & Confidence



Association Mining ex.



What is association mining?





What is sequence mining?

  • Given a set of items, list of events per sequence ordered in time

  • Find rules: if S1 then S2 (sup, conf)

  • S1, S2 are sequences of items

  • S1, S2 have sufficient support: P(S1+S2)

  • Rule has sufficient confidence: P(S2|S1)



Sequence mining

  • User specifies “interestingness”

    • Minimum support (minsup)
    • Minimum confidence (minconf)
  • Find all frequent sequences (> minsup)

    • Exponential Search Space
    • Computation and I/O Intensive
  • Generate strong rules (> minconf)

    • Relatively cheap


Predictive modeling

  • A “black box” that makes predictions about the future based on information from the past and present

  • Large number of input available



Models

  • Some models are better than others

    • Accuracy
    • Understandability
  • Models range from easy to understand to incomprehensible

    • Decision trees
    • Rule induction
    • Regression models
    • Neural networks


What is Classification?

  • Classification is the process of assigning new objects to predefined categories or classes

  • Given a set of labeled records

  • Build a model (decision tree)

  • Predict labels for future unlabeled records



Classification learning

  • Supervised learning (labels known)

  • Example described in terms of attributes

    • Categorical (unordered symbolic values)
    • Numeric (integers, reals)
  • Class (output/predicted attribute): categorical for classification, numeric for regression



Decision-tree classification



From tree to rules



What is clustering?

  • Given N k-dimensional feature vectors , find a “meaningful” partition of the N examples into c subsets or groups

  • Discover the “labels” automatically

  • c may be given, or “discovered”

  • much more difficult than classification, since in the latter the groups are given, and we seek a compact description



Clustering

  • Have to define some notion of “similarity” between examples

  • Goal: maximize intra-cluster similarity and minimize inter-cluster similarity

  • Feature vector be

    • All numeric (well defined distances)
    • All categorical or mixed (harder to define similarity; geometric notions don’t work)


Clustering schemes

  • Distance-based

    • Numeric
      • Euclidean distance (root of sum of squared differences along each dimension)
      • Angle between two vectors
    • Categorical
      • Number of common features (categorical)
  • Partition-based

    • Enumerate partitions and score each


K-means algorithm



K-means algorithm



K-means algorithm



Deviation detection



K-nearest neighbors

  • Classification technique to assign a class to a new example

  • Find k-nearest neighbors, i.e., most similar points in the dataset (compare against all points!)

  • Assign the new case to the same class to which most of its neighbors belong



K-nearest neighbors



Conference overview

  • Overview of KDD and data mining

  • Data mining techniques

  • Demo

  • Research Trends

  • Summary

  • KDD resources pointers



Conference overview

  • Overview of KDD and data mining

  • Data mining techniques

  • Demo

  • Summary

  • KDD resources pointers



Conclusions

  • Scientific and economic need for KDD

  • Made possible by recent advances in data collection, processing power, and sophisticated techniques from AI, databases and visualization

  • KDD is a complex process

  • Several techniques need to be used



Conclusions

  • Need for rich knowledge representation

  • Need to integrate specific domain knowledge.

  • KDD using Fuzzy-categorical and Uncertainty Techniques

  • Web Mining and User profile

  • KDD for Bio-Informatique



KDD resources pointers

  • ACM SIGKDD: www.acm.org/sigkdd

  • KDD Nuggets: www.kdnuggets.com

  • Book: Advances in KDD, MIT Press, ’96

  • Journal: Data Mining and KDD,

    • research.microsoft.com/datamine




Yüklə 510 b.

Dostları ilə paylaş:




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə