An Introduction to Data Mining Why Data Mining

An Introduction to Data Mining

Why Data Mining

Data mining

Applications

Applications (continued)

The KDD process

Relationship with other fields

Some basic operations

Classification

Classification methods

Nearest neighbor

Decision trees

Decision tree classifiers

Pros and Cons of decision trees

Neural network

Neural networks

Pros and Cons of Neural Network

Bayesian learning

Clustering

Applications

Distance functions

Clustering methods

Agglomerative Hierarchical clustering

Partitional methods: K-means

Collaborative Filtering

Collaborative recommendation

Cluster-based approaches

Example of clustering

Model-based approach

Association Rules

Association rules

Variants

Prevalent  Interesting

What makes a rule surprising?

Applications of fast itemset counting

Data Mining in Practice

Application Areas

Why Now?

Data Mining works with Warehouse Data

Usage scenarios

Mining market

Vertical integration: Mining on the web

OLAP Mining integration

State of art in mining OLAP integration

Data Mining in Use

Some success stories

Dostları ilə paylaş:

An Introduction to Data Mining Why Data Mining

An Introduction to Data Mining

Why Data Mining

Credit ratings/targeted marketing:

Fraud detection

Customer relationship management:

Data mining

Process of semi-automatically analyzing large databases to find patterns that are:

Also known as Knowledge Discovery in Databases (KDD)

Applications

Banking: loan/credit card approval

Customer relationship management:

Targeted marketing:

Fraud detection: telecommunications, financial transactions

Manufacturing and production:

Applications (continued)

Medicine: disease outcome, effectiveness of treatments

Molecular/Pharmaceutical: identify new drugs

Scientific data analysis:

Web site/store design and promotion:

The KDD process

Problem fomulation

Data collection

Pre-processing: cleaning

Transformation:

Choosing mining task and mining method:

Result evaluation and Visualization:

Relationship with other fields

Overlaps with machine learning, statistics, artificial intelligence, databases, visualization but more stress on

Some basic operations

Predictive:

Descriptive:

Classification (Supervised learning)

Classification

Given old data about customers and payments, predict new applicant’s loan eligibility.

Classification methods

Goal: Predict class Ci = f(x1, x2, .. Xn)

Regression: (linear or any other polynomial)

Nearest neighour

Decision tree classifier: divide decision space into piecewise constant regions.

Probabilistic/generative models

Neural networks: partition by non-linear boundaries

Nearest neighbor

Define proximity between instances, find neighbors of new instance and assign majority class

Case based reasoning: when attributes are more complicated than real-valued.

Decision trees

Tree where internal nodes are simple decision rules on one or more attributes and leaf nodes are predicted class labels.

Decision tree classifiers

Widely used learning method

Easy to interpret: can be re-represented as if-then-else rules

Approximates function by piece wise constant regions

Does not require any prior knowledge of data distribution, works well on noisy data.

Has been applied to:

Pros and Cons of decision trees

Neural network

Set of nodes connected by directed weighted edges

Neural networks

Useful for learning complex data like handwriting, speech and image recognition

Pros and Cons of Neural Network

Bayesian learning

Assume a probability model on generation of data.

Apply bayes theorem to find most likely class as:

Naïve bayes: Assume attributes conditionally independent given class value

Easy to learn probabilities by counting,

Useful in some domains e.g. text

Clustering or Unsupervised Learning

Clustering

Unsupervised learning when old data with class labels not available e.g. when introducing a new product.

Group/cluster existing customers based on time series of payment history such that similar customers in same cluster.

Key requirement: Need a good measure of similarity between instances.

Identify micro-markets and develop policies for each

Applications

Customer segmentation e.g. for targeted marketing

Collaborative filtering:

Text clustering

Compression

Distance functions

Numeric data: euclidean, manhattan distances

Categorical data: 0/1 to indicate presence/absence followed by

Combined numeric and categorical data: