What is data mining?  Methods for finding interesting structure in large databases
 E.g. patterns, prediction rules, unusual cases
 Focus on efficient, scalable algorithms
 Contrasts with emphasis on correct inference in statistics
 Related to data warehousing, machine learning
Why is data mining important?  Well marketed; now a large industry; pays well
 Handles large databases directly
 Can make data analysis more accessible to end users
 Semiautomation of analysis
 Results can be easier to interpret than e.g. regression models
 Strong focus on decisions and their implementation
CRISPDM Process Model
Data Mining Software Many providers of data mining software  SAS Enterprise Miner, SPSS Clementine, Statistica Data Miner, MS SQL Server, Polyanalyst, KnowledgeSTUDIO, …
 See http://www.kdnuggets.com/software/suites.html for a list
 Good algorithms important, but also need good facilities for handling data and metadata
We’ll use:  WEKA (Waikato Environment for Knowledge Analysis)
 Free (GPLed) Java package with GUI
 Online at www.cs.waikato.ac.nz/ml/weka
 Witten and Frank, 2000. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations.
 R packages
 E.g. rpart, class, tree, nnet, cclust, deal, GeneSOM, knnTree, mlbench, randomForest, subselect
Data Mining Terms  Observation = case, record, instance
 Variable = field, attribute
 Analysis of dependence vs interdependence = Supervised vs unsupervised learning
 Relationship = association, concept
 Dependent variable = response, output
 Independent variable = predictor, input
Common Data Mining Techniques Predictive modeling  Classification
 Derive classification rules
 Decision trees
 Numeric prediction
 Regression trees, model trees
Association rules Metalearning methods  Crossvalidation, bagging, boosting
Other data mining methods include:  artificial neural networks, genetic algorithms, density estimation, clustering, abstraction, discretisation, visualisation, detecting changes in data or models
Classification Methods for predicting a discrete response  One kind of supervised learning
 Note: in biological and other sciences, classification has long had a different meaning, referring to cluster analysis
Applications include:  Identifying good prospects for specific marketing or sales efforts
 Crossselling, upselling – when to offer products
 Customers likely to be especially profitable
 Customers likely to defect
 Identifying poor credit risks
 Diagnosing customer problems
Weather/GamePlaying Data Small dataset  14 instances
 5 attributes
 Outlook  nominal
 Temperature  numeric
 Humidity  numeric
 Wind  nominal
 Play
 Whether or not a certain game would be played
 This is what we want to understand and predict
ARFF file for the weather data.
German Credit Risk Dataset 1000 instances (people), 21 attributes  “class” attribute describes people as good or bad credit risks
 Other attributes include financial information and demographics
 E.g. checking_status, duration, credit_history, purpose, credit_amount, savings_status, employment, Age, housing, job, num_dependents, own_telephone, foreign_worker
Want to predict credit risk Data available at UCI machine learning data repository  http://www.ics.uci.edu/~mlearn/MLRepository.html
 and on 747 web page
 http://www.stat.auckland.ac.nz/~reilly/creditg.arff
Classification Algorithms Many methods available in WEKA  0R, 1R, NaiveBayes, DecisionTable, ID3, PRISM, Instancebased learner (IB1, IBk), C4.5 (J48), PART, Support vector machine (SMO)
Usually train on part of the data, test on the rest Simple method – Zerorule, or 0R  Predict the most common category
 Too simple for practical use, but a useful baseline for evaluating performance of more complex methods
1Rule (1R) Algorithm Based on single predictor  Predict mode within each value of that predictor
Look at error rate for each predictor on training dataset, and choose best predictor Called OneR in WEKA Must group numerical predictor values for this method  Common method is to split at each change in the response
 Collapse buckets until each contains at least 6 instances
1R Algorithm (continued) Biased towards predictors with more categories  These can result in overfitting to the training data
But found to perform surprisingly well  Study on 16 widely used datasets
 Holte (1993), Machine Learning 11, 6391
 Often error rate only a few percentages points higher than more sophisticated methods (e.g. decision trees)
 Produced rules that were much simpler and more easily understood
Naïve Bayes Method Calculates probabilities of each response value, assuming independence of attribute effects Response value with highest probability is predicted Numeric attributes are assumed to follow a normal distribution within each response value  Contribution to probability calculated from normal density function
 Instead can use kernel density estimate, or simply discretise the numerical attributes
Naïve Bayes Calculations Observed counts and probabilities above  Temperature and humidity have been discretised
Consider new day  Outlook=sunny, temperature=cool, humidity=high, windy=true
 Probability(play=yes) α 2/9 x 3/9 x 3/9 x 3/9 x 9/14= 0.0053
 Probability(play=no) α 3/5 x 1/5 x 4/5 x 3/5 x 5/14= 0.0206
 Probability(play=no) = 0.0206/(0.0053+0.0206) = 79.5%
 “no” four times more likely than “yes”
Naïve Bayes Method If any of the component probabilities are zero, the whole probability is zero  Effectively a veto on that response value
 Add one to each cell’s count to get around this problem
 Corresponds to weak positive prior information
Naïve Bayes effectively assumes that attributes are equally important  Several highly correlated attributes could drown out an important variable that would add new information
However this method often works well in practice
Decision Trees Classification rules can be expressed in a tree structure  Move from the top of the tree, down through various nodes, to the leaves
 At each node, a decision is made using a simple test based on attribute values
 The leaf you reach holds the appropriate predicted value
Decision trees are appealing and easily used  However they can be verbose
 Depending on the tests being used, they may obscure rather than reveal the true pattern
More info online at http://recursivepartitioning.com/
Decision tree with a replicated subtree
Problems with Univariate Splits
Constructing Decision Trees Develop tree recursively  Start with all data in one root node
 Need to choose attribute that defines first split
 For now, we assume univariate splits are used
 For accurate predictions, want leaf nodes to be as pure as possible
 Choose the attribute that maximises the average purity of the daughter nodes
 The measure of purity used is the entropy of the node
 This is the amount of information needed to specify the value of an instance in that node, measured in bits

Tree stumps for the weather data
Weather Example First node from outlook split is for “sunny”, with entropy – 2/5 * log2(2/5) – 3/5 * log2(3/5) = 0.971 Average entropy of nodes from outlook split is  5/14 x 0.971 + 4/14 x 0 + 5/14 x 0.971= 0.693
Entropy of root node is 0.940 bits Gain of 0.247 bits Other splits yield:  Gain(temperature)=0.029 bits
 Gain(humidity)=0.152 bits
 Gain(windy)=0.048 bits
So “outlook” is the best attribute to split on
Expanded tree stumps for weather data
Decision tree for the weather data
Decision Tree Algorithms The algorithm described in the preceding slides is known as ID3 Tends to choose attributes with many values  Using information gain ratio helps solve this problem
Several more improvements have been made to handle numeric attributes (via univariate splits), missing values and noisy data (via pruning)  Resulting algorithm known as C4.5
 Described by Quinlan (1993)
 Widely used (as is the commercial version C5.0)
 WEKA has a version called J4.8
Classification Trees Described (along with regression trees) in:  L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, 1984. Classification and Regression Trees.
More sophisticated method than ID3  However Quinlan’s (1993) C4.5 method caught up with CART in most areas
CART also incorporates methods for pruning, missing values and numeric attributes  Multivariate splits are possible, as well as univariate
 Split on linear combination Σcjxj > d
 CART typically uses Gini measure of node purity to determine best splits
 This is of the form Σp(1p)
 But information/entropy measure also available
Regression Trees Trees can also be used to predict numeric attributes  Predict using average value of the response in the appropriate node
 Implemented in CART and C4.5 frameworks
 Can use a model at each node instead
 Implemented in Weka’s M5’ algorithm
 Harder to interpret than regression trees
Classification and regression trees are implemented in R’s rpart package  See Ch 10 in Venables and Ripley, MASS 3rd Ed.
Problems with Trees Can be unnecessarily verbose  “Greedy” hierarchical algorithm
 Small variations can change chosen splits at high level nodes, which then changes subtree below
 Conclusions about attribute importance can be unreliable
Direct methods tend to overfit training dataset  This problem can be reduced by pruning the tree
Another approach that often works well is to fit the tree, remove all training cases that are not correctly predicted, and refit the tree on the reduced dataset  Typically gives a smaller tree
 This usually works almost as well on the training data
 But generalises better, e.g. works better on test data
Bagging the tree algorithm also gives more stable results  Will discuss bagging later
Classification Tree Example Use Weka’s J4.8 algorithm on German credit data (with default options)  1000 instances, 21 attributes
Produces a pruned tree with 140 nodes, 103 leaves
=== Run information === === Run information === Scheme: weka.classifiers.j48.J48 C 0.25 M 2 Relation: german_credit Instances: 1000 Attributes: 21 Number of Leaves : 103 Size of the tree : 140 === Stratified crossvalidation === === Summary === Correctly Classified Instances 739 73.9 % Incorrectly Classified Instances 261 26.1 % Kappa statistic 0.3153 Mean absolute error 0.3241 Root mean squared error 0.4604 Relative absolute error 77.134 % Root relative squared error 100.4589 % Total Number of Instances 1000 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall FMeasure Class 0.883 0.597 0.775 0.883 0.826 good 0.403 0.117 0.596 0.403 0.481 bad === Confusion Matrix === a b < classified as 618 82  a = good 179 121  b = bad
CrossValidation Due to overfitting, cannot estimate prediction error directly on the training dataset Crossvalidation is a simple and widely used method for estimating prediction error Simple approach  Set aside a test dataset
 Train learner on the remainder (the training dataset)
 Estimate prediction error by using the resulting prediction model on the test dataset
This is only feasible where there is enough data to set aside a test dataset and still have enough to reliably train the learning algorithm
kfold CrossValidation For smaller datasets, use kfold crossvalidation  Split dataset into k roughly equal parts
 For each part, train on the other k1 parts and use this part as the test dataset
 Do this for each of the k parts, and average the resulting prediction errors
This method measures the prediction error when training the learner on a fraction (k1)/k of the data If k is small, this will overestimate the prediction error
Regression Tree Example data(car.test.frame) z.auto < rpart(Mileage ~ Weight, car.test.frame) post(z.auto,FILE=“”) summary(z.auto)
Call: Call: rpart(formula = Mileage ~ Weight, data = car.test.frame) n= 60 CP nsplit rel error xerror xstd 1 0.59534912 0 1.0000000 1.0322233 0.17981796 2 0.13452819 1 0.4046509 0.6081645 0.11371656 3 0.01282843 2 0.2701227 0.4557341 0.09178782 4 0.01000000 3 0.2572943 0.4659556 0.09134201 Node number 1: 60 observations, complexity param=0.5953491 mean=24.58333, MSE=22.57639 left son=2 (45 obs) right son=3 (15 obs) Primary splits: Weight < 2567.5 to the right, improve=0.5953491, (0 missing) Node number 2: 45 observations, complexity param=0.1345282 mean=22.46667, MSE=8.026667 left son=4 (22 obs) right son=5 (23 obs) Primary splits: Weight < 3087.5 to the right, improve=0.5045118, (0 missing) …(continued on next page)…
Node number 3: 15 observations Node number 3: 15 observations mean=30.93333, MSE=12.46222 Node number 4: 22 observations mean=20.40909, MSE=2.78719 Node number 5: 23 observations, complexity param=0.01282843 mean=24.43478, MSE=5.115312 left son=10 (15 obs) right son=11 (8 obs) Primary splits: Weight < 2747.5 to the right, improve=0.1476996, (0 missing) Node number 10: 15 observations mean=23.8, MSE=4.026667 Node number 11: 8 observations mean=25.625, MSE=4.984375
Regression Tree Example (continued) plotcp(z.auto) z2.auto < prune(z.auto,cp=0.1) post(z2.auto, file="", cex=1)
Complexity Parameter Plot
Pruned Regression Tree
Classification Methods Project the attribute space into decision regions  Decision trees: piecewise constant approximation
 Logistic regression: linear logodds approximation
 Discriminant analysis and neural nets: linear & nonlinear separators
Density estimation coupled with a decision rule Define a metric space and decide based on proximity  One type of instancebased learning
 Knearest neighbour methods
 Would like to drop noisy and unnecessary points
 Simple algorithm based on success rate confidence intervals available in Weka
 Compares naïve prediction with predictions using that instance
 Must choose suitable acceptance and rejection confidence levels
Many of these approaches can produce probability distributions as well as predictions  Depending on the application, this information may be useful
 Such as when results reported to expert (e.g. loan officer) as input to their decision
Numeric Prediction Methods Linear regression Splines, including smoothing splines and multivariate adaptive regression splines (MARS) Generalised additive models (GAM) Locally weighted regression (lowess, loess) Artificial neural networks (ANNs)
Artificial Neural Networks (ANNs) An ANN is a network of many simple processors (or units), that are connected by communication channels that carry numeric data ANNs are very flexible, encompassing nonlinear regression models, discriminant models, and data reduction models  They do require some expertise to set up
 An appropriate architecture needs to be selected and tuned for each application
They can be useful tools for learning from examples to find patterns in data and predict outputs  However on their own, they tend to overfit the training data
 Metalearning tools are needed to choose the best fit
Various network architectures in common use  Multilayer perceptron (MLR)
 Radial basis functions (RBF)
 Selforganising maps (SOM)
ANNs have been applied to data editing and imputation, but not widely
MetaLearning Methods  Bagging General methods for improving the performance of most learning algorithms Bootstrap aggregation, bagging for short  Select B bootstrap samples from the data
 Selected with replacement, same # of instances
 Can use parametric or nonparametric bootstrap
 Fit the model/learner on each bootstrap sample
 The bagged estimate is the average prediction from all these B models
E.g. for a tree learner, the bagged estimate is the average prediction from the resulting B trees Note that this is not a tree  In general, bagging a model or learner does not produce a model or learner of the same form
Bagging reduces the variance of unstable procedures like regression trees, and can greatly improve prediction accuracy  However it does not always work for poor 01 predictors
MetaLearning Methods  Boosting Boosting is a powerful technique for improving accuracy The “AdaBoost.M1” method (for classifiers):  Give each instance an initial weight of 1/n
 For m=1 to M:
 Fit model using the current weights, & store resulting model m
 If prediction error rate “err” is zero or >= 0.5, terminate loop.
 Otherwise calculate αm=log((1err)/err)
 This is the log odds of success
 Then adjust weights for incorrectly classified cases by multiplying them by exp(αm), and repeat
 Predict using a weighted majority vote: ΣαmGm(x), where Gm(x) is the prediction from model m
MetaLearning Methods  Boosting For example, for the German credit dataset:  using 100 iterations of AdaBoost.M1 with the DecisionStump algorithm,
 10fold crossvalidation gives an error rate of 24.9% (compared to 26.1% for J4.8)
Association Rules Data on n purchase baskets in form (id, item1, item2, …, itemk)  For example, purchases from a supermarket
Association rules are statements of the form:  “When people buy tea, they also often buy coffee.”
May be useful for product placement decisions or crossselling recommendations We say there is an association rule i1 >i2 if  i1 and i2 occur together in at least s% of the n baskets (the support)
 And at least c% of the baskets containing item i1 also contain i2 (the confidence)
The confidence criterion ensures that “often” is a large enough proportion of the antecedent cases to be interesting The support criterion should be large enough that the resulting rules have practical importance  Also helps to ensure reliability of the conclusions
Association rules The support/confidence approach is widely used  Efficiently implemented in the Apriori algorithm
 First identify item sets with sufficient support
 Then turn each item set into sets of rules with sufficient confidence
This method was originally developed in the database community, so there has been a focus on efficient methods for large databases  “Large” means up to around 100 million instances, and about ten thousand binary attributes
However this approach can find a vast number of rules, and it can be difficult to make sense of these One useful extension is to identify only the rules with high enough lift (or odds ratio)
Classification vs Association Rules Classification rules predict the value of a prespecified attribute, e.g.  If outlook=sunny and humidity=high then play =no
Association rules predict the value of an arbitrary attribute (or combination of attributes)  E.g. If temperature=cool then humidity=normal
 If humidity=normal and play=no then windy=true
 If temperature=high and humidity=high then play=no
Clustering – EM Algorithm Assume that the data is from a mixture of normal distributions  I.e. one normal component for each cluster
For simplicity, consider one attribute x and two components or clusters  Model has five parameters: (p, μ1, σ1, μ2, σ2) = θ
Loglikelihood: This is hard to maximise directly  Use the expectationmaximisation (EM) algorithm instead
Clustering – EM Algorithm Think of data as being augmented by a latent 0/1 variable di indicating membership of cluster 1 If the values of this variable were known, the loglikelihood would be: Starting with initial values for the parameters, calculate the expected value of di Then substitute this into the above loglikelihood and maximise to obtain new parameter values  This will have increased the loglikelihood
Repeat until the loglikelihood converges
Clustering – EM Algorithm Resulting estimates may only be a local maximum  Run several times with different starting points to find global maximum (hopefully)
With parameter estimates, can calculate segment membership probabilities for each case
Clustering – EM Algorithm Extending to more latent classes is easy  Information criteria such as AIC and BIC are often used to decide how many are appropriate
Extending to multiple attributes is easy if we assume they are independent, at least conditioning on segment membership  It is possible to introduce associations, but this can rapidly increase the number of parameters required
Nominal attributes can be accommodated by allowing different discrete distributions in each latent class, and assuming conditional independence between attributes Can extend this approach to a handle joint clustering and prediction models, as mentioned in the MVA lectures
Clustering  Scalability Issues kmeans algorithm is also widely used However this and the EMalgorithm are slow on large databases So is hierarchical clustering  requires O(n2) time Iterative clustering methods require full DB scan at each iteration Scalable clustering algorithms are an area of active research A few recent algorithms:  Distancebased/kMeans
 MultiResolution kdTree for KMeans [PM99]
 CLIQUE [AGGR98]
 Scalable KMeans [BFR98a]
 CLARANS [NH94]
 Probabilistic/EM
 MultiResolution kdTree for EM [Moore99]
 Scalable EM [BRF98b]
 CF Kernel Density Estimation [ZRL99]
Ethics of Data Mining Data mining and data warehousing raise ethical and legal issues Combining information via data warehousing could violate Privacy Act  Must tell people how their information will be used when the data is obtained
Data mining raises ethical issues mainly during application of results  E.g. using ethnicity as a factor in loan approval decisions
 E.g. screening job applications based on age or sex (where not directly relevant)
 E.g. declining insurance coverage based on neighbourhood if this is related to race (“redlining” is illegal in much of the US)
Whether something is ethical depends on the application  E.g. probably ethical to use ethnicity to diagnose and choose treatments for a medical problem, but not to decline medical insurance
Dostları ilə paylaş: 