Define data mining Data mining vs. databases Basic data mining tasks Data mining development Data mining issues
Data is produced at a phenomenal rate Data is produced at a phenomenal rate Users expect more sophisticated information How?
Objective: Fit data to a model Objective: Fit data to a model Potential Result: Higher-level meta information that may not be obvious when looking at raw data Similar terms - Exploratory data analysis
- Data driven discovery
- Deductive learning
Objective: Fit Data to a Model Preferential Questions - Which technique to choose?
- ARM/Classification/Clustering
- Answer: Depends on what you want to do with data?
- Search Strategy – Technique to search the data
- Interface? Query Language?
- Efficiency
Query
Database
Classification maps data into predefined groups or classes Classification maps data into predefined groups or classes - Supervised learning
- Pattern recognition
- Prediction
Regression is used to map a data item to a real valued prediction variable. Clustering groups similar data together into clusters. - Unsupervised learning
- Segmentation
- Partitioning
Summarization maps data into subsets with associated simple descriptions. Summarization maps data into subsets with associated simple descriptions. - Characterization
- Generalization
Link Analysis uncovers relationships among data. - Affinity Analysis
- Association Rules
- Sequential Analysis determines sequential patterns.
Example: Stock Market Example: Stock Market Predict future values Determine similar patterns over time Classify behavior
Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data. Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data. Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process.
Data mining: the core of knowledge discovery process. - Data mining: the core of knowledge discovery process.
Selection: Selection: - Select log data (dates and locations) to use
Preprocessing: Transformation: - Sessionize (sort and group)
Data Mining: - Identify and count patterns
- Construct data structure
Interpretation/Evaluation: - Identify and display frequently accessed sequences.
Potential User Applications: - Cache prediction
- Personalization
Human Interaction Human Interaction Overfitting Outliers Interpretation Visualization Large Datasets
Multimedia Data Multimedia Data Missing Data Irrelevant Data Noisy Data Changing Data Integration Application
Privacy Privacy Profiling Unauthorized use
Usefulness Usefulness Return on Investment (ROI) Accuracy Space/Time
Scalability Scalability Real World Data Updates Ease of Use
Statistical Basics Statistical Basics - Point Estimation
- Models Based on Summarization
- Bayes Theorem
- Hypothesis Testing
- Regression and Correlation
Similarity Measures
Point Estimate: estimate a population parameter. Point Estimate: estimate a population parameter. May be made by calculating the parameter for a sample. May be used to predict value for missing data. Ex: - R contains 100 employees
- 99 have salary information
- Mean salary of these is $50,000
- Use $50,000 as value of remaining employee’s salary.
- Is this a good idea?
Bias: Difference between expected value and actual value. Bias: Difference between expected value and actual value.
Mean Squared Error (MSE): expected value of the squared difference between the estimate and the actual value: Why square? Root Mean Square Error (RMSE)
Jackknife Estimate: estimate of parameter is obtained by omitting one value from the set of observed values. Jackknife Estimate: estimate of parameter is obtained by omitting one value from the set of observed values. Let θ(hat) be an estimate on the entire pop. Let θ(j)(hat) be an estimator of the same form with observation j deleted Allows you to examine the impact of outliers!
Obtain parameter estimates that maximize the probability that the sample data occurs for the specific model. Obtain parameter estimates that maximize the probability that the sample data occurs for the specific model. Joint probability for observing the sample data by multiplying the individual probabilities. Likelihood function: Maximize L.
Coin toss five times: {H,H,H,H,T} Coin toss five times: {H,H,H,H,T} Assuming a perfect coin with H and T equally likely, the likelihood of this sequence is: However if the probability of a H is 0.8 then:
General likelihood formula: General likelihood formula: Estimate for p is then 4/5 = 0.8
Solves estimation with incomplete data. Solves estimation with incomplete data. Obtain initial estimates for parameters. Iteratively use estimates for missing data and continue until convergence.
Credit authorizations (hypotheses): h1=authorize purchase, h2 = authorize after further identification, h3=do not authorize, h4= do not authorize but contact police Credit authorizations (hypotheses): h1=authorize purchase, h2 = authorize after further identification, h3=do not authorize, h4= do not authorize but contact police Assign twelve data values for all combinations of credit and income: From training data: P(h1) = 60%; P(h2)=20%; P(h3)=10%; P(h4)=10%.
Training Data:
Calculate P(xi|hj) and P(xi) Calculate P(xi|hj) and P(xi) Ex: P(x7|h1)=2/6; P(x4|h1)=1/6; P(x2|h1)=2/6; P(x8|h1)=1/6; P(xi|h1)=0 for all other xi. Predict the class for x4: - Calculate P(hj|x4) for all hj.
- Place x4 in class with largest value.
- Ex:
- P(h1|x4)=(P(x4|h1)(P(h1))/P(x4)
- =(1/6)(0.6)/0.1=1.
- x4 in class h1.
Chi-Squared Chi-Squared - O – observed value
- E – Expected value based on hypothesis.
Jackknife Estimate - estimate of parameter is obtained by omitting one value from the set of observed values.
Regression - Predict future values based on past values
- Linear Regression assumes linear relationship exists.
y = c0 + c1 x1 + … + cn xn - Find values to best fit the data
Correlation
Determine similarity between two objects. Determine similarity between two objects. Similarity characteristics: Alternatively, distance measure measure how unlike or dissimilar objects are.
Measure dissimilarity between objects Measure dissimilarity between objects
Information Retrieval (IR): retrieving desired information from textual data. Information Retrieval (IR): retrieving desired information from textual data. Library Science Digital Libraries Web Search Engines Traditionally keyword based Sample query: - Find all documents about “data mining”.
DM: Similarity measures; Mine text/Web data.
Similarity: measure of how close a query is to a document. Similarity: measure of how close a query is to a document. Documents which are “close enough” are retrieved. Metrics: - Precision = |Relevant and Retrieved|
- |Retrieved|
- Recall = |Relevant and Retrieved|
- |Relevant|
Dostları ilə paylaş: |