Define data mining



Yüklə 473 b.
tarix08.10.2017
ölçüsü473 b.
#3823



Define data mining

  • Define data mining

  • Data mining vs. databases

  • Basic data mining tasks

  • Data mining development

  • Data mining issues



Data is produced at a phenomenal rate



Objective: Fit data to a model

  • Objective: Fit data to a model

  • Potential Result: Higher-level meta information that may not be obvious when looking at raw data

  • Similar terms

    • Exploratory data analysis
    • Data driven discovery
    • Deductive learning


Objective: Fit Data to a Model

  • Objective: Fit Data to a Model

    • Descriptive
    • Predictive
  • Preferential Questions

    • Which technique to choose?
      • ARM/Classification/Clustering
      • Answer: Depends on what you want to do with data?
    • Search Strategy – Technique to search the data
      • Interface? Query Language?
      • Efficiency


Query

  • Query

    • Well defined
    • SQL


Database

  • Database

  • Data Mining





Classification maps data into predefined groups or classes

  • Classification maps data into predefined groups or classes

    • Supervised learning
    • Pattern recognition
    • Prediction
  • Regression is used to map a data item to a real valued prediction variable.

  • Clustering groups similar data together into clusters.

    • Unsupervised learning
    • Segmentation
    • Partitioning


Summarization maps data into subsets with associated simple descriptions.

  • Summarization maps data into subsets with associated simple descriptions.

    • Characterization
    • Generalization
  • Link Analysis uncovers relationships among data.

    • Affinity Analysis
    • Association Rules
    • Sequential Analysis determines sequential patterns.


Example: Stock Market

  • Example: Stock Market

  • Predict future values

  • Determine similar patterns over time

  • Classify behavior



Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data.

  • Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data.

  • Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process.



Data mining: the core of knowledge discovery process.

    • Data mining: the core of knowledge discovery process.


Selection:

  • Selection:

    • Select log data (dates and locations) to use
  • Preprocessing:

  • Transformation:

    • Sessionize (sort and group)
  • Data Mining:

    • Identify and count patterns
    • Construct data structure
  • Interpretation/Evaluation:

    • Identify and display frequently accessed sequences.
  • Potential User Applications:

    • Cache prediction
    • Personalization




Human Interaction

  • Human Interaction

  • Overfitting

  • Outliers

  • Interpretation

  • Visualization

  • Large Datasets

  • High Dimensionality



Multimedia Data

  • Multimedia Data

  • Missing Data

  • Irrelevant Data

  • Noisy Data

  • Changing Data

  • Integration

  • Application



Privacy

  • Privacy

  • Profiling

  • Unauthorized use



Usefulness

  • Usefulness

  • Return on Investment (ROI)

  • Accuracy

  • Space/Time



Scalability

  • Scalability

  • Real World Data

  • Updates

  • Ease of Use



Statistical Basics

  • Statistical Basics

    • Point Estimation
    • Models Based on Summarization
    • Bayes Theorem
    • Hypothesis Testing
    • Regression and Correlation
  • Similarity Measures



Point Estimate: estimate a population parameter.

  • Point Estimate: estimate a population parameter.

  • May be made by calculating the parameter for a sample.

  • May be used to predict value for missing data.

  • Ex:

    • R contains 100 employees
    • 99 have salary information
    • Mean salary of these is $50,000
    • Use $50,000 as value of remaining employee’s salary.
    • Is this a good idea?


Bias: Difference between expected value and actual value.

  • Bias: Difference between expected value and actual value.

  • Mean Squared Error (MSE): expected value of the squared difference between the estimate and the actual value:

  • Why square?

  • Root Mean Square Error (RMSE)



Jackknife Estimate: estimate of parameter is obtained by omitting one value from the set of observed values.

  • Jackknife Estimate: estimate of parameter is obtained by omitting one value from the set of observed values.

  • Let θ(hat) be an estimate on the entire pop.

  • Let θ(j)(hat) be an estimator of the same form with observation j deleted

  • Allows you to examine the impact of outliers!



Obtain parameter estimates that maximize the probability that the sample data occurs for the specific model.

  • Obtain parameter estimates that maximize the probability that the sample data occurs for the specific model.

  • Joint probability for observing the sample data by multiplying the individual probabilities. Likelihood function:

  • Maximize L.



Coin toss five times: {H,H,H,H,T}

  • Coin toss five times: {H,H,H,H,T}

  • Assuming a perfect coin with H and T equally likely, the likelihood of this sequence is:

  • However if the probability of a H is 0.8 then:



General likelihood formula:

  • General likelihood formula:

  • Estimate for p is then 4/5 = 0.8



Solves estimation with incomplete data.

  • Solves estimation with incomplete data.

  • Obtain initial estimates for parameters.

  • Iteratively use estimates for missing data and continue until convergence.







Credit authorizations (hypotheses): h1=authorize purchase, h2 = authorize after further identification, h3=do not authorize, h4= do not authorize but contact police

  • Credit authorizations (hypotheses): h1=authorize purchase, h2 = authorize after further identification, h3=do not authorize, h4= do not authorize but contact police

  • Assign twelve data values for all combinations of credit and income:

  • From training data: P(h1) = 60%; P(h2)=20%; P(h3)=10%; P(h4)=10%.



Training Data:

  • Training Data:



Calculate P(xi|hj) and P(xi)

  • Calculate P(xi|hj) and P(xi)

  • Ex: P(x7|h1)=2/6; P(x4|h1)=1/6; P(x2|h1)=2/6; P(x8|h1)=1/6; P(xi|h1)=0 for all other xi.

  • Predict the class for x4:

    • Calculate P(hj|x4) for all hj.
    • Place x4 in class with largest value.
    • Ex:
      • P(h1|x4)=(P(x4|h1)(P(h1))/P(x4)
      • =(1/6)(0.6)/0.1=1.
      • x4 in class h1.


Chi-Squared

  • Chi-Squared

    • O – observed value
    • E – Expected value based on hypothesis.
  • Jackknife Estimate

    • estimate of parameter is obtained by omitting one value from the set of observed values.
  • Regression

    • Predict future values based on past values
    • Linear Regression assumes linear relationship exists.
  • y = c0 + c1 x1 + … + cn xn

      • Find values to best fit the data
  • Correlation



Determine similarity between two objects.

  • Determine similarity between two objects.

  • Similarity characteristics:

  • Alternatively, distance measure measure how unlike or dissimilar objects are.





Measure dissimilarity between objects

  • Measure dissimilarity between objects



Information Retrieval (IR): retrieving desired information from textual data.

  • Information Retrieval (IR): retrieving desired information from textual data.

  • Library Science

  • Digital Libraries

  • Web Search Engines

  • Traditionally keyword based

  • Sample query:

    • Find all documents about “data mining”.
  • DM: Similarity measures;

  • Mine text/Web data.



Similarity: measure of how close a query is to a document.

  • Similarity: measure of how close a query is to a document.

  • Documents which are “close enough” are retrieved.

  • Metrics:

    • Precision = |Relevant and Retrieved|
    • |Retrieved|
    • Recall = |Relevant and Retrieved|
    • |Relevant|




Yüklə 473 b.

Dostları ilə paylaş:




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə