Data mining and its application and usage in medicine By Radhika Data Mining and Medicine



Yüklə 491 b.
tarix08.10.2017
ölçüsü491 b.
#3831


Data mining and its application and usage in medicine

  • By Radhika


Data Mining and Medicine

  • History

    • Past 20 years with relational databases
      • More dimensions to database queries
    • earliest and most successful area of data mining
    • Mid 1800s in London hit by infectious disease
      • Two theories
        • Miasma theory  Bad air propagated disease
        • Germ theory  Water-borne
      • Advantages
        • Discover trends even when we don’t understand reasons
        • Discover irrelevant patterns that confuse than enlighten
        • Protection against unaided human inference of patterns provide quantifiable measures and aid human judgment
    • Data Mining
      • Patterns persistent and meaningful
      • Knowledge Discovery of Data


The future of data mining

  • 10 biggest killers in the US

  • Data mining = Process of discovery of interesting, meaningful and actionable patterns hidden in large amounts of data



Major Issues in Medical Data Mining

  • Heterogeneity of medical data

    • Volume and complexity
    • Physician’s interpretation
    • Poor mathematical categorization
    • Canonical Form
    • Solution: Standard vocabularies, interfaces between different sources of data integrations, design of electronic patient records
  • Ethical, Legal and Social Issues

    • Data Ownership
    • Lawsuits
    • Privacy and Security of Human Data
    • Expected benefits
    • Administrative Issues


Why Data Preprocessing?

  • Patient records consist of clinical, lab parameters, results of particular investigations, specific to tasks

    • Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
    • Noisy: containing errors or outliers
    • Inconsistent: containing discrepancies in codes or names
    • Temporal chronic diseases parameters
  • No quality data, no quality mining results!

    • Data warehouse needs consistent integration of quality data
    • Medical Domain, to handle incomplete, inconsistent or noisy data, need people with domain knowledge


What is Data Mining? The KDD Process



From Tables and Spreadsheets to Data Cubes

  • A data warehouse is based on a multidimensional data model that views data in the form of a data cube

  • A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions

    • Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year)
    • Fact table contains measures (such as dollars_sold) and keys to each of related dimension tables
  • W. H. Inmon:“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”



Data Warehouse vs. Heterogeneous DBMS

  • Data warehouse: update-driven, high performance

    • Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis
    • Do not contain most current information
    • Query processing does not interfere with processing at local sources
    • Store and integrate historical information
    • Support complex multidimensional queries


Data Warehouse vs. Operational DBMS

  • OLTP (on-line transaction processing)

    • Major task of traditional relational DBMS
    • Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.
  • OLAP (on-line analytical processing)

  • Distinct features (OLTP vs. OLAP):

    • User and system orientation: customer vs. market
    • Data contents: current, detailed vs. historical, consolidated
    • Database design: ER + application vs. star + subject
    • View: current, local vs. evolutionary, integrated
    • Access patterns: update vs. read-only but complex queries




Why Separate Data Warehouse?

  • High performance for both systems

    • DBMS tuned for OLTP: access methods, indexing, concurrency control, recovery
    • Warehouse tuned for OLAP: complex OLAP queries, multidimensional view, consolidation
  • Different functions and different data:

    • Missing data: Decision support requires historical data which operational DBs do not typically maintain
    • Data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources
    • Data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled






Typical OLAP Operations

  • Roll up (drill-up): summarize data

    • by climbing up hierarchy or by dimension reduction
  • Drill down (roll down): reverse of roll-up

    • from higher level summary to lower level summary or detailed data, or introducing new dimensions
  • Slice and dice:

    • project and select
  • Pivot (rotate):

    • reorient the cube, visualization, 3D to series of 2D planes.
  • Other operations

    • drill across: involving (across) more than one fact table
    • drill through: through the bottom level of the cube to its back-end relational tables (using SQL)








Steps of a KDD Process

  • Learning the application domain:

    • relevant prior knowledge and goals of application
  • Creating a target data set: data selection

  • Data cleaning and preprocessing: (may take 60% of effort!)

  • Data reduction and transformation:

    • Find useful features, dimensionality/variable reduction, invariant representation.
  • Choosing functions of data mining

    • summarization, classification, regression, association, clustering.
  • Choosing the mining algorithm(s)

  • Data mining: search for patterns of interest

  • Pattern evaluation and knowledge presentation

    • visualization, transformation, removing redundant patterns, etc.
  • Use of discovered knowledge



Common Techniques in Data Mining

  • Predictive Data Mining

    • Most important
    • Classification: Relate one set of variables in data to response variables
    • Regression: estimate some continuous value
  • Descriptive Data Mining

    • Clustering: Discovering groups of similar instances
    • Association rule extraction
      • Variables/Observations
    • Summarization of group descriptions


Leukemia

  • Different types of cells look very similar

  • Given a number of samples (patients)

    • can we diagnose the disease accurately?
    • Predict the outcome of treatment?
    • Recommend best treatment based of previous treatments?
  • Solution: Data mining on micro-array data

  • 38 training patients, 34 testing patients ~ 7000 patient attributes

  • 2 classes: Acute Lymphoblastic Leukemia(ALL) vs Acute Myeloid Leukemia (AML)



Clustering/Instance Based Learning

  • Uses specific instances to perform classification than general IF THEN rules

  • Nearest Neighbor classifier

  • Most studied algorithms for medical purposes

  • Clustering– Partitioning a data set into several groups (clusters) such that

    • Homogeneity: Objects belonging to the same cluster are similar to each other
    • Separation: Objects belonging to different clusters are dissimilar to each other. 
  • Three elements

    • The set of objects
    • The set of attributes
    • Distance measure


Measure the Dissimilarity of Objects

  • Find best matching instance

  • Distance function

    • Measure the dissimilarity between a pair of data objects
  • Things to consider

    • Usually very different for interval-scaled, boolean, nominal, ordinal and ratio-scaled variables
    • Weights should be associated with different variables based on applications and data semantic
  • Quality of a clustering result depends on both the distance measure adopted and its implementation



Minkowski Distance

  • Minkowski distance: a generalization

  • If q = 2, d is Euclidean distance

  • If q = 1, d is Manhattan distance



Binary Variables

  • A contingency table for binary data

  • Simple matching coefficient



Dissimilarity between Binary Variables

  • Example



K-nearest neighbors algorithm

  • Initialization

    • Arbitrarily choose k objects as the initial cluster centers (centroids)
  • Iteration until no change

    • For each object Oi
      • Calculate the distances between Oi and the k centroids
      • (Re)assign Oi to the cluster whose centroid is the closest to Oi
    • Update the cluster centroids based on current assignment


k-Means Clustering Method



Dataset

  • Data set from UCI repository

  • http://kdd.ics.uci.edu/

  • 768 female Pima Indians evaluated for diabetes

  • After data cleaning 392 data entries



Hierarchical Clustering

  • Groups observations based on dissimilarity

  • Compacts database into “labels” that represent the observations

  • Measure of similarity/Dissimilarity

    • Euclidean Distance
    • Manhattan Distance
  • Types of Clustering

    • Single Link
    • Average Link
    • Complete Link


Hierarchical Clustering: Comparison



Compare Dendrograms



Which Distance Measure is Better?

  • Each method has both advantages and disadvantages; application-dependent

  • Single-link

    • Can find irregular-shaped clusters
    • Sensitive to outliers
  • Complete-link, Average-link, and Centroid distance

    • Robust to outliers
    • Tend to break large clusters
    • Prefer spherical clusters


Dendrogram from dataset

  • Minimum spanning tree through the observations

  • Single observation that is last to join the cluster is patient whose blood pressure is at bottom quartile, skin thickness is at bottom quartile and BMI is in bottom half

  • Insulin was however largest and she is 59-year old diabetic



Dendrogram from dataset

  • Maximum dissimilarity between observations in one cluster when compared to another



Dendrogram from dataset

  • Average dissimilarity between observations in one cluster when compared to another



Supervised versus Unsupervised Learning

  • Supervised learning (classification)

    • Supervision: Training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations
    • New data is classified based on training set
  • Unsupervised learning (clustering)

    • Class labels of training data are unknown
    • Given a set of measurements, observations, etc., need to establish existence of classes or clusters in data


Classification and Prediction

  • Derive models that can use patient specific information, aid clinical decision making

  • Apriori decision on predictors and variables to predict

  • No method to find predictors that are not present in the data

  • Numeric Response

    • Least Squares Regression
  • Categorical Response

    • Classification trees
    • Neural Networks
    • Support Vector Machine
  • Decision models

    • Prognosis, Diagnosis and treatment planning
    • Embed in clinical information systems


Least Squares Regression

  • Find a linear function of predictor variables that minimize the sum of square difference with response

  • Supervised learning technique

  • Predict insulin in our dataset :glucose and BMI



Decision Trees

  • Decision tree

    • Each internal node tests an attribute
    • Each branch corresponds to attribute value
    • Each leaf node assigns a classification
  • ID3 algorithm

    • Based on training objects with known class labels to classify testing objects
    • Rank attributes with information gain measure
    • Minimal height
      • least number of tests to classify an object
    • Used in commercial tools eg: Clementine
    • ASSISTANT
      • Deal with medical datasets
      • Incomplete data
      • Discretize continuous variables
      • Prune unreliable parts of tree
      • Classify data


Decision Trees



Algorithm for Decision Tree Induction

  • Basic algorithm (a greedy algorithm)

    • Attributes are categorical (if continuous-valued, they are discretized in advance)
    • Tree is constructed in a top-down recursive divide-and-conquer manner
    • At start, all training examples are at the root
    • Test attributes are selected on basis of a heuristic or statistical measure (e.g., information gain)
    • Examples are partitioned recursively based on selected attributes


Training Dataset



Construction of A Decision Tree for “Condition X”



Entropy and Information Gain

  • S contains si tuples of class Ci for i = {1, ..., m}

  • Information measures info required to classify any arbitrary tuple

  • Entropy of attribute A with values {a1,a2,…,av}

  • Information gained by branching on attribute A



Entropy and Information Gain

  • Select attribute with the highest information gain (or greatest entropy reduction)

    • Such attribute minimizes information needed to classify samples


Rule Induction

  • IF conditions THEN Conclusion

  • Eg: CN2

    • Concept description:
      • Characterization: provides a concise and succinct summarization of given collection of data
      • Comparison: provides descriptions comparing two or more collections of data
  • Training set, testing set

  • Imprecise

  • Predictive Accuracy

    • P/P+N


Example used in a Clinic

  • Hip arthoplasty trauma surgeon predict patient’s long-term clinical status after surgery

  • Outcome evaluated during follow-ups for 2 years

  • 2 modeling techniques

  • Bayesian classifier

    • P(outcome=good) = 0.55 (11/20 good)
    • Probability gets updated as more attributes are considered
    • P(timing=good|outcome=good) = 9/11 (0.846)
    • P(outcome = bad) = 9/20 P(timing=good|outcome=bad) = 5/9




Bayesian Classification

  • Bayesian classifier vs. decision tree

    • Decision tree: predict the class label
    • Bayesian classifier: statistical classifier; predict class membership probabilities
  • Based on Bayes theorem; estimate posterior probability

  • Naïve Bayesian classifier:

    • Simple classifier that assumes attribute independence
    • High speed when applied to large databases
    • Comparable in performance to decision trees


Bayes Theorem

  • Let X be a data sample whose class label is unknown

  • Let Hi be the hypothesis that X belongs to a particular class Ci

  • P(Hi) is class prior probability that X belongs to a particular class Ci

    • Can be estimated by ni/n from training data samples
    • n is the total number of training data samples
    • ni is the number of training data samples of class Ci


More classification Techniques

  • Neural Networks

    • Similar to pattern recognition properties of biological systems
    • Most frequently used
      • Multi-layer perceptrons
        • Input with bias, connected by weights to hidden, output
      • Backpropagation neural networks
  • Support Vector Machines

    • Separate database to mutually exclusive regions
      • Transform to another problem space
      • Kernel functions (dot product)
      • Output of new points predicted by position
  • Comparison with classification trees

    • Not possible to know which features or combination of features most influence a prediction


Multilayer Perceptrons

  • Non-linear transfer functions to weighted sums of inputs

  • Werbos algorithm

    • Random weights
    • Training set, Testing set


Support Vector Machines

  • 3 steps

    • Support Vector creation
    • Maximal distance between points found
    • Perpendicular decision boundary
  • Allows some points to be misclassified

  • Pima Indian data with X1(glucose) X2(BMI)



What is Association Rule Mining?

  • Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories



Association Rule Mining

  • Market Basket Analysis

    • Same groups of items bought placed together
    • Healthcare
      • Understanding among association among patients with demands for similar treatments and services
    • Goal : find items for which joint probability of occurrence is high
    • Basket of binary valued variables
    • Results form association rules, augmented with support and confidence


Association Rule Mining



The Apriori Algorithm

  • Starts with most frequent 1-itemset

  • Include only those “items” that pass threshold

  • Use 1-itemset to generate 2-itemsets

  • Stop when threshold not satisfied by any itemset

  • L1 = {frequent items};

  • for (k = 1; Lk !=; k++) do

    • Candidate Generation: Ck+1 = candidates generated from Lk;
    • Candidate Counting: for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t
    • Lk+1 = candidates in Ck+1 with min_sup
  • return k Lk;



Apriori-based Mining



Principle Component Analysis

  • Principle Components

    • In cases of large number of variables, highly possible that some subsets of the variables are very correlated with each other. Reduce variables but retain variability in dataset
    • Linear combinations of variables in the database
      • Variance of each PC maximized
        • Display as much spread of the original data
      • PC orthogonal with each other
        • Minimize the overlap in the variables
      • Each component normalized sum of square is unity
        • Easier for mathematical analysis
    • Number of PC < Number of variables
      • Associations found
      • Small number of PC explain large amount of variance
    • Example 768 female Pima Indians evaluated for diabetes
      • Number of times pregnant, two-hour oral glucose tolerance test (OGTT) plasma glucose, Diastolic blood pressure, Triceps skin fold thickness, Two-hour serum insulin, BMI, Diabetes pedigree function, Age, Diabetes onset within last 5 years


PCA Example



National Cancer Institute

  • CancerNet http://www.nci.nih.gov

  • CancerNet for Patients and the Public

  • CancerNet for Health Professionals

  • CancerNet for Basic Reasearchers

  • CancerLit



Conclusion

  • About ¾ billion of people’s medical records are electronically available

  • Data mining in medicine distinct from other fields due to nature of data: heterogeneous, with ethical, legal and social constraints

  • Most commonly used technique is classification and prediction with different techniques applied for different cases

  • Associative rules describe the data in the database

  • Medical data mining can be the most rewarding despite the difficulty



Thank you !!!

  • Thank you !!!



Yüklə 491 b.

Dostları ilə paylaş:




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə