Influence Diagrams
The probability of an event occurring in a certain time period can depend not only on occurrences of previous and contemporaneous events, but also on prior or contemporaneous actions by one or more decisionmakers. More generally, the conditional probability distribution of a random variable in a period can be changed by actions as well as by values of random variables in that period or earlier. As mentioned in Chapter 1, influence diagrams (IDs) are generalizations of Bayesian networks that include actions (also called decision nodes) and value or utility nodes that evaluate the consequences of the actions. These additions seem to constitute a clear generalization of BNs. However, with a little ingenuity, the same algorithms used to find the most probable explanations (MPEs) in a BN can also be used to solve for optimal decisions in IDs, meaning decisions that maximize expected utility (Mauá, 2016). Every ID problem can be transformed to an equivalent BN inference problem by replacing the decision nodes with chance nodes having equal probabilities for all decisions and replacing utility nodes with binary (01) chance nodes with CPTs that assign a conditional probability for 1, given the values of the parents, proportional to the utility of that combination of values (Crowley, 2004). The decisions that best explain a utility value of 1 (in the sense of MPE) are then the optimal decisions. Although several algorithms have been developed specifically for inference and decision optimization in IDs (Shachter and Bhattacharjya, 2010), the fact that an ID decision optimization problem can be automatically mapped to an equivalent BN inference problem means that BN algorithms can be used to solve both.
Example: DecisionMaking with an Influence Diagram – Taking an Umbrella
To illustrate the use of ID software in solving a textbooktype decision analysis problem, consider a decisionmaker (d.m.) who must decide whether to take an umbrella to work on a day when the prior probability of rain is 0.40. The d.m.’s von NeumannMorgenstern utility function, scaled to run from 0 for the leastpreferred outcome to 100 for the mostpreferred outcome, is as follows: utility = 100 if the weather is sunshine and the d.m. does not take the umbrella; utility = 0 if there is rain and the d.m. does not take the umbrella; utility = 80 otherwise, i.e., if the d.m. takes the umbrella. (Thus, the utility function can be represented as utility = 100  20*TAKE  100*(1TAKE)*RAIN where TAKE is a binary (01) decision variable with value 1 if the d.m. takes the umbrella and value 0 otherwise; and RAIN is a binary random variable with value 1 if it rains and 0 otherwise.) The d.m. can also obtain a forecast that has an 80% probability of being correct, i.e., of predicting rain if it will rain and of predicting that it will not rain (“Sunshine”) if it will not rain; hence there is probability 20% that the forecast will falsely predict rain when it will not rain and a 20% probability that it will falsely predict sunshine when it will rain. Entering these numbers into the tables for the Forecast and Consequence_utility nodes in Figure 2.13a creates an ID model. Readers interested in following the details of building and using ID models should create this network in Netica. (The free “Limited” use mode suffices for this purpose.)
Fig. 2.13a An ID for the umbrella decision. With only prior information (P(Rain) = 0.40), it is optimal to take the umbrella
Note that Netica allows a distinct utility node (indicated by a hexagon) and decision nodes, which do not have probability bars. With these node types, creating an ID is as easy as creating a BN. The utility node, like the chance nodes we have already studied, has a table, but it shows the utility for each combination of values of its parents, rather than the conditional probability for each combination of values of its parents, as in the CPT of a chance node.
As soon as the ID in Figure 2.13a is completed and the Netica network is compiled, the decision node starts showing the expected utility for each alternative decision. Without forecast information, the expected utility of taking an umbrella is 80 (since we specified the utility to be 80 whether or not it rains if the d.m. takes an umbrella) and the expected utility of not taking it is 60 (the 0.60 probability of Sunshine times the utility of 100 plus the 0.40 probability of Raintimes the utility of 0). (The probability that the forecast will predict rain is 0.8*0.4 + 0.2*0.6 = 0.32 + 0.12 = 0.44, as shown in the Forecast node.) Thus, the optimal decision in the absence of forecast information is to take the umbrella. If forecast information is used, and the forecast predicts sunshine, then the optimal decision is to not take the umbrella, with an expected utility of about 85.71, as shown in Figure 2.13b. The expected utility increases from 80 in the absence of forecast information to 0.56*85.7142 + 0.44*80 = 83.2 with the forecast information. This is because there is a probability of 0.56 that the forecast will be for sunshine, in which case the decision changes to not taking the umbrella and expected utility increases from 80 to 85.7142. This increase in expected utility determines the value of information (VOI) from the forecast: the d.m. should be willing to pay up to the amount that would reduce expected utility from 83.2 to 80. If the d.m. is riskneutral and utilities are scaled to correspond to dollars, then the d.m. should be willing to pay up to $3.20 for the forecast information. If the forecast information costs more than that to acquire, the d.m. should forego it and simply take the umbrella. Such calculations of optimal decisions, expected utilities, and VOI are the staples of decision analysis. BN technology, extended to apply to IDs, makes them easy to carry out.
Fig. 2.13b With a forecast of sunshine, the optimal decision changes to not taking the umbrella
Decision Trees
Adding decisions and utilities to a BN makes it an ID. Adding decisions and utilities to an event tree makes it a decision tree. Utilities are placed at the tips of the tree, i.e., evaluating the outcomes. Decision nodes can be inserted into event sequence paths, with decisions being conditioned on the previous events and decisions in the paths leading to them from the root of the tree. Every decision tree can be mapped automatically to an equivalent ID, and ever ID can be automatically reexpressed as a decision tree; thus, IDs and decision trees are logically equivalent classes of models for representing decision problems, although either may have practical advantages over the other in representation, storage, and computational efficiency for specific problems (Shachter, 1986). Since every ID can be mapped to an equivalent BN inference problem (Crowley, 2004), decision trees can also be represented and solved using BNs.
Markov Decision Processes (MDPs) and Partially Observable MDPs (POMDPs)
Markov decision processes (MDPs) and partially observable MDPs (POMDPs) can also be automatically mapped to influence diagrams or dynamic Bayesian networks in which the CPTs depend on the decisions made (Boutilier et al., 1995; Jonsson and Barto, 2007). Indeed, DBNs provide a basis for some of the most computationally efficient algorithms known for solving largescale MDPs and POMDPs, including hierarchical POMDPs with observed signals and transitions for highlevel states consisting of sequences of observations and transitions generated by lowerlevel POMDPs (Theocharous et al., 2004). A straightforward representation of a POMPD by an ID represents states, observations, and rewards in each period by chance nodes and represents the choice among acts in each period by a decision node. The DAG structure of the ID makes the reward in each period depend on the state and act in that period and, optionally, on the state in the next period; the conditional probability distribution for the next state depends on the current state and act; and the conditional probability distribution for observations in a period depend on the state in that period. The act in each period is a decision node that is informed by observations in that period and earlier and by nothing else (Poole and Mackworth, 2017). This representation of a POMDP allows ID algorithms, and hence BN algorithms, to be applied to solve them, although the efficiency of computations can be improved by using specially devised rather than generalpurpose algorithms to exploit the structure of the POMDP graph (Theocharous et al., 2004).
Predictive Causality and Predictive Analytics Models
The intuitive idea that causes help to predict their effects can be applied to screen for causation outside the context of time series analysis, helping to identify potential causal relationships between variables in a data set on the basis of whether they help to predict each other’s values. The discipline of predictive analytics is now well advanced and can be applied to crosssectional data as well as to time series data. Machine learning software packages such as the Caret package in R are freely available to automate the predictive analytics tasks of splitting data into disjoint training and test sets, optimizing choices of input parameters or settings for each of many parametric and nonparametric predictive models and algorithms, and comparing their predictive performances quantitatively using standard metrics such as sensitivity, specificity, and balanced accuracy. Such software can be used to identify which other variables help to predict the values of a userspecified dependent variable. The logic of predictive causality testing identifies these predictors as its potential direct causes in a data set, and screens out variables that do not help to predict it as not being potential direct causes.
Classification and Regression Tree (CART) Models
Classification and regression tree (CART) algorithms estimate the conditional mean – or, more generally, the conditional mean, standard deviation, and distribution – of a dependent variable based on the values of other variables that help to predict the distribution of its values. They are also called recursive partitioning algorithms because they work by partitioning the cases in a data set (its records or rows) into subsets with significantly different distributions of the dependent variable, and then recursively partitioning each subset until no further partitions with significantly different distributions of the dependent variable can be found. Each partition is called a “split” and is represented by a node in a tree, with branches below it showing how the data set is partitioned at that node. A split is formed by conditioning the distribution of the dependent variable on contiguous intervals of values for a continuous predictor variable, or on disjoint subsets of values for a discrete predictor variable. The predictor variable being conditioned on (or split on) is represented by a node in the tree, and the branches emanating from it represent the different sets or ranges of values being conditioned on to get significantly different distributions of the dependent variable. For example, if the probability of an illness depends on age, then consecutive ranges of age (a node variable) could be used as splits to estimate different conditional probabilities of illness given age ranges. Successive splits condition on information about values of additional predictors until no more information can be found that generates significantly different conditional distributions for the dependent variable. The next node to split on is typically selected via a heuristic such as choosing the variable that is estimated to be most informative about the dependent variable (i.e., maximizing reduction in expected conditional entropy of the dependent variable when split on), given the information (splits) conditioned on so far. Although many refinements have been made in recursive partitioning algorithms since the 1980s, including allowing parametric regression modeling for split selection if desired and avoiding overfitting by pruning trees to minimize crossvalidation error, all variations of CART algorithms still use the key idea of identifying variables that are informative about a dependent variable and then conditioning on values of these informative variables until no further valuable information for improving predictions can be found.
Example: Seeking Possible Direct Causes Using Classification and Regression Tree (CART) Algorithms
In the CAT software, a nonparametric classification tree is generated simply by clicking on the “Tree” command in the command menu at the left side of Figure 2.11. By default, the dependent variable is taken to be the first column in the data table. The user can select a different dependent variable if desired. Figure 2.14 shows the result of a applying the “Tree” command to the LA data set in Figure 2.11, with daily elderly mortality counts, AllCause75, as the dependent variable. (The tree is generated by the party package in R, https://cran.rproject.org/web/packages/party/party.pdf, which contains full documentation. Month and year are specified as categorical variables.) The interpretation is as follows. Each “leaf” node at the bottom of the tree shows the conditional expected value for cases (in this case, days, corresponding to the rows in the data table in Figure 2.11) that the leaf node describes. These are the “y” values for the leaf nodes. The “n” values show how many cases are described by each node. For example, the leftmost leaf node (numbered node 5) in Figure 2.14 describes n = 29 cases (days) out of 1461 records (rows) in the data set. The average value of the dependent variable, AllCause75, is y = 171.31 deaths per day for the days described by this node. The description for any leaf node is found by tracing the path from the root node at the top of the tree (node 1) to the leaf node. The conjunction of values (shown along the branches) for the variables (in the nodes) along the path constitute the description of the leaf node to which they lead. For example, the description for node 5, the leftmost leaf node in Figure 2.14, is as follows: February of year 2008. This is found as the conjunction or intersection of months 1, 2, 3, and 12, i.e., December through March, for the left branch below node 1 with months 1 and 2 along the left branch below node 2 and month 2 along the left branch below node 4; together with the year 2008, shown on the left branch below node 3. Similarly, the description for node 27, the rightmost leaf node, is JulySeptember of years 20072009. The average daily elderly mortality count is only 120.7 deaths per day during these 276 days, compared to 131.6 deaths per day during the hot days (with high temperatures over 92 degrees) in 2010 (node 26) and 171.3 deaths per day in February of 2008 (node 5). Thus, the treegrowing algorithm has discovered, or revealed, that the highest daily death counts occur in the winter months and the lowest occur in the summer months.
Fig. 2.14 A CART tree for the LA air, weather, and elderly mortality data set in Figure 2.11
In CART trees, as well as in BNs, what is not shown in a diagram is often even more interesting and useful than what is shown. In Figure 2.14, the fact that month appears in the tree as a key predictor of daily elderly mortality counts, but that sameday temperature per se does not (with the sole exception of the 2010 summer heat wave, where daily maximum temperatures over 92 were associated with a slight increase in the relatively low sameday daily mortality counts, as can be seen by comparing leaf nodes 25 and 26), suggests that it was not sameday low daily temperatures that directly caused high elderly mortality counts in February of 2008, but something else, for which month is a proxy.
Although Figure 2.14 illustrates a classification tree treating the data as crosssectional, lagged values of the different variables can also be used as predictors and included in treegrowing, thus combining time series information and sameperiod crosssectional information. Figure 2.15 shows the result of such a dynamic analysis. The dependent variable is AllCause75.lags.plus7, which is the default name for daily elderly mortality count 7 days in the future. Values for all variables from 7 days in the future back to the present day were included as potential predictors. The resulting tree in Figure 2.15 shows that the best predictor (first split at the top of the tree) for these deaths a week in the future is the low temperature today (tmin). In fact, lagged values of temperatures and AllCause75 itself are the only significant predictors of elderly mortality counts in this data set, showing that month and year were acting as surrogates for these values in Figure 2.14. Neither maximum relative humidity nor fine particulate matter (PM2.5) appears in the tree, indicating that in this data set, elderly mortality counts are conditionally independent of present and lagged values of these variables, given lagged values of elderly mortality counts and temperatures. Thus, daily minimum and maximum temperatures are identified as the only predictive causes of elderly mortality in this data set detected by classification tree analysis. Such analysis provides a nonparametric alternative to multivariate Granger causality testing.
Fig. 2.15 A tree with lagged values reveals that temperatures are predictive causes of elderly mortality
CART trees are closely related to BNs. Indeed, treegrowing algorithms can be used to help identify the DAG structure of a BN from data, as follows. Ideally, in a classification tree, the dependent variable should be conditionally independent of the variables not in the tree, given the variables that are in it. (If this were not so, then the information provided by at least some of the variables omitted from the tree could be used to improve predictions of the dependent variable, since it would depend in part on them.) Thus, the tree can be used as an approximate test for conditional independence. The variables that appear in it should ideally be those in its Markov blanket, i.e., its parents, children, or spouses. The qualifiers “ideally,” “approximately,” and “should” are included because real CART algorithms are not perfect oracles for establishing conditional independence: they are limited by finite sample sizes, sampling variability, and myopic selection of next variables to split on, and hence they may fall short of identifying all and only the variables on which the dependent variable depends. But in practice, they often provide a useful way, albeit only an approximate one, to identify conditional independence among variables in large data sets. Since the DAG structure of a causal Bayesian network represents conditional independence relationships among its variables, algorithms for identifying conditional independence from data can also help to discover these DAG structures.
Example: Conditional Independence Tests Constrain DAG Structures
Problem setting: Suppose that we had an ideal CART algorithm that provided a perfect test for conditional independence by generating trees that contain all and only those variables that the dependent variable depends on, i.e., that share positive mutual information with it, even after conditioning on other variables. Suppose also that we had a large data set with three columns, or variables, X, Y, and Z, which we will interpret as exposure, health, and lifestyle variables, respectively.
Problem: How could the CART algorithm be applied to the data set to determine which of the following DAGs models 15, if any, is consistent with the data?

X ¬ Z ® Y (e.g., exposure ¬ lifestyle ® health)

Z ® X ® Y (e.g., lifestyle ® exposure ® health)

X ® Y ¬ Z (e.g., exposure ® health ¬ lifestyle)

X ® Y ® Z (e.g., exposure ® health ® lifestyle)

X ® Z ® Y (e.g., exposure ® lifestyle ® health)
Solution: The key is that each model implies certain conditional independence relationships. These implications can be tested by using the CART algorithm to determine which conditional independence relationships hold among the variables in the data set. The conditional independence implications of the different models are as follows:

Models 1 and 5 (but none of the others) imply that health response Y is conditionally independent of exposure X given covariate Z. Thus, if either of these models is correct, then a CART tree for Y as the dependent variable should have Z in it but not X, and this pattern suffices to narrow down the choice of models to model 1 or model 5. Two models that have the same conditional independence implications, such as models 1 and 5, are said to be Markovequivalent, or to belong to the same Markov equivalence class. Even a perfect algorithm for determining conditional independence relationships among variables from data cannot distinguish among models in the same Markov equivalence class, and other principles must be used to uniquely identify DAG structures in such cases. For example, in model 1, changes in Z should precede resulting changes in X, but in model 5 this temporal precedence order is reversed.

Only in model 2, but not in the other models, are Y and Z are conditionally independent given X. Thus, a CART tree for Y that only has X in it but not Z, uniquely identifies model 2 as the correct one from among models 15.

Only in model 3 are X and Z unconditionally independent, but conditionally dependent given Y. A CART tree for Y would include both X and Z, which is also true for model 4, but a CART tree for Z that is constrained to split only on X would do so if model 4 generated the data (assuming that information is transmitted from X to Z via Y) and would not do so for model 3, since Z and X are unconditionally independent in model 3.

Only in model 4 are X and Z unconditionally dependent, but conditionally independent given Y. Thus, if this model generated the data, then if a tree for Z is split first on X only, then X will enter the tree. But if a tree for Z is grown with both X and Y as allowed variables to split on, then Y will enter the tree and X will not. This combination does not hold for any of the other models.
Thus, an ideal CART algorithm can be used to uniquely identify any of models 2, 3, or 4 or to discern that either model 1 or model 5 generated the data, if it is known that one of models 15 generated the data. Similar techniques can be used to discover or constrain which DAG models might have generated observed data even in the absence of a priori knowledge about the datagenerating process. This possibility is examined later in the section on causal discovery algorithms.
The second major connection between classification trees and BNs is that CART algorithms can be used to estimate the conditional probability tables (CPTs) for a BN from data if its DAG structure is known. The recipe is simple: grow a tree for each variable in the BN using only its parents as predictors. Each leaf of the resulting tree represents a conjunction of values (or value ranges) for the parents with a corresponding conditional frequency distribution for the dependent variable – precisely what is needed for its CPT. Moreover, the tree algorithm automatically discretizes continuous predictors, partitioning (“splitting”) them into consecutive intervals; and it only creates splits that lead to significantly different conditional probability distributions for the dependent variable. This provides an efficient way to store CPT information – typically far more efficient than listing each possible combination of parent values and child values with a conditional probability for each. For example, the tree in Figure 2.15 has only 16 leaf nodes, yet was grown for a data set with 64 variables (year, month, day, AllCause75, PM2.5, tmin, tmax, MAXRH, each lagged by 07 days) each with many possible values. Even if each variable had been quantized to only 10 deciles, there still could have been up to 100 million combinations to assign probabilities to (allowing for the possibility of a different probability value for each possible combination). A tree with 16 leaf nodes is clearly a much more tractable way to store the information that matters. To use such a tree to find the conditional probability that a dependent variable Y has specific value y, given the values of its parents, one simply “drops” the values of the parents through the tree – that is, applies the successive splits in the tree to the parents’ values to determine which leaf of the tree describes them. Following the path through the tree (the sequence of branches emanating from successive nodes, from the root at the top of the tree to a leaf node at its bottom) determined by the values of the parents leads to a unique leaf node. For example, the tree in Figure 2.15 would classify a summer day that occurred after a week of days with temperatures always above 60 degrees as belonging to leaf node 31 at the far right, since this case meets its defining conditions, tmin > 49 degrees Fahrenheit and tmin.lags.plus3 > 54 degrees. Node 31 is the most common leaf (i.e., classification) in this tree, with 496 cases. Whenever the low temperature is over 49 degrees today and is over 54 degrees 3 days from now, the day 7 days from now is classified as belonging to leaf node 31, with a conditional mean value of E(Y  node 31) = 123.17 for elderly mortalities. The estimated conditional probability of any specific value for a case at that node, such as P(y = 130 deaths  tmin > 49, tmin.lags.plus3 > 54) is then just the fraction of all 496 cases at that node that have that particular y value. The values of other parents are irrelevant once a leaf node has been reached: all combinations for their values give the same estimated conditional distribution for the dependent variable (since otherwise the treegrowing would have continued). In practice, BN learning programs typically discretize, or bin, the values of continuous variables or variables with many possible values into 10 deciles, and then quantify their empirical relative frequencies, which are point estimates of their probabilities. (Some Bayesian approaches adopt a Dirichlet prior and present posterior distributions rather than point estimates for the probabilities of specific values.) Programs such as Netica display the bar charts for the probabilities of the discretized values of random variables directly on the node. No matter how such final processing and display details are handled, however, the CART tree structure provides a highly parsimonious way to store empirical CPT information.
A CART tree requires specifying a single dependent variable. A BN can be thought of as summarizing the results from multiple interrelated CART trees, one for each variable. As an example, consider the Bayesian network (BN) in Figure 2.16. This BN was learned from the LA data set in Figure 2.11 using the “Bayesian” command in the CAT menu, which draws on the bnlearn package in R, as described in more detail in the section on causal discovery algorithms.
Fig. 2.16 A BN for the LA air, weather, and elderly mortality data set in Figure 2.11
Each node with inwardpointing arrows can be viewed as having a corresponding classification tree, with the variables that point into it appearing in the tree and with the CPT for the node consisting of the conditional probability distributions for its value specified by the leaf nodes of the tree. To view a node’s tree explicitly, one can use the “Tree” command at the bottom of the menu on the left side of the screen. For example, Figure 2.17 shows a tree for AllCause75 with month as its parent (and with only the conditional means values, rather than the entire conditional distributions, for the dependent variable AllCause75 displayed at its leaf nodes). The BN DAG display in Figure 2.16 suppresses the details of the tree for each node, showing only its parents and not the CPT information on how their values determine the conditional probability distribution for the node. But it has the advantage of showing the parents for all variables simultaneously. The bnlearn package uses different detailed algorithms than the “Tree” procedure to generate its CPTs and DAG structures, including binning continuous variables into ten deciles rather than using splits to quantize them. Nonetheless, viewing a BN conceptually as consisting of a collection of classification trees, one for each node, with the parents of a node (if any) appearing in its tree and with the CPT for the node determined by the leaf nodes of its tree (or simply specified as a marginal probability distribution, for an input node having no parents), shows that BNs can be interpreted as multivariate generalizations of CART trees that remove the restriction of considering only one dependent variable at a time.
Fig. 2.17 A classification tree for the AllCause75 node in Figure 2.16
The Random Forest Algorithm: Importance Plots and Partial Dependence Plots
Suggestive and useful as CART trees are, individual trees are often not robust, meaning that growing trees on several random subsets of the data may lead to quite different trees. This problem is addressed in modern machine learning by using model ensembles, i.e., collections of hundreds of individual models (e.g., CART trees). These are generated from samples from the original data set. Ensemblebased predictions are formed by averaging the predicted values for the dependent variable, such as its conditional mean values at the leaf nodes of the trees, given the values of its predictors, over all the models in the ensemble. A popular nonparametric ensemble technique for predictive analytics is the Random Forest algorithm, for which documentation and an R implementation are available via the randomForest package in R (https://cran.rproject.org/web/packages/randomForest/randomForest.pdf). Python programmers can use sklearn.ensemble.RandomForestClassifier. We will use the Causal Analytics Toolkit (CAT) software (http://coxassociates.com/CloudCAT), which runs the randomForest R package and presents its results via an internet browser (or, optionally, an Excel addin). By default, the package averages predictions for the dependent variable over an ensemble of 500 trees grown on different samples from the full data set.
The CAT software provides two commands that generate and display outputs from the randomForest package: “Importance” and “Sensitivity.” Each generates multiple outputs. Figures 2.18 and 2.19 show key outputs from the “Importance” command. The plot in Figure 2.18 shows how the mean squared prediction error (MSE) for the dependent variable AllCause75,lags.plus7 decreases as the randomForest algorithm averages predictions over increasing numbers of trees, falling from more than 300 for a single tree to about 160 when 300 or more trees are included in the ensemble. Conditioning on all other variables allows about 46% of the variance in the dependent variable, daily elderly mortality counts, to be explained, as indicated by the “% Var explained: 46.07” message in the “randomForest summary” section. (In the CAT software, clicking on any such section heading calls up corresponding documentation, explaining the outputs and how they are calculated in more detail. Since randomForest averages over many random samples of data, its outputs vary slightly between repeated runs.)
Fig. 2.18 Outputs from the randomForest analysis: Percent of variance explained and accurancy vs. number of trees in the ensemble. “rf” is an abbreviation for “random forest.”
Figure 2.19 shows a second output from the “Importance” command: estimates of the percentage increases in mean squared prediction error made by the rf ensemble (denoted by “%IncMSE” on the x axis) if each predictor is excluded. This is at best only a rough heuristic for gauging the importance of predictors, as it considers them one at a time and ignores interactions among them. Nonetheless, it can reveal such insights as that all of the top few predictors are lagged values of minimum daily temperature (tmin), suggesting that daily elderly mortality rate depends primarily on cold temperatures in recent days. (Only the top 30 predictors are shown; a separate table provides a complete listing of all the variables and their estimated importances.)
Fig. 2.19 Importance plot from the randomForest analysis
The “Sensitivity” command in CAT uses the randomForest package to generate a partial dependence plot (PDP). This is illustrated in Figure 2.20 for the partial dependence of elderly mortality in a week (AllCause75.lags.plus.7) on low temperature today (tmin). Roughly speaking, a PDP shows how the predicted value of the dependent variable (on the y axis) depends on one specific explanatory variable (on the x axis), given the current values of the other variables. It is generated by using the random forest ensemble to predict the average value of the dependent variable as one predictor (here, tmin) is varied over its entire range of values in the data set while holding all other variables fixed at their actual values in the data set. In more detail, the process is as follows. First, a random forest ensemble of many (e.g., 500) classification trees is created using R’s randomForest package. Next, the data set is modified by replacing all of the values in one column – the one for the selected explanatory variable, which is tmin in Figure 2.20 – with a constant value, the same for all records (rows) in the data table. For the first iteration, this constant value is the smallest value of the explanatory value in the data set. It is increased in subsequent iterations to the next smallest value, and so on incrementally until it reaches the maximum value of the selected explanatory variable. At each iteration, each modified record in the data set (meaning the original row modified to replace the entry in the column for the selected explanatory variable with the constant value for that iteration) is dropped through each tree in the ensemble, i.e., each tree is used to predict a value of the dependent variable for each modified record. The predicted values of the dependent variable are then averaged over all records for all trees. The result is a single (averaged) predicted value for the dependent variable for the specified constant value of the selected explanatory variable. This constitutes one point on the PDP. As the constant value is incrementally increased from the smallest to the largest value of the explanatory variable, the entire PDP curve is generated.
Fig. 2.20 A partial dependence plot (PDP) for elderly mortality count in 7 days from now (AllCause75.lags.plus.7) vs. low temperature today (tmin)
In CAT, clicking on the “Sensitivity” command automatically generates a PDP. The default is to treat the first column in the data table as the dependent variable and the second column as the selected explanatory variable that will go on the x axis, but the user can select any two variables to fill these roles. (Python versions of random forest also allow two explanatory variables to be selected and generate PDP surfaces for the dependent variable as a function of the two selected explanatory variables, but this is not yet supported in CAT.) As shown in Figure 2.20, CAT generates a curve for the dependent variable as a function of the selected explanatory variable and plots it twice using two different vertical axes: one on the left that emphasizes the changes in the dependent variable over its range of values, and one on the right that shows the same PDP with a vertical axis that starts at zero. The CAT software (but not the randomForest package) also runs a Spearman’s rank correlation test to determine whether there is a significant ordinal association between the explanatory variable and the dependent variable in the PDP.
The mechanics of generating a PDP have now been explained. But what is its intended interpretation – what does a PDP mean, and why is it useful? The PDP for a dependent variable Y vs. an explanatory variable X is not simply a depiction of the conditional mean value of Y for different assumed values of X predicted by the nonparametric model ensemble of classification trees in the random forest. Rather, it is this with the additional constraint that other variables are held fixed at their current values as Y is predicted for different assumed values of X. In the special case where X is a direct cause (i.e., a parent) of Y in a DAG model, the PDP for Y vs. X is a nonparametric estimate of the natural direct effect of X on Y holding the values of other variables fixed. Whether this coincides with the manipulative effect on Y of setting X to different values while holding other variables fixed depends on how well the estimated PDP estimated from the observed joint values of variables among the cases in the data set describes the conditional mean value of Y for different X values that would be calculated via graph surgery (representing direct manipulation of X values) from the CPTs in a valid causal BN model of the same data.
Causal ConcentrationResponse Curves, Adjustment Sets, and Partial Dependence Plots for Total and Direct Effects in Causal Graphs
One of the most commonly used displays in air pollution health effects research and epidemiology is the exposure concentrationresponse curve. This plots the risk (or incidence rate in a population, in units such as expected cases per personyear) on the y axis against a summary of exposure concentration on the x axis. The exact intended interpretation of such curves is seldom specified. Authors interpret them variously as offering descriptive summaries of observed or expected exposure concentration and response data points (the regression interpretation) and as indicating how responses would change if exposure concentrations were changed (a manipulative causal interpretation), usually without specifying why the concentrations might change or distinguishing among natural direct, controlled direct, total, or other effects. Yet, the same curve cannot represent all of these distinct concepts.
To more clearly define what a causal concentrationresponse curve might mean, consider the example DAG model in Figure 2.21.
Fig. 2.21 DAGitty diagram for the effect of exposure on health outcome
Source: Drawn in browser using DAGitty software (www.dagitty.net/dags.html)
This DAG model was drawn using the online version of DAGagitty (www.dagitty.net/dags.html), a software package for creating and analyzing the statistical implications of DAG models. The arrows between its nodes are intended to signify that the probability distribution for a health outcome in members of a population (e.g., age at death, lung cancer by age 70, etc.) depends directly on income (which may serve as a proxy for health care and other variables), residential location, and exposure. Income, in turn, depends directly on employment (or occupation), and exposure depends both directly on employment and also indirectly on employment via income, as well as depending directly on residential location, which in turn depends on income. A full causal BN model would quantify these dependencies with CPTs for each node (and a marginal distribution for the employment input node). The DAG model implies that the joint distribution of the variables can be factored as follows:
P(employment)*P(income  employment)*P(residential_location  income)*P(exposure  employment, income, residential_location)*P(health outcome  income, residential_location, exposure).
Given any set of values for the five variables, the joint probability that they will have those values can be calculated by plugging appropriate numbers from the nodespecific probability tables into this formula. But even without specifying the quantitative probability tables for the model, the DAG structure alone allows useful inferences to be drawn about the data analyses needed to quantify total effects and direct effects of one variable on another – in this case, the effect of exposure on health outcome.
DAGitty lets the user select one main cause variable and one effect variable. It then applies welldeveloped graphtheoretic criteria and algorithms to automatically calculate adjustment sets showing what other variables to condition on in order to calculate the total and direct causal effects of the selected cause variable on the selected effect variable (Textor et al., 2016). These are listed in the “Causal effect identification” area at the upper right of the DAGitty screen. In Figure 2.21, DAGitty lists a single “minimal sufficient adjustment set for estimating the total effect of exposure on health outcome.” It consists of the two confounders income and residential_location. Both of these must be conditioned in to obtain estimates of the unconfounded causal effect of exposure on health outcome. If all relationships were known to be linear, then a multiple linear regression model for the causal effect of exposure on health outcome would have to include income and residential_location on its righthand side, as well as exposure, in order for the resulting regression coefficient for exposure to be interpretable as indicating the predicted increase in health outcome per unit increase in exposure.
Below the adjustment sets, DAGitty lists testable implications of the DAG structure, using the common notation “Y X  Z” to denote “Y is conditionally independent of X given Z.” Two testable implication of the DAG structure are shown: that health outcome is conditionally independent of employment given the values of exposure, income, and residential_location; and that residential_location is conditionally independent of employment given income. The full theory and resulting algorithms for automatically generating all such testable implications and the mimimal sufficient adjustment sets for estimating total and direct causal effects are discussed in Textor et al., 2016 and its references. A key principle is to condition on confounders, or common causes, of the exposure and response variables to get unconfounded estimates of total or direct causal effects of exposure on response probabilities; but not to condition on common children (“colliders”) or their descendents, in order to avoid creating Berkson selection biases. A graphtheoretic criterion called “dseparation” that accounts for blocking of information transfer along paths by appropriate conditioning provides the generalization needed to compute minimal sufficient adjustment sets and unbiased estimates of direct and total causal effects, as well as testable conditional independence implications of a DAG model.
To generate partial dependence plots (PDPs) estimating natural direct and total causal effects of one variable on another, the CAT software uses an R package implementation of DAGitty to automatically determine which other variables to condition on, i.e., to compute minimal sufficient adjustment sets. It then conditions on these variables in the random forest ensembles used to compute the PDPs. We illustrate this process for one of the data sets (“asthma”) bundled with CAT. Readers who wish to follow along in detail can browse to the http://coxassociates.com/CloudCAT site and load the data set from the “Sample: Data” dropdown menu near the top of the CAT Data screen. This is a random sample of a large data set for adults 50 years old or older from 20082012 for whom survey data are available from the Centers for Disease Control and Prevention (CDC) Behavioral Risk Factor Surveillance System (BRFSS), along with countylevel average annual ambient concentrations of ozone (O3) and fine particulate matter (PM2.5) levels recorded by the U.S. Environmental Protection Agency. Figure 2.22 shows the first few records of this data set, with several columns selected for further analysis. More detailed description of the data set and the variables is provided by Cox (2017b). For purposes of illustrating causal analysis using DAGitty and PDPs, all that is needed is the data set and the names of the variables (columns) included in the analysis.
The first step in the analysis is to obtain a BN DAG model of the data. CAT uses the bnlearn package in R for this purpose; it is activated by selecting the “Bayesian” command from the command menu. Details of this step, including optional incorporation of knowledgebased constraints such as that heart disease is a possible effect but not a possible cause of exposures, are discussed in the following section on causal discovery algorithms. By default, the “Bayesian” command applies a hillclimbing algorithm (“hc”) that searches for a BN model
Fig. 2.22 The “asthma” example data set used to illustrate BN learning
Fig. 2.23 A BN DAG generated for the data set in Figure 2.22 by the BN learning program
Fig. 2.24a The BN from Figure 2.23 redrawn by CAT to allow interactive editing and selection of variables for causal analysis
Fig. 2.24b Closeup of the network in Figure 2.24a. Age has been selected as the cause or exposure variable interest and HeartAttackEver as the effect or target variable of interest for further analysis.
maximizing a score assessing how well the model explains the data. This algorithm, together with some knowledgebased constraints such as that month and year can be causes but not effects of other variables, produces the BN DAG in Figure 2.23.
Figure 2.25 Partial dependence plot for the direct causal effect of Age on HeartAttackEver, adjusting for IncomeCode and Sex.
Once a DAG model has been obtained from the data, possibly with the guidance of knowledgebased constraints on allowed (or required) arrows, the next step is for the user to select an Exposure variable (also called a source or cause) and a target variable (also called a response or effect) and to specify whether the direct or the total effect of the former on the latter is to be quantified. In CAT, this is done with the help of an interactive BN DAG editor (Figure 2.24) that redraws the output from the bnlearn package (Figure 2.23) and allows it to be edited. This interactive graph editor allows the user to reposition the nodes, select a source (green node) and a target (pink node) by clicking on them, and specify or revise any knowledgebased constraints about allowed and forbidden arrows to be assumed in creating and using the DAG model. Figure 2.24b shows a closeup of the network diagram in Figure 2.24a, with Age selected as the exposure (cause) and HeartAttackEver selected as the target (effect).
Next, the user must specify whether to quantify the direct effect or the total effect of the exposue on the target. This selection is made using the Direct or Total radio buttons near the top of the screen in Figure 2.25. The CAT software then runs the DAGitty package and lists adjustment sets of variables to condition on to control for potential biases in estimating the selected causal effect (bottom of Figure 2.24a). In Figure 2.25, there is only one adjustment set, consisting of IncomeCode and Sex. CAT completes the analysis by generating a partial dependence plot (PDP) of the target variable vs. the source variable. The PDP conditions on the variables in the selected adjustment set (via the randomForest algorithm in R) to control for confounding and selection biases without introducing new ones. Figure 2.25 shows the resulting PDP for HeartAttackEver vs. Age, controlling for IncomeCode and Sex by conditioning on them in the trees on the random forest ensemble. The left side has a vertical scale set by the variations in the target variable and the right side shows the same PDP on a vertical scale that includes 0. CAT also tests whether the there is a significant ordinal association between the souce and target variables using Spearman’s rank correlation, which reflects how likely it is that higher values of the effect variable occur for higher values of the exposure variable. As shown at the bottom of Figure 2.25, there is a highly statistically significant ordinal association between age and selfreported heart disease, as one would expect.
Including KnowledgeBased Constraints and Multiple Adjustment Sets
We have now illustrated the mechanics of estimating the direct or total effect of one variable on another in a DAG model. However, one might (and should) question whether the algorithms used have produced sensible results. Some of the arrow directions in Figure 2.24b seem counterintuitive for causal interpretation. The arrow from PM2.5Average to IncomeCode might possibly be rationalized by considering that locations with dirty air are less likely to attract highincome residents, but the arrow from IncomeCode to Education seems clearly wrong, insofar as educational attainment usually precedes current income. One option is to leave the DAG model in Figure 2.24b asis, bearing in mind that the DAG simply represents a possible way of factoring the joint distribution of the variables in terms of marginal distributions for its inputs (Sex, Month, Year, and Age) and conditional distributions (CPTs) for its endogenous variables. In this case, the only causal implications of the DAG are that direct causes should be linked by arrows (in either direction) to their effects. But if we want to constrain the arrow directions to better correspond to possible manipulative causation, then the CAT software can be used to enter such constraints. Figure 2.26 shows how to forbid an arrow from IncomeCode to Education by entering its start and end points on a list of forbidden arrow. Constraints requiring an arrow or specifying a variable as a source (input node with only outwarddirected arrow allowed) or as a sink (output node with only inwarddirected arrows allowed) can be entered into CAT if desired. With this constraint, clicking on the “Run” button at the bottom of the screen in Figure 2.26 produces the revised DAG in Figure 2.27. In the DAG model of Figure 2.27, the arrow directions seem generally consistent with a possible manipulative causal interpretation.
Figure 2.26 Entering a constraint forbidding an arrow from IncomeCode to Education
Fig. 2.27 Revised DAG model with knowledgebased constraints from Figure 2.26
The use of knowledgebased constraints allows the user to inform the DAGlearning algorithms about such important facts as that month might cause exposure but exposure does not cause month. It is not necessary to impose such constraints to discover which variables are informative about each other, but including them allows arrows to be oriented to be more consistent with manipulativecausal interpretations. In some cases, however, the directions of arrows are ambiguous even in principle. For example, if people with higher incomes tend to reside in neighborhoods with less pollution, should an arrow be directed from income to pollution (suggesting that higher incomes allow one to purchase homes in less polluted neighborhoods) or from pollution to income (suggesting that high pollution drives away highincome residents)? Either resulting DAG provides a valid representation of the same joint distribution of the variables. Some causal graph modeling programs do not require that all arcs between pairs of variables be directed, but DAG models are the most common causal graph modeling framework. We will use the DAG model in Figure 2.27 with the understanding that some of its arrows could be redirected without greatly changing the results or their interpretation.
In many applications, more than one adjustment set is identified from a DAG via the dseparation and DAGcomputation algorithms in DAGitty. Each of them can be used to estimate the effect of the exposure variable on the target variable. Figures 2.28 and 2.29 present an example. The DAG model in Figure 2.27 has two different adjustment sets that can be conditioned on to obtain unbiased estimates of the direct causal effect of age on selfreported cumulative heart attack risk: {IncomeCode, Sex, Smoking} and {Education, IncomeCode, Sex}. Figures 2.28 and 2.29 show the respective partial dependence plots for heart attack risk vs. age estimated for these two adjustment sets. The two PDPs are similar but not identical, as might be expected due to the random variability in random forest model ensembles.
Once a DAG model has been created for a data set, it can be reused to quantify direct or total causal relationships between any pair of variables. With CAT’s interactive DAG network editor, this is done simply by clicking on a new exposuretarget pair and then clicking on the “Calculate” button (top left of Figure 30). Figure 2.30 shows the estimated direct causal effect (PDP) of IncomeCode on selfreported cumulative heart attack risk, HeartAttackEver, obtained by selecting (clicking on) these variables in order and then clicking on the “Calculate” button. Other causal effects can be estimated equally easily. In this data set, as in many other realworld data sets, health risks decrease significantly at higher income levels.
Fig. 2.28 Partial dependence plot for the direct causal effect of Age on HeartAttackEver, adjusting for IncomeCode, Sex, and Smoking.
Fig. 2.29 Partial dependence plot for the direct causal effect of Age on HeartAttackEver, adjusting for Education, IncomeCode, and Sex
Fig. 2.30 Partial dependence plot (PDP) for the direct causal effect of IncomeCode on HeartAttackEver, adjusting for Age, Sex, and Smoking.
The RandomForest package in R and the PDPs that it generates do not assume a parametric statistical model and therefore do not provide confidence intervals for model parameters and estimates (as some Bayesian extensions of CART methods do that assume parametric prior distributions for tree depth and branching factors). However, CAT provides other ways to test for model validity and to characterize uncertainties in BN results and causal PDP estimates. One option is to run the DAGitty package from CAT. Scrolling down the “Bayesian” screen below the PDP output shown in Figure 2.30 brings up the portion of this screen shown in Figure 2.31, and clicking the “Run package dagitty” button then generates testable conditional independence implications (the beginning of the listing of which is shown in Figure 2.31), followed by a listing of path coefficients and total effects identifiable by regression (with their corresponding adjustment sets) and path coefficients identifiable by instrumental variables, along with corresponding variables to condition on. Testing the testable implications of the DAG model listed by DAGitty, e.g., by using CART trees to check whether the implied conditional independence relations hold in a data set, provide one way to assess its consistency with data.
Fig. 2.31 Part of the output from the DAGitty package listing testable implications of the DAG model in Figure 2.27
Figure 2.32 shows another way to evaluate a DAG model: use several different BN learning algorithms to estimate the causal BN DAG from the data and see how well they agree. In CAT, clicking on the “Compare” button (upper left of Figure 2.32) runs four different BN algorithms (hc = hill climbing, tabu = tabu search, gs = growshrink Markov blanket, iamb = incremental association Markov blanket); these algorithms are documented in the bnlearn package documentation (https://cran.rproject.org/web/packages/bnlearn/bnlearn.pdf). The number of these algorithms that agree on each arrow is then tabulated (bottom of Figure 2.32, table truncated after the first five rows for legibility) and use to generate a composite diagram (top of Figure 2.32) in which thicker arrows indicate support from more of these DAGlearning algorithms. Checking the “Use short labels” option in the shaded control box to the left of the network causes abbreviated variable names to be used, and this has been done in Figure 2.32 to increase legibility. A slider (“Min.weight to display”) can be used to display only those arrows that have support from at least a certain number of the DAGlearning algorithms. If this slider is set to show arrows on which all four algorithms agree, then the variables separate into two clusters: Year, Month, PM2.5, and O3 on the left, and all other variables on the right. Age, Sex, Smoking, and Income are identified by all four algorithms as the parents of heart attack risk, shown as the node HAE at the right of the network in Figure 2.32. (Algorithms gs and iamb can produce undirected arcs when the data do not allow a unique direction to be inferred. In Figure 2.32, an undirected arc is counted as one arrow in each direction.)
Fig. 2.32 Visualizing consonance of findings from different causal discovery algorithms
Power Calculations for Causal Graphs
Absence of an arrow between two nodes in a DAG model learned by causal discovery algorithms from available data does not necessarily imply that they are not informative about each other or that one does not cause the other. There might be a dependency between them that is simply too small to be detected. To assess this possibility, it is useful to create simulated data sets from the original one, in which it is assumed that one variable does depend on another, with the dependency described by a simple formula such as that each 1% deviation from the mean of one variable causes a corresponding b% deviation from the mean of the other variable. Here, b reflects the elasticity of the second variable with respect to changes in the first. In many realworld data sets, increasing b gradually reveals a fairly sharp, thresholdlike transition from values of b that are too low to yield arrows between the two variables in the DAG, given the random variations of the variables, and values of b that are high enough so that most or all of the DAGlearning algorithms will detect the dependency and create an arrow between them. Such sensitivity analyses reveal the sizes of effects that are too small to detect and those that are large enough to be detected with near certainty.
Figure 2.33 illustrates an alternative way to visually check the validity of missing arrows in a DAG model. If no arrow connects two variables, such as PM2.5 and AllCause75 in the DAG in Figure 2.23, then the partial dependence plot showing how one depends on the other (after conditioning on the values of other variables, yielding a natural direct effect estimate) should be approximately horizontal, as shown in Figure 2.33. In this PDP, AllCause75 remains flat at about 134.6 0.4 as PM2.5 ranges from about 5 to over 80 micrograms per cubic meter. The expanded vertical scale on the left shows some noise, or random variation around this level, as should be expected due to the random sampling components of the random forest algorithm used in generating the PDP, but overall the relation is close to flat.
Figure 2.33 Partial dependence plots for missing arrows (e.g., from PM2.5 to AllCause75) should be approximately horizontal, indicating absence of dependency
Predictive Analytics for Binary Outcomes: Classification and Pattern Recognition
The classification tree, random forest ensemble, partial dependence plot, and BNlearning and modeling techniques just surveyed are all used as parts of contemporary predictive analytics. That direct causes are informative about, and help to predict, their effects (and more specifically, in DBN models, the future of their effects) forges a strong link between predictive and causal analytics, but predictive analytics methods also have many other useful applications. A standard predictive analytics challenge is to learn how to predict the values in one column of a spreadsheet or data table – in other words, the values of some target variable – from the values in other columns. This can be accomplished by partitioning the data into disjoint training and test subsets of records (rows); eliminating redundant columns (duplicates or highly correlated columns) and useless columns (those with the same in all or nearly all rows) training and tuning each of a suite of machine learning algorithms (e.g., regression models, CART, random forest, neural networks, support vector machines, clustering, and other algorithms) on the training set; and evaluating and comparing their performance on the test set. The entire process can be automated using software such as the caret (Classification and Regression Training) package in R (Kuhn, 2017, https://topepo.github.io/caret/; Kuhn, 2008). In CAT, selecting a target column to predict (the first column is the default) and a set of columns to use as predictors (the default is all of the other columns) and then clicking on the “Predict” command invokes the automated predictive analytics capabilities of the caret package.
We will illustrate the process for the applied problem of learning to predict whether a chemical is a mutagen. The data set used for this example is the “mutagens” data set that comes bundled with CAT; it is excerpted from a quantitative structureactivity relation (QSAR) data set for chemical mutagenicity provided as part of an R package of QSAR data (https://cran.rproject.org/web/packages/QSARdata/QSARdata.pdf). Figure 2.34 shows the first 10 records in the data set. Each row corresponds to a chemical. Each column is an attribute of the chemical such as its molecular weight, number of halogen atoms, and so forth. The dependent variable, mutagen, is in the first column: it has possible values of 1 = mutagen and 0 = not a mutagen, as measured in an Ames Salmonella test for mutagenicity. The predictors consist of the large number of chemical properties summarized in subsequent columns. The prediction task is to use a subset of the records to learn prediction rules for classifying chemicals as mutagens on not mutagens based on their chemical properties (attributes). This is typical of a wide class of practical problems in which the goal is to learn to classify cases or patterns accurately based on a training set in such a way that the learned classification rules (i.e., class prediction rules) can be applied successfully to new cases not in the training set.
Fig. 2.34 Mutagen data set
CAT provides two options for selecting a subset of records to train on: random sampling from the set of all records, or selection of the top part of the data table as the training set, with the rest of the data as the test set. In either case, the user must specify how much of the data set (e.g., 50%, 75%, or some other fraction) to use for training. If the rows correspond to consecutive observations in a time series, then training on the earlier rows and testing them on later rows may correspond to the natural order in which data become available. Figure 2.35 shows the options for creating disjoint test sets and, if desired, for filtering columns, i.e., reducing the number of candidate predictors by using CART trees, linear regression coefficients, or importances in a random forest analysis to select the most promisinglooking predictors. Options to automatically detect and drop redundant (highly correlated) columns and to preprocess predictors, e.g., by standardizing them, are included on the right side of the Prediction Options dialogue. The bottom of the screen summarizes the user’s choices under “Prediction output.” These options are detailed further in the caret documentation, which is accessible from CAT via hyperlinks. Once they have been selected, clicking on the Run button causes the rest of the predictive analytics process to run and generate output reports.
Fig. 2.35 Prediction options to be supplied by the user
Figure 2.36 shows one of the detailed output reports: the observed values (1 = mutagen, 0 = not mutagen) for chemicals in the test set (indicated by InTrain = 0) and the values predicted by each of several machinelearning algorithms (earth, which is the R implementation of multiple adaptive regression splines; rpart and ctree, which are two recursive partitioning CARTtype algorithms; random forest (rf); gradient boosted machine (gbm); and boosted generalized linear model (glmboost). The caret documentation and its references provide details of these machinelearning algorithms and many others; for our purposes, however, it suffices to treat them as blackbox algorithms for learning predictive rules from training data. The output of each algorithm is the conditional probability that a chemical is a mutagen in the Ames Salmonella test, given the values of its other attributes; these probabilities are rounded to the nearer of 0 or 1 in Figure 2.36. Figure 2.36 shows that there are some chemicals that are easy to classify, in the sense that all of the algorithms correctly predict the value of mutagen for them. The chemical in the first row in the table in Figure 2.36, with SampleID number 525, is an example: all of the predictive algorithms correctly predict the observed value, mutagen = 1. By contrast, all of the algorithms misclassify chemical 534 in the bottom row. The top half of Figure 2.37 presents a visualization of the predictive performance of the different algorithms on all chemicals in the test set. In this visualization, the top row, “Observed,” shows the correct classification of each chemical (column), with orange used for mutagens and grey used for nonmutagens. A dendogram clustering tree is used to automatically order the chemicals to allow simple visual interpretation: the solid orange columns at the right indicate chemicals that were correctly classified as mutagens by all algorithms, while solid grey columns, mostly toward the left, represent chemicals that were correctly classified as nonmutagens. The remaining columns, with both grey and orange cells, are for chemicals that are misclassified by at least some of the algorithms – most frequently, by gbm for this data set.
Fig. 2.36 Detailed output from “Predict” command showing observed values and predicted values from each of six machine learning algorithms for cases in the test set (InTrain = 0).
The bottom half of Figure 2.37 shows the calibration curve for each algorithm. This answers the following question: given a prediction of the probability that a chemical is a mutagen, what fraction of chemicals with that predicted probability actually are mutagens? The diagonal line in each plot is the line of perfect calibration for predictions, in which the fractions of chemicals that are mutagens match the predicted probabilities that they are mutagens. The Sshaped curves show the actual empirical relations between predicted probabilities of being a mutagen (horizontal axis) and the fractions of chemicals that are indeed mutagens (vertical axis). These curves show that earth, rpart, and glmboost are fairly well calibrated for these data, although they tend to understate low and high probabilities (i.e., predicted low probabilities should be even lower and predicted high probabilities should be even higher). Both ctree and rf tend to systematically underestimate probability of being a mutagen over most of their ranges of predicted probabilities. The gbm algorithm has by far the worst calibration.
Fig. 2.37 Visualizations of predictive performance of machine learning algorithms (top) and their calibration curves (bottom)
Several standard metrics are commonly used to summarize and compare the performance of different predictive algorithms. Figure 2.28 shows these standard outputs. Clicking on the hyperlink “Simple guide to confusion matrix terminologies” near the top of the screen while using CAT on line pulls up definitions of many of these terms, but the two most important concepts are as follows. The fourquadrant diagrams at the top show the number
Fig. 2.38 Summary of predictive model performance
of false positives (nonmutagens mistakenly classified as mutagens) in the lower left (e.g., 22 for the earth algorithm, 23 for rpart, 21 for ctree, and so on) and the number of false negatives (mutagens mistakenly classified as nonmutagens) in the upper right (14 for earth, 19 for rpart, 29 for ctree, and so on). Better predictive algorithms have smaller lobes and lower frequency counts along this main diagonal. The upper left and lower right give the numbers of true negatives and true positives, respectively: these are the correctly classified chemicals.
Of the many summary performance metrics tabulated below these “confusion matrix” results, one of the most useful is balanced accuracy, shown in the bottom line. This summarizes the probability that each algorithm will predict the classification of a chemical correctly when the prediction task is made as hard as possible by balancing the samples so that exactly half are mutagens and half are not. This avoids enabling a classifier with poor predictive power to appear to perform well on a highly unbalanced sample (e.g., one with 95% mutagens) simply by guessing the most likely value every time. A balanced accuracy of 0.50 indicates that a predictive algorithm is no more accurate than random guessing, while a balanced accuracy of 1 would indicate a completely accurate classifier. With the exception of gbm, all of the predictive algorithms achieved balanced accuracies of between 0.7 and 0.8 on the test set, indicating substantial predictive power. The earth algorithm performs best by this metric, consistent with its performance in the confusion matrix.
To apply the predictive models to new cases once they have been learned and evaluated, e.g., to predict the mutagenicity of chemicals for which the correct classification is not yet known, the data for these chemicals can be appended to the data table in Figure 2.34 with blanks left in the mutagen column, and the probabilities that mutagen = 1 will then be computed for each of these chemicals by each predictive algorithm. The calibration information shown in the Sshaped curves at the bottom Figure 2.37 can be applied to these probabilities to improve their accuracy by correcting for any systematic distortions (departures from the line of perfect calibration) revealed when the algorithms are evaluated on the test set. Such probability calibration adjustments to improve predicted class probabilities are performed automatically by modern machine learning software (e.g., http://scikitlearn.org/stable/modules/calibration.html).
The methods for automated predictive analytics surveyed in this section provide a firm computational foundation for investigating questions such as whether, and to what degree, one variable helps to predict another, e.g., as revealed in PDPs and balanced accuracy scores. For time series variables, these methods can also be used to test whether the past values of some variables help to predict future values of other variables, and can do so without committing to the parametric modeling restrictions of classical Granger causality tests. Insofar as direct causes must be informative about, and help to predict, the values of their effects, predictive analytics can serve as a valuable screen for identifying potential direct causes of an effect by determining which variables help to predict its values, even after conditioning on the values of other variables. This is one of the key ideas used to learn causal graph models from data, and hence causal discovery algorithms make heavy use of machine learning methods for predictive analytics.
Dostları ilə paylaş: 