Example: Collider Stratification and Selection Bias – Conditioning on a Common Effect The converging DAG model (2.16) and its abstract version (2.17) show situations in which two variables point into a third. If arrows are interpreted as reflecting causality (by any definition), then the third variable can be viewed as a common effect of the two causes. Conditioning on a specific value of the effect variable induces a probabilistic dependency between the values of the variables that point into it.
low_income hazardous_occupation good_health (2.16)
X Z Y (2.17)
Z is called a collider between X and Y in this DAG, for obvious reasons. A possible interpretation is that only people with good health (indicated by Y = 1) or with low incomes (indicated by X = 1) are likely to work in a particular hazardous occupation (indicated by Z = 1). Treating all variables as binary for simplicity, suppose that the marginal distributions are that each of X and Y is equally likely to be 0 or 1, and the expected value of Z is given by the structural equation E(Z) = 0.5(X + Y) (equivalent to the CPT P(Z = 1 | X = 0, Y = 0) = 0; P(Z = 1 | X = 1, Y = 1) = 1; P(Z = 1 | X = 1, Y = 0) = P(Z = 1 | X = 0, Y = 1) = 0.5). Building this network in Netica, entering the finding Z = 1, and varying the value of X shows that, with the constraint Z = 1, cases with X = 1 have a 2/3 probability of having Y = 1, but cases with X = 0 have a 100% probability of having Y = 1. Interpreting this pattern in terms of model (2.16), among workers in a hazardous occupation (corresponding to Z = 1), those with low incomes (corresponding to X = 1) are less likely to have good health (corresponding to Y = 1) than those with high income (corresponding to X = 0). Quantitatively, the probability of good health (Y = 1) is 2/3 for those with low income, compared to 100% for those with high income. The reason is not that high income is a cause of better health. Rather, the explanation is that having high income implies that a worker is not employed in this occupation because of low income, and therefore the rival explanation for such employment – that the worker has good health – becomes more probable. This is another way in which observing one condition (high income) can make another more likely (good health) even if neither is a cause of the other.
The general pattern illustrated by the foregoing example, in which conditioning on a common effect (or, in less suggestive language, a common child or descendant in a DAG) induces statistical dependencies among their parents or ancestors, has been discussed under different names in epidemiology, including selection bias, collider stratification bias, and Berkson’s bias (Cole et al., 2010; Westreich, 2012). A practical implication is that statistic modeling can inadvertently create significant associations and dependencies between variables that are not causally related, or even associated with each other in the absence of conditioning on other variables. Either of the following two common research situations can create such non-causal statistical associations:
A study design that selects a certain population for investigation,such as an occupational cohort, a population of patients in a specific hospital or health insurance plan, or residents in a certain geographic area. If the dependent variable of interest and an explanatory variable of interest both affect membership in the selected study population, then selection bias might create spurious (meaning non-causal) associations between them.
A statistical analysis, such as regression modeling, which stratifies or conditions on the observed values of some explanatory variables.
Stratification and conditioning are usually done to “control for” the statistical effects of the selected explanatory variables on a dependent variable. However, the attempted control can create spurious effects if a variable that is stratified or conditioned on – that is, a variable on the right side of a regression equation – is affected by both the dependent variable and one or more other explanatory variables. Conditioning on it can then induce spurious associations between other explanatory variables that affect it and the dependent variable. For example, if the data-generating process is described by the collider DAG X Z Y, then regressing Y against X and Z might produce highly statistically significant regression coefficients for both X and Z even though neither one actually affects Y. If Z = X + Y, where X and Y are two independent random numbers and Z is their sum, then fitting the regression model E(Y | X, Z) = 0 + XX + ZZ to a large sample of values for the three variables (x, y, z = x + y) produces estimated regression coefficients of 1 for Z and -1 for X, corresponding to the equation Y = Z - X, even though Y is unconditionally independent of X. The values of X and Y might have been generated by independent random number generators, but conditioning on their common effect Z by including it on the right side of the regression equation induces perfect negative correlation between X and Y, despite their unconditional independence. Thus, regression modeling can estimate a significant statistical “effect” of X on Y even if they are independent, if some other covariate Z that depends on both of them is included in the model.
Causal Probabilities and Causal Bayesian Networks It is clear from the foregoing examples that observing a high value of one random variable can increase the probability of observing a high value for another random variable even if there is no direct causal relation between them. Reverse causation, confounding, and selection bias provide three different ways for the observed occurrence of one event to make occurrence of another more probable without causing it. However, BNs also provide a simple way to represent direct causal effects and mechanisms. Suppose that the CPT for each node in a BN is interpreted as specifying how its conditional probability distribution changes if the values of its parents are changed. Such a BN is called a causal Bayesian network. In a causal BN, the parents of a node are its direct causes with respect to the other variables in the BN, meaning that none of them mediates the parents’ effects on the node. Conversely, the children of a node, meaning the nodes into which it points via arrows of the DAG, are interpreted as its direct effects. The CPT for a variable in a causal BN represents the causal mechanism(s) by which the values of its parents affect the probability distribution of its values. If the causal BN represents manipulative causation, then changing the value of a parent changes the probability distribution of the child, as specified by its CPT. In this sense, the causal BN implements the idea that changing a cause changes the probabilities of the values of its direct effects. Example: Modeling Discrimination in College Admissions – Direct, Indirect, and Total Effects of Gender on Admissions Probabilities Setting: Suppose that a college admissions committee reviewing recent admissions statistics finds that men have a 60% acceptance rate while women with the same academic qualifications have only a 50% acceptance rate. Half of applicants are men and half are women. Their standardized test scores, educational backgrounds, and other academic qualifications are indistinguishable.
Causal prediction question: What would be the effects on acceptance rates for women if each department were required to change its admissions rate for women to be equal to its admissions rate for men with the same academic qualifications? What data are needed to answer this question?
Solution: Accurately predicting what would happen if the policy were adopted requires a causal model for acceptance probabilities. Suppose that the causal DAG model in Figure 2.5 describes the data-generating process. The data for the node probability tables are listed below the DAG.
Fig. 2.4 A Causal BN for the college admissions decision
P(man) = P(woman) = 0.50
P(history | man = 1), so P(math | man) = 0
P(history | woman) = 0, so P(math | woman = 1)
P(accept | woman, history) = 0.70
P(accept | woman, math) = 0.50
P(accept | man, history) = 0.60
P(accept | man, math) = 0.40
In words, half of all applicants are women. Women apply only to the math department and men apply only to the history department; thus, the direct effect of sex on admissions probability, if any, is completely confounded with the effect of department. In the given description of the problem setting, department is an omitted confounder: the description does not mention it. But it must be considered to build a causal model adequate for answering questions about the effects of interventions. Each department currently has an admissions rate that is 0.10 higher for women than for men (given the same qualifications), with the math department’s admissions rate being 0.50 for women and 0.40 for men, and the history department’s admission rate being 0.70 for women and 0.60 for men. (These numbers would have to come from old data or from policies, since currently only women apply to the math department and only men apply to the history department.) Thus, if each department changed its admissions rates for women to equal its admissions rate for men, then admissions rates for women would fall from 0.50 to 0.40. The direct effect of sex on admissions probability is outweighed by its indirect effect, mediated by the department applied to, to yield a total effect that is opposite in sign to the direct effect. Being a woman increases acceptance probability, other things being equal (namely, department applied to), but decreases acceptance probability overall. This is an example of Simpson’s Paradox.
If inquiry revealed a different CPT for the admission decision node, then the predicted outcome would be quite different. If the CPT for the admission decision node were as follows:
P(accept | woman, history) = 0.50
P(accept | woman, math) = 0.50
P(accept | man, history) = 0.60
P(accept | man, math) = 0.60
then the policy change would increase the admissions rate for women from 0.50 to 0.60. Thus, the effects of the policy change depend on the details of the admission decision CPT. They cannot be inferred solely from available data on current admission rates broken down by sex, or even from current data broken down by sex and department, since these rates only show what is happening now and this information does not reveal what would happen under altered conditions. That requires knowledge of P(accept | man, math), which is not currently observed, as men do not apply to the math department. This illustrates the fact that the consequences of a policy intervention can be underdetermined by data collected before the policy is enacted. The Lucas critique mentioned in Chapter 1 expands on this point.
The discrimination example illustrates the differences among direct, indirect, and total effects of sex on admissions probabilities. More generally, epidemiologists distinguish among the following types of effects in causal DAG models (Robins and Greenland, 1992; Pearl, 2001; Petersen et al., 2006; VanderWeele and Vansteelandt, 2009):
The direct effect of a change in a parent on a child is the change in the child’s expected value (or, more generally, in its conditional probability distribution) produced by changing the parent’s value, holding the values of all other variables fixed. In a structural equation model (SEM) with Y = aX + bZ and Z = cX, for example, the direct effect on Y of a unit increase in X would be a, since Z would be held fixed.
The controlled direct effect is the effect if other variables are held fixed at the same values for all cases described by the causal model. For example, consider an SEM with Y depending (perhaps nonlinearly) on X and Z, and with Z depending on X,where X is exposure, Y is an indicator of health effects found in insurance records of covered treatments, and Z is a covariate such as participation in an employer-provided insurance program. The controlled direct effect of an increase in X on Y would be calculated by holding Z fixed at the same level (all insured or all not) for all individuals in the study. Knowledge of the CPT P(y | x, z) makes this easy to calculate.
By contrast, the natural direct effect of X on Y allows the mediator Z to have different values for different individuals, holding them fixed at the levels they had before X was increased. For data analysis purposes, partial dependence plots (PDPs), introduced later in this chapter, provide a useful non-parametric method for estimating natural direct effects.
The total effect of a change in X on Y is the change in the expected value of Y (or its probability distribution) from the change in X when all other variables are free to adjust. In the SEM with Y = aX + bZ and Z = cX, an increase in X by one unit increases Y by a total of a + bc units, representing the sum of the direct effect a and the indirect effect mediated by Z, bc.
The total indirect effect is the difference between the total effect and the natural direct effect.
A further assumption often made in interpreting a causal BN in light of cross-sectional data on its variables is that the probability distribution for each variable is conditionally independent of the values of its more remote ancestors, and, indeed, of the values of all of its non-descendants, given the values of its parents. (More generally, each variable is conditionally independent of all others, given the values of its parents, children, and spouses, i.e., other parents of its children; these are collectively referred to as its “Markov blanket.”) For the DAG model X Z Y, this condition implies that Y is conditionally independent of X given Z. In symbols, P(y | x, z) = P(y | z) for all possible values x, y, and z of random variables X, Y, and Z, respectively. Thus, conditioning on X and Z provides no more information about Y than conditioning on the value of Z alone. Informally, X affects Y only through Z; more accurately, all of the information that X provides about Z is transmitted via Y. The assumption that each node is conditionally independent of its non-descendants given its parents is termed the Markov assumption for BNs, or the Causal Markov Condition (CMC) for causal BNs, in analogy to the property that the next state of a Markov process is conditionally independent of its past states, given the present state. Philosophical objections raised about the suitability of CMC for causal analysis has led to refinements of the concept (Hausman and Woodward, 1999, 2004) and alternative formulations use more explicitly causal language, such as that a variable is conditionally independent of its non-effects given its direct causes. When causal BNs are interpreted as modeling manipulative causality, CMC implies that changing a variable’s indirect causes, i.e., its ancestors, affects its probability distribution only via effects on its direct parents.
In the context of data analysis, CMC is often paired with a second condition, termed faithfulness, stating that the set of all conditional independence relations among variables consists of exactly those that are implied by the DAG model. This rules out the logically possible but empirically unlikely situation in which one variable appears to be conditionally independent of another only because multiple effects coincidentally cancel each other out or duplicate each other. For example, in the DAGmodel X Z Y, faithfulness implies that X and Z values do not happen to coincide. If they did, then the conditional independence relations in this DAG would be indistinguishable from those implied by Z X Y, making it impossible to tell whether X or Z is the parent of Y (i.e., whether Y is conditionally independent of Z given X or whether instead Y is conditionally independent of Z given X). Likewise, if exercise is a direct cause of food consumption and if both are parents of cardiovascular risk in a DAG model, then faithfulness would require that the direct effects of exercise on cardiovascular risk must not be exactly counterbalanced by the indirect effect via food consumption, creating an appearance in the data that cardiovascular risk is independent of exercise.
An obvious way for CMC to fail is if a common cause not included in the BN affects both a node and one of its ancestors without affecting its parents in the BN. Such a hidden confounder (also called an unobserved confounder or latent confounder) could create a statistical dependency between X and Y in the DAG model X Z Y even after conditioning on the value of Z (Tashiro et al., 2013). The requirement that each variable in a causal BN depends only on its parents shown in the DAG, implying that all common causes of any two variables in the DAG are also included in the DAG, is called causal completeness. The closely related condition that all such common causes are not only modeled (represented in the DAG), but also measuredor observed, is called causal sufficiency. When it holds, the observed values of the parents of a node are sufficient to determine which conditional distribution in its CPT the child’s values are drawn from. None of the conditions of causal sufficiency, causal completeness, faithfulness, or CMC necessarily holds in real-world data sets, but they are useful for discussing conditions under which causal discovery algorithms can be guaranteed to produce correct results.
Causal BNs model manipulative causation in a straightforward way. Changing a direct cause in a causal BN changes the conditional probability distribution of its children, as specified by their CPTs. In particular, the probability distribution of a variable can be changed, or manipulated, by exogenous acts or interventions that set specific values for some or all of its controllable parents. These exogenously specified values override endogenous drivers, effectively disconnecting any arrows pointing into the manipulated variables, and simply fix their values at specified levels. (Technical discussions often refer to this disconnecting as “graph surgery” on the BN’s DAG and denote by “do(X = x)” or “do(x)” the operation of setting the value of a variable X to value x, to emphasize that the value of X is exogenously set to value x rather than being passively observed to have that value). Similarly, the CPT for a node in a causal BN depends only on the values of its parents (direct causes), and not on how those values were determined, whether via exogenous manipulation or endogenously.
Example: Calculating Effects of Interventions via the Ideal Gas Law Many BNs contain “function nodes” that calculate the value of a variable a deterministic function of the values of its parents. Netica and other BN software packages allow deterministic functions to be specified in place of CPTs. (A function y = f(x) can be construed as special CPT, with P(y = f(x) | x) = 1).) Figure 2.4 shows an example involving the continuous variables P = pressure of a gas inside a container, T = temperature of the container, V = volume of the container, n = amount of gas in the container, s = strength of the container. There is also a binary indicator variable X = container failure, having possible values of 1 if the container fails during a certain time interval of observation and X = 0 otherwise. The CPT for pressure is given by a formula such as
P = nRT/V where R is a constant with value 0.08206 when V is in liters, P in atmospheres, n in moles, and T in degrees Kelvin. This idealization suffices for purposes of illustration, although realistic formulas are more complex. The probability of container failure over the interval of interest depends on the pressure and on the strength of the container, which in turn may depend on container geometry and details of the spatial distribution of material characteristics not shown in the model.
Fig. 2.4 A causal BN for container failure due to overpressure
n = moles of gas s = strength of container
T = temperature P = pressure X = container failure
V = volume
Suppose that the probability of failure over the interval of interest if the container has pressure P and strength s is given by some empirically derived function, E(X | P, s) = F(P, s). If a decision-maker controls the temperature at which the container is stored, and if all other parents and ancestors of X have known values, then an intervention that sets the temperature to some specific value T* (sometimes indicated via the notation “do(T*)”) will thereby cause a risk of failure given by F(P*, s) where we define P* = nRT*/V. On the other hand, if an overpressure release valve and a compressed gas supply are rigged up to keep P fixed at some target value P0, then the risk of failure will remain at F(P0, s) no matter how T is varied. This illustrates the fact that X depends on T only through P, and therefore interventions that set the value of P to some specific value override the effects of changes in T, effectively cutting it out of the causal network that affects X.
Statistical methods for estimating causal BNs from data, known as causal discovery algorithms are discussed more fully later in this chapter. They often simplify the estimation task by assuming that the CMC, faithfulness, and causal sufficiency conditions hold. More general causal discovery algorithms allow for the possibilities of latent confounders and selection bias (Zhang, 2008; Ogarrio et al., 2016). When causal sufficiency cannot be assumed, meaning that variables not included in a causal graph model might explain some of the observed statistical dependencies between its variables, the directed acyclic graph (DAG) modeling assumption is relaxed to allow undirected (or bidirected) arcs as well as directed ones. This creates mixed graphs with both directed and undirected arcs between nodes. Mixed graphs include several specialized data structures (such as maximal ancestral graphs (MAGs) and partial ancestral graphs (PAGs), the latter representing classes of MAGs that are not distinguishable from each other by conditional independence tests) that acknowledge that some of the observed variables may have unobserved common causes, or ancestors, outside the model (Triantafillou and Tsamardinos, 2015). An undirected arc joining two variables indicates that their association can be explained by an unobserved confounder. Since BN inference algorithms work equally well with networks having the same arcs oriented in opposite directions as long as the joint probability distribution of variables has been factored correctly (e.g., P(X, Y) can be factored equally well as P(y)P(x | y) or as P(x)P(y | x), corresponding to Y X or X Y, respectively), they can readily be extended to apply to mixed graphs. However, it is then important to keep track of which inferences have causal interpretations and which only reflect statistical dependencies. Current algorithms for causal discovery and inference and for synthesizing causal inferences across multiple studies automatically keep track of causal and statistical inferences and clearly distinguish between them (Triantafillou and Tsamardinos, 2015).