Better Causal Inferences and Benefits Estimates via More Active Judicial Review If regulations are sometimes advocated based on overly optimistic and simplistic causal assumptions and models of their effects and the benefits that they cause, what can and should be done about it – and by whom? How might causal inferences and benefits estimates used in regulatory proceedings be made more accurate and trustworthy? This section develops the following points.
Once the most relevant concept of causation, as identifying actions that change the probabilities of preferred outcomes (manipulative causation) has been clearly defined, such improvements in predicting or assessing the benefits caused by regulations are indeed technically possible, based on experience in a variety of other areas.
It is (or should be) well within the competence and jurisdiction of courts to help bring about these improvements by exercising more stringent judicial review of the causal reasoning used to project benefits and advocate for regulations. This is especially so if current forms of deference to regulatory agencies are replaced by a more active role, as urged by the recently proposed Separation of Powers Restoration Act amendment to the Administrative Procedures Act (Walker, 2016).
The organizational culture of many regulatory agencies makes it difficult for them to improve their own causal assumptions and benefits assessments without the compelling presence of an active judiciary engaged in questioning and challenging their reasoning. In part, this is because of a tendency to dismiss as non-scientific or non-expert the concerns of laypeople and other stakeholders that regulations will not produce their intended benefits (Wynn, 1993). In part it arises because regulators use frameworks that treat causality as a matter for expert judgment rather than as a matter of empirically discoverable and verifiable fact.
To overcome these obstacles, it is both necessary and practical to inject more data-driven rather than judgment- and assumption-driven concepts and techniques for assessing causation into deliberations over regulations.
Advances in data science and analytics make it technically possible, and even easy, to test whether necessary conditions for causality, such as that a cause should help to predict its effects, hold in available data sets. They enable the shift from judgment-driven to data-driven analyses of causal impacts and benefits from regulations in many important cases where relevant data are available or readily obtainable, as in the examples of air pollution regulation and food safety regulation. But this shift is unlikely to take place within established regulatory cultures that emphasize the judgments of salaried experts and assumption-based modeling as tools for deciding how the world works (Wynn, 1993).
By contrast, an adversarial setting in which both those who support a proposed regulation and those who oppose it subject their competing causal analyses and resulting benefits estimates to critical review based on rigorous objective standards provides incentives for production and use of relevant data and analyses. These incentives are lacking when only the regulator is charged with making a credible case for proposed regulations, and when opposing views are addressed only through responses to public comments (which, in practice, can usually be readily dismissed, e.g., by citing the contrary views and expertise of those who side with the regulator). Excessive deference to regulatory science by administrative law courts damps the incentives for others to challenge, and perhaps improve, it. Conversely, more active judicial review can stimulate challenges to improve the factual basis for estimated regulatory benefits. Courts are already positioned as the cheapest providers of review and enforcers of rigorous reasoning about the benefits claimed to be caused by proposed regulations. Finding regulations to be arbitrary and capricious when the evidence provides no good reason to expect that they will actually cause the benefits claimed for them might create incentives to improve the quality of causal inference in regulatory science and to reduce passage of regulations whose benefits end up being less than was projected, and perhaps less than their costs.
Distinguishing among Different Types of Causation In discussing the health and economic benefits caused by regulations, policy makers, regulators, courts, scientists, media, and the public refer to several distinct types of causation, often without clearly distinguishing among them (Dawid, 2008). These were articulated and examined at some length in Chapter 2. Here we recapitulate some key points to help make this chapter relatively self-contained; Chapter 2 offers additional explanations and refinements. Each of the following concepts of causality is widely used in discussing the causal implications of associations found in observational data. Each has its own large, specialized technical literature (see Chapter 2), but they are often conflated.
Associational and attributive causation.This is the concept of causation most commonly used in epidemiology and in regulatory risk assessments and benefits estimates for health and safety regulations. It addresses how much of an observed statistical association between an exposure and an adverse outcome will be attributed to the exposure, and how much will be attributed to other factors. This is often interpreted as showing how much of the causation (or blame or liabilityin legal applications) for an adverse outcomeis attributable to exposure, and how much to each of the other causes or factors that produced it. In epidemiology, etiological fractions, population attributable fractions, population attributable risks, burdens of disease, and probabilities of causation are all examples of attributive causal concepts (Tian and Pearl, 2000). As commonly used and taught, all are derived from relative risk, i.e., the ratio of risks in exposed and unexposed populations, or among more- and less-exposed individuals. Hence, they are all based solely on statistical associations.
Predictive causation. In statistics, economics, physics, and neuroscience, among other fields, it is common to define one variable as being a cause of another if and only if the first helps to predict the second (e.g., Friston et al., 2013; Furqan and Siyal, 2015). For example, if exposure helps to predict an adverse health response, then exposure is considered a (predictive) cause of the response. As an important special case, Granger causality between an exposure time series and a response time series (Kleinberg and Hripcsak, 2011) is based on the principle that causes help to predict their effects. (Technically, X is a Granger-cause of Y if the future of Y is dependent on – or, more formally, is not conditionally independent of – the history of X, given the history of Y.) Thus, nicotine-stained fingers can be a Granger cause of lung cancer, helping to predict it, even if cleaning one's fingers would have no effect on future lung cancer risk (manipulative causality) (Woodward, 2013).
Counterfactual causation (Höfler, 2005; Wang et al., 2016) attributes the difference between observed outcomes and predicted outcomes that would have occurred under different conditions, such as if exposure had not been present, to the differences between the real and alternative (“counterfactual”) conditions. This difference in conditions is said to cause the difference in outcomes, in counterfactual causal models.
Structural causation and exogeneity. In constructing a simulation model of a dynamic system, the values of some variables are calculated from the values of others. As the simulation advances, input values may change, and then the values of variables that depend on them may change, and so forth, until exogenous changes in the inputs have propagated through the system, perhaps leading to new steady-state values for the output variables until a further exogenous change in inputs leads to further changes in the values of other variables. The order in which variable values are calculated and updated reflects a concept of causality in which the values of some variables are determined by the values of others that cause them. This computational view of causality considers that the values of effects (or their conditional probabilities, in stochastic models) can be determined from the values of their causes via equations or formulas, representing causal mechanisms, with exogenous changes entering from outside the modeled system propagating through the modeled mechanisms in such a way that values of causes are determined prior to the values of effects that depend on them. It has been formalized in seminal work by Simon (1953) and subsequent work by many others, mainly in economics and econometrics, artificial intelligence, and time series analysis (e.g., Iwasaki, 1988; Hendry, 2004; Voortman et al., 2010; Hoover, 2012).
Manipulative causation is the concept of causality in which changing (“manipulating”) the values of controllable inputs to a system changes the values of outputs of interest (Woodward 2013; Voortman et al., 2010; Hoover, 2012). In detailed dynamic models, the changes in inputs might propagate through a system of algebraic and differential equations describing a system to determine the time courses of changes in other variables, including outputs. If such a detailed dynamic model is unavailable, the relation between changes in values of controllable inputs and changes on the values of variables that depend on them may instead be described by more abstract models such as functions (“structural equations”) relating the equilibrium values, or by Bayesian Networks specifying conditional probability distributions of outputs conditioned on values of inputs. Manipulative causation is the type of causality of greatest interest to decision-makers and policy-makers seeking to make preferred outcomes more likely by changing the values of variables that they can control.
Mechanistic/explanatory causation describes how changes in the inputs to a system or situation propagate through networks of causal laws and mechanisms to produce resulting changes in other variables, including outputs.
These different concepts of causality are interrelated, but not equivalent. For example, attributive causality does not imply counterfactual, predictive, manipulative, or mechanistic causality and is not implied by them. There is no guarantee that removing a specific exposure source would have any effect on the risks that are attributed to it, nor is there any requirement than no more than 100% of a risk be attributed to the various factors that are said to cause it. For example, in the diagram X1 X0 X2 X3 X4 … Xn, if exogenously changing the value of a variable at the tail of an arrow from 0 to 1 causes the value of any variable into which it points to change from 0 to 1, and if the value of X0 is changed from 0 (interpreted as “unexposed”) to 1 (interpreted as “exposed”), then not only would these measures attribute 100% of the blame for X1 becoming 1 to this change in X0, but also they would attribute the same 100% of the blame to changes in each of X2, X3,… and Xn, even though those are simply other effects of the change in X0. Relative risks are the same for all of these variables, and so attributive risk measures derived from relative risk assign the same blame to all (in this case, a “probability of causation” of 1).
Tort law, by contrast, uses an attributive concept of “but-for” causation that attributes harm to a cause if and only if the harm would not have occurred in its absence, i.e., “but for” the occurrence of the cause. This concept would single out the change in X0 as the only but-for cause of the change in X1. On the other hand, X0, X2, X3,X4 … would all be causes of Xn by this criterion. Thus, but-for causation can lead to promiscuous attribution of harm to remote causes in a chain or network, for example, by attributing responsibility for a smoker’s lung cancer not only to the practice of smoking, but also to the retailer who sold the cigarettes, the manufacturer of the cigarettes, the grower of the tobacco, the media that advertised the brand smoked, the friends or family or culture that encouraged smoking, the schools that failed to intervene, genes that predisposed the smoker to addiction, and so on. All can be considered but-for causes of smoking.Tort law also provides standards such as more-likely-than not and joint and several liability for cases where causation is uncertain or is distributed among multiple causes.
Predictive causality does not necessarily imply manipulative causality unless other conditions hold, such as that no omitted confounders are present. This is illustrated by the standard counter-example, mentioned above and in Chapter 2, of nicotine-stained fingers being a predictive but not a manipulative cause of lung cancer, where smoking is the omitted confounder. Often in public health and safety regulations, it is not known whether these other conditions hold, and hence it is not clear whether predictive causality implies manipulative causality.On the other hand, predictive causation can very often be established or refuted (at a stated level of statistical confidence) based on data by applying statistical tests to determine whether predictions of outcomes are significantly improved by conditioning on information about their hypothesized causes. Such tests examine what did happen to the effects when the hypothesized causes had different values, rather than requiring speculations about what would happen to effects under different conditions, as in counterfactual causal modeling. Therefore, even though predictive causality does not necessarily imply manipulative causality, it provides a useful data-driven screen for potential manipulative causation, insofar as manipulative causation usually implies predictive causation (since changes in inputs help to predict the changes in outputs that they cause).
For counterfactual causation, what the outcomes would have been under different, counterfactual conditions is never observed. Therefore, the estimated difference in outcomes caused by differences between real and counterfactual conditions must be calculated usingpredictive models or assumptions about what would have happened. These predictions may not be accurate. In practice, they are usually simply assumed, but are difficult or impossible to validate. Counterfactual models of causation are also inherently ambiguous, in that the outcome that would have occurred had exposure been absent usually depends on why exposure would have been absent, which is seldom specified. For example, nicotine-stained fingers would be a counterfactual cause of lung cancer if clean fingers imply no smoking, but not if they arise only because smokers wear gloves when smoking. In the case of air pollution, especially if exposure and income interact in affecting mortality rates, assuming that the counterfactual condition without exposure occurs because everyone becomes wealthy enough to move to unpolluted areas might yieldquite different estimates of counterfactual mortality rates than assuming that lack of exposure was caused by the onset of such abject poverty and economic depression that pollution sources no longer operate. Counterfactual models usually finesse any careful exposition of specific assumptions about why counterfactual exposureswould occur by using statistical models to predict what would have happened if exposures had been different. But these models are silent about why exposures would have been different,and hence the validity of their predictions is unknown. (Economists have noted a similar limitation of macroeconomic models derived from historical data to predict the effects caused by future interventions that change the underlying data-generating process. This is known as the Lucas critique of causal predictions in macroeconomics policy models mentioned in Chapter 1.)
Although manipulative causality usually implies predictive causality, neither one necessarily implies attributive causality. For example, if consuming aspirin every day reduces risk of heart attack in an elderly population, but only people with high risks take daily aspirin, then there might be both a positive association (and hence a positive etiologic fraction, probability of causation, and population attributable risk) between aspirin consumption and heart attack risk in the population, but a negative manipulative causal relationship between them, with aspirin consumption reducing risk. Even if aspirin had no effect on risk, it could still be positively associated with risk if people at high risk were more likely to consume it. Thus, manipulative and associational-attributive causation do not necessarily have any implications for each other.
The following example illustrates some of these important distinctions among causal concepts more quantitatively.
Example: Associations do not Necessarily Provide Valid Manipulative Causal Predictions Suppose that in a certain city, daily mortality rate, R and average daily exposure concentration of an air pollutant, C, over an observation period of several years are perfectly described by the following Model 1:
R = C + 50 (Model 1)
That is, each day, the number of deaths is equal to 50 deaths plus the average daily concentration of the air pollutant. What valid inferences, if any, do these observations enable about how changing C would change R? The answer, as stressed in Chapters 1 and 2, is none: historical associations do not logically imply anything about predictive, counterfactual, structural, or manipulative causation. One reason is that Model 1 implies that the same data are also described perfectly by the following Model 2, where T is an unmeasured third variable (such as temperature) with values between 0 and 100:
C = 50 – 0.5T
R = 150 - C - T (Model 2)
(The first equation implies that T = 100 – 2C, and substituting this into the second equation to eliminate T yields Model 1.) If the equations in Model 2 are structural equations with the explicitcausal interpretation that exogenously changing the value of a variable on the right side of an equation will cause the value of the dependent variable on its left side to change to restore equality, then the second equation reveals that each unit of reduction in C would increase R by one unit. In this case, if Model 1 is only a reduced-form model describing historical associations, then mis-interpreting it as a causal model would mistakenly imply that increasing C would increase R. The associational Model 1 is not incorrect as a description of past data. It would be valid for predicting how many deaths would occur on days with different exposure concentrations in the absence of interventions. But only the causal Model 2 can predict how changing C would change R, and there is no way to deduce Model 2 by scrutiny of Model 1.
This review of different concepts of causation has highlighted the following two key conclusions: (a) Policy-makers, regulators, courts, and the general public are primarily interested in manipulative causation, i.e., in how regulations or other actions that they take would affect probabilities of outcomes, and hence the benefits caused by their actions; but (b) Regulatory science and claims about the causal impacts of regulations usually address only associational-attributive causation, and occasionally about other non-manipulative (especially, counterfactual) causal concepts. Judicial review of the causal reasoning and evidence used to support estimates of regulatory benefits can and should close this gap between causal concepts by insisting that arguments and evidence presented must address manipulative causation, and that other forms of causation must not be conflated with it. There is an urgent need to enforce such clarity, as current practices in epidemiology, public health, and regulatory science routinely confuse associational-attributive causation with manipulative causation.
As documented in Table 2.5, many published articles in peer-reviewed scientific journals move freely between associational and manipulative causal interpretations of exposure-response associations without showing that the presented associations do in fact describe (manipulative) causation. As a consequence, regulatory benefits assessments and calls for further regulation based on these and many similar analyses do not reveal what consequences, if any, further regulations should actually be expected to cause. In this sense, they might be regarded as arbitrary and capricious, as they provide no rational basis for identifying the likely consequences of the recommended regulations.
Can Regulatory Benefits Estimation be Improved, and, if so, How? Can more active judicial review truly improve the accuracy of causal inferences and benefits predictions used in deciding which proposed regulatory changes to make and in evaluating their performance? To what extent are improvements constrained by hard limits on what can be reliably predicted and learned from realistically incomplete and imperfect data? The following distinct lines of evidence from very different areas suggest that substantial improvements are indeed possible in practice, but that they are best accomplished with the help of strong external critical review of the evidence and reasoning relied on by regulatory agencies and advocates.
The first line of evidence comes from sociological and organizational design studies (see Chapter 13). These suggest that the organizational culture and incentives of regulatory agencies usually put weight on authoritative, prospective estimates of benefits, with supporting causal assumptions that reflect the entrenched views of the organization that regulation produces desirable results and that the beliefs of the regulators are scientific and trustworthy (Wynn, 1993). However, organizational cultures that foster demonstrably high performance in managing risks and uncertainties function quite differently. They typically acknowledge ignorance and uncertainty about how well current policies and actions are working. They focus on learning quickly and effectively from experience, frequently revisiting past decisions and assumptions and actively questioning and correcting current policies entrenched assumptions and beliefs as new data are collected (Dekker and Woods, 2009, Weick et al., 2001; see Chapter 13). For example, difficult and complex operations under uncertainty, such as managing air traffic coming and going from nuclear aircraft carriers, operating nuclear power plants or offshore oil platforms safely for long periods under constantly changing conditions, fighting wildfires, landing airplanes successfully under unexpected conditions, or performing complex surgery, are carried out successfully in hundreds of locations worldwide every day. As discussed in Chapter 13, the disciplines and habits of mind practiced and taught in such high reliability organizations (HROs) have proved useful in helping individuals and organizations plan, act, and adjust more effectively under uncertainty. Regulatory agencies dealing with uncertain health and safety risks can profit from these lessons (Dekker and Woods, 2009).
Five commonly listed characteristics of HROs are as follows (see Chapter 13): sensitivity to operations – to what is working and what is not, with a steady focus on empirical data and without making assumptions (Gamble, 2013); reluctance to oversimplify explanations for problems, specifically including resisting simplistic interpretations of data and assumptions about causality; preoccupation with failure, meaning constantly focusing on how current plans, assumptions, and practices might fail, rather than on building a case for why they might succeed; deference to expertise rather than to seniority or authority; and commitment to resilience, including willingness to quickly identify and acknowledge when current efforts are not working as expected and to improvise as needed to improve results (Weick et al., 2001). Of course, regulatory processes that unfold over years, largely in the public sphere, are a very different setting from operations performed by specially trained teams. But it is plausible that many of the same lessons apply to regulatory organizations seeking to improve outcomes in a changing and uncertain environment (Dekker and Woods, 2009).
A second line of evidence that the improvements in predicting the effects of regulations can be achieved in practice comes from research on improving judgment and prediction, recently summarized in the popular book Superforecasting (Tetlock and Gardner, 2015). Although most predictions are overconfident and inaccurate, a small minority of individuals display consistent, exceptional performance in forecasting the probabilities of a wide variety of events, from wars to election outcomes to financial upheavals to scientific discoveries. These “superforecasters” apply teachable and learnable skillsand habits that explain their high performance. They remain open-minded, always regarding their current beliefs as hypotheses to be tested and improved by new information. They are eager to update their current judgments frequently and precisely, actively seeking and conditioning on new data and widely disparate sources of data and evidence that might disprove or correct their current estimates. They make fine-grained distinctions in their probability judgments, often adjusting by only one or a few percentage points in light of new evidence, which is a level of precision that most people cannot bring to their probability judgments. The authors offer the following rough recipe for improving probability forecasts: (1) “Unpack” the question to which the forecast provides an answer, e.g., about the health benefits that a regulation will end up causing, into its components, such as who will receive what kinds of health benefits and under what conditions. (2) Distinguish between what is known and unknown and scrutinize all assumptions. For example, do not assume that reducing exposure will cause proportional reductions in adverse health effects unless manipulative causation has actually been shown. (3) Consider other, similar cases and the statistics of their outcomes (taking what the authors call “the outside view”) and then (4) Consider what is special or unique about this specific case in contradistinction others (the “inside view”). (5) Exploit what can be learned from the views of others, especially those with contrasting informed predictions, as well as from prediction markets and the wisdom of crowds. (6) Synthesize all of these different views into one (the multifaceted “dragonfly view,” in the authors’ term) and (7) Express a final judgment, conditioned on all this information, as precisely as possible using a fine-grained scale of probabilities. Skill in making better predictions using this guidance can be built through informed practice and clear, prompt feedback, provided that there is a deliberate focus on tracking results and learning from mistakes.
A third line of evidence that it is possible to learn to intervene effectively even in uncertain and changing environments to make preferred outcomes more likely and frequent comes from machine learning, specifically the design and performance of reinforcement-learning algorithms that automatically learn decision rules from experience and improve them over time. A very successful class of algorithms called “actor-critic” methods (Konda and Tsitikilis, 2003; Lei, 2016, Ghavamzadeh et al., 2016) consist of a policy or “actor” for deciding what actions to take next, given currently available information; and one or more reviewers or “critics” that evaluate the empirical performance of the current policy and suggest changes based on the difference between predicted and observed outcomes. These algorithms have proved successful in learning optimal (net-benefit maximizing) or near-optimal policies quickly in a variety of settings with probabilistic relations between actions and their consequences and with systems that behave in uncertain ways, so that is necessary to adaptively learn how best to achieve desired results.
High-reliability organizations, superforecasters, and successful machine learning algorithms for controlling uncertain systems all apply the following common principles.
Recognize that even the best current beliefs and models for predicting outcomes and for deciding what to do to maximize net benefits will often be mistaken or obsolete. They should be constantly checked, improved, and updated based on empirical data and on gaps between predicted and observed results.
Relying on any single model or set of assumptions for forecasting and decision-making is less effective that considering the implications of many plausible alternatives.
Seek and use potential disconfirming data and evidence from many diverse sources to improve current beliefs, predictions, and control policies.
Use informed external critics to improve performance by vigilant review, frequent challenging of current assumptions, predictions and policies, and informed suggestions for changes based on data.
Applying these principles to regulatory agencies suggests that a mindset that seeks to identify and defend a single “best” model, set of assumptions, or consensus judgment about the effects caused by proposed regulations will be less likely to maximize uncertain net social benefits than treating effects as uncertain quantities to be learned and improved via experience and active learning from data. A judgment-driven culture in which selected experts form and defend judgments about causation and estimated regulatory benefits is less beneficial than a data-driven culture in which the actual effects of regulations are regarded as uncertain, possibly changing quantities to be learned about and improved by intelligent trial and error and learning from data. A data-driven regulatory culture expects to benefit from independent external challenges and reviews of reasoning and assumptions before regulatory changes are approved and from frequent updates of effects estimates based on data collected after they are implemented. Strong judicial review can provide the first part, external reviews of reasoning, by maintaining a high standard for causal reasoning based on data and manipulative causation.
Working against the establishment of a data-driven culture is a long tradition in medicine, public health, and regulatory science of treating causation as a matter of informed judgment that can only be rendered by properly prepared experts, rather than as a matter of empirically discoverable and independently verifiable fact that can be determined from data. The difficulties and skepticism that have faced proponents of evidence-based medicine and evidence-based policies, emanating from traditions that emphasize the special authority of trained experts (Tetlock and Gardner, 2015), suggest the barriers that must be overcome to shift more toward data-driven regulatory cultures.
The following sections discuss the contrasting technical methods used by proponents of the causation-as-judgment and causation-as-fact views, and then suggest that a modern synthesis of these methods provides practical principles for defining and using informative evidence of manipulative causation in administrative law to achieve better results from regulations.
Causation as Judgment: The Hill Considerations for Causality and some Alternatives
As discussed in more detail in Chapter 2, one of the most influential frameworks for guiding consideration and judgments about causality is that of Sir Austin Bradford Hill, who in 1965 proposed nine aspects of an exposure-response association that he recommended “we especially consider before deciding that the most likely interpretation of it is causation” (Hill, 1965, quoted in Lucas and McMichael, 2005). This original formulation reflects a view in which causation is dichotomous: an association is either causal or not. Modern statistics and machine learning approaches to causal inference take a more nuanced view in which the total association between two quantities can be explained by a mix of factors and pathways, including some causal impacts and some confounding, sample selection and model selection biases, coincident historical trends, omitted variables, omitted errors in explanatory variables, model specification errors, overfitting bias, p-hacking, and so forth.
The expressed goal of the Hill considerations is to help someone make a qualitative judgment, “deciding that the most likely interpretation of [an exposure-response association] is causation,” rather than to quantify how likely this interpretation is, and what the probability is that the association is not causal after all, even if causation is decided to be the most likely interpretation. Thus, Hill’s considerations were never intended to provide the quantitative information that is essential for BCA evaluations of uncertain regulatory benefits. Consistent with the culture of many medical and public health organizations over a long history (Tetlock and Gardner, 2015), they instead portray causality as a matter for informed subjective qualitative judgment by expert beholders, not as a fact to be inferred (or challenged) by rigorous, objective, and independently reproducible analysis of data.
The Hill considerations themselves – briefly referred to as strength, consistency, specificity, temporality, biological gradient, plausible mechanism, coherence, experimental support (if possible), and analogy for exposure-response associations – are discussed in more detail later in the context of showing how they can be updated and improved using ideas from current data science. Chapter 2 provides a much more thorough discussion. Hill himself acknowledged that his considerations are neither necessary nor sufficient for establishing causation, but suggested that admittedly fallible subjective judgments based on these considerations may be the best that we can hope for. This line of thinking continues to dominate many regulatory approaches to causal inference. For example, the US EPA has formulated and adopted modified versions of the Hill considerations as principles for making weight-of-evidence determinations about causation for ecological, carcinogen, and other risks. Neither the original Hill considerations nor more recent weight-of-evidence frameworks based on them distinguish between associational-attributive, predictive, manipulative, and other types of causation. Thus, the enormous influence of these considerations has tended to promote judgment-based cultures for making and defending causal assertions while conflating different concepts of causation, without providing a sharp focus on objective evidence and quantification of the manipulative causal relationships needed for rational choice among alternatives based on BCA calculations.
Of course, methodologists have not been blind to the difficulties with associational and attributive methods. The fact that the sizes and signs of associations are often model-dependent and that different investigators can often reach opposite conclusions starting from the same data by making different modeling choices has long been noted by critics of regulatory risk assessments, finally leading some commentators to conclude that associational methods are unreliable in general (Dominici et al., 2014). Recognizing such criticisms, there has been intense effort over the past decade to develop and apply more formal methods of causal analysis within the judgment-oriented tradition. This has produced a growing literature that replaces the relatively crude assumption that appropriately qualified and selected experts can directly judge associations to be causal with more sophisticated technical assumptions that imply that associations are causal without directly assuming it. Human judgment still plays a crucial role, insofar as the key assumptions are usually unverifiable based on data, and are left to expert judgments to accept. The most important of these assumption-driven causal inference frameworks, and their underlying assumptions, are as follows.
Intervention studies assume that if health risks change following an intervention, then the change is (probably) caused by the intervention. This assumption is often mistaken, as in the Irish coal burning ban studies (Dockery et al., 2013): both exposures and responses may both be lower after an intervention than before it simply because both are declining over time, even if neither causes the other. Construing such coincidental historical trends as evidence of causation is a form of the post hoc ergo propter hoc logical fallacy.
Instrumental variable (IV) studies assume that a variable (called an "instrument") is unaffected by unmeasured confounders and that it directly affects exposure but not response (Schwartz et al., 2015). The validity of these assumptions is usually impossible to prove, and the results of the IV modeling can be greatly altered by how the modeler chooses to treat lagged values of variables (O’Malley, 2012).
Counterfactual “difference-in-differences” and potential outcome models assume that differences between observed responses to observed exposure concentrations and unobserved model-predicted responses to different hypothetical "counterfactual" exposure concentrations are caused by the differences between the observed and counterfactual exposures (e.g., Wang et al., 2016). However, they might instead be caused by errors in the model or by systematic differences in other factors such as distributions of income, location, and age between the more- and less-exposed individuals. The assumption that these are not the explanations is usually untested, but is left as a matter for expert judgment to decide.
Regression discontinuity (RD) studies assume that individuals receiving different exposures or treatments based on whether they are above or below a threshold in some variable (e.g., age, income, location, etc.) triggering a government intervention are otherwise exchangeable, so that differences in outcomes for populations of individuals above and below the threshold can be assumed to be caused by differences in the intervention or treatment received. The validity of this assumption is often unknown. In addition, as noted by Gelman and Zelizer (2015), RD models “can overfit, leading to causal inferences that are substantively implausible… .” For an application to air pollution health effects estimation based on differences in coal burning in China, they conclude that a “claim [of a health impact], and its statistical significance, is highly dependent on a model choice that may have a data-analytic purpose, but which has no particular scientific basis.”
As discussed in Chapter 2, associational, attributable-risk, and burden-of-disease studies assume that if responses are greater among people with higher exposures, then this difference is caused by the difference in exposures, and could be removed by removing it (manipulative causation). Typically, this assumption is made without careful justification. It simply assumes that association reflects causation. Conditions such as the Hill considerations of strong and consistent association are commonly misconstrued as evidence for manipulative causation in such studies (e.g., Fedak et al., 2015; Höfler, 2005), without testing potential disconfirming alternative hypotheses such as that strong and consistent modeling assumptions, biases, confounders, effects of omitted variables, effects of omitted error terms for estimated values of predictors, model specification errors, model uncertainties, coincident historical trends, and regression to the mean, might account for them (Greenland, 2005).
These methods all make assumptions that, if true, could justify treating associations as if they indicated manipulative causation. Whether they are true, however, is usually not tested based on data, but is left to expert judgment to decide. As succinctly noted by Gelman and Zelizer (2015) in presenting their own critique of regression discontinuity [RD] studies, “One way to see the appeal of RD is to consider the threats to validity that arise with five other methods used for causal inference in observational studies: simple regression, matching, selection modeling, difference in differences, and instrumental variables. These competitors to RD all have serious limitations: regression with many predictors becomes model dependent…; matching, like linear or nonlinear regression adjustment, leans on the assumption that treatment assignment is ignorable conditional on the variables used to match; selection modeling is sensitive to untestable distributional assumptions; difference in differences requires an additive model that is not generally plausible; and instrumental variables, of course, only work when there happens to be a good instrument related to the causal question of interest.” Something better than unverified assumption-driven methods is needed.
Causation as Discoverable Empirical Fact: Causal Inference Algorithms and Competitions At the opposite pole from Hill’s perspective that determination of causation cannot be reduced to a recipe or algorithm is a rich body of literature and computational approaches to causal inference, introduced in Chapter 2, that seek to do exactly that by providing algorithms for automatically drawing reliable causal inferences from observational data (e.g., Aliferis et al.,2010; Kleinberg S, Hripcsak, 2011; Hoover, 2012; Rottman and Hastie, 2014; Bontempi and Flauder, 2015). The best-known modern exponent of causal inference algorithms may be the computer scientist Judea Pearl (e.g., Pearl, 2009 and 2010), although, as discussed in Chapter 2, this analytic tradition extends back to work by economists and social statisticians since the 1950s (e.g., Simon, 1953) and by biologists, geneticists, and psychologists since the invention of path analysis by Sewell Wright a century ago (Joffe et al., 2012). Most causal inference algorithms use statistical tests to determine which variables help to predict effects of interest, even after conditioning on the values of other variables (Pearl, 2010). Thus, they mainly detect predictive causation, although some also explicitly address implications for causal mechanisms, structural causation, and manipulative causation (Iwasaki, 1988, Voortman et al., 2010). Their emphasis on predictive causation allows causal inference algorithms to benefit from well-developed principles and methods for predictive analytics and machine learning (ML).
Key technical ideas of causal inference algorithms can be used more generally to guide human reasoning about causal inference. Here, we very briefly summarize some of the key ideas explained much more thoroughly in Chapter 2. An idea used in many causal inference algorithms is that in a chain such as X Y Z, where arrows denote manipulative or predictive causation (so that changes in the variable at the tail of an arrow change or help to predict changes in the variable that it points into, respectively), each variable should have a statistical dependency on any variable that points into it, but Z should be conditionally independent of X given the value of Y, since Z depends on X only through the effect of X on Y. Algorithms that test for conditional independence and that quantify conditional probability dependencies among variables are now mature (Frey et al., 2003; Aliferis et al., 2010) and are readily available to interested practitioners via free Python and R packages for ML, such as the bnlearn package in R, which learns probabilistic dependencies and independence relations(represented via Bayesian network (BN) structures and conditional probability tables) from data. A second, related idea is that in the chain X Y Z, Y should provide at least as much information as X for predicting Z. A third idea, introduced in path analysis for linear relationships among variables and generalized in BNs to arbitrary probabilistic dependencies, is that the effect of changes in X on changes in Z should be a composition of the effect of changes in X on Y and the effect of changes in Y on Z. Such ideas provide constraints and scoring criteria for identifying causal models that are consistent with data.
Modern causal inference algorithms offer dozens of constructive alternatives for assessing predictive causal relations in observational data without relying on human judgment or unverified modeling assumptions. The field is mature enough so that, for over a decade, different causal inference algorithms have been applied to suites of challenge problems for which the underlying data-generating processes are known to see how accurately the algorithms can recover correct descriptions of the underlying causal models from observed data. Competitions are now held fairly regularly that quantify and compare the empirical performance of submitted causal inference algorithms on suites of test problems (e.g., NIPS 2013 Workshop on Causality; Hill, 2016). Results of recent causal inference competitions suggest the following principles for causal inference from observational data as common components of many of the top-performing algorithms.
Information principle: Causes provide information that helps to predict their effects and that cannot be obtained from other variables. This principle creates a bridge between well-developed computational statistical and ML methods for identifying informative variables to improve prediction of dependent variables, such as health effects, and the needs of causal inference (Pearl, 2009).To the extent that effects cannot be conditionally independent of their direct manipulative causes, such information-based algorithms provide a useful screen for potential manipulative causation, as well as for predictive causation.
Propagation of changes principle: Changes in causes help to explain and predict changes in effects (Friston, Moran, and Seth 2013; Wu, Frye, and Zouridakis, 2011). This applies the information principle to changes in variables over time. It can often be visualized in terms of changes propagating along links (representing statistical dependencies) in a BN or other network model.
Nonparametric analyses principle. Multivariate non-parametric methods, most commonly, classification and regression trees (CART) algorithms, can be used to identify and quantify information dependencies among variables without having to make any parametric modeling assumptions (e.g., Halliday et al., 2016). CART trees can also be used to test for conditional independence, with the dependent variable being conditionally independent of variables not in the tree, given the variables that are in it, at least as far as the tree-growing algorithm can discover (Frey et al., 2003; Aliferis et al., 2010).
Multiple models principle. Rather than relying on any single statistical model, the top-performing causal analytics algorithms typically fit hundreds of nonparametric models (e.g., CART trees), called model ensembles, to randomly generated subsets of the data (Furqan and Siyal, 2016). Averaging the resulting predictions of how the dependent variable depends on other variablesover an ensemble of models usually yields better estimates with lower bias and error variance than any single predictive model.This is reminiscent of the principle in high-reliability organizations and among superforecasters of considering many theories, models, and points of view, rather than committing to a single best one. Computational statistics packages such as the randomForest package in R automate construction, validation, and predictive analytics for such model ensembles and present results in simple graphical forms, especially partial dependence plots (Chapter 2) that show how a dependent variable is predicted to change as a single predictor is systematically variedwhile leaving all other variables with their empirical joint distribution of values. If this dependency represents manipulative causality, then the partial dependency plot indicates how the conditional expected value of an outputs uch as mortality in a population is expected to change when a variable such as exposure is manipulated, given the empirical joint distribution of other measured predictors on which the output also depends. Otherwise, it quantifies a predictive relation.
High-performance causal inference algorithms for observational data usually combine several of these principles. Interestingly, none of them uses the Hill considerations or associational-attributional methods such as probability of causation or attributable risk formulas from epidemiology. A counterfactual-potential outcomes causal modeling approach was entered in a recent competition (Hill, 2016), but performed relatively poorly, with roughly 20 times larger bias, 20 times larger mean square prediction error for estimated causal effects, and wider uncertainty intervals than tree-based algorithms incorporating the above principles. This presumably reflects the fact that the counterfactual approach depends on models of unknown validity. In short, causal inference and discovery algorithms that assume that causal relationships are empirical facts that can be discovered from data have made great progress and yield encouraging performance in competitive evaluations (Bontempi and Flauder, 2015), but none of them uses the methods usually relied on by regulatory agencies in making judgments about causation. Such methods, including weight-of-evidence schemes for evaluating and combining causal evidence, were tried and evaluated as approaches to automated assessment of causality in expert systems research the 1980s (e.g., Speigelhalter, 1986, Todd, 1992), but they have been out-competed by modern causal inference algorithms incorporating the above principles, and are no longer used in practical applications. That they continue to play a dominant role in causal inference in many regulatory agencies invites the question of whether these agencies could also dramatically improve their performance in predicting and assessing causal effects of regulations by applying modern causal inference algorithms and principles instead.
Synthesis: Modernizing the Hill Considerations The enduring influence and perceived value of the Hill considerations and of judgment-centric methods for causal inference in regulatory agencies shows that they fill an importantneed. Despite Hill’s disclaimers, this is largely the need to have a simple, intuitively plausible checklist to use in assessing evidence that reducing exposures will reduce risks of harm. At the same time, the successes of data-centric, algorithmic methods of causal inference and causal discovery in competitive evaluations suggests the desirability of asynthesis that combines the best elements of each. This section describes each of the Hill considerations, its strengths and limitations, and possibilities for improving on Hill’s original 1965 formulation using contemporary ideas. The same list of considerations is also discussed in Chapter 2, using the technical concepts and terminology of modern causal analytics introduced there. Here, we reconsider them without assuming that technical background.
Strength of association: Hill proposed as the first consideration that larger associations are more likely to be causal than smaller ones. One possible underlying intuition to support this is that causal laws always hold, so they should produce large associations, but conditions that generate spurious associations only hold sometimes, e.g., when confounders are present, and thus they tend to generate smaller associations. Whatever the rationale, objections to this consideration are that
The existence, direction, and size of an association is often model-dependent (Dominici et al., 2014, Gelman and Zelizer, 2015). Recall the example of Models 1 and 2 with R = C + 50 and R = 150 - C -T, respectively. In Model 1, C is positively associated with R while in Model 2, C is negatively associated with R. More generally, whether an association is large or small may reflect modeling choices rather than some invariant fact about the real world that does not depend on the modeler’s choices
Associations are not measures of manipulative causation.
There is no reason in general to expect that a larger association is more likely to be causal than to expect that it indicates stronger confounding, larger modeling errors or biases, stronger coincident historical trends, or other non-causal explanations.
On the other hand, there is a useful insight here that can be formulated more precisely and correctly in more modern terminology. In a causal network such as the chain W X Y Z, where arrows signify predictive or manipulative causation or both, it must be the case that Y provides at least as much information about Z as X or W does, and typically more (Cover and Thomas, 2006). (Technically, the information that one random variable provides about another, measured in bits, is quantified as the expected reduction in the entropy of the probability distribution of one variable achieved by conditioning on the value of the other.) Thus, if Y is a direct manipulative or predictive cause of a dependent variable Z, it will provide as much or more information about Z than indirect causes such as X or non-cause variables such as W that are further removed from it in the causal network. The same is not necessarily true for correlations: if Y = X2 and Z = Y1/2, then X and Z will be more strongly correlated than Y and Z, even though Z depends directly on Y and not on X. Thus, replacing association in Hill’s formulation with information yields a useful updated principle: the direct cause(s) of an effect provide more information about it than indirect causes and variables to which it is not causally related. Thus, variables that provide more information about an effect are more likely to be direct causes or consequences of it than are variables that provide less information.Modern causal discovery algorithms incorporate this insight via the information principle that effects are not conditionally independent of their direct causes and via CART tree-growing algorithms that identify combinations of predictor values that are highly informative about the value of an effect dependent variable.
Consistency: Hill proposed that if different investigators arrive at consistent estimates of an exposure-response association in different populations, then this reproducibility provides evidence that the consistently found association is causal. Against this, as noted by Gelman and Zelizer (2015), is the recognition that, “once researchers know what to expect, they can continue finding it, given all the degrees of freedom available in data processing and analysis.” Modern ensemble modeling methods for predictive analytics pursue a somewhat similar criterion – but avoid the potential bias of knowing what to expect and using p-hacking to find it – by partitioning the data into multiple randomly selected subsets (“folds”), fitting multiple predictive models (e.g., CART trees, see Chapter 2) to each subset, and then evaluating their out-of-sample performance on the other subsets. Averaging the predictions from the best-performing models then yields a final prediction, and the distribution of the top predictions characterizes uncertainty around the final prediction. Such computationally intensive methods of predictive analytics provide quantitative estimates of predictive causal relations and uncertainty about them. The Hill consideration that consistent associations are more likely to be causal is replaced by a principle that consistency of estimates across multiple models and subsets of available data implies less uncertainty about predictive relationships.In addition, conditions and algorithms have been developed for “transporting” causal relations among variables inferred from interventions and observations in one population and setting to a different population and setting for which observational data are available (Bareinboim and Pearl, 2013; Lee and Honavar 2013). These transportability algorithms have been implemented in free R packages such as causaleffect (Tikka, 2016). They capture the idea that causal relationships can be applied in different situations, but that differences between situations may modify the effects created by a specified cause in predictable ways. This is a powerful generalization of the consistency consideration envisioned by Hill.
Specificity: Hill considered that the more specific an association is between an exposure and an effect is, the more likely it is to be causal. This consideration is seldom used now because it is recognized that most exposures of interest, such as fine particulate matter, might have more than one effect and each effect, such as lung cancer, might have multiple causes. Instead, modern causal inference algorithms such as those in the R package bnlearn discover causal networks that allow multiple causes and effects to be modeled simultaneously.
Temporality: Hill considered that causes must precede their effects. This was the only one of his nine considerations that he held to be a necessary condition. Modern causal inference algorithms agree, but refine the criterion by adding that causes must not only precede their effects, but must also help to predict them. Methods such as Granger causality testing specify that the history (past and present values) of a cause variable must help to predict the future of the effect variable better than the history of the effect variable alone can do.
Biological gradient: This consideration states that if larger exposures are associated with larger effects, then their association is more likely to be causal than if such monotonicity does not hold. This is closely related to the strength-of-association criterion, since many measures of association (such as correlation) assume a monotonic relationship. Just as a strong confounder can explain a strong exposure-response association in the absence of manipulative causation, so it can explain a monotonic relation between exposure and response even in the absence of manipulative causation. Since 1965, research on nonlinear and threshold exposure-response relations has made clear that many important biological processes and mechanisms do not satisfy the biological gradient criterion. Modern methods of causal discovery, including CART trees and Bayesian Networks (Chapter 2), can discover and quantify non-monotonic relationships between causes and their effects, so the biological gradient criterion is unnecessary for applying these methods.
Plausibility: Hill considered that providing a plausible mechanism by which changes in exposure might change health effects would make a causal relationship between them more likely, while acknowledging that ignorance of mechanisms did not undermine epidemiological findings of associations. The converse is that ignorance of mechanisms can make many proposed mechanisms seem superficially plausible. Fortunately, modern bioinformatics methods allow principles of causal network modeling to be applied to elucidate causal mechanisms and paths, as well as too describe multivariate dependencies among population level variables. Thus, proposed causal mechanisms and paths linking exposure to harm can now be tested using the principles already discussed and data on the relevant variables in bioinformatics databases. For example, a mechanistic path such as “Exposure X increases biological activity Y, which then increases risk of adverse effect Z” might sound plausible when proposed, but might then be shown to be not plausible after all if changes in Z turn out to be independent of changes in Y, or if changes in Z are still dependent on changes in X even when the value of Y has been conditioned on. The same causal discovery and inference algorithms can be applied to both epidemiological and biological data. No new principles or algorithms are required to develop causal network models and dynamic causal simulation models from data collected at the levels of populations, individuals, organ systems, tissues and cell populations, or intracellular processes, as witnessed by the explosive growth of causal discovery and inference algorithms and network modeling in systems biology.
Coherence: Similar to plausibility, coherence of a manipulative causal exposure-response with current scientific understanding, which Hill considered to increase the likelihood that a causal relationship exits, is can be addressed by modern causal diagram methods (Joffe et al., 2012) without introducing any new principles or algorithms. Causal network inference and modeling algorithms can be applied to variables at different levels in the biological hierarchy, allowing coherence among causal networks at different levels to be determined from data. Coherence of knowledge at different levels is then an output from these algorithms, rather than an input to them. Alternatively, if knowledge is sufficient to allow some arrows in a causal diagram to be specified or forbidden, then these knowledge-based constraints can be imposed on the network-learning algorithms in programs such as bnlearn, assuring the coherence of discovered networks with these constraints.
Experiment: If interventions are possible for a subset of controllable variables, then setting them to different values and studying how other variables respond can quickly elucidate manipulative causality (Voortman et al., 2010). Causal network discovery algorithms add to this consideration specific techniques for designing experimental manipulations to reveal manipulative causal relationships and algorithms for “transporting” the resulting causal knowledge to new settings with different values of some of the variables (Tikka, 2016).
Analogy. The last of Hill’s considerations is that it is more likely that an association is causal if its exposure and response variables are similar to those in a known causal relationship. A difficulty with this is that what constitutes relevant “similarity” may not be known. For example, are two mineral oils “similar” for purposes of predicting causation of dermal carcinogenicity if they have similar viscosities, or similar densities, or similar polycyclic aromatic hydrocarbon (PAH) content, or some other similarities? The theory of transportability of causal relationships across different settings (Bareinboim and Pearl, 2013; Lee and Honavar 2013; Tikka, 2016) provides a more precise and rigorous understanding of what conditions must be satisfied for a causal relationship identified and quantified in one system to hold in another. The variables (e.g., viscosity, density, PAH content, etc.) that are relevant for letting a causal relationship be transported define the relevant similarities between systems, and thus allow the analogy consideration to be made precise.
This comparison of Hill considerations with principles used in current causal network learning algorithms suggests that real progress has been made since 1965. The considerations of strength, consistency, and temporality can be refined and made more precise using modern concepts and terminology. The considerations of specificity, plausibility, and biological gradients incorporate restrictions that are no longer needed to draw sound and useful causal inferences, since current causal inference algorithms can simultaneously handle multiple causes and effects, multiple causal pathways, and nonlinear and non-monotonic relationships. The somewhat vague considerations of coherence and analogy can be made more precise, and experimental and observational data can be combined for purposes of causal inference, using the recent theory of transportability of causal relationships (Bareinboim and Pearl, 2013, Tikka, 2016). These technical advances suggest that it is now practical to usedata-driven causal inference methods and concepts to clarify, refine, and replace earlier judgment-based approaches to causal inference. They provide concrete criteria that can be implemented in software algorithms or applied by courts to make more objective and accurate determinations of manipulative causality than has previously been possible. This provides a technical basis for expanding the role of judicial review to include encouraging and enforcing improved causal inference.