Which of these four cards must necessarily be turned over to reveal what is on the other side in order to determine whether the following hypothesis is true?
Hypothesis: Each card with a vowel (A, E, I, O, or U) on one side has an even number (2, 4, 6, or 8) on the other. In other words, what is the smallest subset of the cards that must be turned over to test the validity of this hypothesis? Most people, including scientists, find it difficult to think clearly and correctly about even such simple but abstract relations between hypotheses and the evidence needed to test them. By far the most common answer is that cards A and 2 must be turned over to confirm whether the A has an even number on its other side and whether the 2 has a vowel on its other side. The correct answer is that cards A and 5 must be turned over. It is indeed necessary to confirm whether the A has an even number on its other side, but it is also necessary to confirm that the 5 does not have a vowel on its other side, since finding a vowel there would disconfirm the hypothesis. (Cards L and 2 are irrelevant, since neither one can disconfirm the hypothesis no matter what is on the other side.) This often-repeated experiment, called the Wason selection task, and numerous variations on it, illustrate that people naturally think about confirming evidence more easily than about disconfirming evidence in many settings.
In light of such powerful heuristics and biases in judgments under uncertainty, many of which were elucidated starting in the 1970s (Kahneman, 2011), findings of consistency in effects estimates and associations across multiple studies should raise a suspicion of p-hacking and confirmation bias: the possibility that different investigators varied their modeling assumptions until they produced the results they expected or hoped to find, perhaps based on published results from earlier studies. Consistency per se is not evidence of causation unless other plausible explanations, such as p-hacking, can be ruled out. Indeed, logically, the proposition that a true causal effect is expected to generate consistent associations across studies is questionable, insofar as different studies involve different conditions and distributions of covariates in the population that should affect estimated associations and statistical effects. (Moreover, even if the premise were true that “Causality implies consistency,” it would not necessarily justify the conclusion “Consistency implies causality.” That is the formal logical fallacy of affirming the consequent, analogous to turning over card 2 in the Wason section task to find confirming evidence, which is logically irrelevant for testing the stated hypothesis.)
Causal graph models can improve on the traditional consistency consideration by providing much clearer tests of agreement between theory and data. More convincing than agreement with previous findings (which is too often easily accomplished via p-hacking) is to find associations and effects estimates that differ across studies, and to show that these differences are successfully predicted and explained by invariant causal CPTs applied to the different joint distributions in the populations of causally relevant covariates (e.g., sex, age, income, health care, etc.) Modern transport formulas for DAG models allow such detailed prediction and explanation of empirically observed differences in effects in different populations (Heinze-Deml et al., 2017; Bareinboim and Pearl, 2013; Lee and Honavar, 2013; https://cran.r-project.org/web/packages/causaleffect/causaleffect.pdf). Similar techniques allow the results of multiple disparate studies to be combined, generalized, and applied to new settings using the invariance of causal CPTs, despite the variations of marginal and joint distributions of their inputs in different settings (Triantafillou and Tsamardinos, 2015; Schwartz et al., 2011). Even within a single study, DAGitty algorithms will often produce multiple alternative adjustment sets, allowing the same causal effects to be estimated by conditioning on different subsets of other variables, as illustrated in Figures 2.28 and 2.29. Internal consistency, in the sense that estimates of specified total or direct causal effects using different adjustment sets for the same data, and external consistency, in the sense that the same invariant causal CPTs are found to hold in settings with very different joint distributions of the values of their direct parents, provide useful refinements to the general consideration of “consistency,” and help to distinguish consistency that represents genuine predictive power (based on discovery of causal invariants that can be applied across multiple settings) from consistency arrived at by retrospective p-hacking to make results agree with expectations.
Plausibility, Coherence, and Analogy of Association The considerations of biological plausibility, coherence, and analogy also appeal to, and perhaps encourage, confirmation bias if judgments about the plausibility and coherence of observations, study results, and analogies reflect how well they conform to prior expectations. This can inhibit discovery of unexpected truths. It can also encourage use of logically irrelevant or inadequate information to support preconceptions. For example, IARC (1986) suggests as a principle to use in evaluating causal evidence that if a chemical causes cancer in rats or mice, then it is biologically plausible that it will do so in people. But rodents have organs (e.g., Harderian and Zymbal glands) that people do not. They develop cancers via mechanisms that do not occur in people (e.g., alpha 2 mu globulin protein drop nephropathy in male rats, but not in female rats or in other species). Thus, what is considered plausible may depend on what is known about relevant differences between species. This is consistent with Hill’s own caveat that “What is biologically plausible depends upon the biological knowledge of the day” (Hill, 1965).
Similarly, many scientific papers that argue that exposures might plausibly pose human health risks use logic similar to the following:
Substance X induces production of reactive oxygen species (ROS), oxidative damage, and proliferation of damaged target cells via an NF-kB signaling pathway.
It has been found that several carcinogens increase cancer risk because they induce production of reactive oxygen species (ROS), oxidative damage, and proliferation of damaged target cells via the NF-kB signaling pathway;
Therefore it is plausible that X is also a carcinogen, by analogy to these known chemical carcinogens and supported by mechanistic or mode-of-action evidence about the signaling pathways and responses involved.
Such reasoning may seem quite convincing until we reflect that it is the same basic syllogism as the following obvious fallacy:
X causes responses in living animals;
Known carcinogens cause responses in living animals;
Therefore X is probably a carcinogen.
The fallacy in both cases is that the causal relevance of the described similarities between (A) and (B) is unknown: it is usually unknown whether the described changes caused by X (e.g., ROS production, oxidative damage, etc.) are the specific ones that cause cancer, or whether they are simply similar but normal homeostasis-preserving responses of healthy organisms to stimuli. Responses such as signaling to the nucleus via specific (e.g., NF-kB) pathways and production of ROS occur in healthy cells and organisms in response to a wide variety of stimuli, as well as in disease processes, and whether they should be interpreted as evidence for a health risk requires far more specific molecular biological detail than is usually provided in arguments with this syllogistic structure. For example, many biological processes have bistable or multistable structures in which stimuli with slightly different intensities or durations can elicit qualitatively very different responses, e.g., healthy vs. pathological. Qualitative descriptions and analogies often do not provide the essential quantitative details required to determine which qualitative response will occur following an exposure.
DAG methods and related causal graph models can improve upon the considerations of plausibility, coherence, and analogy. They replace qualitative judgments about whether belief in causality is coherent, plausible, and analogous to other relevant situations with more definite and independently reproducible criteria. The criterion of d-connectivity in DAG models, as calculated via the algorithms in DAGitty and similar software, establishes whether it is plausible and coherent to believe that exposure is a possible cause of the responses that are attributed to it, in the sense that information can flow from exposure to the response variables. If not, then the data do not indicate that it is plausible that exposure could be a cause of the responses. Knowledge-based constraints for plausibility, such as that daily temperatures might be a cause of death counts, but death is not a possible cause of daily temperatures, can be incorporated into BN learning programs (e.g., using white lists and black lists for permitted and forbidden arrows in bnlearn, or using CAT’s source and sink constraints). Quantitatively, estimates of possible effect sizes (or, for some algorithms, bounds on possible effect sizes) calculated from the constraints imposed by observations in a causal BN model can help to determine whether estimated effects of exposures on responses are plausible and consistent with what is known. For example, a PDP for the total effect of exposure on the conditional expected value of response can be used to check whether epidemiological estimates that attribute a certain fraction of responses to exposures are consistent with the causal BN model learned from available data. Instead of drawing subjective analogies across chemicals or studies, information can be combined across studies using algorithms that pool dependence and conditional independence constraints from multiple source data sets having overlapping variables, obtained under a possibly wide range of different experimentaland observational conditions, and that automatically identify causal graph models (with unobserved confounders allowed) that simultaneously describe the multiple source data sets (Triantafillou and Tsamardinos, 2015).
By applying well-supported, publicly available, pre-programmed causal discovery algorithms to data, researchers can minimize effects of confirmation bias, motivated reasoning, p-hacking, and other types of investigator bias. Such algorithm can replace judgments that may be difficult or impossible to verify or independently reproduce with data-driven conclusions that can easily be reproduced by others simply by running the same algorithms on the same data. Applying several different causal discovery algorithms based on different principles, such as those on the right Table 2.4, can reveal conflicting evidence and ambiguities in possible causal interpretations of the data. Some recent causal discovery algorithms pay close attention to resolving conflicting evidence, e.g., by giving priority to better-supported constraints (ibid). Such advances in causal discovery algorithms allow sophisticated automated interpretation of causal evidence from multiple sources. Arguably, they are supporting a beneficial shift in scientific method from formulation and testing of specific hypotheses (to which investigators may become attached) to direct discovery of causal networks from data without first formulating hypotheses, as in bnlearn and CAT. At a minimum, causal discovery algorithms can provide computer-aided design (CAD) and discovery tools such as COmbINE (Triantafillou and Tsamardinos, 2015), DAGitty (Textor, 2015), and CATto help investigators synthesize causal network models from multiple data sets, understand their testable implications, and compute adjustment sets and effects estimates for those causal effects that can be estimated.
Specificity, Temporality, Biological Gradient The consideration of specificity – that specific causes (such as long, thin amphibole asbestos fibers) cause specific effects (such as chronic inflammation-mediated malignant mesothelioma) – is no longer widely used, since there are now many examples of agents, such as radiation, bacteria, asbestos, and cigarette smoke, that can cause a variety of adverse health effects through common underlying biological mechanisms. Many agents, including these, induce chronic inflammation (via activation of the NLRP3 inflammasome, as discussed in Chapter 9) together with repetitive injury and scarring in target tissues. This conjunction of conditions can lead to multiple diseases such as fibrosis, asbestosis, COPD, and even lung cancer. However, in cases where only a single exposure and a single effect are of interest, the techniques already studied can be applied to determine whether the exposure variable is the sole parent of the effect in a DAG model. Thus, specificity can be included as a special case of the more general techniques now available for learning causal graph models from data.
Temporality, in the form proposed by Hill – that causes should precede their effects – is too weak a criterion to be very useful in cases such as the study of effects on mortality of a Dublin ban on coal burning (Clancy et al., 2002). The proposed cause, “Reduction in particulate air pollution,” did indeed precede the proposed effect, “Reduction in all-cause mortality,” but it also followed it, because all-cause mortality was on a long-term downward trend that both preceded and followed the ban. The ban had no detectable effect on the decline in total mortality rates over time (Dockery et al., 2013). Thus, it would be a logical fallacy (the post hoc ergo propter hoc fallacy) to interpret the decline in mortality following the ban as evidence for the hypothesis that the ban, or the large (about 70%) reduction that it caused in particulate air pollution, caused or contributed to the subsequent decline in mortality rates. The much stronger criterion of predictive causality – that the past and present values of causes should help to predict future values of their effects better than they can be predicted without that information – subsumes and strengthens the traditional temporality criterion.
Finally, biological gradient – that larger effect values should be associated with larger values of their causes – is an unnecessarily restrictive requirement that also overlaps substantially with strength of association. Many biological exposure-response relationships are non-monotonic (e.g., U-shaped, J-shaped, or n-shaped) or have thresholds (e.g., because of positive feedback loops that create bistability, or because of discontinuous changes such as rupture of a lysosome as its membrane loses integrity). Thus, a biological gradient should not necessarily be expected even when there are clear causal mechanisms at work. On the other hand, ignored or mis-modeled errors in exposure estimates can make even sharp threshold-like exposure-response relations appear to follow smooth dose-response gradients if the probability that true exposure is above the threshold that elicits a response increases smoothly with the estimated exposure (Rhomberg et al., 2011). In this case, the apparent gradient between estimated exposures and response probabilities is actually just evidence of exposure estimation error, not evidence that risk increases with exposures below the threshold. However, the more important point is that a positive biological gradient is a very special case of more general hypotheses, such as the LiNGAM hypothesis that response (or response probability) is a possibly non-monotonic function of exposure plus an error term; or the general non-parametric hypothesis that the conditional probability of response (or its conditional probability distribution, if the response variable has more than two values) depends on exposure. Since modern causal discovery methods work with these more general hypotheses, tests for a biological gradient are unnecessary for causal inference. PDPs and other non-parametric methods will reveal exposure-response gradients if they exist, but can equally easily describe non-monotonic exposure-response relations (including U-shaped or n-shaped ones with zero average gradient).
In summary, specificity and biological gradient are needlessly restrictive. Causality is often present even when neither property is satisfied, and current methods can identify causal graph structures without making use of either consideration. Temporality can be replaced by the criterion of predictive causality, which is more stringent but more useful.
Methods and Examples of Associational Causation: Regression and Relative Risks In addition to the Hill considerations and related weight-of-evidence criteria for judging evidence of causality, determination of associational causality is supported by a variety of statistical and epidemiological methods for identifying exposure-response associations and for testing the null hypothesis that they can be explained by sampling variability if the statistical modeling assumptions used are correct. In the simplest case where each individual in a study population is either exposed or not exposed to some condition, the difference or ratio of response rates in the exposed and unexposed groups can be used as a basis for quantifying measures of exposure-response association. For example, the relative risk (RR) ratio, which is the ratio of response rates in the exposed and unexposed populations, provides a frequently used measure of exposure-response association. Variations allow this ratio (or the closely related odds ratio) to be quantified after matching on other variables, such as age and sex. Techniques for matching, stratification, and estimation of relative risks are covered in all standard epidemiology textbooks. If exposure is represented by an ordered-categorical or continuous variable instead of by a dichotomous classification, then regression models are used to quantify exposure-response associations. It is common practice to treat evidence that the regression coefficient for exposure is significantly greater than zero as if it were evidence that exposure increases the risk of response. As we have emphasized, this is a mistake: it conflates the distinct concepts of associational and manipulative causation. Table 2.5 gives examples from the literature on health effects of fine particulate matter (PM2.5) air pollution in which findings about associations are misinterpreted as implying that reducing exposures would reduce health risks. The left column shows claims from various articles suggesting that association implies manipulative causal conclusions. The right column comments on the confusion, in each case, between association and manipulative causation. In general, the policy-relevant conclusions in the left column do not follow from the associational findings presented.
Example of Associative vs. Manipulative Causation in Practice: The CARET Trial
The practical importance of the distinction between associative and manipulative concepts of causation is well illustrated by the results of the CARET trial, a randomized, double-blind 12-year trial initiated in 1983 that sought to reduce risks of lung cancer by administering a combination of beta carotene and retinol to over 18,000 current and former smokers and asbestos-exposed workers (Omenn et al., 1996). This intervention was firmly based on epidemiological studies showing that relative risks of lung cancer were smaller among people with larger levels of beta carotene and retinol. The effect of the intervention was to increase risk of lung cancer. In the words of the investigators, “The results of the trial are troubling. There was no support for a beneficial effect of beta carotene or vitamin A, in spite of the large advantages inferred from observational epidemiologic comparisons of extreme quintiles or quartiles of dietary intake or serum levels of beta carotene or vitamin A. With 73,135 person-years of follow-up, the active-treatment group had a 28 percent higher incidence of lung cancer than the placebo group, and the overall mortality rate and the rate of death from cardiovascular causes were higher by 17 percent and 26 percent, respectively.” That the intervention produced the opposite of its intended and expected effect is a valuable reminder of the key point overlooked in thousands of published epidemiological studies similar to those in Table 2.5: relative risks and other measures of association do not necessarily or usually predict how response probabilities will change if interventions are used to change exposures. Predicting effects of interventions requires methods such as those illustrated in Figures 2.23-2.32 for quantifying total causal effects of one variable on another, as well as an assumption that the causal graph models learned from data represent manipulative rather than only predictive causality.
Table 2.5. Association and Causation Conflated in PM2.5 Health Effects Literature Health Effects. (All emphases added.)
“We observed statistically significant and robust associations between air pollution and mortality… these results suggest that fine-particulate air pollution, or a more complex pollution mixture associated with fine particulate matter, contributes to excess mortality in certain U.S. cities.” Dockery, Pope, Xu et al. (1993)
Associations do not suggest a contribution to excess mortality unless they are causal.
“The magnitude of the association suggests that controlling fine particle pollution would result in thousands of fewer early deaths per year.” Schwartz, Laden, and Zanobetti (2002)
Associations do not suggest results from changes in exposure concentrations unless the associations represent manipulative causal relations.
“We examined the association between PM(2.5) and both all-cause and specific-cause mortality… Our findings describe the magnitude of the effect on all-cause and specific-cause mortality, the modifiers of this association, and suggest that PM(2.5) may pose a public health risk even at or below current ambient levels.” Franklin et al. (2006)
An association with mortality is not an effect on mortality. A C-R association does not suggest that exposure poses a public health risk, unless it is causal.
“Residential ambient air pollution exposures were associated with mortality… our study is the first to assess the effects of multiple air pollutants on mortality with fine control for occupation within workers from a single industry.” Hart, Garshick, Dockery et al. (2011)
Associations with mortality are not effects on mortality (Petitti, 1991).
“Each increase in PM2.5 (10 µg/m3) was associated with an adjusted increased risk of all-cause mortality (PM2.5 average on previous year) of 14%... These results suggest that further public policy efforts that reduce fine particulate matter air pollution are likely to have continuingpublic health benefits.” Lepeule, Laden, Dockery and Schwartz (2012)
Associations do not suggest that public policy efforts that reduce exposure are likely to create public health benefits unless the associations reflect manipulative causation.
“Ground-level ozone (O3) and fine particulate matter (PM2.5) are associated with increased risk of mortality. We quantify the burden of modeled 2005 concentrations of O3 and PM2.5 on health in the United States. …Among populations aged 65–99, we estimate nearly 1.1 million life years lost from PM2.5 exposure… The percentage of deaths attributable to PM2.5 and ozone ranges from 3.5% in San Jose to10% in Los Angeles. These results show that despite significant improvements in air quality in recent decades, recent levels of PM2.5 and ozone still pose a nontrivial risk to public health.” Fann et al. (2012)
In the absence of manipulative causation, statistical associations between pollutant levels and mortality risks do not quantify effects caused by exposure on burden of disease or on life-years lost or on deaths, nor do they indicate a risk to public health.
“Ambient fine particulate matter (PM2.5) has a large and well-documented global burden of disease. Our analysis uses high-resolution (10 km, global-coverage) concentration data and cause-specific integrated exposure-response (IER) functions developed for the Global Burden of Disease 2010 to assess how regional and global improvements in ambient air quality could reduce attributable mortality fromPM2.5. Overall, an aggressive global program ofPM2.5mitigation in line with WHO interim guidelines could avoid 750 000 (23%) of the 3.2 million deaths per year currently (ca. 2010) attributable to ambient PM2.5.” Apte et al. (2015)
The Global Burden of Disease IER functions are based on relative risk measures of association. They do not allow prediction or assessment of “how… improvements on ambient air quality could reduce attributable mortality” or avoid deaths unless the underlying relative risks represent manipulative causal relations.
“We use a high-resolution global atmospheric chemistry model combined with epidemiologicalconcentration responsefunctions to investigate premature mortality attributable toPM2.5in adults ≥ 30 years and children < 5 years. …[A]pplying worldwide the EU annual mean standard of 25 μg/m(3) forPM2.5could reduce global premature mortality due toPM2.5exposure by 17%…Our results reflect the need to adopt stricter limits for annual meanPM2.5levels globally… to substantially reduce premature mortality in most of the world.” Giannadaki, Lelieveld, and Pozzer (2016)
Epidemiological exposure concentration-response associations and estimates of PM2.5-attributable mortalities based on them do not imply that reducing PM2.5 would reduce mortality, or allow such reductions to be predicted, unless the associations represent manipulative causal relations.
“Relative risks were derived from a previously developed exposure-response model. …Nationally, the population attributable mortality fraction of PM2.5 for the four disease causes was 18.6% (95% CI, 16.9-20.3%). …Aggressive and multisectorial intervention strategies are urgently needed to bring down the impact of air pollution on environment and health.” Lo et al., 2016
Relative risks and population attributable mortality fractions measure associations. They do not imply that reducing exposures would reduce risks of adverse responses unless there is a manipulative causal relation between them.
“In the US Medicare population from 2000 to 2012, short-term exposures to PM2.5 and warm-season ozone were significantly associated with increased risk of mortality. This risk occurred at levels below current national air quality standards, suggesting that these standards may need to be reevaluated.” (Di et al., 2017)
Significant exposure-mortality associations do not suggest that standards may need to be reevaluated unless unless the associations reflect manipulative causality rather than confounding, biases, etc.