The regression methods used in associational analyses are supported by free, high-quality software for fitting standard regression models to data. Popular choices include Poisson regression models for dependent variables that are counts (e.g., number of deaths per day in a population); logistic regression models for binary dependent variables (e.g., response or no response for each individual); multiple linear regression or generalized linear models for continuous dependent variables; and Cox proportional hazards models for survival data. Non-parametric alternatives such as Random Forest and PDP plots can be used instead of parametric regression models, but here we illustrate traditional regression analysis in CAT.
Figure 2.40 Regression analysis in CAT
Figure 2.40 shows the top-most results generated by loading the LA data set in Figure 2.11 and then selecting CAT’s Regression command. CAT automatically detects that the dependent variable (which by default is the first column, AllCause75) is a count variable; fits a Poisson regression model; notes that the Poisson regression modeling assumption of equal means and variances for the dependent variable given the values of the predictors is violated; and therefore fits a more general Quasi-Poisson model. Other appropriate regression models, including linear regression and a Random Forest analysis, are then fit to the same data, and the resulting regression coefficients (for linear regression), confidence intervals for regression coefficients, and diagnostic plots are displayed by the CAT software below the top-most results that are shown in Figure 2.40. Figure 2.41 shows the results of applying causal discovery algorithms to the same data. In this consensus BN DAG model, all four of the BN learning algorithms used agree that the only two parents of AllCause75 (at the far right) are month and tmin: PM2.5 is not shown as a direct cause, and indeed the DAG structure implies that AllCause75 is conditionally independent of PM2.5 given month. Why, then, does PM2.5 appear as a significant predictor of AllCause75 in both quasi-Poisson and linear regression models, even after conditioning on month and other variables?
Figure 2.41 Consensus DAG in CAT
The answer is that model specification error makes PM2.5 informative about AllCause75 in regression models that treat month as a continuous predictor instead of as a discrete categorical predictor. The default in many regression packages is to treat predictors with many values as continuous variables, so that main-effects regression coefficients have interpretations as slope coefficients. But month is an unusual variable, in that it cycles through the same 12 values repeatedly. Specifying it as a categorical variable with 12 distinct values (using as.factor() in R or by checking the appropriate boxes on the CAT data page) causes PM2.5 to drop out as a significant predictor of AllCause75. But leaving it as a continuous predictor (the default) results in the regression models trying to estimate a single slope coefficient for a variable that cycles. Since PM2.5 also varies by month, including values of PM2.5 as predictors for other variables that vary by month, namely AllCause75, can help to correct some of the specification error introduced when month is misspecified as a continuous predictor. In short, model specification error causes a logically irrelevant variable, PM2.5, to become a significant predictor for the dependent variable because it can be used to correct for some of the specification error, thus reducing error variance, i.e., the mean squared error between model-predicted and observed values.
The same phenomenon can be seem more simply in the small DAG model X Z Y with corresponding structural equations Y = Z3 and X = Z2. In this DAG, Y is clearly conditionally independent of X given Z, but if a linear regression model of the form E(Y | X, Z) = a + b*X + c*Z is fit to a large data set (e.g., with Z values uniformly distributed between 0 and 1 and with Y = Z3 and X = Z2), then the regression coefficient for X will be positive. This is not because X contributes any information for predicting Y that was not already available from Z, but simply because the assumed linear model form is misspecified, so that including X = Z2 helps to reduce the mean squared error in predicting Y = Z3 using a linear model. If a non-parametric method such as Random Forest were used, then parametric model specification error would no longer play this decisive role, and X would no longer appear as a predictor for Y after conditioning on Z. Interpreting X as exposure, Y as response, and Z as lifestyle, this example shows how model specification error can create a significant statistical exposure-response regression coefficient – and hence an apparent associate between X and Y in this example, even after Z has been “controlled for” (or conditioned on by including it on the right side of the misspecified regression model) – even if exposure is not a cause of response. The appearance of PM2.5 as a highly significant predictor in the regression model for AllCause75 in Figure 2.40 but not as a parent of AllCause75 in Figure 2.41 reflects a similar instance of association without causation.
A very similar point holds when errors in measured or estimated values of predictors are ignored or are not well modeled. For example, suppose that all three of the variables in the DAG model X Z Y are measured with error, but that the measurement error is much larger for Z than for X. Then in a typical linear regression model that omits error terms for the values of predictors X and Z, the measured values of X may be more strongly associated with the measured values of Y than are the measured values of Z, even though Z and not X is the cause of Y. Again, strength of association does not necessarily indicate likelihood of causation.
Example: Associative Causation in Air Pollution Health Effects Research
Di et al. (2017) note that “In the US Medicare population from 2000 to 2012, short-term exposures to PM2.5 and warm-season ozone were significantly associated with increased risk of mortality. This risk occurred at levels below current national air quality standards, suggesting that these standards may need to be reevaluated.” But, as in other associational studies (Table 2.5), the reported associations lack clear implications for prudent risk management or regulatory actions. Studies of association are (or should be) of limited interest to policy analysts and decision-makers insofar as they fail to address the following key manipulative causal question needed to inform effective decision-making:
Q1: How would public health effects change if exposure concentrations were reduced?
Instead, studies of association address the following easier, but less relevant, question:
Q2: What are the estimated ratios (or slope factors or regression coefficients) of health effects to past pollution levels in various researcher-selected models and data sets?
As indicated by the sample of literature in Table 2.5, hundreds of scientific articles and accompanying press releases and editorials on air pollution health effects research have presented answers to Q2 as if they were answers to the Q1. Ambiguous language, such as that scientific studies “link" mortalities or morbidities to air pollution levels (often meaning little more than that someone divided one by the other, or regressed one against the other) has obscured the fact that Q2 has been substituted for Q1. Policy-makers need, but lack, trustworthy, data-driven answers to Q1. To recapitulate, answers to Q2 are inadequate substitutes for answers to Q1 for all of the following reasons.
“Associations are not effects” (Petitti, 1991). A strong, positive, no-threshold exposure-response association in a population warrants no conclusions about how changing exposure would change response (Pearl, 2009). Observing that, historically, Y = 10X + 50, where X measures exposure and Y measures response, does not preclude the possibility that increasing X would reduce Y, or leave it unchanged. For example, suppose that the structural (causal) relation is Y = Z - X, meaning that exogenous changes in X or Z cause Y to adjust until Y = Z - X, where Z is some covariate such as age or poverty. Suppose that, historically, the associative equation Z = 11X + 50 has held, perhaps because poor people or older people live disproportionately in high-exposure areas. The reduced-form regression equation describing historical observational data, Y = 10X + 50, reveals nothing about how an exogenous reduction in X alone, holding other factors such as Z fixed, would change Y (in this case, increasing Y by one unit per unit reduction in X).
Most published associations are assumption-dependent and model-dependent. That is, they depend on specific modeling assumptions used in producing them. In the example just given, regressing Y against X alone would yield a positive association (regression coefficient of 10) for X: E(Y | X) = 10X + 50. Regressing Y against both X and Z would yield a negative association (regression coefficient of -1) for X: E(Y | X) = Z - X. Which association, positive or negative, is reported depends on the model selected. Headlines of the form “Study links exposure X to increased risk of Y” would often be more accurately expressed as “Researchers select a model with a positive association between X and Y.”
Historical associations do not predict future effects of interventions. The Dublin coal-burning ban experience illustrates this point. Associations in the observational data (Clancy et al., 2002) did not correctly predict the lack of effect caused by the large reduction in air pollution (Dockery et al., 2013).
Omitted confounders such as lagged daily temperatures create spurious (non-causal) exposure-response associations. Di et al. omit lagged values of daily temperature for days 2, 3 and more. Yet, Figure 2.19 shows that lagged daily minimum temperatures out to 7 days are among the most important predictors of daily elderly mortality counts in that data set. Omitting them creates significant positive regression coefficients for PM2.5 as a predictor of daily mortality because the PM2.5 levels are affected by the lagged temperatures and act as a partial surrogate for them if they are omitted.
Model specification errors create spurious associations. Instead of presenting an ensemble of multiple alternative plausible models, Di et al. relied on a conditional logistic regression model. Other models might well have produced different results. In the LA data set, fitting a quasi-Poisson or linear regression model (by clicking on “Regression” in the CAT software) to same-day values of the variables produces a significant positive regression coefficient for PM2.5 as a predictor of daily mortality, but non-parametric analyses (CART trees, random forest ensembles, and Bayesian networks) show no relation between them. The explanation is that generalized linear models are misspecified for this data set. PM2.5 is informative for predicting mortality in the context of the misspecified parametric regression model because it can be used to partly correct the specification error. It has no predictive value in a correctly specified model.
Association is not manipulative causation. Showing that exposure and mortality rates are associated does not imply that changing exposure would change mortality rates.
Relative Risk and Probability of Causation in the Competing Risks Framework Despite its inadequacies as a general guide for identifying manipulative causal relationships in data, there are very specific models in which association can successfully play this useful role. One of the best known of these is the competing risk framework, in which each of several sources (potential causes of a disease or adverse outcome) is thought of as independently generating “hits” on a target at a random rate with an average intensity expressed in units of expected hits per unit time. The first hit on the target causes the adverse outcome, such as cancer or birth defect. In this specific setting, with the multiple sources “competing” to land the first hit on the target, the probability that each source wins, thereby becoming the cause of the adverse outcome, is the ratio of its intensity to the sum of the intensities from all sources, i/(1 + 2 + … + N) where i denotes the expected hits per unit time (intensity) from source i. This ratio is the probability of causation that source i is the cause of the adverse outcome, given that it occurs. It can also be written as PCi = i/(B + i), where B denotes the background intensity for occurrence of the adverse effect in the absence of exposure to source i, i.e., the sum of the intensities from all other sources. Doubling the intensity of hits from a source approximately doubles its probability of causation if its intensity is small compared to the background intensity. Conversely, for it to be more likely than not that source i is the cause of the adverse outcome, given that the outcome has occurred, it must be the case that i/(B + i) > ½, implying that i > B, and hence that the relative risk ratio RR = (B + i)/B = 1 + i/B for expected occurrences per person-year among people exposed to source i compared to otherwise similar people not exposed to it, must exceed 2. This criterion is sometimes discussed in the context of legal evidence. For this special case of competing risks, the relative risk ratio RR = 1 + i/B is a linear function of the hit intensity I and hence it directly reflects the incremental risk caused by exposure to source i. Indeed, since RR = (B + i)/B and PCi = i/(B + i), their product is RR*PCi = i/B = RR - 1, from which it follows that PCi = (RR - 1)/RR, or PCi = (1- 1/RR) when this is positive, i.e. when RR > 1. Probability of causation is an increasing function of relative risk, ranging from 0 when RR = 1 to 1 as RR approaches infinity. Thus, in this case, association as measured by relative risk is an excellent guide to causation, as measured by probability of causation: the greater is the association, the higher is the probability of causation. In the competing risks framework, Hill’s original intuition that stronger associations make causation more likely is well justified. However, the competing risks framework makes very restrictive assumptions, especially that causes act independently of each other rather than interacting and that each cause by itself fully suffices to cause the adverse outcome, so that the first-hit metaphor applies. When these assumptions do not hold, there is no longer any necessary relation between association and causation, as illustrated in previous examples, and probability of causation for a single source is no longer well defined.
Conclusions on Associational Causation Hill’s essay (1965) formulated a question of great practical importance: “But with the aims of occupational, and almost synonymous preventive, medicine in mind the decisive question is where the frequency of the undesirable event B will be influenced by a change in the environmental feature A.” This is a question about manipulative causation: how would changing exposure (or “environmental feature”) A affect the frequency distribution or probability distribution in the exposed population, of undesirable event B? It overlaps with the questions addressed by modern causal discovery algorithms that quantify total and direct causal effects on a response variable of changes in an exposure variable by using tools such as DAGitty to determine what effects can be estimated (and what adjustment sets of other variables must be conditioned on to do so), and algorithms such as Random Forest to estimate them without making parametric modeling assumptions. However, rather than focusing on how to obtain valid scientific answers to this key question, Hill reformulated it as follows: “Disregarding then any such problem in semantics we have this situation. Our observations reveal an association between two variables, perfectly clear-cut and beyond what we would care to attribute to the play of chance. What aspects of that association should we especially consider before deciding that the most likely interpretation of it is causation?” This is a very different question. It is no longer about how to discover how changing one variable will change another. Rather, it is about what to consider before deciding that the most likely explanation or interpretation for an observed association between two variables is “causation” (without definition or explanation of what that label means, i.e., its semantics). This is a much less interesting question. Even the most likely explanation is quite likely to be wrong when there are many competing plausible ones. Making a decision about how to label “an association between two variables” is less useful for effective decision-making than figuring out how changing one variable would change the other. The crucial idea of manipulative causation has disappeared. Moreover, the new question of what to consider before “deciding that the most likely interpretation of it [the observed association] is causation” imposes a false dichotomy: that an association is either causal or not, rather than some fraction of it is causal and the rest not.
The answers that Hill proposes to the revised question – the nine considerations in the left column of Table 2.4 – are not thorough or convincing insofar as they fail to consider a variety of important possible non-causal explanations and interpretations for some or all of an observed association. Table 2.6 lists those we have discussed, with brief descriptions and notes on some of the main techniques for overcoming them. Hill himself did not consider that his considerations solved the scientific challenge of causal discovery of manipulative causal relationships from data, but offered them more as a psychological aid for helping people to make up their minds about what judgments to form: “None of my nine viewpoints can bring indisputable evidence for or against the cause-and-effect hypothesis and none can be required as a sine qua non. What they can do, with greater or less strength, is to help us to make up our
Table 2.6 Non-causal explanations for observed associations, and methods to overcome them
Source of non-causal association
Methods for overcoming non-causal associations
Unobserved (latent) confounders
These can be tested for and their effects modeled using the Tetrad, Invariant Causal Prediction, and BACKSHIFT algorithms, among others.
Spurious regression in time series or spatial observations with trends
Spurious regression arising from coincident trends can be detected and avoided by using conditional independence tests and predictive causation (e.g., Granger causality) instead of regression models.
Collider bias; stratification or selection bias
A study that stratifies or matches individuals on certain variables, such as membership in an occupation, or an analysis that conditions on certain variables by including them on the right-hand side of a regression model, can induce exposure-response associations if the variables conditioned, matched, or stratified on are common descendents of the exposure and response variables. The association does not indicate causality between exposure and response, but that they provide alternative explanations of an observed value. Such biases can be avoided by using DAGitty to compute adjustment sets and conditioning only on variables in an adjustment set.
Other threats to internal validity
Threats to internal validity (e.g., regression to the mean) were enumerated by Campbell and Stanley (1963), who also discuss ways to refute them as plausible explanations, when possible, using observational data.
Model specification errors.
Model specification errors arise when an analysis assumes a particular parametric modeling form that does not accurately describe the data-generating process. Assuming a linear regression model when there are nonlinear effects present is one example; omitting high-order interactions terms is another. Model specification errors can be avoided by using non-parametric model ensemble methods such as PDPs.
P-hacking, i.e., adjusting modeling assumptions to produce an association (e.g., a statistically significantly positive regression coefficient).
Automated modeling using CAT or packages such as randomForest and bnlearn to automate modeling choices such as which predictors to select, how to code them (i.e., aggregate their values into ranges), and which high-order interactions to include can help to avoid p-hacking biases.
Omitted errors in explanatory variables.
Using job exposure matrices, remote-sensing and satellite imagery for pollutant concentration estimation, or other error-prone techniques for estimating exposures, creates exposure estimates for individuals that can differ substantially from their true exposures. In simple regression models, omitting errors from the estimated values of explanatory variables tends to bias regression coefficients toward the null (i.e., 0), but the bias can be in either direction in multivariate models, and failing to carefully model errors in explanatory variables can create false-positive associations.
Omitted interdependencies among explanatory variables.
Direct and total effects of exposure on response can have opposite signs. More generally, the DAG model in which variables are embedded can create associations without causation in a regression model that includes on its right-hand side variables not in an adjustment set. This can be avoided by using DAGitty to compute adjustment sets for the total causal effect of exposure on response and then to condition on variables in an adjustment set to estimate that effect.
minds on the fundamental question – is there any other way of explaining the set of facts before us, is there any other answer equally, or more, likely than cause and effect?”
Since Hill deliberately avoided specifying what he means by “cause and effect” in this context (instead “disregarding then any such problem in semantics”), his considerations must serve as an implicit definition: “cause and effect” in this context is a label that some people feel comfortable attaching to observed associations after reflecting on the considerations in Table 2.4. Other possible non-causal explanations for observed associations that might disconfirm the causal interpretation, such as those in Table 2.6, are not included among the considerations. Deciding to label an association as “causal” based on the Hill considerations does not imply that a causal interpretation is likely to be correct or that other non-causal explanations are unlikely to be correct. Indeed, by formulating the problem as “is there any other answer equally, or more, likely than cause and effect?” Hill allows labeling an association as causal even if it almost certainly isn’t. Suppose that each of the eight alternative explanations in Table 2.6 is judged to be the correct explanation for an observed association with probability 11% and that “cause and effect” is judged to be the correct explanation with probability 12%. (For simplicity, let these be treated as mutually exclusive and collectively exhaustive possible explanations, although of course they are neither.) Then “cause and effect” would be the most likely explanation, even though it has only a 12% probability of being correct, and non-causal explanations have an 88% probability of being correct. That would satisfy Hill’s criterion that there is not “any other answer equally, or more, likely than cause and effect,” even though cause and effect is unlikely to be the correct explanation. Deciding to label an association as “causal” in the associational sense used by Hill, IARC (2006), and many others does not require or imply that the associations so labeled have any specific real-world properties, such as that changing one variable would change the probability distribution of another. It carries no implications for consequences of manipulations or for decisions needed to achieve a desired change in outcome probabilities.
Given these limitations, associational methods are usually not suitable for discovering or quantifying manipulative causal relationships. Hence, they are usually not suitable for supporting policy recommendations and decisions that require understanding how alternative actions change outcome probabilities. (An exception, as previously discussed, is that if a competing-risk model applies, then associational methods are justified: effects of interventions that change the intensities of hits from one or more sources change relative risks, cause-specific probabilities of causation, and absolute risk of a hit per unit time in straight-forward ways.) Associations can be useful for identifying non-random patterns that further investigation may explain, with model specification errors, omitted variables, confounding, biases, coincident trends, other threats to internal validity, and manipulative causation being among the a priori explanations that might be considered. That associational studies are nonetheless widely interpreted as if they had direct manipulative causal implications for policy (Table 2.5) indicates a need and opportunity to improve current practice.
Studies that explicitly acknowledge that statistical analyses of associations are useless for revealing how policy changes affect outcome probabilities are relatively rare. One exception is a National Research Council report on Deterrence and the Death Penalty that “assesses whether the available evidence provides a scientific basis for answering questions of if and how the death penalty affects homicide rates.” This report “concludes that research to date on the effect of capital punishment on homicide rates is not useful in determining whether the death penalty increases, decreases, or has no effect on these rates” (National Research Council, 2012). Such candid acknowledgements of the limitations of large bodies of existing research and discussion clear the way for more useful future research. In public health risk research, fully accepting and internalizing the familiar warnings that correlation is not causation, that associations are not effects (Petitti, 1991), and that observations are not actions (Pearl, 2009) may help to shift practice away from relying on associational considerations such as the Hill considerations on the left side of Table 2.4 toward fuller use of causal discovery principles and algorithms such as those on the right side of Table 2.4. Doing so can potentially transform the theory and practice of public health research by giving more trustworthy answers to causal questions such as how changing exposures would change health effects.