Learning Causal BN Models from Data: Causal Discovery Algorithms A key aspect of causal BN modeling is how to learn causal BN models from data. There are two main parts to this task: learning the structure of the causal BN model, i.e., the causal DAG; and estimating the conditional probability tables (CPTs) that quantify how each variable depends upon its parents. In sufficiently large and varied data sets, the CPT estimation task is straightforward. A cross-tabulation table giving the empirical frequency distribution of the value of each variable, given the values of its parents, suffices if variables are few and discrete; these empirical conditional distributions are the maximum-likelihood estimates of the underlying CPT values. Bayesian methods (e.g., conditioning Dirichlet priors on any available data to obtain posterior distributions for the CPT entries) have been developed for smaller data sets, and have recently been extended to allow for mixtures of such priors for drawing inferences about individual cases from heterogeneous populations (e.g., Azzimonti et al., 2017). For larger data sets and for variables with many values, or for continuous variables, CART trees or regression model developed for the value of each node as a function of the values of its parents provide parsimonious representations of the CPT information.
The more difficult task is estimating DAG structures from data. This is often called the structure learning problem. State-of-the-art BN learning software typically provides a mix of algorithms for solving it. Some of these were used to obtain the DAG models in Figures 2.23, 2.24, 2.27, and 2.32. These algorithms incorporate a variety of ideas and principles for detecting information-based causal relationships (e.g., predictive causation) between variables. Among the most useful are the following.
Conditional independence constraints: As previously discussed in some detail, effects are not conditionally independent of their direct causes in a DAG model, but they are conditionally independent of their more remote ancestors given their parents if the Causal Markov Condition (CMC) holds. Software such as DAGitty details the testable conditional independence constraints implied by a DAG model (see Figure 2.31); conversely, applying CART trees or other tests to identify conditional independence relations among variables in a data set (to within the limits of accuracy of the test) constrains the set of possible DAGs that are consistent with these relations. Conditional independence constraints restrict the set of possible DAG structures to the Markov equivalence class that is compatible with the constraints.
Composition constraints: If X determines Y and Y determines Z via suitable smooth, deterministic functions, then the composition of these functions should describe how X determines Z and consistency conditions such as the chain rule (that is, dZ/dX = (dZ/dY)(dY/dX)) will hold. If these functions describe manipulative causal relations, rather than only statistical associations, then a small change of size dx in X should cause Y to change by approximately dy = (dY/dX)dx, and this in turn should cause Z to change by approximately dz = (dZ/dY)dy = (dZ/dY)(dY/dX)dx. Such constraints can be generalized to DAG networks, as in the rules for path analysis for the special case of linear functions and normally distributed error terms. They imply consistency conditions for relations among estimated coefficients, and hence can be used to test whether a proposed DAG structure and set of dependency functions is consistent with data, in the sense that these implied consistency conditions hold. In the probabilistic case, composition relations still hold: thus, in the DAG model X Y Z, if Z depends probabilistically on Y via a CPT P(z | y) and Y depends probabilistically on X via a CPT P(y | x), then the composition constraint P(z | x) = yP(z | y)P(y | x) should hold.
Scoring and optimization methods: Once a DAG structure has been specified, its CPTs can be fit to available data as already described, e.g., by estimating a CART tree for each node as a function of its parents. The completed BN model, in turn, can be used to assess the likelihood of the data given the model, or other related score functions reflecting measures of goodness-of-fit between the observed data and the predictions implied by the model. Variations in the DAG such as adding, deleting, or reversing arrows can then be made to try to find a higher-scoring model. This incremental optimization (or “hill-climbing”) in the space of models can be continued until no further improvements can be found. Heuristics for combinatorial optimization, such as tabu search (which restricts the allowed moves at each stage to prevent cycling or excessive concentration of search in the neighborhood of the current best networks) are often used to improve search efficiency. Scoring methods, including the hill-climbing (hc) algorithm used as the default in the bnlearn R package and in CAT, combine computational efficiency with competitive performance compared to constraint-based methods (Nowzohour and Bühlmann, 2016). They are among the most successful current techniques for BN learning. Hybrid methods that combine constraints and scoring are also popular for the same reason, although no single BN learning algorithm has proved best for all data sets.
Simplicity in error terms: Nonlinear and non-Gaussian models: Suppose that a causal discovery algorithm seeks a causal model described by a structural equation of the form observed effect = f(cause) + error, where f is a possibly unknown and nonlinear function and error is a measurement error term, not necessarily normally distributed, i.e., Gaussian. If the observed values for the effect variable Y are plotted against corresponding values of the cause variable X and a non-parametric smoothing regression curve (e.g., loess, kernel regression, or iteratively reweighted least squares) is fit to the resulting scatterplot, then the scatter of data points around this curve due to the error term should look roughly the same for all values of X, as shown in Figure 2.39a. In this figure, the true data-generating process is Y = X2 + error, where error is uniformly distributed between 0 and 1 (and hence is biased upward by 0.5) for all values of X. On the other hand, plotting X against corresponding observed Y values will typically give vertical error scatters that depend on Y if f is nonlinear or if the error term is non-Gaussian (Shimizu et al., 2006). This is shown in Figure 2.39b, using the same data as in 2.39a. Figure 2.39b plots X values against corresponding observed Y values. Clearly, the error variance is smaller at the ends than in the middle, in contrast to Figure 2.39a. Such heteroscedasticity reveals that the correct causal ordering of X and Y is that X causes Y, rather than Y causing X. More generally, when a causal model implies that error terms (residuals) for a dependent variable, given the values of its causes, have a simple form, the empirical distribution of these error terms can be used to determine which is the dependent variable and which are the explanatory variables that cause it. Linear models with Gaussian errors – the traditional regression setting – are an exception, but nonlinear models or linear models with non-Gaussian errors (“LiNGAM” models) suffice to identify correct causal orderings of the variables in a DAG model under many conditions (Shimizu et al., 2006; Tashiro et al., 2014; Nowzohour and Bühlmann, 2016).
Linear Gaussian models using Tetrad: At the opposite end of the spectrum from nonlinear and non-Gaussian causal models are linear causal models with normally distributed (Gaussian) errors. Programs such as Tetrad (www.phil.cmu.edu/tetrad/) exploit the assumptions of linearity and normal errors to address the important practical problems of estimating and quantifying causal DAG models in the presence of hidden confounders or other latent (unmeasured) variables and estimating models with feedback loops. The presence of latent variables and their connections to observed variables are inferred from otherwise unexplained patterns of correlation among measured variables. Like Netica and other BN programs, Tetrad also works with discrete random variables, in which case the assumptions of linear effects and normal errors are not required.
Fig. 2.39a If effect = f(cause) + error, then plotting effect vs. cause gives a scatterplot with additive errors. The scatterplot shows Y = X2 + e where e ~ U[-1, 1]
Fig. 2.39b Conversely, plotting cause vs. effect gives non-additive errors
Invariance and homogeneity principles. The fact that a causal CPT gives the same conditional probability distribution for a response variable whenever its parents have the same values (used as the information conditioned on) provides a basis for causal decision tree programs that seek to estimate average casual effects as the differences in conditional means between leaf nodes in CART-like trees constructed to have a potential outcomes causal interpretation (Li et al., 2017). It is also the basis for the Invariant Causal Prediction (ICP) package in R (Peters et al., 2016). This package supports causal inference in linear structural equation models (SEMs), including models with latent variables, assuming that data from multiple experimental and/or observational settings reflect additive effects of interventions. Heinze-Deml et al (2017) discuss non-parametric generalizations.
Structural causal ordering algorithms: In a system of structural equations, the ones that must be solved first, so that the values of their variables can be used to determine the values of other variables, imply a partial causal ordering of the variables: those that must be solved for first, or that are exogenous to, other variables are possible causes of them. This concept of causal ordering was introduced for systems of linear structural equations by Simon (1953) and was subsequently generalized to include nonlinear structural equations and dynamical systems modeled by ordinary differential equations (ODEs) and algebraic constraints among equilibrium values of variables (Simon and Iwasaki, 1988).
Timing considerations: Time series causal inference algorithms use the fact that information flows from causes to effects over time to constrain the possible causal orderings of variables to be consistent with observed directions of information flow among time series variables. Granger causality and its non-parametric generalization, transfer entropy, provide one set of ordering constraints. Recently, constraint satisfaction algorithms have been applied to infer causal structure from time series of observations subsampled at a rate slower than that of the underlying causal dynamics of the system being observed (Hytinnen et al., 2016). In many real-world systems, variables in causal networks are constantly jostled by exogenous shocks, disturbance, and noise and the effects of these perturbations spread through the network of variables over time. Under certain conditions, such as uncorrelated random shocks and linear effects, the structure of the causal network can be inferred by studying the effects of the shocks on observable variables, even if the underlying causal graph has latent variables and cycles and the shocks are unknown. This possibility has been explored via recent packages and algorithms such as BACKSHIFT (Rothenhausler et al., 2015), although more work needs to be done to extend these developments to non-parametric models, analogous to transfer entropy-based reconstruction of DAG models.
Each of these causal inference principles has led to a substantial technical literature and to software implementations that make them relatively easy to apply to real data sets. For end users, perhaps the most important points are as follows:
Multiple causal discovery algorithms are now readily available, as illustrated in Figure 2.32 for four algorithms. Most are available via free R packages such as bnlearn and CompareCausalNetworks. Special packages are available for commonly encountered special situations, e.g., the sparsebn package (Aragam et al., 2017) is available for learning Bayesian networks from bioinformatics data with many more variables than data records, where some or all of the data records may reflect experimental interventions.
Available algorithms incorporate different conceptual principles for inferring causation from data, allowing the possibility of cross-checking each other and enabling robust causal inferences that are supported by multiple principles and contradicted by none.
Several algorithms and principles (including conditional independence tests, scoring algorithms for discrete BNs, invariance and homogeneity tests, and transfer entropy among time series) have non-parametric versions. This allows them to be applied to data in the absence of known or specified parametric models.
In addition to these principles and algorithms for discovering potential causal relations and causal graphs and quantifying CPTs from structured data (i.e., data tables or data frames), there have been several research efforts to develop software that can acquire causal Bayesian networks automatically from text (Sanchez-Graille and Poesio, 2004; Trovati, 2015) or that can help humans construct them from causal maps elicited from experts (Nadkarni and Shenoy, 2004). Such software recognizes that text strings or expert claims such as “Long working hours create stress that can increase heart attack risks” correspond roughly to causal graph models such as the following: Long_working_hours Stress Heart_attack_risk. Various heuristics have been proposed for quantifying the corresponding CPTs by counting and taking ratios of different mentions of each condition and pairs of conditions. However, learning about causality directly from texts or by being told by experts and then representing the results by BNs is not yet, to our knowledge, a commercially viable technology. Other knowledge representation and inference techniques for identifying causal information from text, especially machine-learning algorithms for natural language processing, appear to be very promising (Asghar, 2016). It seems plausible that text mining will become an increasingly important contributor to enhanced causal discovery systems in the years ahead.
Taken together, current causal discovery algorithms provide a flexible set of principles and algorithms for learning about possible causal relationships among variables in structured data. These methods emphasize information-based concepts of causation – that is, concepts such as predictive causation, structural causation, manipulative causation, and mechanistic causation that reflect the information principle that causes help to predict their direct effects, even after conditioning on the values of other variables. In this sense causes are informative about their effects; conversely, conditional probability distributions of effects are derived from the values of their direct causes, which therefore help to predict and explain their values.
The empirical performance of different causal discovery algorithms has been assessed in many challenges and competitions (e.g., Hill, 2016; Hill et al., 2016). Developing, applying, and evaluating algorithms for discovering causal BN models and other causal graph models from data is now a thriving specialty within machine learning, systems biology, artificial intelligence, and related fields. Several professional societies host regular competitions to assess progress in causal discovery algorithms. Advances are reported in sources such as The Journal of Causal Inference, Artificial Intelligence, Neural Information Processing Workshops on Causality, Uncertainty in Artificial Intelligence (UAI) conference proceedings, and documentation of algorithms implemented in R and Python. Applications specifically to inference of gene regulatory networks have been developed through an important series of DREAM challenges (http://dreamchallenges.org/), leading to useful benchmark results and to empirical verification that current causal discovery algorithms are indeed useful in a range of systems biology and bioinformatics applications (Schaffter et al., 2011; Hill et al., 2016).
Comparison of Causal Discovery to Associational Causal Concepts and Methods: Updating the Bradford Hill Considerations Most causal claims that garner headlines today, such as frequent reports about adverse health effects of various substances, are not based on applying the foregoing principles of causal discovery. Instead, they usually reflect subjective judgments about causality. An approach to forming such judgments has been developed and widely applied within epidemiology over the past half century to draw important-seeming, policy-relevant conclusions from epidemiological data. The conclusions are justified as consensus judgments of selected experts and authoritative bodies based on explicit considerations such as whether observed exposure-response associations are judged to be strong, consistent, and biologically plausible. The left column of Table 2.4 provides a fuller list of the considerations about evidence that are most commonly used to structure and support such judgments.
This approach sprang largely from an influential essay by Sir Austin Bradford Hill (1965). The considerations on the left side of Table 2.4 are often referred to as the “Hill criteria,” although Hill wrote that “What I do not believe – and this has been suggested – [is] that we can usefully lay down some hard-and-fast rules of evidence that must be obeyed before we can accept cause and effect. None of my nine viewpoints can bring indisputable evidence for or against the cause-and-effect hypothesis and none can be required as a sine qua non.” Hill’s approach was later incorporated into various “weight-of-evidence” approaches for systematically documenting considerations and judgments about whether associations are likely to be causal. Table 2.4 matches the original Hill considerations in the left column with roughly corresponding principles of modern causal discovery algorithms in the right column. These considerations and correspondences are discussed in more detail next and contrasted with causal discovery techniques and BN learning algorithms. Readers who do not care about a detailed comparison can find a briefer discussion in Chapter 14 of the Hill considerations and how to update them using modern ideas and methods.
Strength of Association The foremost consideration for Hill was strength of association. He wrote, “First upon my list I would put the strength of the association," and this consideration has subsequently been interpreted by authorities such as the International Agency for Research on Cancer (IARC) to mean that “A strong association (e.g., a large relative risk) is more likely to indicate causality than a weak association.” However, this principle is mistaken. Association is not causation. Evidence of association is not evidence of causation, and strong association does not necessarily (or even, perhaps, usually) indicate likely causation.
Table 2.4 Comparison of Bradford-Hill and Causal Discovery Principles
Causal discovery algorithms and principles
Strength of association: Stronger associations are more likely to be causal
Information principle: Causes are informative about their direct effects and help to predict them.
DAG learning: Effects are not conditionally independent of their direct causes. bnlearn package
Consistency of findings across populations, study designs, times, locations, investigators, etc.
External consistency: Invariance, homogeneity, and transportability of CPTs
Internal consistency: Similar effects estimated via different adjustment sets, principles, and algorithms
Specificity of effects: A specific cause produces a specific effect
LiNGAM for one cause, one effect: y = f(x) + error
Temporality: Causes precede their effects
Information flows from causes to their effects over time. Granger causality tests, transfer entropy
Biological gradient: Data show a dose-response pattern (larger responses at higher exposures)
Structural causation; d-connectivity links exposures to responses (e.g., in DAGittypackage)
Coherence: Agrees with knowledge of disease biology
d-connectivity of dose and response
Knowledge-based constraints in bnlearn package
Analogy: Similar causes are believed to cause similar effects.
Deep learning and automatic abstraction methods for generalizing patterns from data on specific instances
Experiment: Reducing exposure reduces effect
BackShift algorithm for unknown interventions, latent variables; ComparingCausalNetworks package in R
A stronger association is not, simply by virtue of being stronger, any more likely to indicate causality than a weaker one. Indeed, a stronger exposure-response association may simply indicate stronger sampling and selection biases, or stronger modeling errors and biases, or stronger coincidental historical trends, or stronger confounding, or stronger model specification errors, or other strong threats to internal validity (Campbell and Stanley, 1963). The following conceptual examples make this point; practical examples are discussed later.
Example 1: Causation without association. Suppose that the kinetic energy (KE) of a particle is causally related to its velocity (V) and mass (M) via the structural equation KE = ½MV2. If the velocities of a collection of particles are uniformly distributed between -1 and +1 (or are normally distributed with mean 0), then the association between V and KE, as measured by standard metrics such as Pearson’s correlation or Spearman’s correlation will be approximately zero in a large sample of particles (and is exactly zero on average), even though these variables are causally related as strongly as possible, i.e., deterministically.
Example 2: Correlation without causation: Conversely, suppose that X(t) is an exposure variable expressed as a function of time, t, and that Y(t) is a response variable, also expressed as a function of time. For simplicity, suppose that each of X(t) and Y(t) independently is assigned a random linear trend with mean zero; thus, each has a random average slope and tends to increase or decrease linearly with time unless the slope happens to be exactly 0. Then, with probability 1, their values will be correlated even though each is assigned its slope independently of the other and neither depends on the other. Moreover, the correlation between them will be as strong as possible (R2 = 1) if measurement error and random variation are negligible and sample sizes are large. Yet, this strong association indicates nothing about causality.
In practice, many non-stationary random processes, both time series and spatial, exhibit temporal or spatial trends. Any two variables with trends over the same interval of time or the same region of space will have correlated values, even if their values are determined independently of each other and there is no causal relation between them. Thus, the strength of the associations between them indicates nothing about causality. For example, independent random walks are very likely to have significantly correlated values, illustrating the phenomenon known as spurious regression in time series analysis. In real-world applications, associations commonly occur between time series variables (e.g., air pollution levels and mortality rates in Dublin) and between spatial variables (e.g., distance from oil and gas wells, point sources of pollution, or randomly selected locations, and rates of various ailments in people) whether or not there is any causal relationship between them. Many epidemiological journal articles report on such associations and conclude without evidence that that they raise concerns about health effects of exposures. Applying appropriate methods of analysis (e.g., Granger causality testing, conditional independence tests) can help to determine whether predictive causation, rather than mere association, holds between variables.
Example 3: Association created by model specification error. Suppose that X is an exposure variable and that Y is a response variable and that each is independently uniformly distributed between 0 and its maximum possible value (or, more generally, has a continuous distribution with non-negative support). Anyone wishing to make a case for banning or reducing exposures within the framework of the Hill considerations can create a statistical association between X and Y, even if they are independent (or, indeed, even if they are significantly negatively associated, as might occur if exposure has a protective effect), by specifying a statistical model of the form E(Y | X) = KX, i.e., risk is proportional to exposure, and then estimating the slope parameter K from data and interpreting it as a measure of the potency of X in causing Y. Since the values of both variables are positive, the estimated value of K will also be positive, guaranteeing a positive estimated “link” or association between X and Y no matter what the true relation (if any) between them may be. Of course, a regression diagnostic plot, such as a plot of residuals, would immediately reveal that the assumed model E(Y | X) = KX does not provide an accurate fit to the data if the true data-generating process is quite different from the assumed model, e.g., if it is E(Y | X) = 0.5 (i.e., Y is independent of X) or if it is E(Y | X) = 10 - KX (i.e., Y is negatively associated with X). But practitioners who create statistical associations by this method typically do not show diagnostic plots or validate the assumed model, allowing model specification error to be interpreted as evidence for causality because it creates positive associations. This method has been used successfully in the United States to justify banning animal antibiotics and has been recommended by regulators for widespread use (Bartholomew et al., 2005).
Example 4: Ambiguous associations. Suppose that Y = 10 - X + Z and that Z = 2X. Then the direct effect on Y of an exogenous increase in X by 1, holding other variables (namely, Z) fixed, is to reduce Y by 1; but the total effect, allowing Z to adjust, is to increase Y by 1. “The association” between Y and X depends on what else (Z in this example) is conditioned on in modeling the relation between them. No single measure of association can simultaneously describe all of the possible associations conditioning on different subsets of other variables. Clarifying which subsets of variables should be conditioned on to obtain a causally interpretable association (namely, those in a properly constructed minimal adjustment set, e.g., as produced by DAGitty) is not trivial, but without such clarification, the causal significance of an association, if any, is unknown.
Hill (1965) declared that “I have no wish, nor the skill, to embark upon philosophical discussion of the meaning of ‘causation’” (Hill, 1965). As a result, he did not distinguish among distinct varieties of causation, such as associational, attributive, counterfactual, predictive, structural, manipulative, and explanatory causation. Nor did he distinguish among different types of causal effects, such as direct effects, indirect effects, total effects, and mediated effects. However, examples like these make clear that such distinctions are essential for understanding what, if anything, causal assertions imply about how changing some variables (e.g., exposures) affects others (e.g., health effects). Strength of association does not necessarily – or, in practice, usually – shed light on this crucial practical question. Rather, a reported strong association may simply result from particular study design and modeling choices, which we will refer to generically as assumptions. Even a strong association may disappear or be reversed in sign if different assumptions are made. To address the obvious objection that such assumption-dependent conclusions do not necessarily reveal truths about the world, it is common practice to present statistical goodness-of-fit measures and sensitivity analyses supporting the thesis that the selected set of modeling assumptions describe or fit the data better than some alternative assumptions. However, goodness-of-fit comparisons are usually quite insensitive to incorrect model specifications and do not establish that the best-fitting model considered provides a usefully accurate description of the data-generating process, let alone that causal interpretations of associations are justified.
Modern causal discovery algorithms overcome these substantial challenges to the usefulness of association as an indicator of causality by replacing association with information. While stronger associations between variables are not necessarily more likely to indicate causality, it is true that direct causal parents always provide at least as much information about a dependent variable as do its more remote ancestors (and usually strictly more, unless the parents are deterministic, invertible functions of the ancestors). This a corollary of information theory (the data processing inequality) when “information” is interpreted specifically as mutual information between random variables, measured in units such as bits (Cover and Thomas, 2006, p. 34). Qualitatively, a variable in a causal DAG is typically not conditionally independent of its direct causes, even after conditioning on other variables (again with certain rare exceptions, such as if a parent and a more remote ancestor have identical values), but it is conditionally independent of its more remote ancestors and non-descendents, given its parents. Conditional independence tests and estimates of CPTs in a causal BN can be based on non-parametric methods (e.g., using CART trees), thus avoiding the pitfalls of mistaking parametric model specification errors for evidence of causality. In short, the frequently incorrect association principle, stating that causes are more likely to be strongly associated with their effects than are non-causes, can be updated to a more useful information principle stating that its direct causes provide unique information about an effect variable that helps to predict it (so that conditioning on the value of a parent reduces uncertainty about the effect, including about its future values if it is a time series variable, where uncertainty is measured by the expected conditional entropy of its distribution). A brief, approximate summary of this principle is that direct causes are informative about (i.e., help to predict) their effects, even after conditioning on other information; moreover, direct causes are typically more informative about their effects than are more remote indirect causes. Exceptions can occur, e.g., if some variables coincide with or are deterministic functions of each other, but these information-theoretic principles are useful in practical settings and play substantially the role that Hill envisioned for strength of association as a guide to possible causation, but more reliably.
Replacing the association principle with the information principle avoids the difficulties in the foregoing examples, as follows. If KE = ½MV2 and V is uniformly distributed between -1 and +1, then KE and V have positive mutual information even though the average correlation between them is 0. If X(t) and Y(t) are time series variables with random linear trends, then Y(t) cannot be predicted any better from the history of X(t) and Y(t) than from the history of Y(t) alone, even if they are perfectly correlated. If X and Y are two independent random variables, each with positive values, then the mutual information between them is 0 even though fitting the misspecified model E(Y | X) = KX to pairs of values for X and Y (e.g., to exposure levels and corresponding response levels using ordinary least squares) for a sufficiently large sample size would yield a positive estimate of K. Finally, if Y = 10 - X + Z and Z = 2X are structural equations, then a corresponding DAG model will show that Y has X and Z as its parents and that Z has X as its parent; X is the exogenous input to this sytem, and the reduced-form model Y = 10 + X for the total effect of X on Y is easily distinguished from the partial dependency relation Y = 10 - X + Z via the DAG showing which vaiables depend on which others.
Consistency of Association Several of Hill’s other proposed principles appeal strongly to intuition. They include consistency (different studies and investigators find the same or similar exposure-response associations or effects estimates in different populations and settings at different times); biological plausibility (the association makes biological sense), coherence (the association is consistent with what is known about biology and causal mechanisms) and analogy (similar causes produce similar effects). It is now understood that such properties, based on qualitative and holistic judgments, appeal powerfully to psychological heuristics and biases such as motivated reasoning (finding what it pays us to find), groupthink (conforming to what others appear to think), and confirmation bias, i.e., the tendency of people (including scientists) to find what they expect to find and to focus on evidence that agrees with and confirms pre-existing hypotheses or beliefs, rather than seeking and using disconfirming or discordant evidence to correct misconceptions and to discover how the world works (Kahneman, 2011)
Example: Confirmation Bias and the Wason Selection Task To experience confirmation bias first-hand, consider the following question about evidence and hypothesis-testing. Suppose that you are presented with four cards. It is specified that each of the cards has a letter on one side and a number on the other. You can see the faces that are turned up of the four cards, and they are as follows: