Studies of Simpson’s Paradox have been motivated by real-world examples in which it complicates interpretation of data. For example, a study of gender bias in university admissions might find that men applying to a graduate school are significantly more likely to be admitted than women, even if each department in the school is more likely to admit women than men. This can happen if women are more likely than men to apply to the departments with the lowest admissions rates. Or, a new drug for dissolving kidney stones might have a lower overall success rate than an older drug, even if it is more effective for both large kidney stones and small kidney stones, if it is more likely to be used on large kidney stones and these have a lower success rate than small kidney stones. In these and countless other examples, interpreting the implications of an overall statistical association requires understanding what other factors affect the outcome and then controlling for them appropriately to isolate the causal effect of the factor of interest, such as the effect of advertising on customer purchases, the effect of gender on admission decisions, or the effect of a drug or treatment on diseases. Causal analytics provides methods for determining which factors need to be controlled for (e.g., sex of customers, academic department applied to, size of kidney stone), and how, to isolate specific causal effects of interest. Chapter 2 introduced software packages (such DAGitty, see Figure 2.21) that implement these methods.
Example: Visualizing Air Pollution-Mortality Associations in a California Data Set Table 1.2 shows the first few records of a data set that is used repeatedly later to illustrate analytics methods and principles. The full data set can be downloaded as an Excel file from http://cox-associates.com/CausalAnalytics/. It is the file LA_data_example.xlsx. It is also bundled with the CAT software described in Chapter 2, appearing as the data set named “LA”. The rows contain daily measurements of fine particulate matter (PM2.5) concentrations, weather-related variables, and elderly mortality counts (number of deaths among people aged 75 years or older) for California’s South Coastal Air Quality Management District (SCAQMD), which contains Los Angeles. The full data set has 1,461 rows of data, one for each day from January 1 of 2007 through December 31, 2010.
Table 1.2 Layout of data for PM2.5 concentration (“PM2.5”), weather, and elderly mortality (“mortality75”) variables for California’s South Coastal Air Quality Management District (SCAQMD)
The variables (columns) in Table 1.2, and their data sources, are as follows:
The calendar variables year, month, and day in the first three columns identify when the data were collected. Each row of data represents one day of observations. Rows are called “cases” in statistics and “records” in database terminology; we shall use these terms interchangeably. (They are also sometimes called “instances” in machine learning and pattern recognition, but we will usually use “cases” or “records.”)
AllCause75 is a count variable giving the number of deaths among people aged at least 75 dying on each day, as recorded by the California Department of Health at www.cdph.ca.gov/Pages/DEFAULT.aspx. Columns in a data table typically represent “variables” in statistics or “fields” in database terminology; we shall usually refer to them as variables.
PM2.5 is the daily average ambient concentration of fine particulate matter (PM2.5) in micrograms per cubic meter of air, as recorded by the California Air Resources Board (CARB) at www.arb.ca.gov/aqmis2/aqdselect.php.
The three meteorological variables tmin = minimum daily temperature, tmax = maximum daily temperature, and MAXRH = maximum relative humidity, are from publicly available data from Oak Ridge National Laboratory (ORNL) and the US Environmental Protection Agency (EPA): http://cdiac.ornl.gov/ftp/ushcn_daily/ www3.epa.gov/ttn/airs/airsaqs/detaildata/downloadaqsdata.htm.
Lopiano et al. (2015) and the above data sources provide further details on these variables. For example, for the AllCause75 variable, Lopiano et al. explain that elderly mortality counts consist of “The total number of deaths of individuals… 75+ years of age with group cause of death categorized as AllCauses… . Note accidental deaths were excluded from our analyses.” The definitions of the populations covered and the death categories used are taken from the cited sources. Clearly, average PM2.5 concentrations at monitor sites do not apply in detail to each individual, any more than the weather conditions describe each individual’s exposure to temperature and humidity. Rather, these aggregate variables provide data from which we can study whether days with lower recorded PM2.5 levels, or lower recorded minimum temperatures, relative humidity, and so forth, also have lower mortality, and, if so, whether various types of causal relationships, discussed in Chapter 2, hold between them.
Given such data, a key task for descriptive analytics is to summarize and visualize relationships among the variables to provide useful insights for improving decisions. A starting point is to examine associations among variables. Figure 1.2 presents two visualizations of the linear (Pearson’s product-moment) correlations between pairs of variables. (These, together with other tables and visualizations, were generated by clicking on the “Correlations” command in the CAT software described in Chapter 2.) The network visualization on the right displays correlations between variables using proximity of variables and thickness of links between them to indicate strength of correlation (green for positive, red for negative). In the corrgram on the left, positive correlations are indicated by boxes shaded with positively sloped hatch lines (blue in color displays). Negative correlations are indicated by boxes shaded with negatively sloped hatch lines, shaded red in color displays. Darker shading represents stronger correlations. Glancing at this visualization shows that the two strongest correlations are a strong positive correlation (dark blue) between tmin and tmax, the daily minimum and maximum temperatures; and a moderate negative correlation between tmin and AllCause75, suggesting that fewer elderly people die on warmer days. Whether and to what extent this negative correlation between daily temperature and elderly mortality might be causal remains to be explored. Chapter 2 discusses this example further.
Fig. 1.2 A corrgram (left side) and network visualization (right side) displaying correlations between pairs of variables in Table 1.2
Example: What Just Happened? Deep Learning and Causal Descriptions
A useful insight from machine learning and artificial intelligence is that raw data, e.g., from sensor data or event logs, can usually be described more abstractly and usefully in terms of descriptive categories, called features, derived from the data. Doing so typically improves the simplicity, brevity, and noise-robustness of the resulting descriptions: abstraction entails little loss, and often substantial gain, in descriptive power. Descriptively useful features can be derived automatically from lower-level data by data compression and information theory algorithms (of which autoencoders and deep learning algorithms are among the most popular) that map the detailed data values into a smaller number of categories with little loss of information (Kale et al., 2015). (Readers who want to apply these algorithms can do so using the free H2O package in R or Python; see https://github.com/h2oai/h2o-tutorials/tree/master/tutorials/deeplearning.) Describing the data in terms of these more abstract features typically improves performance on a variety of descriptive, predictive, and prescriptive tasks. For example, describing what a nuclear power plant or an airplane is doing in terms of the stream of detailed events and measurements recorded in sensor log data is apt to be far less useful for both descriptive and predictive purposes than describing it using higher-level abstract terms such as “The core is overheating due to loss of coolant; add coolant to the reactor vessel now to prevent a core melt” or “The plane’s angle of attack exceeds the critical angle of attack at this air speed; if this continues, a stall is imminent.” The natural language of descriptions, predictions, warnings, and recommendations implicitly reflects causal relationships, as in “Your car’s engine is overheating; add coolant to the radiator now” or “You are descending too fast; pull up!” The brevity and high information value of such communications indicate that they are using terms at the right level of abstraction to communicate effectively: they express using only a few variables and rough values or comparative terms, such as “too steep” or “too hot” or “too fast” the essential information needed to understand what is going on and what to do about it to avoid undesirable outcomes. Individual terms such as “overheating” convey a wealth of causal information. Such examples suggest a close relationship between causality, information, and effective (brief and informative) communication. Chapter 2 develops these connections further by using information-based algorithms to develop parsimonious descriptions and predictive models.
Example: Analytics Dashboards Display Cause-Specific Information It is common for contemporary analytics dashboards to combine descriptive, predictive, and causal information by comparing observed values for cause-specific outcomes to their desired, predicted, and past values. Such visualizations make unexpected deviations and recent changes visually obvious. For example, Figure 1.3 shows a clinical dashboard that, on its left side, displays actual rates (thick red pointers) and expected rates (thin grey pointers) of patients entering a disease register for various groups of cause-specific reasons – coronary heart disease, cancer, chronic obstructive pulmonary disease (COPD), and palliative care. The COPD rate is about double its expected value, at close to 7 patients per 1000. The right side shows stacked bar charts of the numbers of patients per 1000 entering the practice’s register for each of these groups of reasons in each of three successive time intervals. The cause-specific rates have remained fairly stable over time.
Fig. 1.3 A clinical dashboard displaying descriptive analytics results for different clusters of causes
Many dashboards allow the user to drill down on high-level aggregate descriptive categories of causes or effects, such as “cancer,” to view results by more specific sub-types.
Such analytics dashboards are now widely used in business intelligence, sales and marketing, financial services, energy companies, engineering systems and operations management, telecommunications, healthcare, and many other industries and applications. They provide constructive visual answers to the key questions of descriptive analytics: “What’s happening?”, “What’s changed?”, “What’s new?” and “What should we worry about or focus on?” for different groups of causes and effects. Readers who want to build dashboards for their own data and applications can do so with commercial products such as Tableau (www.tableau.com/learn/training) or with free dashboard development environments such as the flexdashboard R package (http://rmarkdown.rstudio.com/flexdashboard/).
Causality in Predictive analytics Two main forms of predictive analytics are forecasting future values from past ones and inferring values of unobserved variables from values of observed ones. When time series data are arrayed as in Table 1.2, with time increasing down the rows, then forecasting consists of using data values in earlier rows to predict values in later ones. Predictive inference consists of using the data in some columns to predict the values in other columns. The two can be combined: we might use the past several days (rows) of data on air pollution and weather variables to predict the next several days of data on these same variables and also elderly mortality, for example.
Methods of causal analytics overlap with methods for predictive analytics, statistical inference and forecasting, artificial intelligence, and machine learning. Chapter 2 discusses these overlapping methods. Software for predictive analytics is widely available; several dozen software products for predictive analytics, and a handful for prescriptive analytics, are briefly described at the web page “Top 52 predictive analytics and prescriptive analytics software,” www.predictiveanalyticstoday.com/top-predictive-analytics-software/. However, causal analytics is distinct from predictive analytics in that it addresses questions such as “How will the probabilities of different outcomes change if I take different actions?” rather than only questions such as “How do the probabilities of different outcomes change if I observe different pieces of evidence?” Statistical inference is largely concerned with observations and valid probabilistic inferences that can be drawn from them. By contrast, causal analytics is largely concerned with actions and their probable consequences, meaning changes in the probabilities of other events or outcomes brought about by actions. This distinction between seeing and doing is fundamental (Pearl, 2009). Techniques for predicting how actions change outcome probabilities differ from techniques for inferring how observations change outcome probabilities.
Example: Predictive vs. Causal Inference – Seeing vs. Doing Table 1.3 shows a small hypothetical example data set consisting of values of variables for each of three communities, A, B, and C. For each community (row), the table shows values for each of the following variables (columns): average exposure concentration, C, for a noxious agent such as an air pollutant, in units such as parts per million; income levels, I, in units such as dollars per capita-year; and rates of some adverse health effect, R, such as mortality or morbidity, in units of cases per person-per year.
Table 1.3 A hypothetical example data set. A statistical inference question for these data is: What response rate should be predicted for a community with exposure concentration C = 10? A causal question is: How would changing exposure concentrations to 10 affect response rates R in communities A, B, and C?
Exposure concentration, C
Response rate, R
For purposes of descriptive analytics, the dependencies among these variables over the range of observed values are described equally well by any of the following three models (among others):
Model 1: R = 2C (and I = 140 – 10C)
Model 2: R = 35 – 0.5C – 0.25I
Model 3: R = 28 – 0.2*I For purposes of predictive analytics, these three models all predict the same value of R for any pair of C and I values that satisfy the same descriptive relationship between I and C as the data in Table 1.2,I = 140 – 10C. For example, for a community with C = 6 and I = 80, all three models predict a value of 12 for R. But for other pairs of C and I values, the models make very different predictions. For example, if C = 0 and I = 120, then model 1 would predict a value of 0 for R; model 2 would predict a value of 5; and model 3 would predict a value of 4.
For purposes of causal analytics, it is necessary to specify how changing some variables such as C or I would change others such as R. If it is assumed that changingC or I on the right-hand side of one of the above equations would cause Rto adjust to restore equality, then model 1 would predict that each unit of decrease in C would cause 2 units of decrease in R. By contrast, model 2 implies that each unit of decrease in C would increase R by 0.5 units. Model 3 implies that changing C would have no effect on R. Thus, the causal impact of changing C on changing R is under-determined by the data. It depends on which causal model is correct.
In practice, of course, it is unlikely that the numbers in a real data set would be described exactly by linear relationships, as in Table 1.2, or that the different columns would be exactly linearly related. But it is common in real applications for many different descriptive models to fit the data approximately equally well and yet to make importantly different predictions. A key challenge for causal analytics is discovering from data which of many equally good descriptive models, if any, best predicts the effects of making changes in some variables – those that a decision-maker can control, for example – on other variables of interest or concern, such as health or financial or behavioral outcomes. Whether data suffice to identify a unique causal model, to make unique predictions, or to uniquely estimate the causal impact of changing one variable on the average values of other variables, are often referred to as questions of identifiability in the technical literatures of machine learning, statistics, causal analysis, and econometrics. We shall see later how modern causal analytics algorithms determine which effects of interventions can be identified from observed data and how to quantify the effects that can be identified.
Example: Non-Identifiability in Predictive Analytics Table 1.4 provides another small hypothetical data set to illustrate the identifiability challenge in a different way. For simplicity, all variables in this example are binary (0-1) variables.