Descriptive Analytics for Public Health: Socioeconomic and Air Pollution Correlates of Adult Asthma, Heart Attack, and Stroke Risks
Descriptive Analytics for Occupational Health: Is Benzene Metabolism in Exposed Workers More Efficient at Very Low Concentrations?
How Large are Human Health Risks Caused by Antibiotics Used in Food Animals?
Quantitative Risk Assessment of Human Risks of Methicillin-Resistant Staphylococcus aureus (MRSA) Caused by Swine Operations
Part 3. Predictive and Causal Analytics
Attributive Causal Modeling: Quantifying Human Health Risks Caused by Toxoplasmosis From Open System Production Of Swine.
How Well Can High-Throughput Screening Test Results Predict Whether Chemicals Cause Cancer in Mice and Rats?
Mechanistic Causality: Biological Mechanisms of Dose-Response Thresholds for Inflammation-Mediated Diseases Caused by Asbestos Fibers and Mineral Particless
Part 4. Evaluation Analytics
Evaluation Analytics for Public Health: Has Reducing Air Pollution Reduced Mortality in the United States?
Evaluation Analytics for Occupational health: How accurately and consistently do laboratories measure workplace concentrations of respirable crystalline silica?
Part 5. Risk Management: Insights from Prescriptive, Learning, and Collaborative Analytics
Improving individual, group and organizational decisions: Overcoming learning-aversion in evaluating and managing uncertain risks
Improving organizational risk management: From Lame Excuses to Principled Practice
Improving institutions of risk management: Uncertain causality and judicial review of regulations
Intergenerational justice in protective and resilience investments with uncertain future preferences and resources
Part 1 Concepts and Methods of Causal Analytics
Causal Analytics and Risk Analytics Countless books and articles on data science and analytics discuss descriptive analytics, predictive analytics, and prescriptive analytics. An additional analytics area that is much less discussed links this world of analytics, with its statistical model-based descriptions and predictions, to the world of practical decisions in which actions have consequences that decision-makers, and perhaps other stake-holders, care about, and about which they are often uncertain. This is the area of causal analytics. How causal analytics relates to other analytics areas and how its methods can be used to predict what to expect next, explain past outcomes and observations, prescribe what to do next to improve future outcomes, and evaluate how well past or current policies accomplish their intended goals – for whom, and under what conditions – are the main topics of this book.
Causal analytics uses data, models, and algorithms to estimate how different actions change the probabilities of different possible future outcomes. By doing so, it provides the crucial information needed to solve both the prescriptive decision analysis problem of choosing actions to make desired outcomes more likely and undesired ones less likely, and also the evaluation challenge of determining what effects past actions and events have had. Useful recommendations from prescriptive analytics flow from understanding how actions affect outcome probabilities. Learning from experience the causal relationships between actions or policies and their consequences also makes it possible to evaluate and improve policies over time. Collectively, these activities of using data to quantify the causal relationship between actions and their outcome probabilities and then using this understanding to evaluate and improve decisions and policies contribute to what we shall call risk analytics: the application of algorithms to data to produce results that inform and improve risk management.
This book is largely about how to apply principles and methods of causal analytics to data to solve practical risk management decision problems and to inform and improve other steps in the risk analytics process. Table 1.1 outlines these steps, and they are discussed more fully in the rest of this chapter. Chapter 2 introduces different concepts of causality and describes how they can be used to achieve such practical goals as quickly noticing important changes in a controlled industrial system or in an organization’s performance or environment; explaining and predicting such changes and how they affect the performance of the system or organization; making more accurate predictions from limited data; and devising more effective interventions and policies to promote desired outcomes. Causal analytics provides algorithms for learning from data what works and what does not and for estimating how well different polices, treatments, or interventions perform in changing behaviors or outcomes for different people. These methods play a central role in the rest of risk analytics by providing information needed to address the questions in the right column of Table 1.1. Chapter 2 also discusses theoeretical principles and existing algorithms and software for building causal models from data and knowledge and for using them to support the rest of the analytics steps of description, prediction, prescription, evaluation, learning, and collaboration in understanding and managing risks in uncertain systems.
The rest of the book illustrates practical applications of these methods, grouped roughly around the analytics steps in Table 1.1. The different chapters apply and extend principles, ideas and methods explained in this chapter and Chapter 2 to a variety of practical risk analysis problems and challenges. They emphasize public health risks, occupational health and safety, and possibilities for improving individual, organizational, and public policy decisions.
Table 1.1 Components of Risk Analytics
Risk analytics step
Typical questions addressed
What is the current situation? What’s happening?
What has changed? What’s new?
What should we focus on? What should we worry about?
If we do not change what we are doing, what will (probably) happen next? When? How likely are the different possibilities?
Given observed (or assumed) values for some variables, what are the probabilities for values of other variables?How well can some variables be predicted from others?
How well can future outcomes be predicted now?
Diagnosis, explanation, and attribution: What explains the current situation?
What can we do about it? How would different actions change the probabilities of different future outcomes?
What should we do next? What decisions and policies implemented now will most improve probabilities of future outcomes?
How well are our current decisions and policies working?
What effects have our decisions and policies actually caused?
How do different policies affect behaviors and outcomes for different people?
What decisions or policies might work better than our current ones?
How can we use data and experimentation to find out?
By how much do different items of information improve decisions? What is the value of information for different measurements?
How can we best work together to improve probabilities of future outcomes?
Who should share what information with whom, how and when?
What actions should each division of an organization or each member of a team take?
Why Bother? Benefits of Causal Analytics and Risk Analytics Several large potential practical benefits from applying causal analytics provide ample motivation for mastering the technical methods needed to distinguish between association and causation and to estimate causal relationships among variables. The most important one is the ability to quantify how changes in the inputs to a system or situation change the probabilities of different outputs or results, which we refer to generically as outcome probabilities. This ability, in turn, allows the decision optimization question to be addressed of what to do to make preferred outcomes more likely. Crucially, causal analytics also provides relatively objective, data-driven methods for evaluating quantitatively how large the effects of policies, interventions, or actions on improving outcomes actually are. They enable quantitative assessment of what works, how well, for whom, and under what conditions.
In marketing science, medical research, and social science program evaluation studies, randomized control trials (RCTs), which randomly assign individuals to different “treatments” (actions, policies, or interventions), are often considered the gold standard for evaluating causal impacts of treatments on outcomes. Random assignment of individuals to treatments guarantees that any systematic differences in the responses to different treatments are not due to systematic differences between the individuals receiving different treatments. Causal analytics methods allow many of the benefits of RCTs to be achieved even when data are not obtained from RCTs. For example, they can be applied to data from observational studies in which there are no interventions or treatments. They can be applied to data from natural experiments or “quasi-experiments” in which random assignment has not been used. They also address several limitations of RCTs. While RCT results often do not generalize well beyond the specific populations studied, causal analytics methods provide constructive “transport formulas” for generalizing results inferred from one population to others with different population frequency distributions of risk factors; or, more generally, for applying causal relationships and laws discovered in one or more data sets to new data sets and situations (Bareinboim and Pearl, 2013; Lee and Honavar, 2013). Causal analytics can also help to explain why treatments work or fail, and for whom, rather than simply quantifying average treatment effects in a population. They provide powerful techniques for predicting how changes in causal drivers will change future outcomes (or their probabilities) and for deducing the values of unobserved variables that best explain observed data.
In all these ways, the methods we will be studying in this book improve ability to understand, explain, predict, and control the outputs of uncertain systems and situations by clarifying how different decisions about inputs affect probabilities of outcomes. However, to achieve these potential benefits, appropriate methods of causal analytics must be used correctly. This chapter and the next explain these methods; they also caution against unsound, incorrect, and assumption-laden methods of causal analysis. Two particularly important confusions to beware of are (a) Confusion between the effects attributable to a cause, such as numbers of illnesses or deaths attributed to a risk factor, and the effects preventable by removing or reducing that cause; and (b) Confusion between past effect levels that would have been observed had a risk factor been absent or smaller, as estimated by some model, and future effect levels that will be observed if the risk factor is removed or reduced. Chapter 2 discusses these and other important distinctions. They are often not drawn with precision in current policy analyses, risk assessments, and benefit-cost analyses. As a result, policy makers are too often presented with information that is claimed to show what should be expected to happen if different policies are implemented, when what is actually being shown is something quite different, such as historical associations between variables. A greater understanding and use of causal analytics can help to overcome such confusions. In turn, greater clarity about the correct causal interpretation of presented information can help to better inform policy makers about the outcome probabilities for different courses of action.
Who Should Read this Book? What Will You Learn? What is Required? This book is meant primarily for practitioners who want to apply methods of causal analytics correctly to achieve the benefits just discussed. Practitioners need to understand the inputs and outputs for different analytics algorithms, be able to interpret their outputs correctly, know how to apply software packages to data to produce results, and be aware of the strengths, limitations, and assumptions of the software packages and results. The details of how the algorithms compute outputs from inputs are less important for practitioners. We will therefore make extensive use of state-of-the-art analytics algorithms and highlight their key principles and while referring to the primary research literature and to software package documentation for technical and implementation details. This chapter and Chapter 2 seek to present the main ideas of key causal analytics and risk analytics, making them accessible for a broad audience of technically inclined policy-makers, analysts, and researchers who are not expected to be specialists in data science and analytics. The key ideas are independent of the software used to implement them, so we will explain the ideas and illustrate inputs and outputs without assuming or requiring familiarity any particular pieces of software. Thus, one audience for this book is fairly broad: we hope to make the main technical ideas of modern risk analytics and causal analytics (Chapters 1 and 2) and their practical applications to a variety of practical risk assessment and risk management problems (remaining Chapters) clear, interesting, and useful to those who must decide what to do, evaluate what has been done, or offer advice about what should be done next in order to cause desired results.
We anticipate that a subset of readers may want to personally master the algorithms discussed and start applying them to their own data. For those readers, Chapter 2 introduces several free software packages that can be used to carry out the calculations and analyses shown throughout the book. Many state-of-the-art algorithms for causal analytics and closely related machine-learning tasks are implemented as freely available packages in the R statistical programming environment. For Python developers, the scikit-learn machine learning package and other analytics packages provide valuable alternatives to some R packages. We mention such available software packages throughout the book where appropriate, but assume that many readers have limited interest in learning about their details here. Instead, we have created a cloud-based (in-browser) Causal Analytics Toolkit (CAT), introduced in Chapter 2, to let interested readers run algorithms on example data sets bundled with CAT, or on their own data sets in Excel, without having to know R or Python. The CAT software make it possible to perform the analyses we describe simply by selecting columns data table and then clicking on the name of the analysis to be performed; outputs are then produced.
Thus, we expect that most readers with a technical bent and interest in decision, risk, and policy analysis – or in causal analysis, either for its own sake or for other applications – will be able to use this book to understand, relatively quickly and easily, the main technical ideas at the forefront of current causal and risk analytics and how they can be applied in practice. Readers who want to do so can also this book, especially Chapter 2, to master technical skills including applying current state-of-the-art R packages (via the simplified Causal Analytics Toolkit (CAT) software) to quantify and visualize associations in data, analyze associations using parametric and non-parametric regression methods, detect and quantify potential causal relations in data, visualize them using causal networks, and quantify various types of important causal relations in real-world data sets. Finally, readers who care mainly about risk analysis applications can skip the rest of this chapter and Chapter 2 and proceed directly to the applications that begin in Chapter 3. We have sought to make the applied chapters relatively self-contained, recapitulating key ideas from Chapters 1 and 2 where needed.
What Topics does this Book Cover? This chapter and the next describe the roles of causal analytics in the rest of risk analytics and provide technical background on current technical concepts and methods of causal analytics, respectively. The rest of this chapter walks through the risk analytics steps in Table 1.1: descriptive, predictive, prescriptive, evaluation, learning, and collaborative analytics. It discusses how causal analytics is woven into each of them. A major goal is to show how causal analysis and modeling can clarify and inform the rest of the analytics process. Conversely, implicit and informal causal assumptions and causal interpretations of data can mislead analysts and users of analytics results. We provide caveats and examples of how this can occur and suggestions for avoiding pitfalls in data aggregation, statistical analysis, and causal interpretation of results. This chapter also surveys some of the most useful and exciting advances in each of area of risk analytics.
Chapter 2 delves into different technical concepts of causation and methods of causal analysis. These range from popular but problematic measures of associative, attributive, and counterfactual causation, which are based primarily on statistical associations and modeling assumptions, to more useful definitions and methods for assessing predictive, manipulative, structural, and mechanistic or explanatory causation. Chapter 2 emphasizes non-parametric methods and models, especially Bayesian networks (BNs) and other causal directed acyclic graph (DAG) models, and discusses the conditions under which these have valid causal, structural, and manipulative causal interpretations. It points out that many models used for inference and decision optimization in control engineering, decision analysis, risk analysis, and operations research can be brought into the framework of causal BN modeling and briefly discusses other models, such as continuous and discrete-event simulation models, that provide more detailed descriptions of causal processes than BNs.
The remainder of the book is devoted to applications that illustrate how causal analytics and risk analytics principles and methods can be applied in risk analyses, with the main applied focus being on human health risks. Chapters 3 through 6 illustrate how descriptive analytics can be used to address questions in public and occupational health risk analysis, such as: What factors are associated with risks of adult asthma, heart attacks, and strokes, and to what extent might such associations be causal? Do workers exposed to very low occupational concentrations of benzene have disproportionately high levels of risk due to relatively efficient production of metabolites at low exposure concentrations? How large are the risks to humans of antibiotic-resistant “super-bug” infections caused by use of antibiotics in farm animals? Chapters 7 and 8 illustrate attributive and predictive causal analytics by estimating human health risks caused by Toxoplasmosis from open system production of swine and by evaluating how well rodent carcinogenicity in multimillion dollar in-vivo two-year assays can be predicted from much less expensive high throughput screening (HTS) data (the answer is somewhat disappointing), respectively. Chapter 9 is the sole chapter devoted entirely to mechanistic and explanatory causation. It examines health risks at the micro level of disease causation in individuals, focusing on one of the hottest topics in disease biology today: the role of the NLRP3 inflammasome in inflammation-mediated diseases. Mechanistic models of disease causation typically require more detailed applied mathematical and computational modeling than the directed acycic graph (DAG) models emphasized in earlier chapters. Chapter 9 describes the types of mathematical models that might be useful in quantitative causal modeling of exposure-response relationships in which exposures activate the NLRP3 inflammasome and thereby cause increase risks of inflammation-mediated diseases such as lung cancer, mesothelioma, or heart attack.
Chapters 10 and 11 undertake retrospective evaluations of the results actually caused or achieved by two programs: effects on human mortality rates caused by historical reductions in air pollution in many different areas of the United States; and accuracy of efforts to use sampling to estimate workplace air concentrations of respirable crystalline silica (quartz sand and dust). The main purpose of these chapters is to demonstrate the value of retrospective evaluation of what works how well, under what conditions. They also show how to carry out such evaluations using a range of technical methods, from simple descriptive statistics to Granger causality tests. Chapters 12 through 15 shift focus from specific health risk analysis applications to broader questions of prescriptive analytics and individual, group, organizational, institutional, and societal learning and risk management. These chapters are more general and more theoretical than chapters 3-11, as they deal with topics such as improving individual and group decisions; better incorporating risk management disciplines into organizations; using judicial review, especially of causal reasoning, to improve the quality of regulatory science; and pursuing efficiency and justice in risk management decisions that span multiple generations.
To enable the most useful discussion of these topics, it is essential to have a solid grasp of the main aspects of causal analytics and risk analytics. To these, we now turn.
Causality in Descriptive Analytics
Descriptive analytics seeks to summarize, organize, and display data to answer questions about what has happened or is happening and to highlight features of the data that are relevant, interesting, and informative for deciding where to focus attention and what to do next. Questions addressed by descriptive analytics, such as “What has happened to the local unemployment rate recently?” or “How many people per year in this country are currently dying from food poisoning?” or “How has customer satisfaction changed since a year ago for this company?” often involve underlying causal assumptions. For example, an unemployed person may be defined as one who is actively seeking a job and able and willing to work but who does not have a job at the moment because the ongoing search has not yet succeeded. The word “because” indicates a causal attribution. Deciding whether someone should be counted as unemployed depends on the reason or cause for currently not having a job. An otherwise identical person lacking a job for a different stated reason – such as because he or she has given up looking after exhausting all promising leads in a market where employers are not hiring – would not be counted as “unemployed” by this definition, even if he or she is equally eager to find work and capable of working. Thus, describing how many people are currently unemployed requires assessing causes.
Similarly, suppose that food poisoning deaths occur disproportionately among people with severely compromised immune systems who would have died on the same days that they did, or perhaps very shortly thereafter, even in the absence of food poisoning. Then describing these deaths as being due to food poisoning might create an impressive-looking death toll for food poisoning, even if preventing food poisoning would not significantly reduce the death toll. Here, the phrase “due to” indicates a causal attribution, but the descriptive statistics on mortalities attributed to food poisoning do not necessarily reveal how changing food poisoning would change mortality rates. In the terminology introduced in Chapter 2, food poisoning in this example would be an attributive cause but not a manipulative cause of deaths. Descriptive analytics results that are not explicit about the kinds of causation being described can mislead their recipients about the probable effects of interventions.
How data are aggregated and summarized to answer even basic descriptive questions can reflect underlying causes more or less well, as illustrated by the following examples.
Example: Did Customer Satisfaction Improve? Consider how to use data to describe how customer satisfaction has changed since a new customer relationship management (CRM) program was inaugurated a year ago. This question might arise for a retail chain trying to decide whether to roll out the same CRM program in other locations. For simplicity, suppose that each surveyed customer is classified as either “satisfied” (e.g., giving a satisfaction rating of 7 or more on a scale from 0 to 10) or “not satisfied” (giving a lower rating). Suppose also that the data are as follows:
A year ago, 7,000 of 10,000 randomly sampled customers were satisfied.
Today, only 6,600 of 10,000 randomly sampled customers were satisfied.
Based on this description of the data, it might seem plain that there is no evidence that the new CRM program has proved effective in increasing customer satisfaction. In fact, there has been a statistically significant decline in the proportion of satisfied customers (with a maximum likelihood estimate of 0.70 - 0.66 = 4% and a 95% confidence interval of 2.7% to 5.3% for the decline, calculated via the on-line calculator for a difference of two proportions at http://vassarstats.net/prop2_ind.html).
The following different, more detailed, presentation of the same data tells a different story:
A year ago, 55% of women customers and 85% of men customers were satisfied. There were equal numbers of both, so the aggregate fraction of satisfied customers was 0.5*55% + 0.5*85% = 70%.
Today, 60% of women customers and 90% of men customers are satisfied. Moreover, after years of no growth, the number of women customers has quadrupled in the past year and the number of men customers has remained the same. Thus, 4/5 of the customers are now women. Hence, current aggregate satisfaction is (4/5)*60% + (1/5)*90% = 66%.
This second presentation of the data reveals that the new CRM program was followed by an increase of 5% in the fraction of satisfied customers for both men and women and that the business expanded dramatically by attracting more female customers. If these changes were caused by the CRM program, then it might be well worth considering for broader adoption. The first presentation of the data made the program seem unsuccessful because it only reveals the decrease in overall percentage of satisfied customers without showing that this is caused by an influx of new female customers and that satisfaction increased within each sex-specific stratum of customers. The second presentation of the data provides more useful information to inform planners about the probable consequences of introducing the program. This example illustrates that how data should be aggregated and presented for use by decision-makers depends on what cause-and-effect relations are to be illuminated by the analysis. We will return later to how to use understanding of cause and effect to design descriptions of data that reveal causal impacts (e.g., that the CRM program increases the fraction of satisfied customers by 5% in each stratum) instead of obscuring them. One simple and important insight is that if an outcome such as customer satisfaction depends on multiple factors, such as gender and exposure to a CRM program, then estimating the effects of one factor may require adjusting for the effects of others.
Example: Simpson’s Paradox Figure 1.1 shows how adept use of data visualization can clarify otherwise puzzling or misleading patterns in a data set. It shows that the overall statistical association between two variables, x and y, such as annual advertising spent per customer and annual number of purchases per customer, might be negative (downward-sloping dashed line) even if spending more on advertising per customer increases the average number of purchases per year for each individual (upward-sloping solid lines through the data points). In this display, it is clear that there are two clusters of individuals (e.g., men and women), and that the association between x and y is positive within each cluster but negative overall because the cluster with the higher x values has lower y values. In other words, increasing advertising per customer on the x axis increases the expected number of purchases per individual on the y axis for individuals in both clusters, even though the overall association between x and y is negative. This is an example of Simpson’s Paradox in statistics, and the visualization makes clear how it arises.