The bottom half of this table is redundant since the conditional probabilities for the outcomescure and no cure must sum to 1 for each set of input conditions, but the values for no cure are included in the CPT for completeness.
Problem: Calculate the populationlevel effect of an intervention that changes from treating all patients with the old treatment to treating all patients with the new treatment at a hospital in which half the patients are men and half our women. Also calculate the individuallevel effects for men and for women.
Solution: The populationlevel effect can be calculated as follows. Define the outcome variable to have value o = 1 for a patient who is cured and o = 0 otherwise. Then the change in the fraction of patients cured caused by the intervention is:
E(o  a)  E(o  a) = E(o  new treatment)  E(o  old treatment)
= 
=
=
= +
= (1  0.5)*0.5 + (0.1  0.5)*0.5 = 0.25  0.20 = 0.05,
where in the last line, appropriate conditional probability values from the CPT and the assumed marginal probabilities of 50% women and 50% men are substituted for the symbolic expressions. The net result is that changing from the old to the new treatment increases the average cure probability by 5% if the composition of the treated population remains half men and half women.
Rational and informed individuals will care more about individuallevel outcome probabilities than about population averages. In this case, the individuallevel effects for men and women were averaged in the last line of the above calculation using the fractions of the patient population that were men and women, respectively. The individuallevel effects were that the intervention improved the cure rate for men by 0.50, from 0.50 with the old treatment to 1.0 with the new; but it reduced the cure rate for women by 0.40, from 0.50 with the old treatment to 0.10 with the new treatment. Presumably, such an intervention would only be made for the whole population of patients if it were not yet realized that it harmed women. A better individuallevel intervention would be to apply the new treatment only to men.
Using Causal Models to Evaluate Total Effects vs. Direct Effects
From the perspective of causal modeling, the best way to use data to estimate the causal impact of an intervention on an outcome is often to estimate and validate a causal modelfrom the data and then use it to quantify how much difference an intervention makes for affected populations and individuals. In general, the effects of policy changes on outcomes of interest can be predicted and evaluated correctly only by modeling the network of causal relationships by which effects exogenous changes propagate among variables. Figure 1.11 illustrates this point via a directed acylic graph (DAG) model in which income has not only a direct causal effect on health_outcome, signified by the arrow between them, but also indirect effects transmitted via pathways that include residential location and exposure.
Fig. 1.11 Direct and indirect paths in a causal DAG model network
income residential_location health_outcome exposure
Predicting how health_outcome probabilities would change if income were exogenously changed requires quantifying the effects transmitted via these multiple pathways. This cannot be accomplished by statistical methods that ignore the network of relationships among variables, e.g., by regressing health_outcome against income and other variables. Such a regression model does not predict how an exogenous change in income would change the joint frequency distribution of values for residential_location and exposure in the affected population. But these changes must be known to predict how the distribution of health_outcomein the population will change, since it depends on them. Therefore, methods that go beyond regression modeling are needed to quantify how exogenous changes in one or more variables will affect the joint probability distribution of the remaining variables and, specifically, the distribution of outcomes of interest, such as health_outcome.
Causal network analysis methods address this challenge. Moreover, they provide useful distinctions among different kinds of causal effects. For example, the total causal effect of a change in income on health_outcome is the change in the probability distribution of health_outcomethat occurs when changes in income propagate via all pathways, causing changes in residential_location and exposure that, in turn, contribute to changes in the conditional probability distribution of health_outcome. By contrast, the direct causal effect of a change in income on health_outcome is the change in the probability distribution of health_outcome that occurs when changes in income propagate only via the direct link from income to health_outcome. Thus, to compute the direct causal effecton health of an intervention that exogenously changes income, the CPT for health_outcomewould be used to calculate the difference in the conditional probability distribution for health_outcomemade when only the value of income changes from its old to its new value. By contrast, the total causal effect of the intervention is found by using the CPT for health_outcome to compute how its conditional probability distribution changes when the distributions of all variables affected by the change in income also change as it changes, in accord with their own CPTs. Chapter 2 introduces algorithms for evaluating total and direct causal impacts of interventions on outcomes in DAG models using software to facilitate the calculations. In general, no single number, such as a regression coefficient, can represent both the direct and the total effect of a change in an explanatory variableon a dependent variable. Causal network analysis allows such distinctions to be drawn with clarity and provides straightforward ways to define and evaluate both kinds of effects, as well as related concepts such as indirect effects and effects mediated by a specific variable or pathway. Chapter 2 explains these additional concepts and shows how to evaluate quantitatively each type of effect in DAG models. The same ideas and principles carry over to more sophisticated causal models, such as dynamic simulation models.
Using MDPs and DES Causal Models to Evaluate Policies and Interventions
Evaluating the effect of a past action, policy, or intervention on observed outcomes is in general no easier than developing a full causal model for how actions affect outcome probabilities: they are essentially the same problem. Understanding how past actions have affected current outcomes (or their probabilities, if the causal relation is not deterministic) requires essentially the same causal analysis techniques as predicting how current actions will affect future outcomes or their probabilities. When enough knowledge and data are available to develop and validate them, discreteevent simulation (DES) models and Markov decision process (MDP) models can be very useful for evaluating effects of past programs and comparing the effects and the costeffectiveness of alternative polices, treatments, and interventions. They have been widely used for this purpose in healthcare program evaluations. For example Goehler et al. (2011) survey the results of 27 Markov models, 3 discreteevent simulation models, and 4 mathematical models taken from the literature on evaluation of different interventions to reduce risks of chronic heart failure (CHF). Interventions evaluated by these models included efforts to improve screening, diagnosis, treatment with drugs or devices, disease management programs, and a study of heart transplants. The dynamic simulation models are able to simulate how probably distributions of outcomes over time in a treated population shift in response to different interventions. Together with implementation cost and timing information, this provides the crucial information about the probable consequences caused by different interventions needed for wellinformed decisionmaking based on methods such as costeffectiveness and riskcostbenefit analysis.
Simulationbased evaluations of effects of interventions are vulnerable to misuse if the simulations are not based on valid causal models. Comparing probability distributions for outcomes under real and hypothetical conditions can lead to quantified “effects” estimates that cannot be verified and that rest on the validity of the hypothesized conditions, which may be unknown. Different hypothetical modeling assumptions could produce different answers. Chapter 2 discusses further counterfactual and potentialoutcomes causal modeling. Causality in Learning Analytics
The preceding sections describe how causal analytics and causal models can be used to describe and interpret patterns of associations among observed variables, predict outcome probabilities for different decisions or for the status quo, optimize decisions, and evaluate how interventions have changed or will change outcome probabilities. Table 1.6 summarizes models and methods for carrying out these tasks, as discussed in this chapter and Chapter 2.
Table 1.6 Models and methods for analytics tasks
Analytics Task

Models for delivering results

Principles, methods, algorithms

Describe, interpret, and explain observed data
 
Dashboard displays for status, histories, events

Correlations, clusters, interaction plots, exploratory data analysis

Bayesian networks (BNs) and other DAG models showing probabilistic dependencies

Simulation model representing the datagenerating process
 
Deep learning for extracting meaningful features, patterns, and clusters from data

Most probable explanations in BNs (see Chapter 2)

Aggregate, display, and interpret associations and patterns in data

Use transport formulas to generalize particular findings

Predict outcome probabilities if no intervention or change in policy
 
DAG models, Dynamic Bayesian Networks (DBNs, see Chapter 2)

Time series simulation models

State space forecasting models

Forecasting model ensembles
 
Calculate conditional probabilities, P(output  input)

Predictive analytics

Time series forecasting

Particle filtering

Understand how choices affect outcomes or their probabilities; predict how they change for new interventions or changes in policy
 
Causal DAG models

MDP, POMDP

System dynamics continuous simulation models

Discreteevent simulation (DES)

Optimal control models
 
Causal DAG learning algorithms

Markov decision process (MDP), partially observable MDP (POMDP), and system dynamics model estimation and simulation

DES simulation algorithms

Prescribe what to do next
 
Decision tables

Influence diagram (ID) models

Statistical decision theory

Simulationoptimization models

MDP and POMDP models

Optimal control and reinforcement learning models
 
Influence diagram (ID) algorithms

Simulationoptimization algorithms

Monte Carlo Tree Search (MCTS)

Evaluate how actions or policies affect outcome probabilities
 
Quasiexperiments, including time series intervention studies

Counterfactual models

Causal DAG; simulation models
 
Simulationbased whatif analyses

Counterfactual comparisons: Show how outcome probabilities change as decision inputs are varied

The causal DAG models (including influence diagrams) and simulation models shown in the middle column can support the analytics tasks shown in the leftmost column if the required models are known.
But suppose that no trustworthy causal model of how outcome probabilities depend on decisions is available when decisions must be made. Then, how should a decision maker proceed? This commonly happens when the system or situation of interest is not yet understood well enough to credibly simulate or model how it will (probably) respond to changes in inputs. Learning from experience how to act effectively in such an uncertain world is the central challenge for learning analytics. The usual context for machine learning algorithms that address this challenge is a sequence of repeated opportunities to choose an act, perhaps informed by accompanying observations, resulting in outcomes to which rewards are assigned (Sutton et al., 1992). The decision maker (d.m.) makes a sequence of decisions, such as which treatments to prescribe to patients with certain medical histories; which loan applications to approve; how much inventory to order each day; or how to adjust the position of an autonomous vehicle on the road. Outcomes are gradually revealed: patients improve or worsen or die, borrowers repay their loans or don’t, vehicles stay on the road or stray off it, and so forth. Some of these outcomes earn rewards or penalties; generically, we use “rewards” to refer to the values assigned to the outcomes, with larger rewards being preferred. The d.m. seeks to make decisions to maximize an objective function such as average reward per unit time, total reward by a termination time, or discounted reward.
This setup is, of course, extremely reminiscent of those for statistical decision theory, MDPs, POMDPs, and simulationoptimization. The big difference is that now, since no causal model is initially known, the d.m. must act on, and learn from, the realworld environment rather than from a model of it. This implies that certain outcomes, such as patient deaths or autonomous vehicle crashes, which might be informative in a simulation should be avoided as well as possible while learning. This is an aspect of the famous explorationexploitation tradeoff between choosing to reap the rewards from what are currently expected to be the best or safest decisions based on experience so far; or instead to take some risks to discover whether other decisions might yield even better rewards. The following approaches can help to decide what to do next in such settings, even in the absence of a causal model.

Random guessing. One possibility is to choose decisions at random and hope for the best. Unsatisfactory as this is, it can actually outperform some widely used techniques such as applying “risk matrices” to guide risk management intervention priorities based on estimated frequencies and severities of outcomes, rather than on how alternative actions would change outcome probabilities and expected rewards (Cox, 2008).

Reinforcement learning. A much more promising approach for many applications is to allow for some randomization in selecting actions, but also to keep track of what was tried and what rewards resulted and to systematically modify action selection probabilities based on this experience to favor those yielding larger average rewards. This basic idea is exploited in modern reinforcement learning and adaptive control algorithms used in machine learning, artificial intelligence, and control engineering (Sutton et al., 1992). An example is SARSAlearning, which learns from experiences, encoded as “stateactionrewardstateaction” (SARSA) sequences, by updating the estimated value of taking a specific act in a specific state (the first stateaction (SA) pair in the SARSA sequence) to reflect the difference between predicted and actual outcomes (the immediate reward, R, and the next state encountered and act taken, say s’ and a’, comprising the rest of the sequence). For a Markov decision process (MDP), the SARSA update equation is the following simple modification of equation (1.4):
Q(s, a) Q(s, a) + (R + Q(s’, a’)Q(s, a)) (1.12)
In words, the estimated value of taking act a in state s, denoted by Q(s, a), is updated by taking its current value on the right side of the update arrow and adding an increment proportional, with proportionality constant , to the difference between the estimated experienced reward, R + Q(s’, a’), and the reward actually received. Under certain comditions, such as that each state of the MDP is visited infinitely often as the time horizon for the MDP becomes infinitely long (no absorbing states), the SARSA updating rule leads to the same optimal policy as the Bellman equation for stochastic dynamic programming, but with the considerable practical advantage that the rewards and transition probabilities need not be known initially: SARSA learns the optimal policy via wellchosen trialanderror learning. Reinforcement learning software is available at these sites:

https://cran.rproject.org/package=ReinforcementLearning for R

http://www.wildml.com/2016/10/learningreinforcementlearning/ for Python.

Modelbased Bayesian reinforcement learning (Ross et al., 2011). A different approach to adaptively optimizing decisions in the absence of a known model is to make one up and then use it to make decisions. For example, if the transition rates among states in an MDP are modeled as having a Dirichlet prior distribution, then this prior can be updated by condititioning it on observed transitions, thereby obtaining an updated Dirichlet posterior, and decision can be made to maximize expected reward with respect to the updated distributions. Ross et al. (2011) develop this approach in detail for MDPs and POMDPs.

Optimism under uncertainty (Auer et al., 2002). A useful paradigm for many adaptive decision optimization problems under uncertainty, including clinical trials, is the multiarmed bandit (MAB) class of decision problems. A “onearm bandit” is a slot machine: its single arm takes one’s money. Putting in a coin may result in a random payoff. The probability distribution for this reward is initially unknown, but it can be learned about gradually, for a cost, by continuing to play. A MAB problem generalizes this setup by having multiple machines. In applications, these mght represent the different arms or treatments of a clinical trial, variants of ads or of web pages in A/B testing, performances of different subcontractors or vendors in a supply chain, and so forth. The decision problem is to choose which machine to play next, given the history of tries and rewards so far. A relatively simple approach provides a solution with provably low regret, measured as the cumulative difference between the expected reward if the true best (highest average reward) option were always selected and the expected reward from the decisions actually made. The simple rule, called the upper confidence bound 1 (UCB1) algorithm, prescribes playing each machine or action once and then using available data to compute an upper confidence bound on the average reward from each machine or action j via the formula: UCB_{j} = m_{j }+ sqrt(2 logt/n_{j}), where sqrt is the square root function, t is the total number of trials for all machines so far, and n_{j }is the number of times action or machine j has been tried. It can be proved that the regret from using UCB1 grows no faster than roughly logarithmically – about as much in the first 10 decisions as in the next 90, and then in the next 900, and so on – and that no decision rule (including the optimal one) can guarantee regret that grows more slowly than this (Auer et al., 2002). The UCB1 algorithm illustrates the useful principle of optimism under uncertainty, i.e., always choose next the action that might be best, meanng that it has the highest upper confidence bound. Doing so is a lowregret strategy.

Probability matching and Thompson sampling (Ortega and Braun, 2013). Another simple but powerful decision rule for MAB problems and more general classes of adaptive decision and control problems, including Markov decision processes (MDPs), is probability matching. This selects each action according to the current estimated probability that it is the best one, meaning the one giving the highest averge reward. Thus, if an action is known to be best, it will always be selected; if it is know not be best, it will never be selected; and if it might be the best, it is selected with probability equal to the probability that it is best. A straightforward software implementation initially assigns each action a uniform prior probability distribution for the unknown probability that it is the best (i.e., has highest average reward). Its distribution for the probability of being best is then upated by conditioning on experience. Finally, on each move, the action to be chosen is decided by sampling from the current distributions for all actions and then taking the action with the highest sample value. In slightly more detail, for each action, the prior probability of being best is initialized to the unit normal distribution U[0, 1]. Its posterior distribution after it has generated S successes and F failures is the beta distribution with parameters S + 1 and F + 1. If the actions generate binary rewards (1 = win, 0 = lose), then the number of successes S for an action is just the number of wins it has generated so far. For an arbitrary reward distribution normalized to run from 0 = lowest reward to 1 = highest reward, the observed reward on any trial (a number between 0 and 1 on this scale) is scored as a success with probability equal to the reward, and as a failure otherwise (Agrawal and Goyal, 2012). This idea, known as Thompson sampling, was proposed in the 1930s to miminimize the number of patients who receive the less effective drug during a clinical trial. It has been rediscovered many times since, recently gaining prominence in artificial intelligence; it is also related to animal behaviors such as optimal foraging (Averbeck, 2015). Despite its simplicity, the performance of Thompson sampling, as measured by cumulative regret and speed of convergence to the optimal policy, compares favorably to UCB1 on many test problems.
MAB software for computing the probability of each action being best after any history is available at the following link:

https://cran.rproject.org/web/packages/bandit/bandit.pdf

Online decisionmaking using a model ensemble (Kalai and Vempala, 2005). An “online” decision problem refers to a sequence of decisions that must be made one at a time, without knowledge of the future or ability to backtrack and revise an earlier decision. Applications might include approving or rejecting loan applications, choosing which treatment to prescribe to patients at a walkin treatment center, or deciding which advertisement or version of a web page to display to visitors arriving at a web site. When no single best model describing the relation between acts and consequence (i.e., reward or loss) probabilities is known initially, an ensemble of possible models can be used instead. These might correspond to different statistical models, machinelearning algorithms, experts, or other sources of advice and recommendations. The followtheperturbed leader (FPL) strategy slightly modifies the idea of always choosing the action recommended by the model with the best empirical performance so far, by perturbing the cumulative reward for each model by a random amount before selecting the best one (i.e., the one with the greatest cumulative reward or the least cumulative loss). In many settings, the decisions recommended by FPL converge rapidly to those recommended by the best of the considred models, implying that little is lost in the long run by learning which model(s) to use based on their observed performance (Kalai and Vempala, 2005). Since forecasting can be viewed as a special type of decisionmaking in which the forecaster decides what prediction to make and the objective is to optimize some performance metric (such as minimizing mean squared prediction error or a weighted sum of type 1 and type 2 errors for classification), FPL and other algorithms with performance guarantees (e.g., provably low cumulative regret) can also benefit predictive analytics.

Responsesurface methods and Evolutionary Operations (EVOP). If experimentation is possible and relatively inexpensive, then a complimentary tactic is to systematically vary the decision variables and study resulting responses using industrial designofexperiments (DOE) techniques. The resulting data can then be used to fit an approximate statistical model, called a response surface model, showing the modelpredicted expected value of the response or outcome variable that one wants to maximize for each combination of values of the decision variables. Quadratic regression models are a popular choice for this purpose. The response surface model can then be used to estimate the combination of decision variables that will maximize the expected value of the response variable. Repeating this process sequentially by (a) collecting further data around the estimated optimum using a designed experiment that varies the input factors slighty around their current levels; (b) fitting a new response surface model; and (c) reoptimizing the combination of inputs can improve the response surface model and the optimization of decision variables. This approach, called Evolutionary Operation (EVOP), has been used since the 1950s in chemical and manufacturing industries to improve ongoing production operations without interrupting them (Box, 1957). When experiments are relatively costly or timeconsuming, an alternative is to use stochastic approximation algorithms (Dupac, 1965) or more recent stochastic gradient adaptive optimization algorithms (Lange et al., 2014) to adjust the inputs in order to move toward the optimum of the initially unknown (and perhaps gradually changing) response surface. The guiding metaphor for these approaches is to climb a hill of initially unknown shape by collecting data and then moving in the direction of the estimated greatest improvement. Response surface methodology (RSM) is supported by freely available software such as the rsm package in R, available at the following link:

https://cran.rproject.org/web/packages/rsm/rsm.pdf
These methods for deciding what to do when no causal model is available can be combined and extended in various ways. Doing so has led to a recent burgeoning of productive new methods for adaptive decisionmaking and learning in artificial intelligence and machine learning. For example, the UCB1 algorithm has been combined with Monte Carlo tree search (MCTS), yielding the “UCT” (UCB1 for trees) algorithm for learning to play games. UCT has been streamlined and further improved by incorporating valueofinformation (VOI) heuristics, and has been applied successfully to multiarmed bandits, Markov decision processes, and adversarial games (e.g., Tolpin and Shimony, 2012; Pepels et al., 2014). The original multiarm bandit problem has been generalized to settings where the reward distributions from different actions are not independent, are correlated with observable information (“contextual bandits”), change over time (“restless bandits”), or are manipulated by an intelligent opponent or attacker (“adversarial bandits”). Deep learning methods for succinctly describing and estimating stateaction values have been combined with variations on SARSA by several investigators (e.g., Ganger et al., 2016; Van Seijen et al., 2009). There are many other examples of the confluence of technical ideas leading to evermore powerful and robust optimization of decisions without models. These procedures are so effective that they are often used for simulationoptimization when models are known\. Both adaptive optimization of actionselection probabilities and EVOPlike designofexperiments and iterative improvement of estimated statistical models and optimization of decisions implied by the estimated models are widely used in current simulationoptimization algorithms and software (Amaran et al., 2016).
The success of these modelfree algorithms for adaptively learning to optimize decisions raises the questions of why and whether causal models are really needed in analytics. After all, if a learning algorithm can learn fairly quickly to make effective decisions without them, then why use them? Answers include the following.

For descriptive analytics, causal models help to describe, explain, and interpret observed data. The learning methods just described are blackbox methods, in the sense that they map data to decisions without trying to explain or interpret why the decisions make sense. Their sole goal is to discover how to optimize some objective, such as minimizing cumulative regret. By contrast, causal models and causal analytics methods, such as the Bayesian networks and algorithms introduced in Chapter 2, help to understand and visualize the probabilistic interdependencies (and independencies) among variables. They can also provide a most probable explanation (MPE) for observations in terms of the likely value of unobserved variables. This allows descriptive analytics to go well beyond simply providing statistical summaries or compact descriptions and visualizations of what is observed, but also to provide possible explanations, e.g., diagnoses for anomalies observed in patients or in systems. Such explanations carry implications for action that mere description without causal interpretation does not.

For predictive analytics, causal models can predict what will happen if interventions are undertaken in new settings. Neither the interventions nor the conditions under which they are applied need to have been seen before to make accurate probabilistic predictions using transport formulas (Bareinboim and Pearl, 2013; Lee and Honavar, 2013). Transport formula software is available at
https://cran.rproject.org/web/packages/causaleffect/causaleffect.pdf.

Similarly, for prescriptive analytics, causal models such as influence diagrams can optimize decisions for new situations and new interventions. This is something that blackbox datadriven methods that do not generalize beyond the datagenerating process for the observations at hand cannot do.

From the standpoint of causal analytics, algorithms that learn the conditional probability distributions or conditional expected values of rewards for different actions taken (or, more generally, different actions taken in the context of available observations) are in fact doing a form of causal discovery. This point of view is articulated further specifically for generalizations of Thompson sampling by Ortega and Vaughn (2014).

For evaluation analytics, causal models provide a principled way to compare what has actually happened to the full probability distribution for what could have happened, both under the conditions of the actions, policies, treatments, or interventions that were actually taken and under alternative, counterfactual conditions.
Thus, despite the exciting advances in machine learning applied to efficient discovery of lowregret decision rules, causal modeling and inference methods have valuable roles to play throughout these other parts of the risk analytics sequence.
Causality in Collaborative Analytics
Commercially available collaborative analytics information technologies (IT) platforms enable multiple data collectors, analysts, and decisionmakers to share data and insights, both throughout an organization and in physical or virtual “war rooms” and operations centers where information is displayed, analyzed, and acted upon by teams collaborating in real time. Causal analytics models can provide useful content and organizing structures to coordinate the collection, analysis, and display of information on such plaforms. Causal graph models such as causal Bayesian networks and influence diagrams provide useful conceptual and visual aids to help teams of collaborating analysts combine, coordinate, and synthesize the research and expertise of different subject matter experts (SMEs). They allow SMEs with complementary expertise about different parts of a problem – such as emission reductions technologies, costs, and health effects of pollution in Figure 1.5 – to develop the parts of the diagram and the corresponding conditional probability tables or models that each understands best, while understanding how the pieces fit together.
Causal Models in Game Theory: Normal and Extensive Forms, Characteristic Functions
More generally, effective cooperation across and collaboration within analytics centers requites principles and insights into how multiple agents can best work together to accomplish shared goals such as managing a complex system – whether a logistics network, a nuclear power plant, a commercial forest or fishery, or a multidivisional firm – to reduce risks and increase the production of desired outcomes. Many socioeconomic, technological, and business systems are governed by the interacting choices of multiple decisionmakers. The decisionmakers may having different information, different opportunities to take actions, and different individual preferences and values, beliefs, and incentives. For example, a business’s pollutant emissions may be regulated by a combination of Federal, state, county, and city authorities. It may also reflect past capital budgeting and investment decisions about which production and pollutionabatement technologies to purchase; ongoing operating and production decisions from multiple divisions; and remediation and cleanup decisions by various vendors and subcontractors hired by the business. The confluence of these many decision threads over time affects how much pollution is actually produced.
To understand how to improve decisionmaking by multiple decisionmakers in groups, teams, or organizations, it is necessary to extend normative principles for prescriptive analytics beyond the expected utility formalism (Equation 1.1) to allow for multiple agents. Normative principles that seem unexceptionable for individual decision makers need to be rethought for groups of decision makers. For example, an individual decision maker should never choose a strictly dominated strategy, meaning one that always yields a lowerutility outcome than some other feasible choice no matter how uncertainties are resolved. But the notorious Prisoner’s Dilemma shows that in situations with multiple decisionmakers and negative externalities, if each decisionmaker chooses an undominated strategy, then these choices can cause each to receive lowerutility outcomes than if each had instead chosen a dominated strategy (Thomas, 2003).
Cooperation and coordination of individual choices are required to avoid such Paretoinferior outcomes. With the need for coordination and cooperation come a host of new technical, practical, and conceptual issues, including how to model information and incentives; trust and reputation; credible commitments and signals of intent; asymmetric and private information; altruism and free riding; costly private or public monitoring and enforcement of agreements; and sharing of information, control, and rewards or losses. Game theory addresses these and related topics in detail (Thomas, 2003; Myerson, 1991; Osborne, 2004; Shoham and LeytonBrown, 2009). It provides both theory and algorithms for fair division of jointly produced gains; clarifies the existence, uniqueness, computational complexity, and computability of various proposed solutions to the joint decision problem for multiple agents (e.g., finding sets of choices such that no agent or coalition can increase its own expected reward by unilaterally changing its own decisions); and establishes impossibility results for design of collective choice mechanisms that simultaneously satisfy several desired properties, such as fairness (e.g., symmetric treatment of participants), efficiency, budget balance, and incentivecompatibility (e.g., voluntary participation and cooperation). In each of these areas, causal models relate the joint decisions of multiple agents to the joint probability distributions of their rewards. These models are essential for analysis and applications. Important special cases that focus on multiple agents cooperating to accomplish shared goals include the mathematical theories of teams and organizations (Marschak and Radner, 1972; Yüksel and Saldi, 2017) and of risk and reward sharing in investment syndicates (Wilson, 1968).
Game theory uses causal models that generalize normal form (decision table) and extensive form (decision tree) models from decision analysis to allow for multiple decision makers. These extensions yield payoff matrix models mapping ntuples of player choices to ntuples of resulting rewards or utilities,where n is the number of players; and game tree models in which different players are able to choose moves (branches) at different nodes, based on the information available to them at the time. The decision makers are usually called “agents” in the artificial intelligence and machine learning literature, or “players” in classical game theory terminology. They may have different initial information, including information about each others’s utilities, beliefs, intentions, capabilities, constraints, and resources (or, more generally, each other’s “types”), which they update over time by conditioning on their own observations. Private information arises when observations differ for different agents. Their actions jointly determine the probabilities of their individual rewards (often called “payoffs”in the older game theory literature) via the causal model. In addition to normal form payoff matrices and extensive form game trees used in analysis of noncooperative games, cooperative game theory uses causal models called “characteristic functions” that specify how much reward each subset or coalition of agents can obtain if it acts by itself (Myerson, 1991; Shoham and LeytonBrown, 2009). In most gametheoretic modeling, the relevant causal models (or “mechanisms”) mapping player choices to their outcome or reward probabilities are exogenously specified, sometimes by a social or institutional mechanism designer attempting to elicit deired behaviors from agents. In evolutionary game theory, there is no centralized mechanism designer, but mechanisms emerge ane evolve from the interactions of agents game theory. Most of the gametheoretic analysis then focuses on what the agents should choose to do, or on what behaviors and decision rules governing agent interactions (including shared norms and conventions) will emerge and persist, given the specified causal relations between choices (or behaviors) and their probable consequences. In practice, causal analytics methods and models applied to available data and knowledge may be needed to create usefully realistic gametheoretic models for supporting predictions and decisions.
Causal Models for MultiAgent Systems
The causal models discussed earlier in this chapter for prescriptive analytics also have generalizations for multiple cooperating decisionmaking agents. For example,

Influence diagrams can be generalized to multiagent influence diagrams (MAIDs), with gametheoretic reasoning then being executed by agents within the MAID graph (Koller and Milch, 2003). At present, the literature on MAIDs is relatively small.

Markov decision processes (MDPs) and partially observable MDPs (POMDPs) have been extended to multiagent systems (MAS). Multiagent planning and coordination problems, as well as multiagent team reinforcement learning problems, are often formulated as largescale MDPs or POMDPs, augmented with models describing observations and communication among agents, that can be solved using a combination of specialpurpose techniques (e.g., DAGbased factoring and function approximation methods) (Amato and Oliehoeck, 2015; Koller and Parr, 1999).

Optimal control problems, both deterministic and stochastic, have corresponding distributed control formulations when multiple sensors and controllers (agents consisting of software and actuators that they control) are distributed throughout a system and cooperate to manage its performance. The agents communicate with each other via a communication network to share information generated by their local sensing and control activities and to coordinate their actions to achieve common goals. Control is often arranged in a hierarchy. In a twolevel control hierarchy, a centralized planner collects information from each local controller and allocates tasks and goals (and in some cases, resources). The local controllers then use their own local information, resources, and control opportunities to pursue their assigned goals. Such hierarchical control architectures have been applied to control of electric power networks with randomly varying local loads and conditions (Ilic and Liu, 1996); to control of traffic signals which, in turn, provide distributed control of traffic flows in cities (Abdoos et al., 2014); and, more recently, to control of swarms of autonomous vehicles or drones engaged in searchandrescue or other missions (Gómez et al., 2016). In the latter application, when the swarms operate in uncertain and variable environments (e.g., with unpredictable wind gusts that can suddenly push a drone from its planned trajectory), the local controllers must respond quickly and competently to their changing local conditions. Bearing in mind the uncertainties and performance constraints the local controllers must confront, e.g., in navigating narrow spaces in the presence of disruptions from wind or from obstacles and from other drones, the centralized planner may assign longer but safer flight paths to some members of the swarm to reduce risks (Gómez et al., 2016). Conditions under which hierarchical control is more effective or less effective than purely decentralized control, in whch local control agents communicate with each other and form their own plans directly instead of going through a centralized planning and coordinating controller, are still being investigated in a variety of applications (Feng et al., 2017). The annual RoboCup competitions for robot soccer provide a useful experimental forum for evaluating different communication, coordination, and control protocols for teams of cooperating agents in uncertain, dynamic, and adversarial environments.

Reinforcement learning algorithms have been extended to multiagent reinforcement learning (MARL) algorithms for multiple cooperating agents (software programs) communicating with each other as they sense, act, and learn. Applications with substantial technical literatures include distributed sensing and control of urban traffic via communicating traffic signals; intelligent control of electric power production, distribution, storage, and routing in smartgrids; realtime control and efficiency improvement of industrial processes based on distributed sensors and actuators; and management of multiple autonomous unmanned aerial or ground vehicles, including drone swarms. MARL algorithms are included as components of many other hierarchical and distributed control systems for such applications.

Multiarmed bandit problems have been studied for teams of cooperating players who can try different arms and share information about what they have learned. For example, Chakraborty et al. (2017) compare centralized and decentralized control for teams of such agents when communication is costly and each agent in each period must choose whether to collect more information before broadcasting what it has discovered. They show that both centralized and decentralized control architectures lead to decisions with noregret properties for the communication cost model considered.

Simulation and simulationoptimization programs today routinely include simulation of the behaviors of multiple interacting intelligent agents. Applications are as diverse as managing aquaculture and protecting ecosystems more efficiently; countering piracy in commercial shipping by coordinating convoys and mutual help among ships; optimizing detection and responses to cyberattacks and fraud; designing and managing supply chains and logistics networks to achieve and maintain desired feasible levels of risk, profitability, performance, resilience to disruption; and other objectives; and changing layouts and operations to improve patient experiences in hospitals, customer experiences in shopping malls, and driver experiences on roads. Theory, practical software, and deployed applications of multiagent simulations and system design improvements based on simulationoptimization of multiagent systems (MAS) can be found in many sources. The annual Proceedings of the Winter Simulation Conference (WSC) (https://informssim.org/) is one of the best.
The artificial intelligence, machine learning, and control engineering literatures on cooperative multiagent control of uncertain systems and on coordination of team actions and communications in uncertain environments are vast. They include many elegant and profound results and frameworks, including exact forms of optimal control laws for each agent in some models and proposed formulations that apply to human organizations (Marschak and Radner, 1972) as well as to engineered systems (Ho and Chu, 1972). Causal modeling contributes several important components to this rich body of knowledge. First, causal models, when they are available, permit simulation (or, in special cases, analytic solutions) for predicting the outputs or output probabilities of a system for different inputs. Simulation, in turn, permits simulationoptimization of system designs, decision rules, and operating decisions to improve performance metrics and make preferred outcomes more likely, even in uncertain environments and even when multiple intelligent agents are involved in the operation or use of the system. Thus, causal models provide a foundation for improving the design and operation of many realworld systems. Second, the study of collaboration in multiagent systems (MAS) suggests new ways to manage a variey of risks using distributed sensors and controllers. The opportunities to use distributed communication and control systems to monitor and reduce risks and to improve performance in areas from production and logistics to transportation and energy infrastructures to improved healthcare and patient and customer experiences are already enormous. They are increasing as sensor networks, the internet of things, and ubiquitous access to data become increasingly widespread realities. Finally, the study of collaboration among software agents, each collecting and processing local data, updating its own beliefs about the causal relation between its actions (and perhaps also the actions of others) and resulting probabilistic changes in the world it acts on, taking actions based on its own information and beliefs, and communicating results within a team, may yield insights that help to improve distributed analytics and cooperative goalseeking, control of enterprises and systems, and risk management in human organizations. This was a goal of mathematical team theory more than half a century ago (e.g., Marschak and Radner,1972; Ho and Chu, 1972). Its fruition is starting to be seen today in applications
Conclusions: Causal Modeling in Analytics
This chapter has surveyed numerous causal models and methods for representing and calculating conditional probabilities of outcomes in response to changes in the inputs to a system or situation. They are summarized in Table 1.7. Their inputs can usefully be partitioned into those controlled by a decisionmaker – the decision variables or control variables, usually with values to be set or chosen by the decisionmaker to optimize expected utility or some other criterion, subject to feasibility constraints – and those not controlled by the decisionmaker. The uncontrolled inputs represent random variables, such as exogenous noise, shocks, or uncertain quantities, that affect the outcomes but that are not selected or set by the decisionmaker. Each causal model can be thought if as mapping its inputs into conditional probabilities for outputs, which include outcomes and rewards in all models, and also observations and changes of state in dynamic models.
A causal model links actions to their probable consequences and shows how other variables affect each other and the outcomes. Probabilistic dependencies among variables are expressed through conditional probability table (CPTs) or other conditional probability models. Causal models differ from statistical models by clearly distinguishes between effects of doing something (setting a controlled variable to a certain value) and of seeing something (observing that a variable has a certain value): acting on the world can cause the probability distributions of vatiables to change, but observing the world only allows us to update our beliefs about it (Pearl 2009). This fundamental distinction is discussed further in Chapter 2.
Table 1.7 Causal models relating decisions to probabilities of consequences
Causal Model

Controlled inputs (decisions)

Random outcomes

Decision table (normal form)

Row

Column (state of nature) and outcome

Decision tree (extensive form)

Branch at choice node

Branches taken at chance nodes; terminal node and reward received

Influence diagram

Decisions at choice nodes

Value of random variable at chance node

Markov decision process (MDP)

Act selected from current state

Next state, reward

SemiMarkov decision process

Act selected from current state

Dwell time in current state, next state, reward

Partially observable MDP (POMDP)

Act selected given current conditional probabilities of states

Next state, observation, reward

Deterministic optimal control

Input at each moment

State, output, and reward trajectories (deterministic)

Stochastic optimal control

Input at each moment

Output, state, and reward trajectories

Statistical decision theory

Decision rule or policy mapping observations to actions

State, observations, rewards

Simulation model (continuous or discreteevent)

Initial conditions, controllable inputs at each moment

Outcome trajectories

Multiarm bandit

Which act (arm) to choose next

Rewards

Response surface model

Next settings of inputs

Next output (e.g., yield)

Game theory models

Each player’s choice of strategy

Rewards to players

Multiagent systems (MAS)

Agent decision rules

Behaviors, rewards

Once a causal model for how actions affect outcome probabilities has been formulated, it can be used for a variety of purposes, including describing how variables depend on each other; identifying the most probable explanations for observations; comparing outcome probabilities under different counterfactual scenarios; predicting probabilities of outcomes for current or assumed initial conditions and decision rules; and recommending courses of action that make preferred outcomes more likely to occur. To extract useful advice from causal models on what to do next, it is necessary to have solvers that use them to determine the feasible settings of decision variables that cause the most desirable probability distribution of outcomes. A variety of algorithms are available for using the causal models in Table 1.7 to identify acts or decision rules that optimize various criteria, such as maximizing expected utility or average reward per period or expected net present value of future rewards; or minimizing expected loss or cumulative regret. Approaches introduced in this chapter include the following:

Static optimization: Suppose that u(a, s), represents the expected utility of the consequence of choosing act a if the state of the world is s, where a is an act or decision rule in a feasible choice set A and s is in a set S of possible states with a known probability distribution or probability measure (which may in general depend on the choice of a). Given this model, optimization techniques can be used to search for a choice in A that maximizes the expected value of u(a, s), i.e., expected utility. In the simplest case of a small decision tablewith only a few rows and columns, where A is the set of rows and S is the set of columns, bruteforce evaluation of the expected utility for each row reveals the optimal choice. In many operations research and risk management problems, however, the choice set A includes one or more continuous decision variables, and perhaps some discrete ones as well, and the set A is specified via a set of constraints determining jointly feasible choices of their values. Mathematical programming algorithms for constrained optimization, such as linear, nonlinear, or mixed integer programming solvers, are used to solve such problems.

Stochastic dynamic programming (SDP): In a decision tree or game tree representing a sequential decision problem, classical (backward) dynamic programming, beginning with the rewards at the terminal nodes (“leaves”) of the tree and working back by calculating expected utilities at chance nodes and maximizing them at choice nodes, provides an effective procedure for identifying the optimal initial choice, i.e., the one that maximizes expected utility when all subsequent decisions are made optimally. Dynamic programming is also used to formulate the Bellman equations for optimizing decisions in Markov decision processes (MDPs) and partially observable MDPs (POMDPs) and the HamilitonJacobiBellman equations for optimal control of deterministic and stochastic dynamic systems.

Optimal control, dynamic and adaptive optimization, and reinforcement learning: Mathematical analysis using dynamic optimization methods leads to closedform optimal decision rules for many optimal control problems, for both deterministic and stochastic dynamic systems. It also yields insights into the qualitative characteristics of optimal policies, such as the existence of thresholds or control limits for observed attribute values that should trigger specific actions such as sparing or harvesting. Numerical optimal control algorithms and software packages are increasingly available for optimizing multiple controlled inputs together over time when the equations representing a causal model of a dynamic system are known. When they are not, reinforcement learning approaches such as the SARSA algorithm for MDPs or various POMDP solvers can be used to discover effective decision rules for systems with uncertain initial conditions and state transition dynamics. Optimal control methods and MDP and POMDP models and solvers have a wide range of practical applications, such as the following:

Harvesting and replenishing a forest or other renewable resource over time

Managing a perishable inventory with timevarying demands and values

Timing the storage and release of water behind a hydroelectric dam to take advantage of changing hourly electricity prices

Scheduling inspections of components in a complex industrial plant or infrastructure network to increase reliability and reduce failure risks

Scheduling the screening, treatments, and release of patients from a hospital.
Adaptive optimization principles such as optimism under uncertainty (e.g., the UCB1 algorithm), probability matching (via Thompson sampling), and “followtheperturbedleader” (FPL) algorithms for online prediction and decision problems have proved surprisingly effective in many settings for yielding highquality, lowregret approximations to the optimal control laws that would be used if valid causal models were known. These adaptive optimization techniques can be viewed as learning causal models on the fly for the effects of acts on outcome probabilities in different situations.

Markov Chain Tree Search (MCTS): For decision or game trees that are too large to be fully generated and evaluated by dynamic programming, MCTS provides a practical heuristic alternative. MCTS emphasizes forward search into a partly known tree, rather than backward optimization starting from the tips of a fully known one. MCTS heuristics have been incorporated into algorithms for solving largescale POMDPs.

Influence diagram solvers use graphtheoretic algorithms (e.g., bucket elimination, which is closely related to nonserial dynamic programming) to solve for the optimal decisions at choice nodes.

Simulationoptimization algorithms combine numerical optimization techniques such as gradient descent, statistical techniques such as response surface methodology, smoothing and aggregation techniques such as deep learning, and computational Bayesian sampling and updating methods such as particle filtering, to seek combinations of controllable inputs over time that optimize the performance metrics for a simulated system.

Multiagent Systems (MAS) algorithms extend simulation and simulationoptimization concepts to teams or systems of interacting agents, each of which uses its own information to decide what to do and what to communicate. Studying the behavior of such systems in response to different communication and control protocols and different simulated environmental conditions can suggest system design and operating principles to improve the performance, reliability, and safety of real systems.
Mastering any of these techniques provides valuable skills for formulating and solving certain types of analytics problems using appropriate causal models. Their learning curves are substantial, however, and thorough understanding usually requires experience applying them. The references at the end of this chapter provide useful points of entry to the relevant technical literatures. Fortunately, modern objectoriented software makes it possible to encapsule much of the detailed knowledge and expertise needed to apply these algorithms correctly to appropriate causal models or data.
Causal modeling is used throughout the rest of risk analytics to help describe and interpret data and to predict, optimize, and evaluate the impacts of risk management decisions (Table 1.6). Chapter 2 examines more closely the multiple meanings of “cause” and “causality” and discusses the senses that are most useful for analytics tasks such as prescription and evaluation. It also introduces techniques and algorithms for learning causal models from data and for using them to carry out other analytics tasks. The emphasis is largely, but not exclusively, on causal Bayesian networks (BNs) and other models that can be represented as BNs. Such models turn out to be both relatively tractable to learn from data in many practical applications, and also highly useful for insightful description, prediction, prescription, and evaluation of the effects of decisions and interventions. Chapter 2 introduces the main ideas of causal BNs and related causal modeling methods and describes free software implementations to enable practitioners to apply them.
REFERENCES FOR CHAPTER 1
Abdoos M, Mozayani N, Bazzan ALC. (2014) Hierarchical control of traffic signals using Qlearning with tile coding. Applied Intelligence March 40(2): 201213
Agrawal S, Goyal N. (2012) Analysis of Thompson sampling for the multiarmed bandit problem. Journal of Machine Learning Research: Workshop and Conference Proceedings 23(39): 39.1–39.26. http://proceedings.mlr.press/v23/agrawal12/agrawal12.pdf
AkhavanTabatabaei R, Sánchez DM, Yeung TG. A Markov Decision Process Model for Cervical Cancer Screening Policies in Colombia.Med Decis Making. 2017 Feb;37(2):196211. doi: 10.1177/0272989X16670622.
Amacher GS, Ollikainen M, Koskela E. (2009) Economics of Forest Resources. The MIT Press. Cambridge, MA
Amato C., Oliehoeck (2015). Scalable planning and learning for multiagent POMDPs. Proceedings of the TwentyNinth AAAI Conference on Artificial Intelligence: 19952002. www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/viewFile/9889/9495
Amaran S, Sahinidis NV, Bikram S, Bury SJ. (2016) Simulation optimization: A review of algorithms and applications. Annals of Operations Research. May 240(1): 351–380
Dostları ilə paylaş: 