Conditional probability tables (CPTs) for the outcomes and observations, P(c | a, s) and P(y | a, s), respectively. If the data observed does not depend on the act, then the CPT for y simplifies to P(y | a).
A utility function for consequences, u(c)
A marginal probability distribution for the state, P(s).
These elements complete the specification of the Bayesian statistical decision theory problem shown in the DAG. The marginal distribution for the state, P(s), is also called the prior distribution for the state, as it summarizes knowledge or beliefs about the state, expressed via probabilities for the possible values of the state variable, prior to conditioning on the data, y. The standard solution approach to such a statistical decision theory problem, as detailed at the beginning of Chapter 2, is to use Bayes’ Rule to infer updated probabilities for the state after conditioning on the observed data.This updated distribution of state probabilities, denoted by P(s | y), is called the posteriordistribution for the state, because it is formed after observing the data. Together with the utility function u(c) and the conditional probabilities of consequences specified by the causal model CPT,P(c | a, s), the posterior distribution allows the expected utility of each act to be evaluated after taking the data into account, as follows:
EU(a | y) = Scu(c)*P(c | a, y) (1.10a).
The conditional probability of each consequence, c, given act a and observed data y, in turn, is found from the causal model P(c | a, s) via an application of the law of total probability:
P(c | a, y) = SsP(c | a, s)P(s | y) (1.10b)
Finally, the posterior probability distribution for s is given by Bayes’ Rule applied to the observation model P(s | y) and the prior probabilities for states, P(s):
P(s | y) = P(y | s)P(s) /Ss’P(y | s’)P(s’) (1.11).
where the sum in the denominator is taken over all possible states. (As usual, if the state is a continuous variable, then sums are replaced by integrals and discrete state probabilities are replaced by probability density functions.)
If the d.m.’s choice of act affects the data collected, so that an arrow from a to y is added to DAG model (1.9) then appropriate act-specific conditional probabilities must be used:
P(s | y) = P(y | a, s)P(s) /Ss’P(y | a, s’)P(s’) (1.11a).
Likewise, if the state probabilities also depend on the act, then the calculation would be:
P(s | y, a) = P(y | a, s)P(s | a) /Ss’P(y | a, s’)P(s’) (1.11b).
None of these variations changes the basic idea, which is to choose an act that will maximize expected utility after state probabilities are updated based on observations using Bayes’ Rule and appropriate conditional probabilities.
When the sets of possible acts, states, and observations are all small and discrete, so that explicit probability tables (CPTs) can be developed, the entire process of evaluating the expected utility for each act and recommending an optimal act given any observed (or assumed) observations can be automated. Bayesian network (BN) and influence diagram (ID) software, introduced in Chapter 2, automates the required calculations, producing optimal statistical decision rules mapping observations to decisions (y to a) and calculating their expected utilities. Value of information (VOI) calculations of how expected utility changes if more data are collected to inform decisions can also be used to optimize sample size decisions and experimental designs. Chapter 2 provides some simple examples of optimal statistical decision-making using influence diagram software for causal modeling and Bayesian inference from observations to optimize decisions.
Among the very many practical applications of Bayesian statistical decision theory are the following:
Statistical estimation: The observations y consist of sample data. The unobserved state, s, is a population parameter of interest, such as the mean value of an attribute or the proportion of individuals with a certain characteristic in the sampled population. The act a is an estimate of s. An optimal decision-rule selects a to minimize the expected value of a loss function L(a, s) or, equivalently, to maximize the expected value of a utility function that depends on the act and state. Formulating statistical estimation problems in these Bayesian decision-theoretic terms provides an explicit rationale for many traditional statistical procedures. For example, if the loss function from the error in a point estimate of an uncertain quantity is quadratic, L(a, s) = (a - s)2, where s is the true but unknown quantity and a is the point estimate decided on, then the Bayesian optimal statisticaldecision is to use the mean of the posterior distribution of s, conditioned on all available data, as the point estimate, a. This is the best estimate, in the sense that it minimizes expected loss, i.e., mean squre prediction error. If the loss function is the absolute value of the difference between the true and estimated values, then the optimal statistical decision is to use the median of the posterior distribution as the best (expected loss-minimizing) estimate. Such results show conditions under which different traditional estimators, such as the mean and median, are optimal.
Change-point analysis (CPA): The hidden state, s, describes a data-generating process, and the inference task is to infer from data whether, when, and how the data-generating process has changed. The data y consist of time series observations. The decision variable could specifya classification of the process (e.g., changed vs. not changed) or an intervention to be taken when it is concluded from the data that the process being observed has (probably) changed.
Statistical quality control: The hidden state is the true quality (e.g., probability of producing a conforming item), the data consists of observed results of inspections, and the decision rule specifies whether to perform a costly intervention (e.g., replacing or recalibrating a machine used in the production process) based on the sample data.
Lot acceptance sampling: The hidden state is the true proportion of defectives in a lot or consignment of items received for approval. The act aspecifies a sampling plan and a decision rule specifying whether to accept or reject the lot (or, in multi-stage plans, whether and how to sample further before making a final accept/reject decision) based on the results of sampling so far, y.
Diagnosis: The unobserved state, s, is the true state of a patient or system being examined. The act a specifies a label (e.g., a disease name or a classification or description of the system’s state). An optimal decision rule minimizes the expected cost from misclassification errors. In adaptive decision support systems, the optimal decision rule also specified what tests to perform next, if any, given the results seen so far and the costs and statistical characteristics of available tests.
Pattern recognition: This is a generalization of diagnosis. Pattern recognition, or optimal statistical classification, seeks to classify or label the underlying state of a system or situation based on observations whose probabilities depend on the state. The ability to correctly classify or recognize the state from the observed data with high probability has applications in speech, face, and handwriting recognition, fraud detection, oil exploration, and countless other areas of science, technology, and engineering, in addition to medical diagnosis.
Statistical forecasting: Observations up to the present constitute the data, y. The act or decision consists of a prediction for one or more future observations. An optimal prediction rule minimizes the expected loss from prediction errors. Some forecasting systems, called state space models, use the observations yto estimate the current state. If they are not already known, the state dynamics equations describing the evolution of the state over time, such as (1.6a), (1.7) or the state transition probabilities of a Markov chain, depending on the system being studied, are also estimated from data, along with the observation equation (1.8). The state dynamics equations are then applied to the estimated state to predict future states. This approach only coincides with the Bayesian statistical decisiontheory approach under special conditions, such for “linear quadratc Gaussian” (LQG) systems with state dynamics described by linear ODEs (or difference equations, in discrete time), normally distributed (“Gaussian”) additive observation errors, and quadratic loss functions.Moreover, identification of a unique estimated state (and, if they are not already known, dynamic evolution equations for the state)from available observations is not always possible (the “identification problem” for systems). Even when the state estimation problem is well posed and uniquely solvable from the data, the computational challenge of calculating the posterior probability distributionfor states – or even its mean – can be formidable. Efficient recursive updating formulas are available in special cases(most notably, the celebrated Kalman filter for LQG systems widely used in aeronautics, electrical engineering, automatic control, and many other areas (Zarchan and Musoff, 2015)). But in general, tractable computation of posterior probability distributions for the hidden state requires using approximation techniques such as Sequential Monte Carlo (SMC) sampling methods. These include popular “particle filtering” algorithms that draw weighted random samples (“particles”) from the current distribution, propagate them through the system dynamics equations, update the samples and weights by conditioning on observations, and then resample using sequential importance sampling (SIS) and sequential importance resampling (SIR) techniques to yield computationally manageable samples representing the posterior distribution (Doucet and Johansen, 2008). For practitioners, particle filtering can be easily implemented using software such as the packages pomp and SMC in R.
Adaptive optimal control (Sutton et al., 1992): Observations that are affected by the underlying state of a system are fed into a feedback control decision rule that, in turn, determines what sampling and control actions to undertake next. The decision rule is adjusted over time in light of observations to try to optimize an objective function, e.g., maximizing expected utility or minimizing expected loss.
POMDPs: A POMDP has a hidden state that evolves according to a stochastic causal law such as (1.7c) and that is observed via probabilistic signals as in equation (1.8). An optimal decision rule specifies what act to choose in each period, given the observations, to optimize an objective function.
Sequential comparison and selection: The hidden state is the true average reward or success probability for each of multiple alternatives. These could be treatment plans, advertising campaigns, product designs, or other alternatives being compared in order to select the best one, i.e., the one that generates the highest average reward or success probability. The observations consist of the history of trials so far, i.e., choices made and resulting rewards; these constitute samples from the true but unknown reward distributions for the different alternatives. A decision rule specifies what to try next, given the results so far. An optimal decision rule minimizes the expected sum of costs or losses from sampling and from the opportunity costs (or “regret”) of trying less-than-optimal alternatives. A/B testing in marketing science, clinical trials in medicine, and multi-arm bandit (MAB) problems in statistics and machine learning are examples of such sequential comparison and selection problems.
The huge variety of important and useful applications of Bayesian statistical decision theory all have the following common simple causal structure. A hidden state that is not directly observed causes observations, in the sense that the probabilities of the observations are affected by the state. Also, the current state together with a choice of current act cause (i.e., determine the values of, or the probabilities of) outcomes of interest, typicallythe next state and rewardsor losses. Given observed histories of acts and observations, the d.m. seeks to choose next acts to optimize the expected value of an objective function. The expected value is calculated using posterior probabilities that are conditioned on observations via Bayes’ rule. This combination of concepts and methods, which model causality via conditional probability relationships among acts, states, observations, and outcomes and then use it to draw inferences and optimize actions, powers much of modern signal processing, artificial intelligence, automated control, telecommunications, and industrial engineering technology.
Simulation-Optimization In many applications, conditional probabilities of outcomes caused by different acts are not known or specified as part of a decision problem. They must instead be estimated from experience or from a model. For a system that is understood in enough detailso that its behavior can be modeled accurately via a discrete-event simulation (DES) or continuous stochastic simulation model, the probabilities of outcomes caused by different combinations of values for the controllable inputs can be estimated via simulation. Optimization algorithms can then be applied to the simulation model to seek the best policies or courses of action. For example, a discrete-event simulation model might be composed by linking stochastic models for the transition rates between“working” and “failed” states for the components and subsystems of a reliability system,with failure and repair rates for each component and subsystem being determined by the states (working or failed) of its predecessors in a dependency graph (DAG) model. The input data for such a model include:
Spontaneous failure rates and repair rates – or, more generally, state transition intensities – for the “basic events,” i.e., those without predecessors;
Conditional failure and repairrates, or state transition intensities,for each component or subsystem, given the states of its predecessors.
As outputs, such a model can be used to simulate the probability distribution of the random time until occurrence of an undesirable “top event” such as catastrophic system failure, given different maintenance and repair policies as controllable decision inputs. More generally, simulation can be used to estimate the conditional probabilities of failure times for the system, or for the remaining random time until its first passage into a specified target set of states, given the occurrence or non-occurrence of various events such as failures of its components or subystems. Such conditional probabilities, in turn, can be used to help figure out which subset of costly improvements in component-level reliabilities would most improve the reliability of the system as a whole for the dollars spent, although true optimization usually requires special techniques of combinatorial optimization for reliability systems. Special techniques, such as importance sampling and subset simulation, have been developed for efficient simulation of rare events such as failures in highly reliable systems (Beck and Zuev, 2015).
Simulation and optimization are often combined into powerful simulation-optimization algorithms that automatically vary controllable inputs (the values of “decision variables” or “control variables”), estimate expected utilities or other performance measures via simulation, and then adjust the controllable inputs to increase expected utility – or, equivalently,to reduce expected loss – until no further improvement can be found (Amaran et al., 2016; Fu, 2015). A relatively recent class of simulation-optimization algorithms, Monte Carlo Tree Search (MCTS), has led to breakthrough advances in difficult decision problems such as those that arise in large POMDPs and in world champion-level games of Go. MCTS first evaluates potential next acts by running multiple probabilistic (Monte Carlo) simulations to see what might happen after each is selected, and then chooses the act that has the best evaluation averaged over multiple simulations (Silver and Veness, 2010; Fu, 2016). Causal knowledge about what can happen, asrepresented by the rules of a game or by the state transitions probabilities of a POMDP, for example, constrains the simulated possible futures. MCTS typically proceeds via the following systematic four-step process, carried out in a search tree in which the nodes represent possible states and the branches (arcs) leaving a node represent possible acts that can be taken from it (Browne et al., 2011):
Select a node to“expand,” i.e., to evaluate further by generating and evaluating new child nodes for it;
Expand the search tree by generating additional acts and new child nodes, i.e., states, to which they lead, following the ones that have already been considered;
Simulate: Evaluate a newly added nodeby simulating multiple random feasible trajectories (sequences of states and acts) forward from it, choosing subsequent acts via a default policy (possibly just via random selection) where needed, and using the causal model to simulate outcomes. Stop each simulation when it reaches a terminal state (e.g., a winning or losing position in a game or an absorbing state of an MDP or POMDP) or when an evaluation horizon – that is, a maximum desired depth for the search tree – is reached. Then average the resulting rewardsfrom the simulated trajectories following the new act being evaluated. This phase of the MCTS is also sometimes called “policy rollout” since it evaluates a node by simulation (“rolling forward”) using a default policy to select acts and using a causal model to generate resulting states, observations, and rewards.
Backpropagation: Update the estimated values of the predecessor acts that lead to the newly evaluated act to reflect the simulated average reward from that point forward.
At the end of this process, the next act chosen is the one with the highest estimated value. Although this basic structure is simple, combining Monte Carlo simulation with decision tree and game tree search and evaluation principles that have been known for decades, it has proved extremely powerful for solving problems that were previously computationally intractable (Fu, 2016). In practical implementations, quite sophisticated algorithms and heuristics may be used to carry out each step. For example, in applications to large-scale POMDPs, it is common to use particle filtering to represent and update beliefs about the hidden state, and ideas from reinforcementlearning and multiarm bandit problems to optimize search priorities (Katt et al., 2017). Some of these ideas are discussed later in the section on causality in learning analytics, which addresses the problem of decision-making with unknown causal models.
At the other end of the spectrum of complexity for simulation-optimization are problems in which both simulation and optimization are straight-forward and no sophisticated algorithms are needed, other than the sampling methods already incorporated intobasic Monte Carlo uncertainty analysis software. Such relatively simple, straightforward modeling can still be extremely useful for evaluating and improving proposed policies when enough is known to develop credible simulation models to assess the (approximate) probabilities of different outcomes for each policy alternative via Monte Carlo simulation.
Example: An Influence Diagram (ID) DAG Model for Optimizing Emissions Reductions Influence diagrams (IDs) provide a well-developed prescriptive analytics framework that is appropriate for many decision analysis problems. Figure 1.5 shows an example of an ID drawn using the Analytica software package (Morgan and Henrion, 1990). This ID model illustrates considerations that might be used to support a policy decision about how much to reduce air pollution emissions. It consists of nodes, representing variables, and arrows between nodes representing dependencies between the corresponding variables. Several different types of nodes are allowed in an ID, as follows.
Fig. 1.5 An Analytica® influence diagram (ID) for optimizing emissions reduction
Source: Analytica® web site (http://wiki.analytica.com/index.php?title=Example_Models_and_Libraries)
Decision nodes. The green rectangular node at the upper left of Figure 1.5 labeled “Emissions Reduction” is a decision node representing a decision variable. The decision-maker chooses the value for such a variable from a set of possible values. In general, an ID may have multiple decision nodes, although Figure 1.5 has only one. Decision nodes are also called choice nodes.
Value node. The pink hexagon at the lower right of Figure 1.5labeled “Total Cost” is a value node. This is the ultimate output of the decision problem: a measure of the overall impact of the decisions made. Because it is an output, it has only inward-pointing arrows in the diagram. The decision-maker seeks to set the values of decision variables to optimize the expected value of the value node, meaning its average value over many simulation runs of the decision model. Using expected values allows for the possibility of a random relationship between choices and outcomes. This is realistic if other uncertain quantities (modeled as random variables) also affect the outcome. If the value node represents total cost or loss, as in this case, then the challenge for decision analysis is to find values for the decision variables to minimize the expected value of the value node. If the value node represents utility, then the goal is to make decisions to maximize its expected value. In standard operations research and optimization terminology, the value node represents the objective function to be optimized in the decision problem.
Chance nodes (yellow ellipses) represent uncertain quantities, modeled as random variables. For example, several of the input nodes (defined as nodes with only outward-pointing arrows, also called exogenous nodes) around the periphery of Figure 1.5, such as Control Cost Factor and Base Concentration, may be uncertain quantities. In this case, a probability distribution for the value of each uncertain input quantity is specified as part of the ID model. A special case of a chance node is a deterministic node that assigns probability 1 to a specific value: constants are included in this way.
Derived nodes. Nodes (i.e., variables) with inward-pointing arrows have values that depend on the values of the variables that point into them. For example, in Figure 1.5, the value of Health Damage depends on the values of Concentration, Threshold, and Health Damage Factor, each of which may be a random variable. Rectangular boxes with rounded edges are used to represent derived variables in Figure 1.5. The value of a derived node may be a deterministic function of the values that point into it, but a useful generalization is to allow a derived node to represent a random variable with a conditional probability distribution that depends on the values of the variables pointing into it. The variables that point into a given variable are called its parents, and the given variable is called a childof any of its parents; thus, an arrow runs from each parent into each of its children. In Figure 1.5, Excess Deaths has two parents, Population Exposed and Health Damage, and it is a child of each of them. If there are only a few discrete possible values for each of the variables involved, then a conditional probability table (CPT) giving the conditional probability of each possible value of a node for each possible combination of the values of its parents provides a way to describe the probabilistic relationship between them. The table for an input node with no parents is called its marginal probability table. Chapter 2 discusses other methods of specifying conditional probability relationships that also apply to continuous variables such as age or temperature and to variables with many discrete values, such as daily mortality count in a large population.
The nodes and arcs in an ID are arranged so that there are no directed cycles: it is not possible to start at a node and follow a sequence of arrows that eventually would lead back into the same node. This reflects a constraint that the value of each variable depends only on the values of other variables, and not on its own value. Such networks composed ofnodes joined by arrows that do not form any directed cycles are called directed acyclic graphs (DAGs). DAG models are used and discussed extensively in Chapter 2.
A fully quantified ID model specifies a DAG structure together with a marginal probability distribution for the value of each input node and a conditional probability table (CPT) or some other probability model for each derived node giving the probability (or probability density, for continuous variables) of each of its possible values as a function of the values its parents. Thus, once the values of its parents are known, the conditional probability distribution for the value of the node is known. The probability distribution for an output such as Total Cost can be quantified for any selected values of the decision inputs by a simple process called forward Monte Carlo uncertainty analysis. The process begins by sampling a value from the marginal distribution for each input chance node. Then it samples a value (i.e., draws a value from its conditional probability distribution, given the values of its parents) for each as-yet unevaluated node all of whose parents have been evaluated, thereby evaluating it, i.e., providing a value for it. This process of using sampling to evaluate the as-yet unevaluated nodes with evaluated parents continues until a value has been sampled for the output. Repeating many times for given values of the decision nodes builds up an entire distribution for the output variable, given the selected values of the decision variables. Metaphorically, we might the say that the uncertainty probability distributions for the inputs have been “pushed through” the DAG to yield a probability distribution for the output. Varying the values of the decision variables then allows the decision, or setting of decision variable values, with the lowest expected loss or the greatest expected utility to be identified; if necessary, in a problem with many decision variables, optimization algorithms can be used to automate the search for the best values.