Fig. 1.4. Dashboard for Sepsis Risk Management Control Problem
Source: Tsoukalas et al. (2015) www.ncbi.nlm.nih.gov/pmc/articles/PMC4376114/
Copyright ©Athanasios Tsoukalas, Timothy Albertson, Ilias Tagkopoulos. Originally published in JMIR Medical Informatics (http://medinform.jmir.org/,24.02.2015. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/),
To the right of the rewards in Figure 1.4 are “Vital history” summaries of the observed data, consisting of time series histories and current values for observed variables such as temperature and white blood cell counts (WBC); these are the components of the y vector of observed data values produced by the observation function of the POMDP. In the upper right are patient state trajectories and anticipated target “destination trajectories” that can probably be achieved via optimal policies for administering antibiotics. At the lower right are recommended optimal next actions and alternative nearlyoptimal next actions based on inferences about the probabilities of different health states for the patient (The “Belief” profiles shown for current and next states) and records of the acts taken recently “Drug History” and other events. The lower left shows estimated transition probabilities among states. The authors conclude that this POMDPbased tool greatly increased the proportion of patients whose next transitions were to better health states and that mortality and lengthofstay were also predictable with accuracies of about 70%80% using the POMDP model.
Bayesian Statistical Decision Theory
The basic causal structure in which outcomes of decisions depend on (i.e., have probabilities that are affected by) an underlying state variable that cannot be observed directly, but whose value affects the probability distribution of observable signals or data, arises very frequently in applied statistical decision theory. This is the normative theory of how one should make decisions under uncertainty when informed by statistical evidence from observed data and by a causal model, theory, or beliefs about how the probabilities of observations and outcomes depend on the unobserved state (Luce and Raiffa, 1957). The abstract structure of many such problems can be diagrammed in DAG form as follows:
a c s y, (1.9)
indicating that the consequence or outcome c depends on the d.m.’s choice of an act, a, and on the state variable, s, and that the observable data or signal y also depends on state s. If the data y result from a costly sampling effort or experiment designed by the d.m., then there could alsobe an arrow from a to y to indicate the statistical dependence of the data on these design decisions, such as what sample sizes to use and what variables to collect data on. The quantitative elements of the statistical decision problem are as follows:

Conditional probability tables (CPTs) for the outcomes and observations, P(c  a, s) and P(y  a, s), respectively. If the data observed does not depend on the act, then the CPT for y simplifies to P(y  a).

A utility function for consequences, u(c)

A marginal probability distribution for the state, P(s).
These elements complete the specification of the Bayesian statistical decision theory problem shown in the DAG. The marginal distribution for the state, P(s), is also called the prior distribution for the state, as it summarizes knowledge or beliefs about the state, expressed via probabilities for the possible values of the state variable, prior to conditioning on the data, y. The standard solution approach to such a statistical decision theory problem, as detailed at the beginning of Chapter 2, is to use Bayes’ Rule to infer updated probabilities for the state after conditioning on the observed data.This updated distribution of state probabilities, denoted by P(s  y), is called the posteriordistribution for the state, because it is formed after observing the data. Together with the utility function u(c) and the conditional probabilities of consequences specified by the causal model CPT,P(c  a, s), the posterior distribution allows the expected utility of each act to be evaluated after taking the data into account, as follows:
EU(a  y) = S_{c}u(c)*P(c  a, y) (1.10a).
The conditional probability of each consequence, c, given act a and observed data y, in turn, is found from the causal model P(c  a, s) via an application of the law of total probability:
P(c  a, y) = S_{s}P(c  a, s)P(s  y) (1.10b)
Finally, the posterior probability distribution for s is given by Bayes’ Rule applied to the observation model P(s  y) and the prior probabilities for states, P(s):
P(s  y) = P(y  s)P(s) /S_{s}_{’}P(y  s’)P(s’) (1.11).
where the sum in the denominator is taken over all possible states. (As usual, if the state is a continuous variable, then sums are replaced by integrals and discrete state probabilities are replaced by probability density functions.)
If the d.m.’s choice of act affects the data collected, so that an arrow from a to y is added to DAG model (1.9) then appropriate actspecific conditional probabilities must be used:
P(s  y) = P(y  a, s)P(s) /S_{s}_{’}P(y  a, s’)P(s’) (1.11a).
Likewise, if the state probabilities also depend on the act, then the calculation would be:
P(s  y, a) = P(y  a, s)P(s  a) /S_{s}_{’}P(y  a, s’)P(s’) (1.11b).
None of these variations changes the basic idea, which is to choose an act that will maximize expected utility after state probabilities are updated based on observations using Bayes’ Rule and appropriate conditional probabilities.
When the sets of possible acts, states, and observations are all small and discrete, so that explicit probability tables (CPTs) can be developed, the entire process of evaluating the expected utility for each act and recommending an optimal act given any observed (or assumed) observations can be automated. Bayesian network (BN) and influence diagram (ID) software, introduced in Chapter 2, automates the required calculations, producing optimal statistical decision rules mapping observations to decisions (y to a) and calculating their expected utilities. Value of information (VOI) calculations of how expected utility changes if more data are collected to inform decisions can also be used to optimize sample size decisions and experimental designs. Chapter 2 provides some simple examples of optimal statistical decisionmaking using influence diagram software for causal modeling and Bayesian inference from observations to optimize decisions.
Among the very many practical applications of Bayesian statistical decision theory are the following:

Statistical estimation: The observations y consist of sample data. The unobserved state, s, is a population parameter of interest, such as the mean value of an attribute or the proportion of individuals with a certain characteristic in the sampled population. The act a is an estimate of s. An optimal decisionrule selects a to minimize the expected value of a loss function L(a, s) or, equivalently, to maximize the expected value of a utility function that depends on the act and state. Formulating statistical estimation problems in these Bayesian decisiontheoretic terms provides an explicit rationale for many traditional statistical procedures. For example, if the loss function from the error in a point estimate of an uncertain quantity is quadratic, L(a, s) = (a  s)^{2}, where s is the true but unknown quantity and a is the point estimate decided on, then the Bayesian optimal statisticaldecision is to use the mean of the posterior distribution of s, conditioned on all available data, as the point estimate, a. This is the best estimate, in the sense that it minimizes expected loss, i.e., mean squre prediction error. If the loss function is the absolute value of the difference between the true and estimated values, then the optimal statistical decision is to use the median of the posterior distribution as the best (expected lossminimizing) estimate. Such results show conditions under which different traditional estimators, such as the mean and median, are optimal.

Changepoint analysis (CPA): The hidden state, s, describes a datagenerating process, and the inference task is to infer from data whether, when, and how the datagenerating process has changed. The data y consist of time series observations. The decision variable could specifya classification of the process (e.g., changed vs. not changed) or an intervention to be taken when it is concluded from the data that the process being observed has (probably) changed.

Statistical quality control: The hidden state is the true quality (e.g., probability of producing a conforming item), the data consists of observed results of inspections, and the decision rule specifies whether to perform a costly intervention (e.g., replacing or recalibrating a machine used in the production process) based on the sample data.

Lot acceptance sampling: The hidden state is the true proportion of defectives in a lot or consignment of items received for approval. The act aspecifies a sampling plan and a decision rule specifying whether to accept or reject the lot (or, in multistage plans, whether and how to sample further before making a final accept/reject decision) based on the results of sampling so far, y.

Diagnosis: The unobserved state, s, is the true state of a patient or system being examined. The act a specifies a label (e.g., a disease name or a classification or description of the system’s state). An optimal decision rule minimizes the expected cost from misclassification errors. In adaptive decision support systems, the optimal decision rule also specified what tests to perform next, if any, given the results seen so far and the costs and statistical characteristics of available tests.

Pattern recognition: This is a generalization of diagnosis. Pattern recognition, or optimal statistical classification, seeks to classify or label the underlying state of a system or situation based on observations whose probabilities depend on the state. The ability to correctly classify or recognize the state from the observed data with high probability has applications in speech, face, and handwriting recognition, fraud detection, oil exploration, and countless other areas of science, technology, and engineering, in addition to medical diagnosis.

Statistical forecasting: Observations up to the present constitute the data, y. The act or decision consists of a prediction for one or more future observations. An optimal prediction rule minimizes the expected loss from prediction errors. Some forecasting systems, called state space models, use the observations yto estimate the current state. If they are not already known, the state dynamics equations describing the evolution of the state over time, such as (1.6a), (1.7) or the state transition probabilities of a Markov chain, depending on the system being studied, are also estimated from data, along with the observation equation (1.8). The state dynamics equations are then applied to the estimated state to predict future states. This approach only coincides with the Bayesian statistical decisiontheory approach under special conditions, such for “linear quadratc Gaussian” (LQG) systems with state dynamics described by linear ODEs (or difference equations, in discrete time), normally distributed (“Gaussian”) additive observation errors, and quadratic loss functions.Moreover, identification of a unique estimated state (and, if they are not already known, dynamic evolution equations for the state)from available observations is not always possible (the “identification problem” for systems). Even when the state estimation problem is well posed and uniquely solvable from the data, the computational challenge of calculating the posterior probability distributionfor states – or even its mean – can be formidable. Efficient recursive updating formulas are available in special cases(most notably, the celebrated Kalman filter for LQG systems widely used in aeronautics, electrical engineering, automatic control, and many other areas (Zarchan and Musoff, 2015)). But in general, tractable computation of posterior probability distributions for the hidden state requires using approximation techniques such as Sequential Monte Carlo (SMC) sampling methods. These include popular “particle filtering” algorithms that draw weighted random samples (“particles”) from the current distribution, propagate them through the system dynamics equations, update the samples and weights by conditioning on observations, and then resample using sequential importance sampling (SIS) and sequential importance resampling (SIR) techniques to yield computationally manageable samples representing the posterior distribution (Doucet and Johansen, 2008). For practitioners, particle filtering can be easily implemented using software such as the packages pomp and SMC in R.

Adaptive optimal control (Sutton et al., 1992): Observations that are affected by the underlying state of a system are fed into a feedback control decision rule that, in turn, determines what sampling and control actions to undertake next. The decision rule is adjusted over time in light of observations to try to optimize an objective function, e.g., maximizing expected utility or minimizing expected loss.

POMDPs: A POMDP has a hidden state that evolves according to a stochastic causal law such as (1.7c) and that is observed via probabilistic signals as in equation (1.8). An optimal decision rule specifies what act to choose in each period, given the observations, to optimize an objective function.

Sequential comparison and selection: The hidden state is the true average reward or success probability for each of multiple alternatives. These could be treatment plans, advertising campaigns, product designs, or other alternatives being compared in order to select the best one, i.e., the one that generates the highest average reward or success probability. The observations consist of the history of trials so far, i.e., choices made and resulting rewards; these constitute samples from the true but unknown reward distributions for the different alternatives. A decision rule specifies what to try next, given the results so far. An optimal decision rule minimizes the expected sum of costs or losses from sampling and from the opportunity costs (or “regret”) of trying lessthanoptimal alternatives. A/B testing in marketing science, clinical trials in medicine, and multiarm bandit (MAB) problems in statistics and machine learning are examples of such sequential comparison and selection problems.
The huge variety of important and useful applications of Bayesian statistical decision theory all have the following common simple causal structure. A hidden state that is not directly observed causes observations, in the sense that the probabilities of the observations are affected by the state. Also, the current state together with a choice of current act cause (i.e., determine the values of, or the probabilities of) outcomes of interest, typicallythe next state and rewardsor losses. Given observed histories of acts and observations, the d.m. seeks to choose next acts to optimize the expected value of an objective function. The expected value is calculated using posterior probabilities that are conditioned on observations via Bayes’ rule. This combination of concepts and methods, which model causality via conditional probability relationships among acts, states, observations, and outcomes and then use it to draw inferences and optimize actions, powers much of modern signal processing, artificial intelligence, automated control, telecommunications, and industrial engineering technology.
SimulationOptimization
In many applications, conditional probabilities of outcomes caused by different acts are not known or specified as part of a decision problem. They must instead be estimated from experience or from a model. For a system that is understood in enough detailso that its behavior can be modeled accurately via a discreteevent simulation (DES) or continuous stochastic simulation model, the probabilities of outcomes caused by different combinations of values for the controllable inputs can be estimated via simulation. Optimization algorithms can then be applied to the simulation model to seek the best policies or courses of action. For example, a discreteevent simulation model might be composed by linking stochastic models for the transition rates between“working” and “failed” states for the components and subsystems of a reliability system,with failure and repair rates for each component and subsystem being determined by the states (working or failed) of its predecessors in a dependency graph (DAG) model. The input data for such a model include:

Spontaneous failure rates and repair rates – or, more generally, state transition intensities – for the “basic events,” i.e., those without predecessors;

Conditional failure and repairrates, or state transition intensities,for each component or subsystem, given the states of its predecessors.
As outputs, such a model can be used to simulate the probability distribution of the random time until occurrence of an undesirable “top event” such as catastrophic system failure, given different maintenance and repair policies as controllable decision inputs. More generally, simulation can be used to estimate the conditional probabilities of failure times for the system, or for the remaining random time until its first passage into a specified target set of states, given the occurrence or nonoccurrence of various events such as failures of its components or subystems. Such conditional probabilities, in turn, can be used to help figure out which subset of costly improvements in componentlevel reliabilities would most improve the reliability of the system as a whole for the dollars spent, although true optimization usually requires special techniques of combinatorial optimization for reliability systems. Special techniques, such as importance sampling and subset simulation, have been developed for efficient simulation of rare events such as failures in highly reliable systems (Beck and Zuev, 2015).
Simulation and optimization are often combined into powerful simulationoptimization algorithms that automatically vary controllable inputs (the values of “decision variables” or “control variables”), estimate expected utilities or other performance measures via simulation, and then adjust the controllable inputs to increase expected utility – or, equivalently,to reduce expected loss – until no further improvement can be found (Amaran et al., 2016; Fu, 2015). A relatively recent class of simulationoptimization algorithms, Monte Carlo Tree Search (MCTS), has led to breakthrough advances in difficult decision problems such as those that arise in large POMDPs and in world championlevel games of Go. MCTS first evaluates potential next acts by running multiple probabilistic (Monte Carlo) simulations to see what might happen after each is selected, and then chooses the act that has the best evaluation averaged over multiple simulations (Silver and Veness, 2010; Fu, 2016). Causal knowledge about what can happen, asrepresented by the rules of a game or by the state transitions probabilities of a POMDP, for example, constrains the simulated possible futures. MCTS typically proceeds via the following systematic fourstep process, carried out in a search tree in which the nodes represent possible states and the branches (arcs) leaving a node represent possible acts that can be taken from it (Browne et al., 2011):

Select a node to“expand,” i.e., to evaluate further by generating and evaluating new child nodes for it;

Expand the search tree by generating additional acts and new child nodes, i.e., states, to which they lead, following the ones that have already been considered;

Simulate: Evaluate a newly added nodeby simulating multiple random feasible trajectories (sequences of states and acts) forward from it, choosing subsequent acts via a default policy (possibly just via random selection) where needed, and using the causal model to simulate outcomes. Stop each simulation when it reaches a terminal state (e.g., a winning or losing position in a game or an absorbing state of an MDP or POMDP) or when an evaluation horizon – that is, a maximum desired depth for the search tree – is reached. Then average the resulting rewardsfrom the simulated trajectories following the new act being evaluated. This phase of the MCTS is also sometimes called “policy rollout” since it evaluates a node by simulation (“rolling forward”) using a default policy to select acts and using a causal model to generate resulting states, observations, and rewards.

Backpropagation: Update the estimated values of the predecessor acts that lead to the newly evaluated act to reflect the simulated average reward from that point forward.
At the end of this process, the next act chosen is the one with the highest estimated value. Although this basic structure is simple, combining Monte Carlo simulation with decision tree and game tree search and evaluation principles that have been known for decades, it has proved extremely powerful for solving problems that were previously computationally intractable (Fu, 2016). In practical implementations, quite sophisticated algorithms and heuristics may be used to carry out each step. For example, in applications to largescale POMDPs, it is common to use particle filtering to represent and update beliefs about the hidden state, and ideas from reinforcementlearning and multiarm bandit problems to optimize search priorities (Katt et al., 2017). Some of these ideas are discussed later in the section on causality in learning analytics, which addresses the problem of decisionmaking with unknown causal models.
At the other end of the spectrum of complexity for simulationoptimization are problems in which both simulation and optimization are straightforward and no sophisticated algorithms are needed, other than the sampling methods already incorporated intobasic Monte Carlo uncertainty analysis software. Such relatively simple, straightforward modeling can still be extremely useful for evaluating and improving proposed policies when enough is known to develop credible simulation models to assess the (approximate) probabilities of different outcomes for each policy alternative via Monte Carlo simulation.
Example: An Influence Diagram (ID) DAG Model for Optimizing Emissions Reductions
Influence diagrams (IDs) provide a welldeveloped prescriptive analytics framework that is appropriate for many decision analysis problems. Figure 1.5 shows an example of an ID drawn using the Analytica software package (Morgan and Henrion, 1990). This ID model illustrates considerations that might be used to support a policy decision about how much to reduce air pollution emissions. It consists of nodes, representing variables, and arrows between nodes representing dependencies between the corresponding variables. Several different types of nodes are allowed in an ID, as follows.
Fig. 1.5 An Analytica® influence diagram (ID) for optimizing emissions reduction
Source: Analytica® web site (http://wiki.analytica.com/index.php?title=Example_Models_and_Libraries)

Decision nodes. The green rectangular node at the upper left of Figure 1.5 labeled “Emissions Reduction” is a decision node representing a decision variable. The decisionmaker chooses the value for such a variable from a set of possible values. In general, an ID may have multiple decision nodes, although Figure 1.5 has only one. Decision nodes are also called choice nodes.

Value node. The pink hexagon at the lower right of Figure 1.5labeled “Total Cost” is a value node. This is the ultimate output of the decision problem: a measure of the overall impact of the decisions made. Because it is an output, it has only inwardpointing arrows in the diagram. The decisionmaker seeks to set the values of decision variables to optimize the expected value of the value node, meaning its average value over many simulation runs of the decision model. Using expected values allows for the possibility of a random relationship between choices and outcomes. This is realistic if other uncertain quantities (modeled as random variables) also affect the outcome. If the value node represents total cost or loss, as in this case, then the challenge for decision analysis is to find values for the decision variables to minimize the expected value of the value node. If the value node represents utility, then the goal is to make decisions to maximize its expected value. In standard operations research and optimization terminology, the value node represents the objective function to be optimized in the decision problem.

Chance nodes (yellow ellipses) represent uncertain quantities, modeled as random variables. For example, several of the input nodes (defined as nodes with only outwardpointing arrows, also called exogenous nodes) around the periphery of Figure 1.5, such as Control Cost Factor and Base Concentration, may be uncertain quantities. In this case, a probability distribution for the value of each uncertain input quantity is specified as part of the ID model. A special case of a chance node is a deterministic node that assigns probability 1 to a specific value: constants are included in this way.

Derived nodes. Nodes (i.e., variables) with inwardpointing arrows have values that depend on the values of the variables that point into them. For example, in Figure 1.5, the value of Health Damage depends on the values of Concentration, Threshold, and Health Damage Factor, each of which may be a random variable. Rectangular boxes with rounded edges are used to represent derived variables in Figure 1.5. The value of a derived node may be a deterministic function of the values that point into it, but a useful generalization is to allow a derived node to represent a random variable with a conditional probability distribution that depends on the values of the variables pointing into it. The variables that point into a given variable are called its parents, and the given variable is called a child of any of its parents; thus, an arrow runs from each parent into each of its children. In Figure 1.5, Excess Deaths has two parents, Population Exposed and Health Damage, and it is a child of each of them. If there are only a few discrete possible values for each of the variables involved, then a conditional probability table (CPT) giving the conditional probability of each possible value of a node for each possible combination of the values of its parents provides a way to describe the probabilistic relationship between them. The table for an input node with no parents is called its marginal probability table. Chapter 2 discusses other methods of specifying conditional probability relationships that also apply to continuous variables such as age or temperature and to variables with many discrete values, such as daily mortality count in a large population.
The nodes and arcs in an ID are arranged so that there are no directed cycles: it is not possible to start at a node and follow a sequence of arrows that eventually would lead back into the same node. This reflects a constraint that the value of each variable depends only on the values of other variables, and not on its own value. Such networks composed ofnodes joined by arrows that do not form any directed cycles are called directed acyclic graphs (DAGs). DAG models are used and discussed extensively in Chapter 2.
A fully quantified ID model specifies a DAG structure together with a marginal probability distribution for the value of each input node and a conditional probability table (CPT) or some other probability model for each derived node giving the probability (or probability density, for continuous variables) of each of its possible values as a function of the values its parents. Thus, once the values of its parents are known, the conditional probability distribution for the value of the node is known. The probability distribution for an output such as Total Cost can be quantified for any selected values of the decision inputs by a simple process called forward Monte Carlo uncertainty analysis. The process begins by sampling a value from the marginal distribution for each input chance node. Then it samples a value (i.e., draws a value from its conditional probability distribution, given the values of its parents) for each asyet unevaluated node all of whose parents have been evaluated, thereby evaluating it, i.e., providing a value for it. This process of using sampling to evaluate the asyet unevaluated nodes with evaluated parents continues until a value has been sampled for the output. Repeating many times for given values of the decision nodes builds up an entire distribution for the output variable, given the selected values of the decision variables. Metaphorically, we might the say that the uncertainty probability distributions for the inputs have been “pushed through” the DAG to yield a probability distribution for the output. Varying the values of the decision variables then allows the decision, or setting of decision variable values, with the lowest expected loss or the greatest expected utility to be identified; if necessary, in a problem with many decision variables, optimization algorithms can be used to automate the search for the best values.
Dostları ilə paylaş: 