Markov Decision Processes
An important generalization of oneshot decisionmaking allows the d.m. to make a sequence of decisions. Each decision produces an outcome. Outcome probabilities in each period may depend on the state at the start of that period; on the decision (or act) selected; and possibly on the next state, i.e., the state to which the system transitions by the start of the next period. Acts may also affect the probabilities for the next state. A Markov decision process (MDP) models sucha sequence of decisions and states over time, where the decisions taken in one period affect the probabilities of the states for the next period as well as the rewards received in the current period. More specifically, the d.m. chooses an act in each period. This act, together with the state of the system at the start of the period, determines, or causes, conditional probabilities of different outcomes. The outcome has two components, both an immediate reward to the d.m., and also a next state of the system. In other words, the d.m.’s choice causally affects not only the probability distribution of immediate rewards, which in general depends on both the chosen act and the current state; but also the probabilities of making transitions from the current state to each possible next state. For example, if the state is the amount of inventory at the start of the period and the decision variable is the number of items purchased and added to inventory in this period, then the next state will be the current state plus the number of new items added to the inventory (the decision variable) minus the number of items sold, and hence withdrawn from the inventory, in this period. The number of items sold reflects current demand for the item, usually modeled as a random variable. The reward for this period is then the net profit from these purchases and sales. The probability distribution for the next state is determined by the decision of how many items to purchase and by the probability distribution for the number of items demanded in this period.
MDPs with known rewards and state transition probabilities can be solved via welldeveloped optimization algorithms to maximize objective functions such as the expected net present value of rewards or the average reward per period. To “solve” an MDP means to specify what act to take in each state to maximize the objective function. More generally, a decision ruleor policy is a function that maps available information to decisions: it specifies what to do whenever a decision must be made, given the information available then. An optimal decision rule or optimal policy is one that optimizes the expected value of an objective function, e.g., maximizing expected utility or minimizing expected loss. In an MDP, a decision rule maps the current state, such as the amount of inventory on hand at the start of a day, to a feasible act to take from that state, such as how much additional inventory to purchase that day.
A quick recap of standard theory for MDPs follows. Readers who do not thirst for such details can safely skip them. If an optimal policy is followed, then the expected discounted value of the MDP starting from current state s,denoted by V(s), can be expressedas follows:
V(s) = max_{a}[R(a, s) + ] (1.3)
Here,

s denotes the current state

sdenotes the next state

a = current act

V(s)= expected discounted value of the stream of rewards generated by the process starting from state s, assuming that an optimal policy is followed henceforth.

R(a, s) = immediate reward (or, if it is uncertain, the expected value of the immediate reward) from taking act a in state s.In some models, this reward function is generalized to depend not only on the current state and act, but also on the next state. In this case, if the reward function is specified by a table, then instead of showing the value of the reward for each pair of values (a, s), it is necessary to show its value for each triple (a, s, s’) where s’ denotes the next state.

= conditional probability that the next state is s, given that the current state is s and that the act taken now is a. This is also called the transition probability for a transition to next state s, given that act a is chosen from current state s. It provides a simple causal model for the relationship between the current choice of act and the probability distribution for the next state. It could be diagrammed as a ss, signifying that this probability distribution depends on, or is determined by, the current state and on the current act. When the sets of states and acts are finite, the array of numbers as a, s, and s range over all possible acts starting from the current state, all states, and all possible next states starting from the current state, respectively, constitute a conditional probability table (CPT) for the value of the next state given the current state and act. In continuous time, instead of transition probabilities, the causal dynamics of Markov processes are described by transition intensities giving the expected rate of transitions (in units of transitions per unit time) from the current state to each possible next state. The probability of each next state is then the ratio of the transition intensity into it from the current state to the sum of these transition intensities over all possible next states.

= expected value of the future reward from the process, starting one periodfrom now. This is its expected value starting from each possible next state weighted by the transition probability that it will indeed be the next state, given the current state and act.

= oneperiod discount factor, i.e., the amount that a reward postponed by one period is worth now.

R(a, s) + = expected value starting from the current state, assuming that optimal decisions are taken from each state henceforth. It is the sum of the immediate rewards and the expected discounted value of future rewards.

By definition, the best act to take now is the one that maximizes expected value starting from the current state. The “max_{a}” on the right side of the equation signifies that this is the act selected.
Readers familiar with dynamic programming optimization methods will recognize this equation as a version of the Bellman equation. In words, it states that the optimized value of the process (meaning the expected discounted value of the stream of rewards that it generates over time if optimal decisions are made) is the sum of the immediate reward that it generates in this period plus its optimized value thereafter. The optimal value function defined by this equation, once it is known, yields a straightforward way to decide what to do in any state: choose the act that maximizes the optimized value, R(a, s) + . If the set of acts is small and finite (or, more generally, easily searched), so that this expression can easily be evaluated and the optimal act found, then this provides an easily computed optimal decision rule.
In practice, of course, the optimal value function is usually initially unknown.However, if the reward functionR(a, s) (or, in some formulations, R(a, s, s)), the transition probabilities and the discount factor are all known, then the optimal value function can be calculated from via the following value iteration algorithm:

Initialize the algorithm by assigning a numerical value to each state. This assignment may be thought of as an initial guess at the optimal value function. Let’s call is Q(s). If there are only a few states, then Q(s) can be shown as a table with a row for each state and with two columns, the first listing each state and the second showing the value for each state. If there is no other information, then the values in this initial guess at the optimized value function may be assigned arbitrarily, or at random. If there is some information about which states have the highest values (as in many chessplaying heuristics, for example), then using it to guess at the optimized value function may speed convergence of the value iteration algorithm.

Iteratively improve the estimated value function.This is done as follows. Denote the current iteration (or guess or estimate) for the optimal value function by Q(s). This function gives a numerical value, the estimated expected discounted value of current and future rewards, for each possible state, s. Similarly, denote by Q(s, a) the estimated expected value of taking act a in state s and then acting optimally in future, where this estimate is formed using the current iteration Q(s). No confusion should arise from the use of the same function name, Q, for both of these closely related functions: Q(s) denotes the estimated optimal value of the MDP starting from state s if optimal decisions are made, and Q(s, a) denotes the estimated value of the MDP starting from state s if act a(which may not be optimal) is taken now and then all future decisions are made optimally. In symbols,
. (1.4)
Here, Q(s) on the right side refers to the current guess at the optimal value starting from next state . Q(s) is derived from Q(s, a) by “optimizing out” the current act, a, i.e., by choosing it to maximize Q(s, a); in symbols, Q(s) = max_{a}Q(s, a).
Equation (1.4) shows that the estimated expected value of taking act a in state sis the sum over all possible next states of the conditional probabilities of the next state (which is probabilistically caused by the choice of act and by the current state), multiplied by the sum ofthe immediate reward R(a, s, s) and the discounted value starting from the next state reached, . (For generality, we address the case where the immediate reward depends not only on the current state and act, but also on the next state. If it depends only on the current state and act, then the equation would simply to .) With this estimate of the expected value of the process starting from state s if act a is selected, it is clear that estimated expected value is maximized simply by choosing an act that maximizes the function . Since the current state s is known and R(a, s, s)), ,and the current estimate of the optimal value starting from the next state, Q(s), are all known quantities, Q(a, s) can be evaluated for each a and its maximum achievable value determined. This estimated maximum achievable value starting from state s, denoted by max_{a}Q(a, s) in mathematical notation, now becomes the new estimate of Q(s), the optimal value starting from state s. In other words, we can improve the old estimate of Q(s) by replacing it with an updated estimate, as follows:
Q(s) max_{a}Q(a, s). (1.5)
The update arrow “” means that the old value on the left is to be replaced by the new value on the right. Once the function Q(s) has been updated for all states via equation (1.5), this new estimated function can be used to derive updated values for Q(a, s) via equation (1.4). These then lead to further updated values of Q(s) via equation (1.5), and the iteration between equations (1.4) and (1.5), updating Q(a, s) and Q(s) respectively, continues until the values of Q(s) and Q(a, s) stop changing appreciably, i.e., to within the desired level of accuracy, between successive iterations. That such convergence will occur, and that the resulting Q(s) derived at the end of this process is indeed the true optimal value function V(s) (to within the desired level of accuracy used in deciding when to terminate the iteration), follow from mathematical analysis of the value iteration algorithm, which establishes that the iteration of equations (1.4) and (1.5) converges to a unique fixed point at V(s) (Kaebling et al., 1996).
The value iteration algorithm just described is one of three main classical approaches, each with many variations, for solving MDPs. Linear programming and policy iteration are the other two. Policy iteration works by iteratively improving an initial policy, which may be random. Recall that a policy for an MDP is an assignment of acts to states, specifying which act to take in each state. An optimal policy maximizes an objective function such as expected discounted reward or average reward per period. Given any nonoptimal policy (including one formed by randomly selecting which act to take in each state), an improved policy is created by assigning to each state the act that maximizes the objective function when all future decisions are made according to the current policy. Iterating such policy improvement steps leads to an optimal policy; see Kaebling et al.(1996) and references therein for details. Linear programming provides an alternative way to optimize the objective function by assigning acts to states subject to the linear constraints implied by equation (1.4).
For practical work, free MDP solvers such as the MDPtoolbox package in R provide efficient numerical optimization algorithms that return optimal policies as outputs, given as inputs a finite set of states, a finite set of acts, actdependent transition probability matrices P(s  a, s), a reward matrix specifying the immediate reward from taking each act in each state (with the option of having the reward also depend on the next state), and a discount factor, .The transition probability matrix for a given policy can be estimated from appropriate data on observed acts and transitions, if such data are available, using statistical software such as the markovchain package in R.The actspecific state transition probabilities and rewards together constitute a causal model linking decisions to the probabilities of consequences, namely, rewards and state transitions. MDP solvers such as MDPtoolboxuse this causal model to prescribe what is best to do in each state. They apply value iteration, policy iteration, or other algorithmsto solve the MDP and determine the optimal policy.
Once the best act for each state has been specified, the MDP becomes just an ordinary Markov process, with the probabilities for each possible next state fully determined by the current state (and by the optimal act prescribed for that state; since this act in turn is determined by the state, the current state fully determines the probabilities for the next states). The acts are then said to have been optimized out, leaving a purely random (Markov) process that is determined by the initial state (or probability distribution for the initial state, if it is uncertain) and by the optimal policy. The probabilities of sequences of future states, i.e., of the future state trajectories, of this process can now be solved for or simulated using standard algorithms for Markov processes. For example, the markovchain package in R provides straightforward commands for carrying out calculations of future state probabilities and for probabilistic simulation of future state trajectories using specified state transition probabilities. For systems that can be modeled well by MDPs, such software enables predictions of the probabilities of future state trajectories and related quantities, such as the probability distribution of the timeat which the optimized MDP will first enter or leave a specified target subset of states, or the probability distributions of the reward or of the times spent in different states over any specified time interval. Thus, excellent predictive analytics are possible for such systems, provided that enough data are available to determine the optimized transition probabilities.
MDPs have been used in a variety of practical applications, ranging from optimizing screening for cervical cancer (AkhavanTabatabaei et al., 2017) to designing treatment pathways to manage progression of diseases in individual patients by trying to optimize assignment of treatments to disease states (Bala and Mauskopf, 2006). However, the main practical value of MDP methodology is that it serves as a fruitful point of departure for more flexible and realistic decision optimization models.
Improving MDPs: SemiMarkov Decision Processes and DiscreteEvent Simulation (DES) Models
In practice, MDPs are often too simple to describe complex systems or diseases because the fundamental Markov assumption does not hold. This is the assumption that the probabilities for the next states depend only on the current state and act, and hence are conditionally independent of past states given the current state (the “memoryless” property of Markov processes). This property also implies that the transition probability (or, in continuous time, the transition intensity, i.e., expected transitions per unit time) into each possible next state does not depend on how long the process has been in the current state. This in turn implies that the dwell time in each state has a geometric distribution for discretetime systems, or an exponential distribution for continuoustime systems. These are restrictive assumptions in trying to model the real world. They can be relaxed by allowing transition probabilities or intensities from the current state to each possible next state to depend on the time since the system entered the current state; the results is called a semiMarkov process. Still more generally, transition probabilities or intensities for occurrences of events can depend on the history of events (including acts by the d.m. or controller) that have occurred so far and when they occurred. This very flexible modeling framework is called discreteevent simulation (DES). DES includes simulation of MDPs and semiMarkov decision processes as relatively simple special cases. It can be carried out using free packages such as SimPy in Python (https://simpy.readthedocs.io/en/latest/) or Simmer in R. Considerable ingenuity has been devoted to computational efficiency in such packages, making it practicable to simulate the probabilistic time courses (also called sample paths, trajectories, or histories) of large and complex systems if they are understood well enough so that event occurrence intensities at any moment can be specified as functions of the history to date. Different events and processes and their interactions can be modeled without having to explicitly enumerate all possible states. For example, if a system has 100 components, each of which can be working or failed at any moment, then there is no need to consider transitions among all 2^{100} (which exceeds 10^{30}) logically possible states. Instead, the failures and repairs of each component can be simulated separately, if they are all independent, or the transition intensities between the working and failed states of each component can be specified to depend only on the states of those other components that affect them. The sparsity of causal connections, or probabilistic dependencies, among the components in many systems allows a great reduction in the computational complexity of DES models.
Performance of Prescriptive Models
Given a large data set recording the times of observed transitions (or other events) in a large number of human patients or other observational units (e.g., computers and other components in communications networks whose reliability and performance are being assessed), together with the histories of covariates thought to affect the event occurrence rates, a risk analyst typically has a choice of fitting a Markov model, a semiMarkov model, or a more flexible DES model to the data to describe causal relationships among the measured variables and event occurrence rates. The technical challenges of estimating such models from observed event history and covariate data are met by software such as the markovchain and SemiMarkov or depmixS4 packages in R (Kr´ol and SaintPierre, 2015). However, using any model to predict the probabilities of future states or state trajectories and to prescribe current decisions based on the model’s assumptions and on available data invites the following crucial questions of model validation, evaluation, and selection:

Model validation and evaluation: How well do the predictions and recommendations from a model actually work, as judged in hindsight and as compared to the predictions and recommendations from other models?

Model selection and combination: Which of the multiple models that can be fit to the data should be used in making predictions and recommendations? How should the predictions and recommendations from multiple models be combined?
For example, how much difference does the extra effort needed to develop a semiMarkov model with state transition intensities that depend on elapsed time in a state make to the quality of predictions and recommendations, compared to fitting a simpler Markov model to the same data?
Chapter 2 introduces modern algorithms (such as RandomForest) that include both model validation and evaluation steps and automatic selection and combination of results from multiple models – an approach often referred to as ensemble modeling that has proved extremely valuable in improving modelbased predictions. Meanwhile, practical experience in different domains also offers some guidance on the value of using more sophisticated vs. less sophisticated models. For example, Cao et al. (2016) find that a semiMarkov model fit to data on disease management and outcomes for heart failure patients substantially outperformed a Markov model with the same states, yielding not only much better fits to the data (i.e., better descriptive validity), but also about a nearly 50% reduction in mean total cost and about a two month increase in mean survival times for patients. On the other hand, Simpson et al. (2009) compared Markov and DES modeling for predicting longterm (5year) clinical and economic outcomes for HIV patients. They found that the predictions from both models matched clinical results well for a oneyear forecast horizon, but that the DES model had slightly better 5year predictive validity than the Markov model and provided more detailed predictions for outcomes. Both models agreed in their recommendations for which of several alternative treatment regimens being evaluated generated the lowest costs and best results for disease management. Bennett and Hauser (2013) estimate that applying MDPs and related models, including methods for “partially observable” health states discussed later in this section, to electronic health records can have very substantial benefits in healthcare, increasing measures of health improvement for patients by about 50% while cutting costs approximately in half. As advances in analytics software and access to plentiful data make it easier to fit more flexible and detailed models to available data, it will become more desirable to use semiMarkov and DES causal models instead of simpler Markov models to achieve the gains in descriptive and predictive validity and the finer detail in predictions that the more sophisticated models offer.
Dynamic Optimization and Deterministic Optimal Control
Markov decision processes (MDPs) have the simplifying features that the d.m. is assumed to know the current state. The state represents all the conditions other than the d.m.’s act that affect probabilities of outcomes, i.e., rewards and next states. In many practical applications, however, the d.m.’s knowledge is far less complete than this: all the d.m. knows is the values of some observed variables that depend probabilistically on the state, but that do not necessarily fully reveal it. In this case, the concept of a policy or decision rule must be generalized. It is now a rule specifying what act to take as a function of the information available to the d.m. at the time – that is, the values of observed variables – rather than as a function of the true underlying state. An optimal policy or decision rule maximizes the expected value of some objective function conditioned on available information and assuming that future decisions will also be made optimally with respect to the future information that will be available when they are made. Formalizing these ideas has led to useful techniques for solving both stochastic (random) and deterministic dynamic programming and optimal control problems. These techniques have been extensively developed by mathematicians, control engineers, and operations researchers since the 1960s. Practical solution techniques today are largely being driven by advances in machine learning, especially, variations on reinforcement learning, discussed later.
Some important special cases of the general problem of devising highperforming decision rules are as follows. A discretetime deterministic optimal control problem has the form shown in equations (1.6a) through (1.6c).
x(t + 1) = f(x(t), u(t)) (1.6a)
y(t) = g(x(t), u(t)) (1.6b)
u(t) = h(y(t 1)) (1.6c)
Following conventional notation, the state of the dynamic system in time period t is denoted by x(t) and the controllable input to the system at time t is denoted by u(t); both of these can be vectors if the state and controls have multiple components. These are the same as the state s and the act a of nomal form decision analysis, but now renamed x and u in deference to control system tradition and equipped with the time index (t), often shown as a subscript, to explicitly index time periods. The dynamic evolution of the system is described by the difference equation (1.6a), where f is the deterministic law mapping the current state and input to the next state. This is sometimes called the “plant dynamics equation” or “plant model” in control systems engineering for power plants and other industrial facilities. If it changes over time, then equation (1.6a) can be generalized to x(t + 1) = f(x(t), u(t), t), or, in subscript notation, x_{t}_{+1 }= f_{t}(x_{t}, u_{t}). Equation (1.6b) shows that the observable output for period t, denoted by y(t), is fully determined by the state and the input decisions (which may include decisions to sample, observe, or experiment) that are taken then, x(t) and u(t), respectively. The output function g() describes this dependency. The observed output (or outputs, if y(t) is a vector) can be used to inform decisionmaking. Equation (1.6c) shows the form of a feedback control policy in which the input in each period is determined by the observed output from the previous period; more generally, it may depend on observed outputs from many previous periods, i.e., on all the information or history observed so far. Designing the feedback function h() to optimize an objective function or cost function, which may depend on both the control decisions u(t) and the state and output trajectories (e.g., on whether the system ends up in a desired target set of states, and how quickly it gets there) is the central challenge for discretetime optimal control.
Equation (1.6a) shows that the state at the start of the next period depends only on the state at the start of this period and the control u(t) applied in this period. It constitutes a deterministic dynamic model of the system or situation being controlled, and summarizes the causal relationship between each state and input and the next state. For example, if x(t) represents the number of fish in a commercial fishery, the number of items in an inventory, or the number of customers who have not yet received a new offer at the start of the current period (e.g., the current year, day, or minute, respectively); and if u(t) represents the number of fish caught, the net number of items removed from the inventory, or the number of customers who receive an offer during the current period, respectively, then the system dynamics might be modeled by the firstorder linear difference equation x(t + 1) = (1 + r)x(t)  u(t), where the growth rate r= 0 for the inventory and offer examples but positive for the fishery example.More interesting dynamics result from nonlinearities. Redefining x(t) as the current fraction of the fishery’s maximum carrying capacity and setting u(t) = 0 for the quantity lost to fishing in each period, the logistic difference equation x(t + 1) = 4x(t)*(1  x(t)) displays chaotic behavior starting from most initial states, although of course 0 is a trapping state that is reached from initial values for x of 0, 1, or 0.5. Such difference equations can easily be solved, starting from any initial state, by substituting consecutive states into equation (1.6a), generating each from its predecessor. For example, the following R script plots the first 60 periods of the chaotic trajectory of x(t), starting from an initial value in period 1 of x[1] = 0.2, for the logistic difference equation x(t + 1) = 4x(t)*(1  x(t)):
N < 60; x < 1:N; x[1] < 0.2; time < 1:N
for (t in 2:N){x[t] < 4*x[t1]*(1x[t1])}
plot(time, x); lines(x)
In continuous time, the causal model (1.6a) is replaced by the following ordinary differential equation (ODE) model:
dx/dt = f(x(t), u(t)) (1.7)
(If the dynamics are timedependent, then this ODE is generalized to dx/dt = f(x(t), u(t), t).) The trajectory of x values starting from any initial state and given any input history (or a closedloop feedback control law for determining u(t) from observations so far) is derived by integrating this ODE, and freely available, highquality numerical integration software provides the necessary tools for practitioners to do so. Continuoustime deterministic simulation modeling and solver software packages,languages, and environments, such as the deSolve package in R, return numerical solutions to systems of ODEs such as equation (1.7) (and also numerical solutions to partial differential equations, delay differentialdelay, and differential algebraic equations, in case the relevant causal model has one of these forms).
Solving the optimal control problem to determine an optimal policy for selecting u(t) given the information available at time t, which typically consists of the history of inputs and observed outputs until then, requires specialized techniques. Classical solution algorithms draw on methods of dynamic programming, the calculus of variations, Pontryagin’s maximum principle for optimal control, the HamiltonJacobiBellman equation, and related methods of dynamic optimization; these are well synthesized and explained by Kamien and Schwartz (2012). Sophisticated algorithms based on these methods are increasingly being made available to nonspecialist practitioners via relatively easytouse solvers for optimal control problems, such as the opensource BOCOP (http://bocop.org/) package and the DIDO commercial solver (www.elissarglobal.com/). However, the vast and sophisticated older literature on deterministic and stochastic optimal control is largely being overtaken by developments and software emerging from machine learning, especially reinforcement learning algorithms discussed later in this chapter.
Example: Optimal Harvesting
The study of dynamic optimization and optimal control of dynamic systems often yields intuitively appealing, economically interpretable, qualitative insights into the nature of optimal control policies in many interesting realworld applications. Indeed, in a famous essay, Dorfman (1969) showed that the equations of the Pontryagin maximum principle used in optimal control can be derived and interpreted by purely economic reasoning, e.g., by making momentbymoment decisions to equate the instantaneous marginal gains from consumption or use of a stock of capital or resource to the marginal value of its contribution to the growth of the stock. For optimal control trajectories, the stock of capital is such that it depreciates at the same rate that it contributes to the net present value of present and future consumption. Dynamic optimization models have been used to characterize optimal harvesting and extraction polices over time for both renewable resources such as forests (Amacher et al., 2009) and fisheries and for nonrenewable resources such as mineral deposits. Optimal policies balance the value of present vs. future consumption while taking into account time preferences (e.g., discount factors) that typically give extra weight to earlier relative to later consumption.
Starting from an initial agestructured population of trees or fish in a managed forest or fishery, respectively, an optimal control law specifies how much of the current population should be harvested, and which individuals should and should not be harvested if they differ in maturity or other attributes, in each period to maximize the value of the stream of consumption of the harvested resource over time. Typical questions analyzed in deterministic optimal control deal with the existence and uniqueness of optimal controls (i.e., harvesting trajectories); whether they are periodic; the amplitudes and length of periodic cycles in harvesting if they exist; how to allocate harvesting effort across areas if the resource is spatially distributed; whether the optimal control laws have simple forms (e.g., harvest all trees over a certain age, and no others); and how the answers to these questions depend on problem parameters such as the discount rate, agespecific growth rates, the prices or values of the harvested resource, and the initial size and composition of the population (Amacher et al., 2009). By contrast, stochastic control models, discussed next, must also consider how probabilities of catastrophic loss (e.g., forest fires), extinction of the population, and random variations in growth rates and prices affect optimal harvesting.
Stochastic Optimal Control, Hidden Markov Models, and Partially Observable Markov Decision Models (POMDPs)
Key ideas of deterministic optimal control can be generalized to probabilistic systems. One way to do so is to replace the causal model described by the deterministic ODE equation (1.7) with a probabilistic generalization, corresponding to the discretetime random difference equation
x(t + 1) = f(x(t), u(t), w(t), t) (1.7a)
where w(t) is a random variable (called a disturbance term) representing uncontrolled inputs from the environment. Constructing a continuoustime analog requires some mathematical care and yields a stochastic differential equation (SDE) that can be solved via stochastic (ItoStratonovich) integration (Bertsekas and Shreve, 1996). If the state dynamics are described by timeinvariant causal laws, as in many physics and engineering systems, then (1.7a) simplifies because the causal laws encoded in the ODE do not depend on time:
x(t + 1) = f(x(t), u(t), w(t)) (1.7b)
Modeling stochastically evolving systems via solution of SDEs has yielded important results and applications in mathematical finance, including the BlackScholes formula for pricing options and derivatives, as well as in the study of diffusion processes and more general stochastic processes in physics and engineering. A welldeveloped field of stochastic optimal control deals with both discretetime and continuoustime optimal control of systems in which the state dynamics include randomness, evolving according to a controlled stochastic process. In many such problems, the observed outputs have conditional probability distributions that depend on the underlying state, but the underlying state is not observed directly. Stochastic dynamic programming provides a fruitful framework for the study and solution of such problems (Bertsekas and Shreve, 1996). We defer further discussion of solution methods for stochastic dynamic programming problems to the discussion of causality in learning.
It is also possible to generalize the comparatively simple MDP model with probabilistic causal dynamics given by the transition probabilities P(s  a, s), which could also be written in discrete time control system notation as
P(x_{t}_{+1 } x_{t}, u_{t}), (1.7c)
Where x_{t}_{+1 } = s is the next state; x_{t} = s is the current state; u_{t} = a is the current act; and the transition probability P(x_{t}_{+1 } x_{t}, u_{t}) is the conditional probability that the next state will be x_{t}_{+1}, given that the current state is x_{t}and that the current act is u_{t}. The key idea for the generalization is that the decisionmaker or controller now only sees a signal that depends probabilistically on the underlying state, rather than observing the true state directly. The observation equation (1.6b) can be generalized to describe such probabilistic observations by specifying an observation function, experiment, or information channel (all three terms are used) as follows:
P(y  a, s) = probability of seeing signal y if state is s and act a is taken (1.8).
Equivalently, this idea can be expressed using systems theory notation by generalizing the observationequation (1.6b) to include a measurement error or observation noise term, v(t):
y(t) = g(x(t), u(t), v(t), t) (1.8a)
where v(t) is a random variable. (The time argument t can be dropped unless the observation structure itself changes over time.) In discrete time control system notation, equation (1.8) can also be shown as
P(y_{t } x_{t}, u_{t}) (1.8b)
Here, x_{t} corresponds to the state s and u_{t }corresponds to the act a for period t. This formulation assumes that the state holds at the start of a period; an act is taken at that time; and then an observation results that depends on the act and the state for that period. Slightly different formulations use different timing conventions, such as having the current state and act determine the probabilities of observations at the start of the next period. No matter which timing convention is used, the essential point is that the true state is now an unobserved variable (also called a hidden or latent variable) that affects the probabilities of observations.Therefore, decisions must be made based only on observations rather than on perfect knowledge of the state. In many engineering applications, the observation equation is assumed to have the special form
y(t) = g(x(t), u(t), t) + v(t) (1.8c)
where v(t) is assumed to be a Gaussian (i.e., normally distributed) additive noise term.
A causal model consisting of a Markov chain with information about the state obtained only via an observation function such as equation (1.8) is called a Hidden Markov Model (HMM). HMMs are widely used to model unobserved causal processes in applications as diverse as handwriting recognition (observed y = handwritten words, unobserved x = the motions executed to produce them) and cancer progression modeling, where clinical measurements and symptoms provide probabilistic signals about the progression of the underlying disease (Taghipour et al., 2017). Software for estimating HMMs and Hidden SemiMarkov models (i.e., models in which transition intensities from the current state to possible next states change with elapsed time in the present state) from sequences of data on observed transition timesincludes the R packages msm for HMMs and mhsmm for HMMs and Hidden SemiMarkov models, respectively.
A partially observable Markov decision process (POMDP) is an MDP with hidden states. Inputs must be selected based only on observations – that is, on the data obtained via an observation function such as (1.8b) – to optimize an objective function such as expected discounted value of rewards or the time to first reach or first exit from a target subset of states.In formal discussions, it is usual to represent a POMDP by a tuple such as (S, A, T, R, Y, g,b), where S denotes the set of states, A the set of possible acts, T the transition probability matrix with elements P(s  a, s), R the reward matrix with elements R(a, s) or R(a, s, s), Y the set of possible signals, g the observation function specifying P(y  a, s), and b a probability distribution for the initial state of the POMDP. The probability distribution b may also be thought of as a Bayesian prior probability distribution or belief about the initial state of the system. After each act and each observation, this belief is updated (via conditioning, e.g., using Bayes’ Rule, which is discussed in Chapter 2) to yield a new belief, or probability distribution for the states, given the history of acts and observations so far. Mathematically, the transitions among these successive beliefs, each of which can be represented by a vector of state probabilities, can be viewed as taking place within a larger MDP whose states are the probability distributions for the states of the underlying POMDP. From this perspective, it is natural to consider applying MDP solution techniques such as value iteration to determine the best sequence of actions, given the history of observations. However, in many practical applications, the number of states in the larger MDP is so large that approximation techniques must be used. Excellent results with performance guarantees for the difference between achieved and theoretically optimal results can be obtained by “pointbased” sampling approaches that update beliefs only on a sample of points (i.e., probability vectors) reachable from the initial belief vector via a sequence of actions(Shani et al., 2013). POMDP solvers are now widely available, including pomdpsolve (written in C), MDPSOLVE (written in MATLAB), ZMDP (written in C++), and others (https://github.com/JuliaPOMDP). These implement both exact solution methods for small POMDP problems and sophisticated numerical solvers for largerscale POMDPs.
Example: Administering Antibiotics to Avoid Septic Shock
Tsoukalas et al. (2015) consider the dynamic decision problem of determining which antibiotics to administer to a patient when, and at what doses, to control sepsis and avoid potentially lethal responses to bacterial infections. Figure 1.4 shows a clinical decision support dashboard developed for this purpose. In the upper lefthand corner are the subjectively assessed rewards for each unit of time that a patient spends in each health state; these may be thought of as expected utilities. The health states considered range from Healthy at the top of the list down through increasingly deteriorated states ending in Septic Shock and then Death. The goal of decisionmaking in this case is to postpone or reverse transitions from better to worse states by administering antibiotics at appropriate times, doses, and formulations to maximize the expected reward fromthe time spent in various health states.
Dostları ilə paylaş: 