Table 1.4 A machine learning challenge: What outcome should be predicted for case 7 based on the data in cases 16?
Case

Predictor 1

Predictor 2

Predictor 3

Predictor 4

Outcome

1

1

1

1

1

1

2

0

0

0

0

0

3

0

1

1

0

1

4

1

1

0

0

0

5

0

0

0

0

0

6

1

0

1

1

1

7

1

1

0

1

?

Suppose that cases 16 constitute a “training set”, with 4 predictors and one outcome column (the rightmost) column to be predicted from them. The challenge for predictive analytics or modeling in this example is to predict the outcome for case 7 (the value, either 0 or 1, in the “?” cell in the lower right of the table.) For example, predictors 14 might represent various features (1 = present, 0 = absent) of a chemical, or perhaps results of various quick and inexpensive assays for the chemical (1 = positive, 0 = negative). The outcome might indicate whether the chemical would be classified as a rodent carcinogen in relatively expensive twoyear liveanimal experiments.Chapter 2 reviews a variety of machinelearning algorithms for inducing predictive rules or models from such training data. But identifiability places hard limits on what can be learned and on the accuracy of predictive models learned from data. No algorithm can provide trustworthy predictions for the outcome in case 7 based on the training data in cases 16, since many different models fit the training data equally well but make opposite predictions. For example, the following two models each describe the training data in rows 16 perfectly, yet they make opposite prediction for case 7:

Model 1: Outcome = 1 if the sum of predictors 2, 3, and 4 exceeds 1, else 0

Model 2: Outcome = value of Predictor 3.
Likewise, these two models would make opposite predictions for a chemical with predictor values of (0, 0, 1, 0). If these models are interpreted causally, with changes in predictors on the right side causing the dependent variable (Outcome) on the left side to change to make the equality hold, then Model 1 would imply that setting the values of any two of the values for predictors 2, 3, and 4 equal to 1 would suffice to achieve an Outcome value of 1, but Model 2 would imply that this can be achieved only by setting predictor 3 equal to 1. Additional models or prediction rules such as

Model 3: Outcome is the greater of the values of predictors 1 and 2 except when both equal 1, in which case the outcome is the greater of the values of predictors 3 and 4

Model 4: Outcome is the greater of the values of predictors 1 and 2 except when both equal 1, in which case the outcome is the lesser of the values of predictors 3 and 4
also describe the training data, but make opposite predictions for case. Thus, it is impossible to confidently identify a single correct model structure from the training data in this case (the datagenerating process is nonidentifiable from the training data), and no predictive analytics or machine learning algorithm can determine from these data a unique model (or set of prediction rules) for correctly predicting the outcome for new cases or situations or for determining how manipulating the values of the predictors, if some or all of them can be controlled by a decisionmaker, will affect the value of the outcome variable. Chapter 2 discusses conditions under which unique causal or predictive models can be identified from available data and what to do when this is impossible.
Example: Anomaly Detection, Predictive Maintenance and CauseSpecific Failure Probabilities
When a complex system fails, it is often possible in retrospect to identify precursors and warning signs that might have helped the system’s operators to realize that failure was imminent. This has inspired the development of anomaly detection and predictive maintenance algorithms to seek patterns in data that can help to predict costly failures before they happen, so that maintenance can be taken to prevent them. Onevery useful principle is to train an autoencoder to predict a system’s outputs from its inputs during normal operation. This trained autoencoder then serves as a model for normal, problemfree operation. An anomaly is detected when the observed outputs stop matching the autoencoder’s predicted outputs. (For implementation details, see the free TensorFlow or H2O package documentation on autoencoders, e.g., at http://amunategui.github.io/anomalydetectionh2o/.)Using discrepancies between observed and expected normal behaviors provides a powerful way to detect fraud in financial systems, cyberattacks, and other perturbations in normal operations due to intelligent adversaries, as well as to changes in the performance of system components. Anomaly detection algorithms provide a way to automatically notice the early warning signs of altered inputoutput behaviors that can show that a complex engineering system – or, for that matter, a human patient – is losing normal homeostatic control and may be headed for eventual system failure.
Several companies, including IBM and Microsoft, offer commercial predictive maintenance software products that go beyond merely detecting anomalies. They apply deep learning algorithms – especially, Long Short Term Memory (LSTM) algorithms for learning from time series with long time lags between initiating events and the consequences that they ultimately cause – to predict specific causes or modes of system failure and to identify components or subsystems that are at risk. They also quantify the probability distributions of remaining times until failure is predicted to occur from various causes (Liao and Ahn, 2016). This information can then be displayed via a predictive analytics dashboard that highlights which failures are predicted to occur next, quantifies the probability distributions for remainingtime until they occur, and recommends maintenance actions to reduce failure risks and increase the remaining time until failure occurs. In industrial applications, predictive maintenance has significantly reduced both maintenance costs (by avoiding unnecessary maintenance) and costs of failures (by targeting maintenance to prevent failures or reduce failure rates). Similar algorithms have started to be applied in health carerecently, for example, to predict heart failure diagnosis from patients’ electronic health records (EHRs) (Choi et al., 2017).
Identifying from data the symptoms of fault conditions or potential causes that can eventually lead to systems failure is one key challenge for predictive analytics. Conversely, identifying the potential longterm consequences of currently observed aberrations in subsystem or component performance is another. Practical algorithms and software for meeting these challenges are now advancing rapidly under the impetus of new ideas and methods for machine learning. In doing so, they are clarifying methods for learning about delayed relationships between causes and their effects from time series data. Chapter 2 will discuss further the important principle that causes help to predict their effects, and how this principle can be used to draw inferences about causation between variables from data recording the values of different variables over time.
Causality Models Used in Prescriptive Analytics
Prescriptive analytics addresses the question of how to use data to decide what to do next. It uses a combination of data, causal modeling or assumptions, and optimization to decide on a best course of action. To do so, it is common practice to model how probability distributions of outcomeswould change if different actions were taken. In very simple decision analysis models, outcome probabilities can simply be tabulated for different acts. The “best” act, as defined by certain axioms of rational choice,is then one that maximizes expected utility. This rule is explained and justified in detail in decision analysis.
NormalForm Decision Analysis
In mathematical notation, the expected utility (EU) of an act a is defined by the following sum:
EU(a) = S_{c}u(c)*P(c  a) (1.1)
Here, EU(a) is the expected utility of act a, c is a consequence or outcome, u(c) is the utility of consequence c, and P(c  a) is the conditional probability of consequence c if the decisionmaker (d.m) chooses act a. (If needed, the beginning of Chapter 2 provides a quick review of probability and conditional probability concepts and notation; see equations (2.1) and (2.2). The remainder of this section assumes familiarity with both.) P(c  a) is a causal model of the probabilistic relationship between exogenous acts and their consequences. Interpretively, equation (1.1) says that the expected utility of an act is the mean or average value of the utilities of the consequences that it might cause, weighted by their respective probabilities.
Normative decision analysis describes choosing among alternatives as choosing among different sets of outcome probabilities. Prescriptively, the choice is to be made to maximize expected utility or to minimize expected loss, respectively. In each case, a decision is represented by the outcome probabilities that it causes
Example: Identifying the Best Act in a Decision Table
Suppose that a decisionmaker (d.m.) must choose between two acts, acts 1 and 2, represented by rows in Table 1.5. The consequence of each act depends on which of three possible states, 1, 2, or 3, occurs. The probabilities of states 1, 2, and 3 are 0.2, 0.3, and 0.5, respectively, as shown in the bottom row of Table 1.5. The cells of the table show the rewards, payoffs or expected utilities, expressed on a scale with numbers between 0 to 10, for choosing each act if each state. (The two endpoints of a von Neumann Morgenstern utility function can be chosen arbitrarily, much as the numbers corresponding to the boiling and freezing points of water can be chosen arbitrarily on a temperature scale. Both temperature and utility are measured on interval scales, and such a scale is uniquely determined by specifying the values of two points on it, such as by making 0 the value of the leastpreferred outcome and 10 the value of the mostpreferred outcome. Luce and Raiffa (1957) provide an excellent full exposition of utility theory and normal form decision analysis.)
Table 1.5 A simple example of a normal form decision table with two acts and three states


state 1

state 2

state 3

act 1

3

1

4

act 2

1

5

9


P(state 1) = 0.2

P(state 2) = 0.3

P(state 3) = 0.5

In Table 1.5, the utility of the consequence if the d.m. chooses act 2 and state 2 occurs is 5; the utility from act 2 if state 3 occurs is 9; and the utility from act 2 if state 1 occurs is 1. If a particular choice of act combined with a particular state does not determine a unique outcome as the consequence, but only a conditional probability distribution of outcomes, then the numbers in the cells in Table 1.5 should be interpreted as expected utilities (i.e., expected values of the utilities) of the random outcomes for the different actstate pairs. This formulation of a decision problem, in which the d.m. chooses an act (i.e., row of the table) from a set of feasible alternatives, “Nature” or “Chance” chooses a state (i.e., column of the table) at random according to known probabilities from a set of possible states; and the intersection of the selected act and the selected state determine an outcome (or, more generally, the conditional probabilities of different outcomes or consequences, each having a known utility) is called thenormal form of decision analysis (Luce and Raiffa, 1957).
Problem: Given the decision problemdata, in Table 1.5, which act, act 1 or act 2, should the d.m. choose? Assume that the goal of decisionmaking is to maximize expected utility.
Solution: The expected utility of act 1 is chosen is 3*0.2 + 1*0.3 + 4*0.5 = 2.9. The expected utility if act 2 is chosen is 1*0.2 + 5*0.3 + 9*0.5 = 6.2. Since 6.2 > 2.9, the d.m. should choose act 2.
Table 1.5 represents the probabilistic causal relationship between acts and the outcomes that they cause in the form of a decision table. In such a table, acts are represented by rows, states (meaning everything other than the act that helps determine the outcome) are represented by columns, and state probabilities (shown in the bottommost row) are assumed to be known and to be independent of the act. The numbers in the cells of the table show the (von NeumannMorgenstern) utilities to the decisionmaker (d.m.) of different outcomes. Each outcome is determined by an (act, state) pairs. These ingredients – acts, states, consequences, state probabilities, and utilities – are all of the elements of a normal form decision analysis. A decision table is a simple causal model: in Table 1.5, choosing act 1 causes outcome probabilities of 0.2 for an outcome with a utility of 3, probability 0.3 for an outcome with utility 1, and probability 0.5 for an outcome with utility 4. This idea of representing acts by the probabilities of the outcomes that they cause (or, as in Table 1.5, the probabilities of the utilities of those outcomes), is often formalized in technical work by representing decision problems as choices among cumulative probability distributions (CDFs), probability density functions (PDFs), or probability measures, which are then used to calculate expected utilities or expected losses. In each case, however, all relevant differences among decision alternatives are assumed to be captured by differences in the outcome probabilities that they cause.
Causal models can be more elaborate, of course. One generalization is to let the conditional probabilities of the states depend on which act the d.m. selects. The choice of act can then affect outcome probabilities both directly, via their dependence on the act, and also indirectly via their dependence on the state, which in turn depends probabilistically on the act. In notation, the probability of consequence c if act a is selected is given by the following sum, expressing the law of total probability:
P(c  a) = (1.2)
where the sum is over all possible states, s.In words, the probability of a particular consequence of an act is the sum of the probabilities that it occurs in conjunction with each possible state, weighted by the probability of that state given the selected act. This type of causal structure arises, for example, if the decision variable a is the price per unit that a retailer charges for a good; the state s is the number of units of the good purchased at that price; and the consequence of selling s units at price a is the revenue, a*s. (In this case, = 1 for c = a*s and 0 otherwise.)
The following examples illustrate how the main ideas of normalform decision analysis can be applied even in cases where there are too many acts or states to be compactly summarized in a small decision table.
Example: Optimizing Research Intensity
Setting: Suppose that a startup company has a oneyear window in which to solve an applied research and development (R&D) problem. Successfully solving the problem within a year is worth $1M; otherwise, the company earns $0 from its effort on this problem. The company can assign employees to work on the problem. The causal relationship between number of employees assigned and probability of success is that each employee assigned has a 10% probability of solving the problem, independently of anyone else. (Allowing for teamwork and interaction among employees might be more realistic, but would not affect the points to be illustrated.) Each employee assigned to this effort costs $0.05M.
Problem: How many employees should the company assign to this R&D effort to maximize expected profit?
Solution: The R&D problem will be solved successfully unless everyone assigned to it fails to solve it. If N employees are assigned to work on it, then the probability that they succeed in solving the problem is one minus the probability that they all fail to solve it: P(success  N employees) = 1  0.9^{N}, since each independently has a 0.9 probability of failure and therefore all N together have a 0.9^{N} probability of failure. The cost of assigning N employees, expressed in units of millions of dollars, is 0.05N. The expected benefit is (1  0.9^{N})*1 million dollars. The expected profit from assigning N employees to this effort is therefore (1  0.9^{N})  0.05N. Clearly, if N is too large (e.g., greater than 20) then the expected profit will be negative, and if N = 0 then the expected profit is 0. Searching over this range e.g., using the online “Wolfram Alpha extrema calculator,” or simply evaluating the profit function for each integer in the range from 0 to 20, reveals that the most profitable number of employees to assign to the R&D effort is 7. For readers familiar with R, the following R script returns the answer:
profit < c(1:20)
for (N in 1:20){
profit[N] < 1.9^N0.05*N}
print(which.max(profit))
[1] 7
In this example, the outcome, profit, depends on the decision variable, N = number of employees assigned, both directly through the effect of N on cost, and indirectly through the effect of N on the probability that the effort will succeed. The general formula in equation (1.2) is instantiated in this case by the specific assignments P(c  a, s) = 1 for c = s  0.05a and P(c  a, s) = 0 otherwise and P(s = 1  a) = 1 0.9^a, where we define the state variable s to have value 1 if the R&D problem is solved successfully within a year and s = 0 otherwise and we define the decision variable as a = N = number of employees assigned to the effort.
Example: Optimal Stopping in a Risky Production Process
Setting: Suppose that a hazardous production process or facility, such as a chemical plant, an oil rig, or an old mine, produces a profit of $10M per year while it is operating. If it fails due to a catastrophic accident during operation, this costs $50M and destroys the process. The random lifetime until such a catastrophic accident occurs is uniformly distributed between 0 and 60 years with a mean of 30 years. The process can be voluntarily closed down at any time before failure occurs.
Problem: When should the production process be voluntarily closed (if it has not yet failed) to maximize expected profit? For simplicity, ignore interest rates, acquisition and replacement costs, and discounting: assume that the goal is to maximize expected profit, where the profit is given by 10A if the process is voluntarily closed at age A and is given by 10*T 50 if the process fails at age Tbefore it reaches ageA. More detailed and realistic objective functions can be devised, but this simple one suffices for purposes of illustration.
Solution: One way to solve this problem is to consider different decision rules, evaluate the expected reward from each one, and use optimization to find the decision rule that maximizes the specified objective function. Suppose that the decision rule adopted is to voluntarily shut down the process when and if it reaches age A years. The probability that it survives until age A without an accident is 1  A/60, and the reward if it does so is specified to be 10A. On the other hand, if an accident terminates the process at some time before age A, which occurs with probability A/60, then the expected net reward is 10*(A/2) – 50. This is because the expected age at failure, given that failure occurs before age A, is just A/2. (Conditioning on the event of failure by age A replaces the original uniform distribution between 0 and 60 years for the failure time with a new uniform distribution between 0 and A years, having conditional mean E(T  T < A) =A/2.) The expected value of the process with the decision rule determined by A is therefore (A/60)*(10*(A/2)  50) + (1(A/60))*10*A.The value of A that maximizes this average reward per unit time can be found using free online solvers such as the WolframAlpha Max/Min Finder widget, or can be searched for (here, to the nearest year) in R as follows: A < 1:60; J < (A/60)*(10*(A/2)  50) + (1(A/60))*10*A; print(which.max(J)). The solution is that the process should be closed when it reaches age 55 years.
A different way to solve this problem is to applythe following concepts from reliability theory and economics. Intuitively, the process should be operated as long as the expected marginal benefit from continuing for a little longer exceeds the expected marginal cost. The expected marginal benefit from additional product produced by continuing for an additional time increment of length dt is 10dt. The expected marginal cost is 50h(t)dt, where h(t) is the agespecific hazard function for failure at time t. From reliability theory, a formula for h(t) is h(t) = f(t)/(1F(t)) where F(t) = P(Tt) is the cumulative probability distribution function (CDF) for random lifetime T, i.e., for failure by time t; 1  F(t) = P(T>t) is the survivor function, i.e, the probability that the process survives until time or age t without failing; and f(t) = F’(t) is the probability density function (PDF) for the failure time. For a uniformly distributed lifetime, the hazard function h(t) increases with time. The process should be operated until the expected marginal benefit from continuing equals the expected marginal cost, 10dt = 50h(t)dt, i.e., until h(t) = 0.2. Free online calculators for various hazard functions are available, such as the one at http://reliabilityanalyticstoolkit.appspot.com/normal_distribution for normally distributed lifetimes, but for a uniform distribution between 0 and 60 years, h(t) can readily be calculated by hand: h(t) = f(t)/(1  F(t) = (1/60)/(1  t/60) = 1/(60  t). Equating this to 0.2 and solving for t yields 60  t = 5, or t = 55 years. Hence, the process should be voluntarily closed at age 55 years if it has not failed by then. The solution in this case has the attractive and intuitive form of an instantaneous lookahead rule (or, if decisions are made at discrete intervals, a onestep lookahead rule) that calls for continuing an activity until the marginal costs of doing so are no longer less than the marginal benefits, and then stopping. This works because the hazard function is increasing: if it is not worth continuing at time t, it will be even less worthwhile to continue thereafter.
In this example, the act and state are both continuous variables: when to stop the process and when the process will spontaneously fail, respectively. The conceptual framework of normal form decision analysis still applies, but finding the best act in the way just illustrated requires using specialized concepts such as the agespecific hazard function, h(t). The problem could also be solved by brute force by trying out different decision rules, i.e., ages at which to voluntarily close the process, and using simulation to estimate the average profit from different choices and to identify the best choice. A more efficient version of this approach uses optimization algorithms to home in more quicky on the best decision: this is a key idea of simulationoptimization, discussed further later in this section.
Example: Harvesting Timber
Suppose that a commercial timber stand has a market value that grows linearly with time: if it is harvested at age t years, its value will be 0.1t million dollars. There is a 5% probability of a fire that destroys the current stand and resets its value to zero. At what age should the timber be harvested? If we again focus just on the tradeoff between the risk and benefit of waiting to harvest, ignoring interest rates, discounting, costs, age structure of the tree population, price changes and uncertainties, and other important realistic details, then the simplified problem consists of finding the age for which the expected marginal benefit of waiting to harvest equals the expected marginal cost. Waiting an additional time increment of length dt brings an expected marginal benefit of 0.1dt million dollars and an expected marginal cost of (0.05dt)*0.1*t, which is the probability of loss due to fire (approximately 0.05dt) times the size of the loss if a fire occurs, which is the accumulated value of the timber so far, 0.1t. Equating these and solving for t yields 0.1 = 0.05*0.1*t, or t = 20 years. Thus, the optimal decision rule for the simplified problem is to grow the stand for 20 years and then harvest it if it has not yet burned down. Realworld harvesting decision rules and calculations are more complex because they must consider the various important factors that we have omitted, especially the opportunity cost of foregoing interest on sales whileharvesting is postponed. However, many such renewal reward processes, in which the process resets to its initial condition (i.e., “renews” itself) when certain random events occur (such as a forest fire or harvesting when a prespecified stopping time is reached) can be solved by formulating an objective function expressing the average profit per unit time per cycle, i.e., per interval between renewals of the process, and then choosing the decision variables to maximize this objective function.
Dostları ilə paylaş: 