Dynamic Bayesian Networks (DBNs) The DAG models we have looked at so far do not show the passage of time. Yet time is essential to causality, insofar as effects must not precede their causes. Fortunately, it is easy to incorporate time into BNs by using successive time periods, called time slices, to distinguish between the values of the same variables at different times or in different time periods. This creates a dynamic Bayesian network (DBN) that shows dependencies among the values of different random variables over time. A common notation is to time-stamp variables with subscripts to indicate periods, as in Figure 2.5. This diagram depicts three time series variables, A, B, and C, which interact with each other over time. Their values at time (i.e., within time slice) t are denoted by At, Bt, and Ct, respectively using subscript notation. The value of random variable at time t, denoted by Bt at the right of the network, depends on the concurrent value of variable A, namely At, and on the previous value of B, namely Bt-1. Dependencies are indicated by the solid dark arrows between nodes. In the special case of a deterministic dynamic model, this dependency among the values of the variables can be described by a difference equation of the form for some appropriate function f:
Bt = f(At, Bt-1).
In the more general case of probabilistic dependencies, the corresponding information is given by a CPT specifying the conditional probabilities
P(Bt = x | At = y, Bt-1 = z)
for all possible of x, y, and z. We might abbreviate this CPT as
P(Bt | At, Bt-1),
and use similarly abbreviated notation for the CPTs at other nodes:
P(At | Bt-1,Ct-1) and P(Ct | Ct-1)
Fig. 2.5 A portion of a dynamic Bayesian network (DBN)
If these CPTs are invariant over time, as might be expected for a description based on stable causal laws or mechanisms, then they can be used to generate an entire multi-period network. The probability distributions for the values of the three variables in each successive period are derived from their values in the preceding period via the CPTs P(At | Bt-1, Ct-1), P(Ct | Ct-1), and P(Bt | At, Bt-1). The grey dashed arrows in Figure 2.5 suggest how the values for the two time slices shown can be linked to past and future values via these CPTs. Figure 2.6 makes these links explicit by showing a “rolled-out” version of the DBN for three periods. Netica automatically generates such rolled-out networks (which it refers to as time expansions or time-expanded versions) from more concise networks, such as Figure 2.5, that determine the probabilities of the variable values in each period from their values in previous periods.
Fig. 2.6 A rolled out version of the dynamic Bayesian network (DBN) in Figure 2.5
A further compression of notation can be accomplished by referring collectively to all of the variables in time slice t (namely, At, Bt, and Ct in Figures 2.5 and 2.6) as simply Xt. Then the DBN can be seen as providing a way to calculate the probabilities of current values of values from their past values, P(Xt | Xt-1), or, equivalently (since t is just a dummy variable indicating the time stamp), P(Xt+1, | Xt). This can be viewed as the state transition probability matrix for a Markov chain. One use of DBNs is to store efficiently prior probabilities for the initial state and CPTs representing state transition probabilities for Markov chains. These can then be used to compute probabilities of future states. (The calculation is given by the famous Chapman-Kolmogorov equation: if P is the one-step state transition probability matrix with the number in its row i and column j giving the probability that the next state will be state j if the current state is state i, then the element in the same position, row i and column j, of Pn is the probability that the system will be in state j after n transitions, starting from state i. Equivalently, the law of exponents P(m + n)= PmPn for the transition matrix can be interpreted as showing that the probability of passing from one state to another in m + n steps is found by summing the probabilities of getting to each possible intermediate state in m steps and then from there to the target state in the remaining n steps. If the starting state in uncertain and uncertainty is described by a prior probability distribution over states, then pre-multiplying this probability vector by the n-th power of P will give the probability distribution for the state after n transitions.) Often, current values depend on lagged values that go back several periods, violating the Markov assumption that probabilities of next states are conditionally independent of the past given the current state. Then a useful trick (called “state augmentation”) is to stack up all of the variable values needed to compute values of variables in the next time slice (time t + 1) into one long list, again denoted by Xt even if it includes some values of variables from periods before t. By construction, Xt contains all of the information needed to determine P(Xt+1, | Xt), so the Markov model holds for the augmented state and the large body of well-developed techniques for Markov chains can be applied, including predicting future state probabilities over different forecast horizons via the Chapman-Kolmogorov equation. DBNs can also be used to represent and draw inferences for a wide variety of other uncertain systems, including hidden Markov models (HMMs), multiple time series, and partially observable Markov decision processes (POMDPs).
Many commercial and free software packages are now available for BN and DBN modeling. The Belief and Decision Networks tool at www.aispace.org/mainTools.shtml provides a useful tutorial and software for learning to use BNs. The commercial Bayes Server™ software (www.bayesserver.com) provides a powerful modeling environment for both BNs and DBNs, with support for modeling of multiple time series, latent variables, and both discrete and continuous variables. Free interactive demos (www.bayesserver.com/Live.aspx) allow users to interactively enter findings into BNs and see how probabilities at other nodes adjust. Other software tools for learning BNs and DBNs from data and using them to draw inferences are listed by the Australasian Bayesian Network Modeling Society at http://abnms.org/forum/viewtopic.php?f=7&t=11515. They include BayesiaLab, CaMML for causal Bayesian networks, Genie, Hugin, JavaBayes, Tetrad, and WinBUGs (implementing Gibbs sampling-based inference) as well as the Analytica and Netica software introduced in Chapter 1 and this chapter.
Causal Risk Models Equivalent to BNs Understanding Bayesian network notation and capabilities yields multiple dividends. Key analytic techniques of system reliability engineering, dynamic systems modeling and prediction, probabilistic risk analysis, and quantitative decision analysis can be viewed as special cases of BNs, thereby gaining in clarity, simplicity, and generality. This section examines how other modeling methods used to clarify causality, uncertainty, and risk can be mapped to the framework of causal BNs.
Fault Tree Analysis One way to envision, predict, and mitigate risks in complex reliability systems is to imagine that some undesirable event, such as catastrophic failure of a system (e.g., loss of power in an airplane, collision of ships or of trains, core breach in a nuclear power plant, or theft of credit card data) occurs, and then to inquire systematically howit might have occurred, how probable or frequent are the conjunctions of conditions that suffice to cause it, and what could have been done to prevent it. The necessary thinking can be organized via a fault tree. In a fault tree, nodes of the tree represent events. The undesirable event of interest is called the top event. It is conventionally drawn at the top of the tree, which is then expanded downward, as in Figure 2.7. Below the top event are displayed logical combinations of other events that suffice to cause it. This can often be done conveniently using AND gates to show conjunctions of events that suffice to cause the top event, and OR gates to show alternative ways (i.e., alternative conjunctions of events) that suffice to cause it. For example, if a computer monitor fails to light up when the power switch is turned to on, one possible explanation is that no power is reaching the computer; a different possibility is that the monitor is broken. A fault tree would include both possibilities, depending from an OR gate below the event that the monitor does not light up when switched on. The rest of the tree is developed recursively, by successively developing possible explanations for each sub-event in terms of logical combinations of lower-level events that could cause them. For example, a failure of power to reach the computer could occur if it is not plugged in, or if there has been a power failure, or if there is a fault in the power cord, or if a circuit breaker has been tripped; these would be sub-events below the event of power not reaching the computer when it is switched on.
Fig. 2.7 Schematic sketch of a small fault tree. The top event is failure of Subsystem A.” Basic events are numbered 1-8. AND gates have flat bottoms, OR gates have curved bottoms.
The process of expanding events into possible explanations (via alternative conjunctions of predecessor events that jointly suffice to cause them) stops when all of the remaining unexplained events in the model (the “leaf nodes” at the bottom of a downward-growing fault tree) are “basic events.” These are events that are not explained or described further. Their occurrence probabilities or rates are estimated directly from data. For example, in Figure 2.7, there are 8 basic events at the bottom of the tree, numbered 1-8. The logic of the tree shows that the top event occurs if basic events 3-5 all occur, or if both 7 and 8 occur, or if any of the other basic events 1, 2, or 6, occurs. The probability or frequency (occurrences per year) of the top event can now be calculated from the probabilities or frequencies of the basic events using rules of probability: the probability of a conjunction of independent events is the product of their probabilities, and the probability of a disjunction is given by a series (the inclusion-exclusion series) that can be approximated by the sum of their probabilities if all probabilities are sufficiently small. Such rules allow the probabilities of the basic events to be pushed up the tree by resolving (i.e., calculating the probability of) each AND and OR gate as soon as the probabilities of all of its children have been calculated, beginning with the parents of the leaf nodes. The probability or frequency of the top event is obtained by resolving that node once all of its children have been resolved.
Fault tree analysis (FTA) can be extended much further by allowing other types of logic gates; optimizing combinatorial algorithms for calculating approximate probabilities; identifying minimal cut sets and path sets that block or imply occurrence of the top event, respectively; speeding computations via Monte Carlo simulation; allowing for (possibly unreliable) repair of failed components; allowing for dynamic logic gates (e.g., C occurs if A occurs and B occurs after A); showing unavailability of systems over time; calculating various importance metrics for components and subsystems, and so forth. Resources for learning more about FTA include the following:
For basic concepts and methods of fault tree analysis, see the Fault Tree Handbook (NUREG-0492), available for download from the U.S. Nuclear Regulatory Commission (www.nrc.gov/reading-rm/doc-collections/nuregs/staff/sr0492/)
A free on-line FTA program at www.fault-tree-analysis-software.com/fault-tree-analysis can be used to experiment interactively with a fault tree solver, e.g., by varying the failure rates for basic events and seeing how the failure rate of the top event changes.
A dynamic fault tree solver is available at http://fmt.ewi.utwente.nl/puptol/dftcalc/. Fault trees are entered via a simple standard syntax (known as the Galileo format after an earlier dynamic fault tree program (Dugan, 2000)). Markov chains model component failures over time as the dynamic basic events of the fauilt tree. A graphic user interface provides easy access to calculations and reports.
Example: A Dynamic Fault Tree Calculator Figure 2.8 illustrates one output from the DFTCalc Web-Tool dynamic fault tree calculator (http://fmt.ewi.utwente.nl/puptol/dftcalc/). The fault tree model is specified via the following Galileo commands:
"A" 2of3 "B" "C" "D";
"B" lambda=0.1 dorm=0;
"C" lambda=0.2 dorm=0;
"D" lambda=0.3 dorm=0;
The first command specifies the top event as A. The second line gives the reliability model, which is that A fails if at least 2 of B, C, and D fail (concisely indicated vs the “2of3 syntax). The remaining lines specify the failure rates for B, C, and D as 0.1, 0.2. and 0.3 expected failures per unit time, respectively. (Lambda is the conventional designator for a failure rate. There is no dormant period before the components come on line and are susceptible to failure.) The output in Figure 2.8 shows the growth of unreliability (i.e., probability occurrence of the top event) over time and the mean time to failure, which text output identifies as 4.5 time units. The program can be used to study how these outputs change as the failure rates of the basic events B, C, and D are changed, or if their states (working or failed) are specified as assumptions.
Fig. 2.8 Example output from DFTCalc Web-Tool dynamic fault tree calculator showing increase in probability of top event over time and mean time to failure (MTTF = 4.5 time units, indicated by *)
Source: DFTCalc Web-Tool dynamic fault tree calculator, http://fmt.ewi.utwente.nl/puptol/dftcalc/
Despite its many important successes in a variety of practical applications, FTA has important limitations that can be overcome by re-expressing the fault tree model as a BN. Among them are the following (Bobbio et al., 2001):
Flexibility of event descriptions: Events in classical FTA are binary: components can switch from fully on (working) to fully off (failed), but intermediate degrees of degraded performance are not allowed. This assumption is relaxed in some generalizations of fault tree analysis that allow for continuous state performance indicators, but most FTA codes are developed for binary logic gates. By contrast, BNs allows binary, unordered or ordered categorical, and continuous random variables (although many BN software implementations discretize arbitrary continuous random variables, e.g., into 10 deciles). This flexibility lets BNs handle degrees of performance very easily.
Flexibility and generality of failure propagation logic: FTA is built around logic gates, especially AND and OR gates, although other types of logic gates (including negation and exclusive-or gates) are allowed in some FTA software. Logic gates can easily be modeled as special deterministic conditional probability tables (CPTs). The conjunctive logic gate “C IF A AND B”is equivalent to a CPT with values P(C | A, B) = A*B where A, B, and C are Boolean variables (with possible values 1 = True and 0 = False, indicating occurrence or non-occurrence of the events, respectively). Likewise, “C IF A OR B”is equivalent to P(C | A, B) = A + B - A*B. But CPTs can equally easily express many other dependencies, stochastic as well as deterministic. For example, “C occurs in a specified time interval with probability 0.2 if A and not B; C occurs with probability 0.5 if B and not A; and C occurs with probability 0.8 if both A and B” can be represented by a CPT with values P(C | A, B) = 0.2A + 0.5B + 0.1AB for each possible combination of values of the 0-1 input variables A and B. Thus, a CPT is a far more general and flexible representation of dependencies among an event and its parents than is a logic gate.
Independence requirements. Basic events in FTA are often assumed to be statistically independent: the failure probability or rate of one component (represented by a basic event) does not depend on which others have failed. BNs can model dependencies among basic events via arrows linking them and CPTs describing the dependencies.
Inference: Although fault trees are ideally suited for propagating information about failure rates up from the basic events at the leaves of the tree to the top event at its root, they are not designed to support more general inferences. If it is observed that an intermediate event has occurred part way up the tree – e.g., that power has not reached a monitor after a computer is switched on – then FTA does not automatically update the probabilities of basic events or the probabilities of other intermediate events to condition on this observation. A BN solver such as Netica automatically does such updating of probabilities in light of observations.
Fortunately, any fault tree can be converted automatically to a logically equivalent BN by mapping its logic gates to corresponding CPTs and its basic events to input nodes (Bobbio et al., 2010). The BN formulation also supports useful generalizations such as those just mentioned and others, including simultaneous analysis for multiple top events and natural treatments of common cause failures. BN inference algorithms can identify the most likely explanation for occurrence of the top event (or any other subset of events, entered as assumed findings) and can compute the conditional probability that a low-level event has occurred if it is assumed that a higher-level one occurs. These capabilities enable BNs to identify the components that are most likely to be involved in various failure scenarios, helping system designers and operators understand how failures are most likely to occur and what can be done to prevent them or to reduce their risks.
Event Tree Analysis FTA’s approach of recursively expanding events in terms of combinations of lower-level events explain how they might occur reflects a style of reasoning known in artificial intelligence as backward chaining. It reasons backward from a specified target state (occurrence of the top event) to the conditions that could cause it. In the reverse direction, forward chaining begins by assuming that some specific initiating event occurs, and then reasons forward about what could happen next and how probable are different sequences of consequences. For example, if off-site power is lost at a chemical manufacturing plant, one possibility might be that on-site generators would activate and supply the power needed for continued operations or for safe shut-down. A different possibility might be that one or more of the on-site generators fails to start when needed. An event tree shows the various possibilities branching out from the initiating event. Event tree analysis (ETA) quantifies the probabilities of different outcomes that might be caused by the initiating event.
Figure 2.9a shows an example with an initiating event consisting of an explosion and three subsequent binary events in which the explosion either does or does not ignite a fire, the sprinkler system either is or is not working, and the fire alarm either is or is not activated. At the tips or leaves of the tree on the right-hand side are outcomes caused by the different possible event sequences. This example is taken from free demo software for the Event Tree Analysis Module of the RAM Commander suite of reliability and risk analysis software (www.reliability-safety-software.com/eta/). Probabilities or average annual frequencies of each outcome node in any given interval of time can be calculated by multiplying the probability of the initiating event by the conditional probabilities of each event along the path leading to the outcome, where the occurrence probability of each event is conditioned on the events that precede it. (Technically, probabilities and average annual frequencies are slightly different, with probabilities necessarily lying between 0 and 1 and average annual frequencies having units of expected occurrences per year, which in principle has no necessary upper bound. But they are numerically very close when the occurrence probability is small, e.g., 1% or less. Otherwise, the relation between them is that the probability p for occurrence of an event within T years is: p = 1 - exp-(T), where denotes the average annual frequency. In reliability theory, is also known as the failure rate or the intensity for the event.)
Fig. 2.9a An event tree with 4 events and four possible distinguished outcomes.
Figure 2.9b shows an example of an event tree for the dose of radiation received by a human operator if a drum of radioactive material is dropped and releases radioactive material. The dose received depends on the nature of the material (high activity or not), whether ventilation is working at the time, and on how quickly the operator evacuates, which in turn depends on whether an alarm is working at the time. It is clear that dichotomizing all descriptions is a drastic simplification. In reality, the drum might spill different amounts and might do so at different locations within the cell being considered. The operator’s time to evacuate is a continuous random variable rather than a binary one (more or less than 20 seconds), as in the even tree. But even such coarse-grained descriptions are useful in envisioning what might happen in the event that a drum is dropped, and for understanding how changing the conditional probabilities of some events (e.g., by installing a more reliable alarm) changes the frequency distribution of outcomes.
Fig. 2.9b An event tree for dose of radiation received by an operator if a dropped drum releases radioactive materials
Source: Logan Fault and Event Tree Analysis software, http://loganfta.com/index_files/enlarged_image3.html
Like fault trees, event trees can be mapped automatically to equivalent BNs (Bearfield and Marsh, 2005). Events and outcomes in the tree become nodes in the BN, with each node having arrows directed into it from the event nodes on which it directly depends. The CPTs for these nodes are obtained from the conditional probabilities in the event tree. Nodes can readily be generalized to allow more than two values, so that ordered-categorical or continuous descriptions of spill sizes, for example, become practical and easy to include in modeling. BN algorithms can then be used to quantify the conditional probabilities of outcomes given observations or assumptions about predecessor events; conversely, given an assumed outcome, standard BN calculations can solve for the most likely explanation in terms of conjunctions of preceding events that could cause it. Thus, the BN formulation combines the advantages of forward chaining reasoning to predict the probabilities of undesired outcomes and backward chaining reasoning to explain them, and hence perhaps help understand how to prevent them.