What is the marginal probability distribution of X?
What is the marginal distribution of Y?
What is the conditional probability P(Y > 5 | X + Y < 10)?
Answers: All answers can be found by summing the joint probabilities in Table 2.1 that satisfy the specified conditions. The marginal probability distribution for a variable is found by summing all the joint probabilities corresponding to each of its values to get the total marginal probability for that value. (These sums are typically shown in the margins of the table, hence the term “marginal” distribution.)
The marginal probability distribution of X is: P(X = 1) = 0 + 0.05 + 0.1 = 0.15; P(X = 2) = 0.25 + 0.15 + 0.0 = 0.40; P(X = 3) = 0.15 + 0.0 + 0.3 = 0.45.
The marginal distribution of Y is P(Y = 4) = 0 + 0.25 + 0.15 = 0.40; P(Y = 8) = 0.05 + 0.15 + 0.0 = 0.20; P(Y = 16) = 0.1 + 0.0 + 0.3 = 0.40;
The conditional probability P(Y > 5 | X + Y < 10) = 0.05/0.45 = 0.11.
Discussion: This example illustrates a simple principle for how to answer any query about the probability that a condition or assertion holds, given constraints reflecting observations, knowledge, or assumptions about the values of variables on which its truth depends and a prior joint probability distribution for the variables. Let’s call a query a condition whose truth or falsity can be determined from the values of variables for which a joint distribution is known, and call data a set of observations or constraints reflecting whatever is known about the values of these variables. Examples of query conditions in the previous calculations are X + Y < 10 in part (a) and Y > 5 in part (f). Corresponding examples of data about X and Y to be used in calculating the probabilities of the query conditions are: nothing in part (a) (i.e., only the prior joint probability distribution of the variables is used); and X + Y < 10 in part (f). To calculate the probability that a query condition holds, given some data, one can simply sum the probabilities of all cells (i.e., all combinations of values for the variables) in a joint probability distribution such as Table 2.1 that satisfy both the query condition and the constraints given by the data; sum the probabilities of all cells that satisfy the constraints given by the data; and then divide the former by the latter. This corresponds to the following form of equation (2.1):
P(query | data) = P(query and data)/P(data). (2.8)
The required sums can be calculated in one pass from top to bottom of the joint probability table if it is arranged as in Table 2.2. (This is sometimes called the “long” or “molten” form of the data, as opposed to the “wide” data in Table 2.1 (Wickham, 2014).) The joint probability for each row, shown in the right-most column of Table 2.2, is included in the cumulative sum for P(query and data) if and only if that row’s values of the variables match both the query condition and the data constraint. The joint probability for the row is included in the cumulative sum for P(data) if and only if that row’s values of the variables match the data constraint, data. When all rows have been visited and both cumulative sums have been computed, their ratio P(query and data)/P(data) provides the value of P(query | data), i.e., the conditional probability that query is true given that data is true. For example, P(X > 1 | Y < 16) could be computed by this algorithm from the joint distribution in Table 2.2 as follows: P(X > 1 | Y < 16) = (0.25 + 0.15 + 0.15 + 0.0)/( 0.0 + 0.25 + 0.15 + 0.05 + 0.15 + 0.0) = 0.917.
Table 2.2 “Long data” form of data in Table 2.1
X value, x
Y value, y
Joint probability P(x, y)
In principle, this method provides an oracle for calculating the conditional probability of any query condition given any data, including user-specified constraints or assumptions as well as observed values of variables. All that is required is the joint probability table and that the query condition and the data constraints are specified clearly enough in terms of the values of the variables to determine which rows in the joint probability table match each one. This might appear to provide a useful way to construct a probabilistic expert system for answering queries based on data. Such a query-answering system, returning answers of the form P(query | data), could potentially be useful for many applications, including the following, among very many others:
Forecasts quantifying P(future_value | history). Special cases include weather forecasts such as P(rain_tomorrow | observations) or hurricane predictions, agricultural forecasts, economic forecasts, and many others.
Reliability models quantifying P(system will survive for > T more years | current age and condition);
Predictive toxicology models quantifying P(adverse_response | exposure) or P(chemical is a mutagen | chemical properties)
Mineral prospecting expert systems quantifying P(ore grade | observations)
But in practice, the sheer number of combinations of different values for the different variables makes this direct approach impracticable for all but the smallest problems. For example, in a data set with only 7 continuous variables such as age, weight, income, daily maximum and minimum temperature, relative humidity, and so forth, each discretized to have only 5 levels, the joint probability table analogous to Table 2.2 would have 57 = 78,125 rows. If each variable had 10 levels, then there would be 10 million rows, and if there were 10 variables each with 10 levels, there would be 10 billion rows. Filling out all of the joint probabilities for such a table, or summing probabilities over all rows that match specified conditions to answer queries, would be impractical. Thus, while the basic idea is promising, a much more efficient way is needed to specify and store the joint probability distribution and to use it to compute conditional probabilities in order to create a practical query-answering technology. Bayesian networks (BNs) provide such a more efficient way.
Example: Sampling-Based Estimation of Probabilities from a Database Using R Instead of seeking to calculate conditional and marginal probabilities from a full joint probability distribution table such as Table 2.2, a more frequent task in practice is to estimate them from available data. Table 2.3 presents an example of a very small database with six records (rows) and four variables (columns other than the Record ID column). Such a database can be used to answer queries about probabilities for variable values (or for predicates and conditions based on them) for a randomly sampled record, as in the following examples.
Table 2.3 A small data base
The probability that a COPD patient over 50 years old is a male smoker is 1/2;
The probability that a randomly selected person in this database is a smoker with COPD is 2/6 = 1/3.
Although such calculations are simple for a database of only 6 records, suppose that the database were much larger, e.g., with 60 million records instead of 6. Then answering queries by brute force application of equation (2.8), scanning each record and counting the number of records that satisfy various user-specified conditions, would be time consuming. A faster way to get numerically accurate approximate answers would be to randomly sample the records and then use the random samples to compute the requested probabilities.
For readers familiar with R, we illustrate exact and approximate sampling-based calculations for the data in Table 2.3; the same calculations can be applied to much larger data sets. The following R script 1 creates a data table (a data frame named “table”) from the data in Table 2.3, with M and F recoded as 1 and 0, respectively, and with Yes and No recoded as 1 and 0, respectively:
# R script 1: Create data table, store as a data frame named “table”
The template is the same for each of these: to calculate P(query | data), just count the number of records (using the “nrow” command) satisfying the query and data conditions (specified using “==” to show that a variable has a certain specific value, or using comparison operators such as “>” to select ranges of values, and using “&” between conjoined conditions). Then divide by the number of records satisfying the data conditions. The ratio is the desired conditional probability, P(query | data).
Running R script 2 in R Studio produces the following output, in agreement with the previous manual calculations:
The following objects are masked from table (pos = 3):
Such calculations run quite quickly on data sets of several thousand rows and several hundred columns.
Almost the same script can be used to answer questions by sampling rows instead of exhaustively counting how many satisfy the specified conditions. The only change is that instead of creating a table that holds all the data, we create one that holds a random sample of the data. Randomly sampling 100,000 rows using simple random sample with replacement suffices to estimate probabilities to two significant digits. This can be done by running R script 3.
# R script 3: Answer questions using random sample of size 10,000
The first few records in this sample are as follows:
Gender Age Smoker COPD
6 1 58 1 1
6.1 1 58 1 1
4 1 26 0 0
2 0 41 0 0
1 1 31 1 0
6.2 1 58 1 1
(Note that record numbers such as 6.1 and 6.2 indicate repeated samples of the same record from the original data set – record 6, in this case.) Overwriting the original data table with the newly created sample using the assignment command “table <- sample” (an admittedly sloppy practice, as one should in general avoid overwriting original data sources, but adequate for purposes of illustration) and then rerunning R script 2 produces the following output:
The following objects are masked from table (pos = 3):
These sampling-based answers are obtained quickly (less than a second to execute R script 2 with 100,000 records in the sample) and they are sufficiently accurate approximations for many purposes. Even if the original data table had contained many millions of records, a simple random sample of size 100,000 records would suffice to answer such questions with this level of speed and accuracy.
This example illustrates how to use random sampling of records to answer questions about the conditional probability that a target condition holds for a randomly selected record in a data set, given some other observed or assumed conditions (with all conditions expressed as constraints on values of the variables). The same idea can be used if the data-generating process is well enough understood so that it can be simulated. Each simulation run produces, in effect, a synthetic record for a potential database consisting of sample values drawn from the joint distribution of the variables. Creating a large number of runs provides a sample that can be used to make probability calculations and answer questions using the techniques just illustrated.
Example: Bayes’ Rule for Disease Diagnosis Setting: Suppose that 1% of the population using a medical clinic have a disease such as HIV, hepatitis, or colon cancer. A medical test is available to help detect the disease. Its statistical performance characteristics are as follows:
P(test is positive | disease is present) = 1 - probability of false negative = 0.99.
P(test is positive | disease is not present) = probability of false positive = 0.02.
Thus, the probability of a false positive is 2% and the probability of a false negative is 1%.
Problem: Suppose that a randomly selected member of the population is given the medical test and the result is positive. Then what is the probability that the individual has the disease?
Solution via Bayes’ Rule: Before continuing, the reader may wish to make an intuitive judgment about the probability that the subject has the disease, given the above information (i.e., 1% base rate of the disease in the population, sensitivity of test = 99%, specificity of test = 98%, where sensitivity = true positive rate and specificity = true negative rate). Most people judge that the posterior probability of the disease, given a positive test result, is quite high, since the test has high sensitivity and specificity. The quantitative answer can be calculated via Bayes’ Rule (equation (2.7)), which is repeated here for convenience:
P(x | y) = P(y | x)P(x)/x’P(y | x’)P(x’) (2.7)
Applying this rule to the disease data yields the following:
P(disease | test positive)
= 0.99*0.01/(0.99*0.01 + 0.02*0.99) = 1/3.
Discussion: The posterior probability of disease, given a positive test result, is only 1/3, despite the high sensitivity and specificity of the diagnostic test, because of the low base rate of the disease (only 1%) in this population. Intuitive judgments often neglect the importance of the base rate, and hence over-estimate the posterior probability of disease given the test results.
Bayesian Network (BN) Formalism and Terminology The assertion “X causes Y” or “X is a cause of Y” can be made for different kinds of entities X and Y, including events that occur at specific moments; conditions or attribute values or states that persist over an interval of time; transition rates between states, or rates of change of variables; values of variables at specific moments; and time series, histories or sequences of values of time series variables. In the Bayesian network (BN) formalism introduced in this section, the entities between which causal relationships might hold are random variables. These are flexible enough to represent events, conditions, states, and deterministic quantities as special cases. The values of the variables can be binary indicators with values 1 if an event occurs or a condition holds and 0 otherwise. They can be names or numbers for the possible values of state variables, if states are discrete; or they can indicate values of continuous attributes, states, or uncertain quantities, or of a known, deterministic quantity if all probability mass is concentrated on a single value. The variables corresponding to the columns in a data frame or table such as Table 2.3 are usually modeled as random variables, with the specific value in a particular row (representing a record or case) and column (representing a variable) being thought of as a sample value drawn from a conditional probability distribution that may depend on the values of some or all of the other variables for that case.
A BN consists of nodes, representing the random variables, and arrows between some of the nodes, representing statistical dependencies between random variables. An input node, meaning one with no arrows pointing into it, has an unconditional probability distribution for its possible values; this is also called its marginal probability distribution, as previously noted. A node with one or more arrows pointing into it from other nodes has a conditional probability distribution for its possible values that depends on the values of the variables (nodes) pointing into it. The nodes that point into a given node are called its parents; the parents of node or variable X in a BN are often denoted by Pa(X).
When each random variable has only a few possible values, the conditional probability distributions for each combination of a node’s parents’ values can be tabulated explicitly. The resulting data structure is called a conditional probability table (CPT) for that node. If a node represents a continuous variable, then its conditional probability distribution, given the values of its parents, may be described by a regression model instead of by a table. Alternatively, modern BN software often discretizes continuous random variables, e.g., rounding age to the nearest year. A coarser discretization (e.g., five-year age buckets) can be used if none of the other variables is very sensitive to age, in the sense that finer-grained discretization does not change their conditional probability distributions significantly compared to using the coarser-grained description. Algorithms for automatically discretizing continuous variables (e.g., vector quantization or classification and regression tree (CART) algorithms based on information theory) are well developed and are incorporated into some BN software packages, but a simpler approach that is also widely applied is to use the 10 deciles of the empirical frequency distribution of each continuous variable to represent it as a discrete random variable. We shall use the term “conditional probability” and its abbreviation “CPT” generically to refer to any representation – whether an actual table, a regression model, a CART tree, a random forest ensemble (discussed later), or some other model – that specifies the conditional probability distribution for the value of a variable given the values of its parents.
The arrows in a BN are oriented to form a directed acyclic graph (DAG), i.e., a graph or network with directed arcs that do not form directed cycles, so that no variable is its own parent or its own ancestor (where an ancestor is defined recursively as a parent of a parent or a parent of an ancestor). The BN’s DAG structure represents a way to decompose the joint probability distribution of the variables into marginal probability distributions for the input nodes and conditional probabilities for other nodes. For example, the two-node BN
X Y (2.9)
has only one input variable, X, and one output variable, Y. Input X has a marginal probability distribution P(x) and output Y has a CPT, P(y | x). Thus, the BN (2.9) decomposes the joint probability distribution for the pair of variables X and Y as shown in equation (2.3):
P(x, y) =P(x)P(y | x) (2.3)
Conversely, BN (2.10) represents the decomposition in equation (2.2):
X Y (2.10)
P(x, y) =P(y)P(x | y) (2.2)
Both decompositions represent the same joint probability distribution, but which is more useful depends on what can be measured and what questions are to be answered. To predict conditional probabilities of Y values from measured values of X, we would use BN (2.9). The prediction comes straight from the CPT for Y, P(y | x). If x values are not measured, but are known to be described by (or “drawn from”) a marginal probability distribution P(x), then the prediction formula for y values is given by equation (2.4):
P(y) = xP(y | x)P(x) (2.4)
The probability of y is just the probability-weighted average value of P(y | x), weighted by the probabilities for each x value. For example, in the previous example of disease diagnosis, the predictive probability that a randomly selected patient will have a positive test result is
P(positive test) = P(positive test | disease)P(disease) + P(positive test | no disease)P(no disease) = 0.99*0.01 + 0.02*0.99 = 0.0297.
Conversely, to assess the probability of disease given an observed test result, one would use BN (2.10) and Bayes’ rule (2.7), as already illustrated.
These ideas extend to BNs with more than two nodes. A joint distribution of n random variables can be factored by generalizing the idea that “joint = marginal x conditional” for two variables, as follows:
P(x1, x2, …, xn) = P(x1)P(x2 | x1)P(x3 | x1, x2)…P(xn | x1, x2, …, xn-1) (2.11)
To apply this to a DAG model, the variables are numbered so that each variable is listed only after all of its parents (if any) have been listed. The input nodes in the DAG appear first in the list of variables, (x1, x2, …, xn). (Such an ordering of the nodes is always possible for a DAG, and it can be produced by running a topological sort algorithm on the DAG.) The key insight that makes BNs so useful is that most of the terms in the decomposition (2.11) simplify for sparsely connected DAGs, since usually the set of parents of a node, Pa(X), on which its CPT depends is only a (possibly small) subset of all of the variables that have been listed so far. For example, the joint probability distribution P(x1, x2, x3, x3) of the variables in the DAG model
If each of the four variable has 2 levels, then it requires two numbers (summing to 1, so only one independent number, or degree of freedom) to specify the marginal distribution for X1; another independent number to specify the marginal distribution for X4; 2 independent numbers for the CPT P(x2 | x1), one for each level of x1; and 4 numbers for the CPT P(x3 | x2 , x4), one for each combination of values for x2 and x4. Thus, a total of 8 independent numbers must be provided as input values to fully determine the joint probability distribution of the four variables using equation (2.12). By contrast, the general identity (2.11), which for n = 4 is
requires considering 1 independent number to specify P(x1), 2 more for P(x2 | x1), 4 more for P(x3 | x1, x2), and 8 for P(x4 | x1, x2, x3), for a total of 15, in keeping with the fact that the joint probability distribution of n binary variables requires 2n– 1 independent numbers to specify in the most general case – one probability number for each possible combination of values, with the constraint that the numbers must sum to 1. Thus, the DAG model represented by equation (2.12) requires about half (8/15) as many numbers to fully specify as would be true in the worst case. For DAG models with more variables, the potential savings in specifying the joint distribution using a factored representation in which each variable’s CPT depends only on its parents can be dramatic. A fairly sparse DAG with n binary nodes and only a few parents for each node – say, no more than k parents, for some small integer k – requiresfewer than n*2k independent input numbers to specify the joint distribution by means of probability tables representing marginal distributions and CPTs, instead of 2n-1 numbers to list the full joint distribution table. If n = 12 and k = 3, for example, then the DAG representation requires fewer than 100 input numbers for the probability tables, compared to over 4000 for the joint distribution table.