Table 3.1 Means and frequency counts (N) for individual responses to different questions on the BRFSS survey, for 2008-2012. Responses are broken down by states (rows).
Similarly, not every respondent answered all questions, and there is no guarantee that responses can be extrapolated from those who did to those who did not. Hence, we only consider questions that were answered by almost all of the 228,369 respondents. For the variables in Table 3.1, for example, over 95% of surveyed individual answered each question.
The BRFSS data consist primarily of either dichotomous (yes-no) variables such as those in Table 3.1, all of which are coded as binary (0-1) variables with 0 = no, 1 = yes; or categorical variables, including age (50-99 years), income, education, and marital status. To these we added the two continuous pollution variables obtained from EPA: average daily O3 concentration in ppm and average daily PM2.5 concentration in micrograms of fine particulate matter per cubic meter of air. Table 3.2 lists the complete set of variables analyzed (other than survey year, month, and location) and their means and minimum and maximum values, as well as the number of individuals responding to each question. Table 3.3 shows the layout of the data (the first 21 of 228,369 records) for individual respondents. Ozone measurements were not available for the county (Apache County, AZ), year, and month of the survey (January, 2010) for these 21 individuals. The entire data set is available from the author upon request.
Table 3.2 Variables, number of records with complete data for each question, and mean, minimum, and maximum values
Table 3.3 Layout of the data, showing values of variables for the first 21 individual respondents. Each column represents a variable, and each row contains the data for one respondent. The variables have various scale types. Sex (0 = female, 1 = male), smoking (0 = no, 1 = yes), and the health outcome indicators for asthma, heart attack, and stroke (0 = no, 1 = yes) are binary variables. Income and Education are ordered categorical variables, Marital Status is a nominal variable (see text for details), and PM2.5 and O3 are continuous.
In Table 3.3, the three categorical variables Income, Education, and Marital Status have integer values for responses of 1-8, 1-6, and 1-6, respectively, with higher numerical values representing higher levels for Income and Education. Smoking is a binary variable that indicates whether a respondent reports having smoked at least 100 cigarettes (5 packs) during his or her life to date.The dependent variablesAsthma Ever, Heart Attack Ever, and Stroke Ever are answers to the questionof whether a doctor, nurse, or other health professional had ever told the respondent that s/he had the corresponding condition, with answers are coded as 1 for yes, 0 for no, and blank (missing) for all other values.
Methods and Analytic Strategy Since most of the variables in this data set other than age, PM2.5, and O3 are dichotomous or categorical, it is useful to examine associations and interactions among them using interaction plotsthat show how the mean value of one variable varies with the levels of one or more others. The following sections plot the main dependent variables of interest (prevalence of self-reportedasthma, stroke, or heart attack) against explanatory variables such as age, income, sex, and average concentrations of O3 and PM2.5 in the counties where respondents lived at the time of the survey. Traditional 95% confidence intervals (mean plus or minus 1.96 sample standard deviations) are indicated visually as vertical bars around the mean values shown in the interaction plots. Such exploratory data analysiscan reveal nonlinear patterns of association and does not require any parametric modeling assumptions.However, interaction plots are most useful for examining the relations among only a few explanatory variables and the dependent variables. We also used multiple logistic regression models to quantify associations between multiple explanatory variables and health effects, and used a non-parametric Bayesian network (BN) learning program (the bnlearn package in R) to discover and visualize statistical dependence relations (represented by arrows between variables) and conditional independence relations (represented by a lack of arrows between variables) among all variables simultaneously.
Potential causal relations in observational data can be clarified using modern nonparametric methods. Many top-performing methods in recent competitions that evaluate the empirical performance of causal discovery and inference algorithms on suites of test problems (e.g., Hill, 2016; NIPS, 2013) use the following ideas, as discussed in more detail in Chapter 2:
Information principle: Causes provide information about their effects that help to predict them and that cannot be obtained from other variables. This principle creates a bridge between well-developed statistical and machine learning methods for identifying informative variables that improve prediction of dependent variables (such as health effects), on the one hand, and the needs of causal inference, on the other (Frey et al., 2003; Aliferis et al., 2010). Only variables that help to predict an effect by providing information that is not redundant with that from other variables (e.g., measured confounders) are candidates to be its causes. This constraint allows techniques of predictive analytics to be applied as screens for potential causation (Pearl, 2010).
Nonparametric analyses. Multivariate non-parametric methods, most commonly, classification and regression trees (CART) algorithms, can be used to identify and quantify information dependencies among variables without having to make any parametric modeling assumptions (e.g., Frey et al., 2003; Halliday et al., 2016). Conversely, if no significant change occurs in the conditional empirical cumulative distribution function of a dependent variable as the value of an explanatory variable varies, for any combination of values of the remaining variables (so that that explanatory variable does not appear in CART trees for the dependent variable), then this lack of dependence does not support a conclusion that the explanatory variable is a cause of the dependent variable. The dependent variable is then said to be conditionally independent of the explanatory variable, given the values of other variables. Effects are not conditionally independent of their direct causes.CART trees can also be used to test for conditional independence, with the dependent variable being conditionally independent of variables not in the tree, given the variables that are in it, at least as far as the tree-growing algorithm can discover (Frey et al., 2003; Aliferis et al., 2010).
Model ensembles.Rather than relying on any single statistical model, the top-performing causal analytics algorithms typically fit hundreds of nonparametric models (e.g., CART trees), called model ensembles, to randomly generated subsets of the data (Furqan and Siyal, 2016). Averaging the resulting predictions of how the dependent variable depends on other variables over an ensemble of models usually yields better estimates with lower bias and error variance than any single predictive model.
Our analytic plan for clarifying potential causal relations is as follows:
Identify statistical dependencies and conditional independence relations among variables in Table 3.1 via nonparametric methods (described below). This step screens for possible causal relations using the information principle that variable X is a potential cause of variable Y only if X provides information that helps to predictY and that cannot be obtained from other sources (Pearl, 2010). An arrow between two variables in a DAG model shows that one is informative about the other (see Figure 3.10). To facilitate simple interpretations, we also provide partial correlation coefficients and their significance levels for every pair of variables.
Quantify the association betweenPM2.5 and adverse health outcomesusing the Random Forest machine learning algorithm (i.e., an ensemble of nonparametric regression treesfit to different random subsets of the data) to correct for the observed values of all other variables, and compare the results to those from parametric regression modeling. This step simply quantifies the dependence (if any) between two variables, without assessing whether it is causal, taking into account model uncertainty by refusing to commit to any single model or parametric class of models. We carry it out using a partial dependence plot generated by the randomForest package in R, which averages the results of hundreds of regression trees fit to random subsets of the data to obtain a non-parametric estimate of how asthma varies as PM2.5 is swept over its full range of values. (The randomForest package documentation contains details.)
To facilitate easy replication and interpretation by other investigators without requiring skill in R, we accessed these packages and displayed the results using the Causal Analysis Toolkit (CAT) introduced in Chapter 2. The methods are described in more detail in the online documentation for the corresponding R packages. Figures 3.1-3.9 were generated by the Statistica commercial software packages, and all other analyses and figures were generated using R via the CAT software.
A conspicuous challenge for this data set is that respondents answered questions about whether they had ever been told that they had asthma, heart attack, or stroke, but exposure concentrations are recorded only for the specific counties that they lived in at the time of the survey and only for the years 2008-2012. This raises the possibility that people who moved residences or who were diagnosed with these health conditions when pollution conditions were quite different from those in 2008-2012 might contribute irrelevant responses that dilute any observable relation between current pollutant concentrations and health effects. We meet this challenge by showing thatthe relative ranking of counties by pollution levels is fairly stable over the five years of the study and is significantly associated with health effects; thus, although such dilution of associations probably occurs, there appears to be sufficient signal in this large data set to overcome the noise.
Results Dependence of Health Effects on Age and Sex Figure 3.1 presents three interaction plots describing how risk of ever having had a heart attack (upper left), stroke (lower left), and asthma (upper right) vary with age and sex. The horizontal axis shows age categories; these represent ages rounded to the nearest decade (e.g., ages 65-75 are rounded to 70; ages 75-85 are rounded to 80, and so forth). The fractions of men (red curves with square data points) and women (blue curves with round data points) that report ever being told by a medical professional that they have each health condition are plotted on the vertical axes. Separate curves are shown for men and women. A 95% confidence interval (vertical bar) is shown around each data point.
Fig. 3.1. Fraction of respondents reporting ever having had a heart attack (upper left panel), asthma (right panel) or stroke (lower panel) vs. age (horizontal axes, age in years rounded to the nearest decade). Vertical bars are 95% confidence intervals. Men and women aged 65-95 are more likely to have had a heart attack or stroke and less likely to have had asthma than people aged 50-65. Men (blue circles) are more likely than women (red squares) to have hadheart attacks and strokes, but women have higher asthma risks than men.
With the exception of the oldest age group (those over 95, rounded to 100, who may be exceptionally healthy), risks of ever having had a heart attack or a stroke increase dramatically with age, especially for men. By contrast, risk of ever having been diagnosed with asthma decreases with age and is greater for women than for men. Since the risks shown in the upper right panel of Figure 3.1 are cumulative, i.e., they represent ever having been diagnosed with asthma, they can only decrease with age if asthma risks have been increasing over time, so that younger people are more likely to have received an asthma diagnosis than older people. Given the marked effects of age and sex on disease rates, it is important to adjust for them in studying the effects of other variables on health outcomes.
Smoking Effects Figure 3.2 shows how the fraction of female respondents reporting different health conditions varies with age and smoking status. As expected, smoking (here defined so that “No smoke” indicates fewer than 100 cigarettes (five packs) in a lifetime to date and “Smoked” indicates 100 cigarettes or more in a lifetime to date) is associated with increased risks of stroke, heart attack, and asthma at every age level. Similar effects hold for men, but heart attack risks are larger (and effects of smoking greater) and asthma risks are smaller (and effects of smoking smaller). The interaction plot for effects of smoking on age-specific asthma fraction in men is shown in the bottom right panel of Figure 3.2, beneath the corresponding diagram for women in the upper right panel. In these and subsequent analyses, people over 95 are excluded, as they are relatively few (hence have wide confidence bands) and may have exceptionally low risks of heart attack and stroke (Figure 3.1).
Fig. 3.2. Fraction of respondents reporting ever having had a heart attack (upper left panel), asthma (right panels) or stroke (lower left panel) vs. age (horizontal axes, age in years rounded to the nearest decade) for non-smokers (blue circles) and smokers (red squares).Vertical bars are 95% confidence intervals. Adverse health effects are greater among smokers than among non-smokers, for all age groups. These figures are for women except for the lower right panel, which shows asthma for male smokers and non-smokers.
Income Effects Figure 3.3 shows striking effects of income on risks of adverse health effects. The top two panels show how age-specific fractions of women reporting different health effects vary with three different annual income levels: $10k-$15k (income code 2 on the survey); $20k-$25k (income code 4); and greater than $75k (income code 8). For both heart attacks (upper left panel) and asthma (upper right panel), risks are much smaller for respondents with high incomes than for respondents with low incomes for age groups 50-80. This income effect is attenuated at older ages for asthma, although not for heart attack (or for stroke, not shown). The bottom two panels of Figure 3.3 show all income levels across the horizontal axes and the fraction of respondents reporting all three health endpoints (heart attack, stroke, asthma) on the vertical axis, for smoking and non-smoking subpopulations (left and right sub-panels) for women and men (left and right lower panels, respectively). It is clear that the effects of higher income on dramatically reducing risks of all three health effects are greatest for smokers. The reduction in asthma risk associated with never smoking compared to ever smoking (i.e., no more than 100 cigarettes in a lifetime to date vs. more than 100 cigarettes to date) is large among respondents in the lower income categories, especially for women, but is much smaller or even non-existent at the highest income levels. This is in contrast to heart attack risk, which shows elevated risk among ever-smokers compared to never-smokers at all income levels. Thus, interactions among income, smoking, sex, and age are important for asthma risk. Table 3.4 shows health risks for different health effects risks for various combinations of these four factors, each at only two levels. The effects of smoking on asthma (0 = never, 1 = ever) are clearest at low incomes (0.19 vs. 0.29 for never- vs. ever-smoking women aged 55-65, i.e., in age decade 60, and in income code 2, $10k-$15k per year) but are negligible in income code 8 (> $75k/yr.)
Fig. 3.3. UPPER PANELS: Fractions of women reporting ever having had a heart attack (upper left panel) or asthma (upper right panel) vs. age (horizontal axes, age in years rounded to the nearest decade) for low income (blue circles) medium income (red squares) and high income (small green circles). Vertical bars are 95% confidence intervals. LOWER PANELS: Asthma, heart attack, and stroke risks for men (lower right) and women (lower left) by income category (horozontal axes, details in text, 1 = lowest income, 8 = highest income) and smoking status (left half of panel for non smokers, right for smokers).
Table 3.4. Health risks for different combinations of sex, age (rounded to nearest decade), income (2 = lower level, 8 = highest level), and smoking (0 = no, 1 = yes); the values of these variables are in the four left-most (shaded) columns. Each row represents one such combination. The fractions of respondents in each row reporting ever being diagnosed with asthma, heart attack, or stroke, and the total number of respondents in each row, are shown in the four right-most columns, respectively.
Effects of Education and Ethnicity Education has a negative association with heart attack risk but a positive association with asthma risk for women (and men) at every income level as shown in the top row of Figure 3.4. However, education is also associated with age(younger people are more likely to have graduated from college), income (college education is associated with a nearly $20k higher annual income than high school education at all ages), and smoking (women over 75 who are college graduates are more likely to be ever-smokers than those who are high school graduates; but women college graduates under 75 are less likely to be ever-smokers than those
Fig. 3.4. Fractions of women reporting ever having had a heart attack (left) or asthma (right) vs. income category (horizontal axes, details in text, 1 = lowest income, 8 = highest income) and education level (blue circles for High school graduates, red squares for College graduates). The bottom panels are for younger (55-65 years old) non-smoking women. Vertical bars are 95% confidence intervals. College education is associated with lower heart attack risk but higher asthma risk for women than high school education at every level of annual income
who are high school graduates, suggesting that being a college graduate used to be associated with smoking but has more recently become associated with not smoking). Thus, the effects of education on health risks are complicated by confounding due to age, smoking and income. The bottom row of Figure 3.4 shows heart attack risks (left panel) and asthma risks (right panel) for different income levels and for high school and college graduates specifically for women never-smokers aged 55-65, thus controlling for possible confounding effects of age, smoking, and income by conditioning on specific levels for each. In the lower right diagram, college graduates are still more likely to report having been diagnosed with asthma than high school graduates at each income level, even after conditioning on age and never-smoked status. In the lower left diagram, however, there is no longer a clear effect of education on heart attack risk at all income levels. Similarly, although Hispanic ethnicity is associated with increased risk of asthma for women in different income groups, the association disappears after conditioning on age and smoking status (Hispanic women in this date set are more likely to be younger, and hence to have higher risk of asthma, than other women). Figure 3.5 illustrates this pattern. In summary, it appears that college education is associated with increased risk of reporting having been diagnosed with asthma, but Hispanic ethnicityper seis not, after conditioning on age and smoking.
Fig. 3.5. Fractions of women reporting ever having had asthma vs. income category (horozontal axes, details in text, 1 = lowest income, 8 = highest income) and ethnicity (blue circles for not Hispanic, red squares for Hispanic). The left panel is for all women; the right panel is for younger (55-65 years old) non-smoking women. Vertical bars are 95% confidence intervals.The association between Hispanic ethnicity and asthma for all but the lowest income level (left panel) disappears after conditioning on age and smoking status (right panel)
Effects of Marital Status Figure 3.6 shows that divorced status is associated with higher health risks of asthma, heart attack, and stroke than married status. Divorce is most strongly associated with asthma for women; the effect is much less for men (upper left panel of Figure 3.6). The effect of divorce on heart attack risks is greater than the effect on asthma for men 50-75 years old, but the effects are larger for asthma than for heart attacks (or stroke) for women.
Fig. 3.6. Fractions of men (red squares) and women (blue circles) reporting ever having had asthma (upper left), heart attack (upper right) or stroke (lower left) vs. age (main horizontal axis) and married/divorced status (left and right sides). Vertical bars are 95% confidence intervals. Divorced status is associated with higher health risks than married status for both sexes and most ages.
Effects of Fine Particulate Matter (PM2.5) and Ozone (O3) Air Pollution Table 3.5 shows the lower 10th percentile, lower quartile (25th percentile), median (50th percentile), upper quartile (75th percentile) and 90th percentile of the frequency distributions of PM2.5 and O3 for the BRFSS respondents for all 140 counties and 60 months covered in the data set, where each respondent is assigned the average daily PM2.5 and O3 values recorded by the EPA for the county of residence during the year and month of the interview. Monthly averages of daily PM2.5 concentrations ranged mainly between 5 and 14 g/m3, and O3 concentrations ranged mainly between 0.025 and 0.05 ppm. The following figures use the codes 10 (meaning values less than or equal to the 10th percentile), 25 (between the 10th and 25th percentile), 50, 75, 90, and 100 (meaning between the 90th and 100th percentile).