If air pollution levels have enough random variation over time so that pollution levels at the time of the interview contain no useful information about past pollution levels relevant for calculating associations with health effects, then associations between contemporaneous pollution levels and self-reported health conditions would not be meaningful. However, calculating the Spearman’s rank correlation coefficients between the rankings of all 140 counties by average PM2.5 concentrations for all five years 2008-2012 reveals that all 10 of the Spearman’s rank correlations are positive (each of 5 years paired with each of 4 other years yields 20/2 = 10 associations), meaning that counties that rank relatively high in PM2.5 one year are likely to do so in other years; thus, the pollution levels at the time of the interview have the potential to be informative about levels over multi-year time spans.
Figure 3.7 plots adverse health effect prevalence against the quantiles of the PM2.5 (left side) and O3 (right side) frequency distributions for men and women ever-smokers and never-smokers. The upper row shows the fractions of individual respondents in each of these groups reporting each health condition (heart attack, stroke, asthma) at each exposure concentration quantile (corresponding concentrations are shown in Table 3.5). The lower row repeats this exercise specifically for the relatively large subpopulation of respondents aged 55-65 (age group 60, to the nearest decade) with incomes over $75k/year. Conditioning on age and income helps to control for any potential confounding effects of these variables. For PM2.5, it appears in the diagram in the upperleft quadrant of Figure 3.6 that individual respondents from counties with relatively high ambient concentrations of PM2.5 (right ends of the asthma-vs.-PM2.5 curves) have significantly lower risks of asthma than individuals from counties with relatively low ambient concentrations (left ends of these curves), but that smokers (both men and women) have higher risks of heart attack when ambient PM2.5 is high than when it is low. Significant associations between ozone and asthma are not apparent. The interaction plots in the lower left quadrant of Figure 3.7 suggest that the positive association between PM2.5 and heart attack risk and the negative association between PM2.5 and asthma risk are not explained away by confounding by age or income.
Fig. 3.7. Fractions of respondents reporting risks asthma (blue circles), heart attack (red squares) and strokes (small green circles) vs. PM2.5 percentile (left side) and O3 percentile (right side) for male and female ever-smokers and never-smokers, with and without conditioning on age and income (top and bottom rows, respectively; bottom row is for people aged 55-65 in top income category). Vertical bars represent 95% confidence intervals.
To obtain a more quantitative description of how health effects differ by PM2.5 concentration after controlling for age, sex, income, and smoking history, Table 3.6 tabulates the mean prevalence of each health condition for all combinations of each of two levels of sex, income, smoking, and PM2.5, for respondents aged 55-65 (the largest age group). While even this large data set has limited power to resolve differences when so finely partitioned, a standard binomial test rejects the null hypothesis that the proportions of respondents reporting adverse health conditions are independent of PM2.5 level in favor of the alternative hypotheses that heart attack rates are elevated at the high compared to the low level of PM2.5 (p = 0.009) for women ever-smokers with high incomes; and asthma rates are reduced at the high compared to the low level of PM2.5 for women never-smokers with high incomes (p = 0.0475). While heart attack rates are also elevated and asthma rates reduced in other groups at high compared to low PM2.5 levels – most notably, for women with low income, both ever-smokers and never-smokers – the sample sizes (N in the right-most column of Table 3.6) are too small to make these individual differences statistically significant at the conventional 0.05 level (e.g., p = 0.13 for the difference in asthma prevalence, 0.25 vs. 0.18, and p = 0.07 for the difference in heart attack prevalence, 0.05 vs. 0.01, among female never-smokers with low income). Nonetheless, the pooled data for all of the larger groups supports the conclusion that the higher PM2.5 levels are associated with increased risk of heart attack and decreased risk of asthma.
Table 3.6. Comparison of health effects prevalence in respondents aged 55-65 from counties with less than 5.9 g/m3 (PM2.5 quantile = 10) or more than 13.95 (PM2.5 quantile = 100) PM2.5, matched for sex, income, and smoking. The four left-most (shaded) columns show different combinations of sex, income (2 = lower level, 8 = highest level), smoking (0 = no, 1 = yes) and PM2.5 (10 = bottom 10th percentile, 100 = top 10th percentile). Each row represents one such combination. The fractions of respondents in each row reporting ever being diagnosed with asthma, heart attack, or stroke, and the total number of respondents in each row, are shown in the four right-most columns, respectively.
Quantitatively, the largest reductions in asthma risk (from 0.25 to 0.18 for never-smokers and from 0.35 to 0.28 for ever-smokers) occur among women with low incomes. By contrast, at high income levels, the corresponding changes are small to negligible (from 0.15 to 0.12 for never-smokers and from 0.13 to 0.13 for ever-smokers, respectively). Thus, high income appears to greatly reduce or eliminate the association between PM2.5 and reduced asthma risk, as well as reducing the absolute risk of asthma and the positive association of smoking with asthma. From this perspective, income appears to be of central importance not only for asthma risk, as shown in Figure 3.4, but also for modifying (specifically, reducing) the associations between other factors (smoking and PM2.5) and asthma risks.
Additional Interaction Analyses To help visualize interactions among risk factors, we created the following dichotomous risk factors based on the preceding analyses: young = 1 for age less than 64 years (the median age), else 0 for respondents 64 or older; female = 0 for men, 1 for women; lowIncome = 1 for respondents with incomes below median (6 on the 1-8 scale), else 0 for respondents at or above the median income; NotCollegeGrad = 1 for Education less than 6 (graduated from college), else 0; divorced = 0 for married, 1 for separated or divorced; smoke = 1 for ever-smokers, 0 for never-smokers, lowPM2.5 = 1 for PM2.5 < 8.91 g/m3 (the median level of PM2.5), else 0 for greater values of PM2.5; and lowO3 = 1 for O3 < 0.04 ppm (the median level of O3), else 0 for greater levels of O3. Figure 3.8 shows how these dichotomous risk factors (other than income) and the three health condition indicators (bottom three curves, for asthma, heart attack, and stroke) vary with income levels 1-8. Other risk factors are strongly (although not always linearly) associated with income and with each other; most extremely, probability of being divorced or separated is about seven times greater for respondents in the lowest income levels compared to respondents in the top income group (level8, >$75k per year, constituting the top quartile of the income distribution), and is even higher for ever-smokers than for never-smokers in every income category (not shown). However, despite this very strong association between low income and high prevalence of divorce (or separation), the association between divorce and asthma is not explained away by confounding by low income. Asthma is significantly more prevalent among divorced or separated respondents than among married ones at every income level, as shown on the left side of Figure 3.9, with the effect being greatest at lower income levels. For heart attacks, however, risk is lower among separated/divorced respondents at levels of 4 or more (and this is true for each sex separately, although that is not shown in Figure 3.9).
Fig. 3.8. Fractions of respondents reporting various attributes vs. income group (1 = lowest income, 8 = highest income, details in text). All factors vary with income level. Vertical bars indicating 95% confidence intervals are very narrow, reflecting large sample sizes.
Fig. 3.9. Fractions of married (blue circles) or divorced/separated (red squares) respondents reporting asthma (left panel) or heart attack (right panel) vs. income group (1 = lowest income, 8 = highest income, details in text). Divorce or separation is positively associated with asthma risk at each income, but is negatively associated with heart attack risk at the upper income levels
Results of Logistic Regression Analysis The results presented so far have not fit specific parametric models to the data. Table 3.7 shows the odds ratios for different risk factors in logistic regression models for Asthma and Heart Attack risks (developed usingthe R script LR<- glm(Asthma ~ young + female + lowIncome + divorced + NotCollegeGrad + Smoking + PM2.5 + O3, family=binomial(link='logit')); exp(cbind(OR = coef(LR), confint(LR))); summary(LR).By default, the lowest level of each categorical variable is used as the reference level.) These logistic regression models reinforce many of the findings from the previous interaction plots,but focus on main effects (one coefficient for each risk factor), and hence are less informative about nonlinearities and interactions among predictors.
Being divorced or separated is confirmed as a highly significant predictor of asthma risk, but not of heart attack risk. Not being a college graduate is a highly significant predictor of heart attack risk, but not of asthma risk. Smoking is a risk factor for both; ozone has a borderline significant positive association with heart attack risk (as suggested by the lower right panel of Figure 3.7) but not with asthma risk; and PM2.5 has a significant negative association with asthma risk but no significant association (after conditioning on the other risk factors) with heart attack risk. (PM2.5 and O3 are continuous variables, unlike the other risk factors; hence the odds ratio of 0.99 for PM2.5 and asthma simply indicates that an increase of 1 g/m3 in PM2.5 only slightly reduces risk of asthma. For a 10 g/m3 increase in PM2.5, the odds ratio would be 0.94, i.e., about a 6% reduction in the probability of asthma.) The heart attack odds ratio of 11.3 for ozone is startlingly high. However, when continuous risk factors such as age and income are no longer dichotomized, but all values of all predictors are used, the odds ratio for O3 as a predictor of heart attack risk falls to 0.28 (95% CI 0.016-4.9, p value = 0.38), suggesting that the high odds ratio with dichotomized variables may be due to residual confounding or issues caused by dichotomizing continuous predictors. Income, age, sex, education, and smoking all remain highly significant predictors, The odds ratio for PM2.5 is 0.99 (95% CI 0.98 to 1.00, p value 0.07).
Table 3.7. Logistic regression odds ratios,95% confidence intervals [in brackets], and p-valuesfor various risk factors (left column) as predictors of Asthma (middle column) and Heart Attack (right column) risks
Asthma Odds Ratio (OR)
[95% confidence limits], p value
Heart Attack OR
[95% confidence limits], p value
1.25 [1.20, 1.3], p< 2E-16
0.44 [0.41, 0.46], p< 2E-16
1.54 [1.48, 1.60], p< 2E-16
0.40 [0.37, 0.42], p< 2E-16
1.36 [1.30, 1.42], p< 2E-16
2.0 [1.89, 2.12], p< 2E-16
1.25 [1.20, 1.31], p< 2E-16
0.99 [0.93, 1.05], p = 0.77
1.01 [0.97,1.05], p = 0.70
1.41 [1.33, 1.49], p< 2E-16
1.19 [1.14, 1.24], p< 2E-16
1.74 [1.65, 1.84], p< 2E-16
0.99 [0.98, 0.997], p = 0.003
1.00 [0.99, 1.01], p = 0.58
1.66 [0.26, 10.6], p = 0.59
11.3 [0.97, 131.8], p = 0.053
The results for Stroke (not shown) are easily summarized: being young or female are associated with significantly reduced risks of stroke (ORs of 0.47 and 0.79, respectively); being low income, divorced, an ever-smoker, or not a college graduate are all associated with significantly increased risk of stroke (ORs of 2.15, 1.18, 1.36, and 1.35, respectively); and neither PM2.5 nor O3 is significantly associated with risk of stroke (OR of 1.00 for PM2.5 with 95% CI from 0.99 to 1.01; and OR of 2.0 for O3 with 95% CI from 0.10 to 42) after controlling for the other variables in Table 3.7.
Dichotomizing continuous or ordered categorical variables risks distorting regression results. For this data set, however, the key findings just summarized are robust to alternative model specifications. For example, treating Age as a continuous variable (or using ordered categories for age to the nearest decade), treating Income on a scale from 1 to 8 as either continuous or categorical, and treating Marital Status and Educationas categorical variables, [e.g., using the R model specification LR <- glm(Asthma ~ Age + female + as.factor(Income) + as.factor(Marital) + as.factor(Education) + Smoking + PM2.5 + O3, family=binomial(link='logit'))], does not change the conclusions that asthma risk decreases significantly with age, income, and PM2.5 concentration and increases significantly with smoking and being female. Ozone has no significant association with asthma risk. If education is entered as a continuous variable (on a scale from 1-6), then there is a significant positive association (logistic regression coefficient) between education and asthma risk (OR = 1.03, 95% CI = [1.01, 1.05], p = 0.002), consistent with the discussion of Figure 3.4. Various interaction terms (not shown in the main-effects model in Table 3.7) are also significant (e.g., the female:smoking interaction for asthma has an odds ratio of 1.09, p = 0.04), consistent with Figure 3.7 and previous figures.
Results of Bayesian Network and Partial Correlation Analysis Regression models treat the different variables in a data set asymmetrically by distinguishing a single dependent variable which is to be explained or predicted from the values of one or more independent variables. By contrast, Bayesian Network (BN) models show how different variables can be used to help explain each others’ values, treating each variable as both a potential explanatory variable and a potential dependent variable in relation to the rest. (See Chapter 2 for a fuller discussion.) Figure 3.10 shows the graph structures of a BN learned from the data using the default settings of the bnlearn package in R. The arrows in Figure 3.10 indicate statistical dependency relations, and should not necessarily be interpreted causally. An arrow between two variables can be read as “is not independent of” (or, more briefly, as “depends on,” if it is understood that statistical dependence can go in either direction). Thus, arrows only show that the values of some variables depend on (i.e., are not statistically independent of) the values of other – a necessary but not sufficient condition for causality. In many cases, as in the relation between age and sex, or either of these and smoking, the arrows can be directed either way (via Bayes’ rule) to correctly show that these variables have statistically dependent values, but this does not imply any clear causal interpretation. (Technically, the directions of arrows in a BN indicate one way to decompose the joint distribution of all variables into a directed acyclic graph (DAG) model with marginal distributions at input nodes (those with only outward-pointing arrows) and conditional probability distributions at other nodes.) The only implication for causality is that direct causes of a variable must be neighbors of it such a DAG, i.e., linked to it by an arrow (in either direction), in order to satisfy the information condition constraint that effects depend on – and hence are not conditionally independent of – their causes.
Fig. 3.10. Bayesian Network(BN) showing dependence relations among variables, generated using the bnlearn package in R. An arrow between two variables indicates that they are informative about each other: the frequency distribution of each variable varies significantly with the values of the variables to which it is connected, but, given those, it is conditionally independent of the values of variables to which it is not connected.
The structure of the BN in Figure 3.10 shows that asthma and heartattack risks are directly linked to each other (having any of Asthma_Ever, Heart_Attack_Ever, or Stroke_Evermakes the others more likely) and also to smoking, sex, age, education, and income, but not to PM2.5 or O3. Insofar as health outcomes are conditionally independent of exposure variables PM2.5 and O3, given the values of other variables such as smoking and education (which also depend on each other), this data set does not support a causal interpretation for exposure-response associations, but rather explains them in terms of mutual dependence of exposure and response variables on other (confounding) variables. For example, higher education levels (college and above) are associated with lower ozone and PM2.5 exposure concentrations, and also with higher incomes, and lower heart attack risks than lower levels of education, but this is not because lower ozone and PM2.5 directly cause these beneficial outcomes, as they are not adjacent to them in the DAG model.
Figure 3.10 in conjunction with the signs of different dependencies also suggests an explanation for the negative association between PM2.5 and asthma in regression modeling. As measured by partial correlation coefficients (i.e., correlation coefficients for each pair of variables after adjusting for others by multiple linear regression), shown in Table 3.8, PM2.5 exposure is negatively associated with smoking, and smoking is positively associated with asthma; also, PM2.5 (and O3) are negatively associated with education, and education is positively associated with asthma. Since the only paths connecting PM2.5 and asthma in Figure 3.10 involve smoking and education, and the associations are negative along each path, the total association between PM2.5 and asthma is negative. (The magnitudes of many of these correlation coefficients are attenuated by the use of binary variables such as Smoking, Sex, and Asthma_Ever, but they are still significantly different from zero.)
Table 3.8. Partial correlation coefficients (top) between pairs of variables (rows and columns) after adjusting for all other variables via multiple linear regression, and their p-values (bottom)
Results of Regression Tree and Random Forest Analyses A sense of the magnitudes of the multivariate dependencies and interactions among variables can be gained fromnonparametric regression tree analyses, the outputs of which are shown in Figure 3.11 for the three adverse health effects, Asthma_Ever(upper left),Heart_Attack_Ever(upper right), and Stroke_Ever(bottom). Including all of the relevant variables (especially age, income, and education, which all have many levels) makes the trees too crowded to read, but these partial trees suffice to illustrate that heart attack risk and stroke risk are conditionally independent of PM2.5 given the values of other variables, while asthma has a negative association with PM2.5 for men (the Sex> 0 branches of the trees). Each regression tree shows combinations of value ranges for the explanatory variables that lead to significantly different conditional distributions for the dependent variable.The trees are read as follows. Each shaded “leaf” node at the bottom of the tree shows the conditional mean value of the dependent variable, given the ranges of values for the variables in the path leading to that leaf, e.g., asthma risk is y = 0.129 among the 16,983 women with Smoking= 0 (for nonsmokers, represented in the tree as Smoking< 0) and Divorced = 0 (for non-divorced women, coded as Divorced< 0) for the left-most node in the tree for Asthma_Ever (upper left). This compares to
Fig. 3.11. Regression trees for adverse health effects (asthma in the upper left tree, heart attack in upper right, stroke at bottom. For binary variables, > 0 = yes, < 0 = no.) Each tree shows how the fraction of respondents reporting each condition (the y values in the shaded leaf nodes) depend on the conditions specified by the paths leading to the leaf nodes. The number of respondents (the numbers n in the shaded leaf nodes), p values for tests of significantly different y values at intermediate nodes, and conditions defining each branch, are also shown. See text for further explanation and interpretation.
a risk of 0.20 among the 4,528 women who are divorced smokers. Non-leaf nodes show the p values from F tests for rejecting the null hypothesis that the conditional distributions are not different from each other on the left and right of each split.
Growing regression trees on different random subsets of the data can produce different trees. A more robust approach averages the results of ensembles of hundreds of trees fit to random subsets of the data to predict how the dependent varies as a single independent variable is varied; this generates a partial dependence plot. Figures 3.10 and 3.11 and Tables 3.7 and 3.8 show that logistic regression, Bayesian Networks, regression trees, and partial correlation analyses all found no significant positive association between PM2.5 (or O3) and any of the three adverse health effects after conditioning on other variables, despite significant unconditional correlations created by confounders such as income. However, one can still generate partial dependence plots to estimate any residual associations. Because of interactions and residual confounding (e.g., due to differences in incomes or education or smoking within a single discrete coding level of these categorical variables), such residual patterns are expected. Quantifying their sizes puts a bound on the plausible magnitude of any undetected causal relations that might contribute to them. Accordingly, Figure 3.12 shows partial dependence plots for each health effect vs. PM2.5 (generated by the randomForest package with a default ensemble size of 500 trees. A 10% simple random sample of all 142,081 records with complete data for these variables was used, to satisfy the memory requirements for the package.) After conditioning on levels of other variables (via trees), disease probabilities vary only slightly as PM2.5 changes. Asthma_Ever has values in the range 0.145 + 0.01, Heart_Attack_Ever has values in the range 0.077 + 0.003, and Stroke_Ever has values in the range 0.0055 + 0.0004 as PM2.5 concentrations range from less than 5 µg/m3 to over 60 µg/m3. The plots suggest U-shaped residual associations between PM2.5 and disease risks, but whether these slight variations reflect causal impacts, residual confounding, selection biases, or random noise is unclear. In any event, an increase in PM2.5 concentration across its range by more than 50 µg/m3changes corresponding health risks relatively little, if at all.