METHODS
Our analyses emphasize directly plotting and examining the data wherever possible using simple descriptive statistics and visualizations including histograms, interaction plots, and scatter plots. These graphs (Figures 4.14.7, 4.9 and 4.10), with vertical bars around data points indicating 95% confidence intervals, as well as linear correlation and multiple linear regression results and nonparametric regression curves fit to the scatter plots via locally weighted scatterplot smoothing (lowess), were all generated in Statistica (www.statsoft.com/Products/STATISTICAFeatures), a commercial statistical software environment marketed by Quest Software Inc (formerly marketed by StatSoft). These plots suffice to establish our key findings. We also performed a more sophisticated analyses using computational statistics packages from the CRAN repository for the R project, https://cran.rproject.org/, as follows:

Bayesian network learning algorithms were run using the R package bnlearn, (www.bnlearn.com/) with all settings at their default values. This package uses nonparametric machine learning algorithms to discover statistical dependencies and conditional independence relationships among variables, revealing which variables are informative about each other. It was used to generate Figure 4.8.

Random forest model ensembles were generated by the R package randomForest, https://cran.rproject.org/web/packages/ randomForest/randomForest.pdf) to quantify multivariate statistical dependencies among variables while controlling for the levels of other variables using multiple nonparametric (classification and regression tree (CART) tree) models. Random forest provides a nonparametric alternative to parametric regression modeling that deals with model uncertainties about which variables to include as predictors and what functional forms to specify to relate predictors to a dependent variable by fitting hundreds of CART trees to random subsets of the data and averaging their predictions. It was used to generate Table 4.6 and several results mentioned in the accompanying discussion.
Bayesian networks and random forest, as well as CART trees, are also available in most current commercial statistics and machine learning software and packages, as well as in free R packages and Python scikit learn; the abovecited documentation provides details for the R implementations we used. We used additional R packages (car, MASS, leaps) for multiple linear regression analyses; the results are briefly mentioned, but not discussed in detail, as any multiple linear regression program will produce the same results. To facilitate easy replication and extensions of our analyses, we accessed all R packages and displayed the results (Figure 4.8 and Table 4.6) using the Causal Analysis Toolkit (CAT) described in Chapter 2.
RESULTS
Descriptive Statistics
Table 4.2 summarizes several aspects of the frequency distributions of the variables in this data set. For variables such as weight and height, the mean, median, geometric mean, and harmonic mean are all closely similar. Figure 4.1 illustrates that air benzene concentrations (AB) have a distribution with a long right tail (skewness = 5), leading to a mean value that is more than triple the median value and more than twice the geometric mean. Metabolite distributions are even more skewed (Table 4.1).
Table 4.2. Descriptive statistics for all observations (pooled individualdays of data)
Variable














620

0.3



0.0

0.0

1.0

0.0

1.0

0.5

0.64


620

29.8

28.7

27.6

28.0

18.0

52.0

22.0

37.0

8.3

0.55


620

0.2



0.0

0.0

1.0

0.0

0.0

0.4

1.29


613

2.2



0.0

0.0

40.0

0.0

0.0

5.3

2.83


620

0.7



1.0

0.0

1.0

0.0

1.0

0.4

1.11


620

1.6

1.4

1.3

1.0

1.0

3.0

1.0

3.0

0.9

0.88


611

60.7

59.8

59.0

60.0

39.0

114.0

53.0

66.0

10.6

0.96


611

164.5

164.3

164.2

164.0

146.0

185.0

160.0

169.0

6.9

0.48


366

4.9

1.9

1.1

1.5

0.3

88.9

0.7

4.7

10.4

5.00


620

1219.6

67.8

0.8

114.9

0.0

54782.2

7.8

506.2

4947.0

8.00


620

27.5

6.9

2.0

6.6

0.1

651.0

1.9

23.2

66.1

5.39


615

1.0

0.1

0.0

0.1

0.0

71.4

0.0

0.4

4.0

10.55


620

248.6

122.9

75.5

116.6

7.0

6604.9

60.7

211.8

558.4

7.69


620

32.1

18.7

13.1

17.6

2.1

860.9

10.3

29.6

63.9

8.17


620

31.8

14.7

8.7

12.9

0.6

742.8

7.4

26.0

67.2

5.92


620

1.3

1.2

1.0

1.3

0.1

3.3

1.0

1.7

0.6

0.31


370

14.0

7.6

4.4

7.7

0.6

174.3

3.4

14.7

20.3

3.96


619

71.3

19.0

3.9

30.6

0.3

1415.9

3.1

98.2

122.0

5.16

Fig. 4.1. Histograms showing the skewed distributions of air benzene (AB) in factories 1 (left) and 2 (right). Lognormal distributions (continuous curves) fit to these data appear to underestimate their skewness (heavy right tails). Note the different vertical scales. High concentrations are less frequent in Factory 1 than in Factory 2.
Metabolites vs. benzene concentrations in air
Perhaps the most interesting research question is whether metabolite concentrations and DSM ratios vary nonlinearly with benzene concentrations between 0 ppm and 3 ppm. To start to address this key question without introducing modeling assumptions, Figures 4.2 and 4.3 plot mean levels of benzene metabolites against air benzene concentrations rounded to the nearest ppm. These plots pool over all individualdays of observation, so that some individuals who were measured on multiple days contribute multiple data points; as discussed later, individual data points are not strongly correlated between different measurement days, and including multiple days of data separately when they are available avoids the need to choose a single summary measure. Figure 4.2 shows these plot for phenol (PH) (left panel) and urinary benzene (right panel) levels averaged over all exposed workers in Factories 1 and 2 at each level of air benzene (AB) from 0 to 5 ppm, rounded to the nearest ppm. Figure 4.3 shows a similar plot for the benzene metabolites hydroquinone (HQ), catechol (CA), and muconic acid (MA), which have similar enough values to be plotted together on the same axes. These plots do not show strongly nonlinear metabolism or supralinearity between 0 and 3 ppm of air benzene; rather, metabolism below 3 ppm appears to be approximately linear.
Fig. 4.2. Plot of phenol (PH) concentrations (left) and urinary benzene (right) concentrations vs. air benzene (AB) for all exposed workers in Factories 1 and 2. The plots do not clearly suggest a nonlinear or supralinear relationship between benzene and phenol for AB between 0 and 3 ppm.
Dostları ilə paylaş: 