Causal Analytics for Applied Risk Analysis Louis Anthony Cox, Jr



Yüklə 12,64 Mb.
səhifə27/57
tarix25.07.2018
ölçüsü12,64 Mb.
#58662
1   ...   23   24   25   26   27   28   29   30   ...   57

METHODS
Our analyses emphasize directly plotting and examining the data wherever possible using simple descriptive statistics and visualizations including histograms, interaction plots, and scatter plots. These graphs (Figures 4.1-4.7, 4.9 and 4.10), with vertical bars around data points indicating 95% confidence intervals, as well as linear correlation and multiple linear regression results and non-parametric regression curves fit to the scatter plots via locally weighted scatterplot smoothing (lowess), were all generated in Statistica (www.statsoft.com/Products/STATISTICA-Features), a commercial statistical software environment marketed by Quest Software Inc (formerly marketed by StatSoft). These plots suffice to establish our key findings. We also performed a more sophisticated analyses using computational statistics packages from the CRAN repository for the R project, https://cran.r-project.org/, as follows:

  • Bayesian network learning algorithms were run using the R package bnlearn, (www.bnlearn.com/) with all settings at their default values. This package uses nonparametric machine learning algorithms to discover statistical dependencies and conditional independence relationships among variables, revealing which variables are informative about each other. It was used to generate Figure 4.8.

  • Random forest model ensembles were generated by the R package randomForest, https://cran.r-project.org/web/packages/ randomForest/randomForest.pdf) to quantify multivariate statistical dependencies among variables while controlling for the levels of other variables using multiple nonparametric (classification and regression tree (CART) tree) models. Random forest provides a non-parametric alternative to parametric regression modeling that deals with model uncertainties about which variables to include as predictors and what functional forms to specify to relate predictors to a dependent variable by fitting hundreds of CART trees to random subsets of the data and averaging their predictions. It was used to generate Table 4.6 and several results mentioned in the accompanying discussion.

Bayesian networks and random forest, as well as CART trees, are also available in most current commercial statistics and machine learning software and packages, as well as in free R packages and Python sci-kit learn; the above-cited documentation provides details for the R implementations we used. We used additional R packages (car, MASS, leaps) for multiple linear regression analyses; the results are briefly mentioned, but not discussed in detail, as any multiple linear regression program will produce the same results. To facilitate easy replication and extensions of our analyses, we accessed all R packages and displayed the results (Figure 4.8 and Table 4.6) using the Causal Analysis Toolkit (CAT) described in Chapter 2.


RESULTS

Descriptive Statistics

Table 4.2 summarizes several aspects of the frequency distributions of the variables in this data set. For variables such as weight and height, the mean, median, geometric mean, and harmonic mean are all closely similar. Figure 4.1 illustrates that air benzene concentrations (AB) have a distribution with a long right tail (skewness = 5), leading to a mean value that is more than triple the median value and more than twice the geometric mean. Metabolite distributions are even more skewed (Table 4.1).


Table 4.2. Descriptive statistics for all observations (pooled individual-days of data)


Variable



Valid N




Mean




Geometric
Mean




Harmonic
Mean




Median




Minimum




Maximum




Lower
Quartile




Upper
Quartile




Std.Dev.




Skewness




Gender




620

0.3







0.0

0.0

1.0

0.0

1.0

0.5

0.64

Age




620

29.8

28.7

27.6

28.0

18.0

52.0

22.0

37.0

8.3

0.55

Smoke




620

0.2







0.0

0.0

1.0

0.0

0.0

0.4

1.29

Cig.




613

2.2







0.0

0.0

40.0

0.0

0.0

5.3

2.83

ExpCat




620

0.7







1.0

0.0

1.0

0.0

1.0

0.4

-1.11

Factory




620

1.6

1.4

1.3

1.0

1.0

3.0

1.0

3.0

0.9

0.88

Weight




611

60.7

59.8

59.0

60.0

39.0

114.0

53.0

66.0

10.6

0.96

Height




611

164.5

164.3

164.2

164.0

146.0

185.0

160.0

169.0

6.9

0.48

AB




366

4.9

1.9

1.1

1.5

0.3

88.9

0.7

4.7

10.4

5.00

UB




620

1219.6

67.8

0.8

114.9

0.0

54782.2

7.8

506.2

4947.0

8.00

MA




620

27.5

6.9

2.0

6.6

0.1

651.0

1.9

23.2

66.1

5.39

SPMA




615

1.0

0.1

0.0

0.1

0.0

71.4

0.0

0.4

4.0

10.55

PH




620

248.6

122.9

75.5

116.6

7.0

6604.9

60.7

211.8

558.4

7.69

CA




620

32.1

18.7

13.1

17.6

2.1

860.9

10.3

29.6

63.9

8.17

HQ




620

31.8

14.7

8.7

12.9

0.6

742.8

7.4

26.0

67.2

5.92

Creat




620

1.3

1.2

1.0

1.3

0.1

3.3

1.0

1.7

0.6

0.31

AT




370

14.0

7.6

4.4

7.7

0.6

174.3

3.4

14.7

20.3

3.96

UT




619

71.3

19.0

3.9

30.6

0.3

1415.9

3.1

98.2

122.0

5.16



Fig. 4.1. Histograms showing the skewed distributions of air benzene (AB) in factories 1 (left) and 2 (right). Log-normal distributions (continuous curves) fit to these data appear to underestimate their skewness (heavy right tails). Note the different vertical scales. High concentrations are less frequent in Factory 1 than in Factory 2.


Metabolites vs. benzene concentrations in air
Perhaps the most interesting research question is whether metabolite concentrations and DSM ratios vary nonlinearly with benzene concentrations between 0 ppm and 3 ppm. To start to address this key question without introducing modeling assumptions, Figures 4.2 and 4.3 plot mean levels of benzene metabolites against air benzene concentrations rounded to the nearest ppm. These plots pool over all individual-days of observation, so that some individuals who were measured on multiple days contribute multiple data points; as discussed later, individual data points are not strongly correlated between different measurement days, and including multiple days of data separately when they are available avoids the need to choose a single summary measure. Figure 4.2 shows these plot for phenol (PH) (left panel) and urinary benzene (right panel) levels averaged over all exposed workers in Factories 1 and 2 at each level of air benzene (AB) from 0 to 5 ppm, rounded to the nearest ppm. Figure 4.3 shows a similar plot for the benzene metabolites hydroquinone (HQ), catechol (CA), and muconic acid (MA), which have similar enough values to be plotted together on the same axes. These plots do not show strongly nonlinear metabolism or supra-linearity between 0 and 3 ppm of air benzene; rather, metabolism below 3 ppm appears to be approximately linear.
Fig. 4.2. Plot of phenol (PH) concentrations (left) and urinary benzene (right) concentrations vs. air benzene (AB) for all exposed workers in Factories 1 and 2. The plots do not clearly suggest a nonlinear or supra-linear relationship between benzene and phenol for AB between 0 and 3 ppm.



Yüklə 12,64 Mb.

Dostları ilə paylaş:
1   ...   23   24   25   26   27   28   29   30   ...   57




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə