We followed the same significance test used by Kleinstreuer et al. (2013): “Assay-endpoint pairs were considered significant if the CI for the pair did not include 1.0 (i.e., an OR of ‘no evidence of association’), and if the point estimate of the OR was outside of the 95% permutation test-derived CI for the endpoint.” Applying theiradditional filter of including only cases with 3 or more true positives (n11 3), we replicated the data underlying their Figure 1 (odds ratio forest graph), reproduced as our Figure 1. Our Table 8.4 summarizes the computed values underlying Figure 8.1.
Table 8.4. Replication of Significant Assay Endpoint Pairs and OR Values
Fig 8.1 (from Kleinstreuer et al. 2013) Forest plot showing the mean OR and CIs for each significant association betweenin vitroassay andin vivoendpoint. Only associations with three or more true positives are shown. The colored circles give the point estimate of the OR and whiskers give the 95% CI. The gray bars indicate the endpoint-specific permutation test 95% CI. The linkage to types of processes is indicated by the color of the OR circle: dark gray is cancer hallmark-related, light gray is XME-related, and white is other. The assay name is listed at the far left. The associated gene, gene-related process, species, cancer type, and cancer severity level (2 = preneoplastic lesions, 3 = neoplastic lesions) are indicated to the right. A darker line indicates overlap of the assay-specific and the endpoint CIs.
Our replication identified two significant assay-endpoint combinations that were not included in the Kleinstreuer et al. article or in the resultsdata files that they provided. These two combinations are excerpted below:
Theywere not included in Figure 1. However, in their file of all endpoints computations, ORforestData_assay.txt ,similar combinations are coded as follows:
Note that the OR related values are in agreement between the files. The permutation CIs for the endpoints (theirs not shown here) are also in agreement. It is the “#” symbol in the file that seems to have caused these combinations to be excluded from their calculations. We conclude that there was probably miscoding or corruption in the data files that mistakenly caused the “#” designation. Fortunately, the practical impacts are minor: oxyfluorfen (CASRN 42874-03-3) shown in Table 1 should have a total score of 11 versus the 9 shown, and the Liver 2 and Liver 3 columns in that row should have 1 added to them, making them 3 and 2 respectively. This does not have a significant impact on the results and will be discussed further below.
Overall, we were able to reproduce most of the variable-selection process used by Kleinstreuer et al. (2013). However, we do not endorse this approach to variable selection based on odds ratios. We recommend instead using Bayesian Model Averaging, cross-validation, or other model ensemble methods to help overcome model selection, multiple comparison, and over-fitting biases that can inflate false-positive rates and reduce generalization accuracy. These sources of avoidable bias and error do not appear to have been adequately controlled in the OR-based selection procedure applied by Kleinstreuer et al. (2013).
The identified “significant” assay-endpoint combinations were used aspredictor variables for calculating a cancer hazard score for each chemical. The score for each chemical is defined simply as a count of how many assays were activated, i.e., had non-zero AC50 values, that coincide with a significant (rat-related) assay-endpoint pair. Table 1 of Kleinstreuer et al. (2013) lists the resulting total scores, along with a breakdown by endpoint type for each of the 60 chemicals. Using our Python software, we reproduced all of the scores as shown, with the exception of the two table values just discussed.
Replicating the Comparison of the Model-Predicted Cancer Hazard Scores to EPA’S Binary Cancer Classifications
The key research question that the previous steps are intended to provide data to answer is: How well do the computed cancer hazard scores predict the externally derived 0/1 cancer potential classification scores provided by EPA-OPP? As previously discussed, the hazard scores were generated by applying the scoring procedure (counting the number of relevant activated assays), developed from the training set of 232 chemicals with (rat and mouse) in-vivo endpoint information, to a test set of 60 chemicals without in-vivo endpoint information (of which 33 had externally provided cancer classification data).To compare the predictive scores (counts) to the externally provided EPA-OPP binary classifications, Kleinstreuer et al. (2013) performed a Mann-Whitney test (also known as Mann-Whitney-Wilcoxon rank sum test or MWW) to assess the statistical significance of the association between the 33 cancer hazard scores and the binary cancer classifications. The objective of MWW is to test the null hypothesis that two populations have the samedistribution of scores against the alternative hypothesis that one tends to have larger values than the other. The MWW test was originally devised for continuous variables (here we have binary and integer variables), but in practice has been applied to ordered categorical data as well. The authors reported a significant correlation (value not provided) with a significance level of 0.024 from their MWW test. Since this is well under the conventional 0.05 level, they concluded that their methodology is significantly predictive in the external validation test set: “We have demonstrated an approach to identify and test molecular pathways or processes that, when perturbed by a chemical, raise the likelihood that the chemical will be a carcinogen.. …A simple scoring function built from these associated genes was significantly predictive of cancer hazard classifications for an external test set.”
We performed an independent analysis to check this conclusion. First, we noted that the MWW test is not applicable when there are ties in the ranks of the values, of which there are many. Itcannot be relied upon to determine a correct significance value under these conditions. (It is not clear what implementation of MWW Kleinstreuer et al. used, but the R function “wilcox.test” provides a warning message of: “In wilcox.test.default(x = c(21.4, 18.7, 18.1, 14.3, 24.4, 22.8, : cannot compute exact p-value with ties” when the data contains tied values.) We therefore applied a different test – the Kendall tau-b correlation test – that adjusts for tied values. We used the standard implementation available in the Python scipy library (http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mstats.kendalltau.html). We first performed the Kendall tau-b test (with adjustments for ties) on the data as reported in Table 1 of Kleinstreuer et al. (2013), without corrections. This test provided a correlation coefficient of 0.30 and a barely significant p-value of 0.0453, larger than their reported p value of 0.024 but still slightly less than 0.05. (For comparison, we were able to derive 0/1 classification data for 176 of the 232 in the training data set using the EPA/OPP document, and found that test provided a correlation of 0.20 with a p value of 0.00188, that is, a highly significant correlation.)
We next performed the same test on the corrected data with methylene bis(thiocyanate) (MITC) removed from the external validation test set, etridiazole added, and the minor score correction made for oxyfluorfen. This test provided a correlation coefficient of 0.20 and a p value of 0.09 that is not significant at the conventional 5% significance level (although it is at the 10% significance level). If both MITC andetridiazole are included, the Kendall tau-b test yields a correlation coefficient of 0.25 and a p value of 0.09. Thus, Kleinstreuer et al.’s rejection of the null hypothesis of no statistically difference association between scores and binary external classifications at the conventional 5% significance level appears to depend crucially on the mistaken classifications of two chemicals and use of the MWW test without needed corrections for ties.
An Alternative Comparison of the Model-Predicted Cancer Hazard Scores to EPA’S Binary Cancer Classifications