Assembly and binning

Yüklə 116,04 Kb.

tarix	31.08.2018
ölçüsü	116,04 Kb.
	#65564

Testing the effect of genome completeness
Simulating run-times of de-replication algorithms
S. aureus genome alignment
Supplemental Figure S2
Supplemental References

Assembly and binning

Reads were assembled using IDBA-UD (Peng et al., 2012) under default settings. For co-assemblies, reads from all samples were combined into a single file prior to assembly. Assemblies were binned using CONCOCT (Alneberg et al., 2013) with default parameters. The taxonomy of each genome was determined using Centrifuge (Kim et al., 2016) and parsed by dRep version 0.1. Bins that had below 50% completeness, above 25% contamination, or above 25% strain heterogeneity (according to checkM) (Parks et al., 2015) were removed from the analysis.

Testing the effect of genome completeness

A genome of Escherichia coli (NCBI accession GCA_000988385.1) was subset into 10%, 20%, 30%, … 90%, 100% fractions. This was done by fragmenting the genome using a gamma distribution model and choosing a subset of the pieces, as described previously (Brown et al., 2016). All ten fractions were then compared to each other in a pair-wise manner using gANI, ANIm, and MASH.

Simulating run-times of de-replication algorithms

Mash has been previously reported to make 54,118 comparisons in 0.9 seconds (Ondov et al., 2016), and gANI to make 86.5 million comparisons in 190,000 hours (Varghese et al., 2015). This corresponds to 2.7 x 10^-7 and 0.13 minutes / comparison, respectively. This information was used to estimate the run-time assuming pair-wise comparisons of the genome list. As the run-time of dRep depends on the number and size of primary clusters, a model to determine the number of secondary comparisons performed per number of genomes was made. Each of the infant datasets used in this analysis, as well as all of them de-replicated together, were used to generate a line of best fit assuming an exponential model. The best fit line is:

c = (2.953E-2 * g²) + (4.309 * g) + 1.060
Where c is the number of secondary comparisons performed, and g is the number of input genomes. The CPU time was then calculated using the same time estimates as above, assuming pair-wise Mash and c gANI comparisons.
S. aureus genome alignment

S. aureus bins attained from individual and co-assemblies were aligned to a complete reference genome of Staphylococcus aureus (accession GCA_001549675.1) using Geneious (minimum scaffold length 3 kb). Results were visualized using Circos (Krzywinski et al., 2009). Scaffolds from the co-assembly were aligned to a single scaffold from the individual assembly using Geneious (Kearse et al., 2012). Reads from all samples combined were also aligned to the individual scaffold using Bowtie 2 (Langmead and Salzberg, 2012, 2), and visualized using Geneious.
dRep

dRep is an open source python program available under a MIT license. The up-to-date program and manual are available for download at https://github.com/MrOlm/drep and drep.readthedocs.io/en/latest/, respectively.

Filtering genomes based on completeness

The first step in the dRep pipeline in filtering the input genomes. This step is essential because Mash is unable to accurately compare highly incomplete genomes (see supplemental program manual for more information). Filtering is performed based on genome completeness estimates determined using the program checkM. Filtering may also be performed based on genome length, completeness, contamination, and/or strain heterogeneity.

Clustering genomes based on ANI

Genomes are first compared in a pair-wise manner using Mash. Hierarchical clustering is then used to cluster genomes into “primary clusters” based on the results of this distance matrix. The default threshold used is 90% ANI, but this value is user determined. This threshold will need to be lowered if more incomplete genomes are allowed. Next, each primary cluster is compared pair-wise using a more sensitive and slower algorithm. The two options available in dRep v0.3 are ANIm and gANI. Hierarchical clustering is then performed on each primary cluster to form “secondary clusters,” which for the purpose of the program are regarded as identical genomes. This threshold can be adjusted depending on user preferences.

Choosing representative genomes

In order to choose the best genome from each secondary cluster, a scoring system is implemented based on the results of checkM. The algorithm for determining the score is:

score = A(completeness) + B(log₁₀(N50)) – C(contamination) – D(strain heterogeneity) + E(log₁₀(genome size))
Where A, B, C, D, and E are all user-defined weights (default values 1, 1, 5, 1, 0, respectively). The genome bin with the highest score is selected as the representative of that secondary cluster, and is the only genome from the secondary cluster which will be represented in the de-replicated genome list.
Visualization

A dendrogram is generated for the initial primary clustering, as well as all secondary clusters (Supplemental Figure S2). This helps the user visualize the de-replication process. When using secondary clustering algorithms that do not always result in an ANI value of 1 when comparing a genome to itself (like ANIm), the highest ANI value of all self-comparisons in a primary cluster is drawn as a red dotted line to aid in visualizing the “limit of detection” for that primary cluster. The secondary clustering threshold is also shown.

Warnings

Warnings alert the user to possible de-replication errors, and can be generated at the end of the run (optional). They are generated when a genome is almost included or excluded in a secondary or primary cluster, and when two de-replicated genomes from different primary clusters have an ANI value above a certain threshold.

Supplemental References

Alneberg J, Bjarnason BS, de Bruijn I, Schirmer M, Quick J, Ijaz UZ, et al. (2013). CONCOCT: clustering contigs on coverage and composition. ArXiv Prepr ArXiv13124038. http://arxiv.org/abs/1312.4038 (Accessed July 7, 2016).

Brown CT, Olm MR, Thomas BC, Banfield JF. (2016). Measurement of bacterial replication rates in microbial communities. Nat Biotech 34: 1256–1263.

Kearse M, Moir R, Wilson A, Stones-Havas S, Cheung M, Sturrock S, et al. (2012). Geneious Basic: An integrated and extendable desktop software platform for the organization and analysis of sequence data. Bioinformatics 28: 1647–1649.

Kim D, Song L, Breitwieser FP, Salzberg SL. (2016). Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res 26: 1721–1729.

Krzywinski M, Schein J, Birol I, Connors J, Gascoyne R, Horsman D, et al. (2009). Circos: an information aesthetic for comparative genomics. Genome Res 19: 1639–1645.

Langmead B, Salzberg SL. (2012). Fast gapped-read alignment with Bowtie 2. Nat Methods 9: 357–359.

Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, et al. (2016). Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol 17. e-pub ahead of print, doi: 10.1186/s13059-016-0997-x.

Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. (2015). CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res 25: 1043–1055.

Peng Y, Leung HCM, Yiu SM, Chin FYL. (2012). IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28: 1420–1428.

Varghese NJ, Mukherjee S, Ivanova N, Konstantinidis KT, Mavrommatis K, Kyrpides NC, et al. (2015). Microbial species delineation using whole genome sequences. Nucleic Acids Res 43: 6761–6771.

Yüklə 116,04 Kb.

Dostları ilə paylaş: