JGI MICROBIAL ISOLATE ASSEMBLY IMPROVED QUALITY DRAFT QC AND ASSEMBLY REPORT: 05/07/18 ------------------------------------------------------------------------------------------------- Sequencing Project Id: 1189086 Sequencing Project Name: Caloramator sp. E03 Organism: Caloramator sp. 1) RAW DATA: Library Name: COYBP Raw Reads: 449,293 Read Type: Pacbio Average Read length: 3,165.8 +/- 2,854.5 2) PACBIO READ FILTERING STATS: Raw Reads: 449,293 Filtered SubReads: 284,615 Error Correct Reads: 11,044 Raw Reads > 5kbp: 73,243 Filtered SubReads > 5kbp: 12,990 Error Correct Reads > 5kbp: 3,199 3) NCBI SCREENING STATS This step identifies potential contaminants screened by NCBI for submission. ** No potential contaminants found. 4) ASSEMBLY STATS: Assembly stats of the HGAP assembly. A C G T N IUPAC Other GC GC_stdev 0.3432 0.1561 0.1576 0.3431 0.0000 0.0000 0.0000 0.3138 0.0039 Main genome scaffold total: 8 Main genome contig total: 8 Main genome scaffold sequence total: 3024286 Main genome contig sequence total: 3024286 0.000% gap Main genome scaffold N/L50: 2/731679 Main genome contig N/L50: 2/731679 Main genome scaffold N/L90: 6/224463 Main genome contig N/L90: 6/224463 Max scaffold length: 901806 Max contig length: 901806 Number of scaffolds > 50 KB: 8 % main genome in scaffolds > 50 KB: 100.00% Minimum Number Number Total Total Scaffold Scaffold of of Scaffold Contig Contig Length Scaffolds Contigs Length Length Coverage -------- -------------- -------------- -------------- -------------- -------- All 8 8 3024286 3024286 100.00% 50000 8 8 3024286 3024286 100.00% 100000 7 7 2966196 2966196 100.00% 250000 5 5 2557788 2557788 100.00% 500000 2 2 1633485 1633485 100.00% Mapped Coverage: 167.1X Estimated Percent Genome Recovery using checkM: archaea: 48.32% bacteria: 99.04% lineage workflow: 99.55% 5) ASSEMBLERS USED: Assembler: HGAP Assembler params: smrtanalysis/2.3.0_p5, HGAP 3 6) ASSESSED GENOME PROJECT STANDARD: Improved High-Quality Draft 7) QC Report: FAIL - low library quality, too many problematic/unresolved regions - potential contamination (?) Assembly size = 3.0 Mb; 8 contig(s) (0 circular) Average mapped coverage: 167 Confirmed target genome: NT: COYBP_unitig_35|arrow gi|219857456|ref|NR_025044.1|:Bacteria;Firmicutes;Clostridia;Clostridiales;Clostridiaceae;Caloramator;Caloramator_viterbiensis 0.0 1483 99.393 183945 1480 N/A COYBP_unitig_29|arrow gi|219857456|ref|NR_025044.1|:Bacteria;Firmicutes;Clostridia;Clostridiales;Clostridiaceae;Caloramator;Caloramator_viterbiensis 0.0 1483 99.326 385875 1480 N/A COYBP_unitig_6|arrow gi|219857456|ref|NR_025044.1|:Bacteria;Firmicutes;Clostridia;Clostridiales;Clostridiaceae;Caloramator;Caloramator_viterbiensis 0.0 1426 99.229 250158 1480 N/A COYBP_unitig_5|arrow gi|219857456|ref|NR_025044.1|:Bacteria;Firmicutes;Clostridia;Clostridiales;Clostridiaceae;Caloramator;Caloramator_viterbiensis 0.0 1485 99.125 731679 1480 N/A COYBP_unitig_36|arrow gi|848258874|gb|CP011803.1|:Bacteria;Firmicutes;Clostridia;Clostridiales;Clostridiaceae;Clostridium;Clostridium_carboxidivorans 0.0 1361 90.301 288270 5732880 N/A COYBP_unitig_33|arrow gi|1114582714|gb|CP018335.1|:Bacteria;Firmicutes;Clostridia;Clostridiales;Clostridiaceae;Clostridium;Clostridium_kluyveri 6.33e-158 403 92.804 901806 4454353 N/A COYBP_unitig_30|arrow gi|1230880998|gb|CP016893.1|:Bacteria;Firmicutes;Clostridia;Thermoanaerobacterales;Thermoanaerobacterales_Family_III._Incertae_Sedis;Thermoanaerobacterium;Thermoanaerobacterium_thermosaccharolyticum 5.12e-49 170 90.000 224463 2895726 N/A SILVA: COYBP_unitig_5|arrow tid|112729|SSU_HC491092.1.1481:Bacteria;Firmicutes;Clostridia;Clostridiales;Clostridiaceae;Caloramator;Caloramator_viterbiensis 0.0 1486 99.125 731679 1481 N/A COYBP_unitig_36|arrow tid|1313|SSU_CVKP01000051.3319.4841:Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus_pneumoniae 5.63e-79 224 91.518 288270 1515 N/A Collab_16S: COYBP_unitig_29|arrow 1189086_Caloramator_sp__E03_175679_16s 0.0 1576 99.619 385875 1577 N/A Megan: correct Family-Genus Check for the additional (non-target) genome: FAIL bi-modal gc-coverage distribution, bi-modal Megan peaks single-copy: mean > 1, median = 1; antifam hits = 3 8) Repeats removal 1. Repeats are identified by running nucmer on the assembly produced by HGAP assembler (nucmer -maxmatch --nosimplify -c 1000 assembly.fasta assembly.fasta; show-coords -r -c -l ), 2. Following sequences identified as a likely artifacts and moved to the chaff.fasta: a) Contigs <40Kb, with >2/3 of the contig length >99% identical to the sequences on contigs >40Kb. Contig header in chaff.fasta is appended with '|dedup' b) Terminal repeats (defined as repeats on the ends of the same contig, >99% identical). Shorter copy of the repeat is cut from the contig and moved to chaff.fasta. Sequence header is appended with '|term_repeat'. - If no artifacts were identified, chaff.fasta is empty and is not provided with the assembly release. 3. Assembly is re-polished with the arrow using the default settings. 9) Frequently Asked Questions Q1: Why does my assembly have more than one contig? A1: It is not unusual for PacBio Improved Draft assemblies to contain more than one contig. The JGI specification is 10 contigs or less, which represents a reasonable quality versus cost compromise. JGI also has a coverage specification which insures we obtain sufficient data to produce a high quality PacBio assembly. As a result, adding additional data from existing libraries is not expected to improve the outcome. The most common reasons for PacBio microbe assemblies to contain more than one contig are: 1. Genome contains multiple replicons; chromosomes and/or plasmids, therefore a single contig assembly is neither possible nor desirable. 2. Genome contains very long repeats; longer than the PacBio read length from the library used and a better assembly can only be produced given better, that is longer, DNA. 10) DOE AUSPICE STATEMENT FOR PUBLICATION The work conducted by the U.S. Department of Energy Joint Genome Institute, a DOE Office of Science User Facility, is supported under Contract No. DE-AC02-05CH11231. The data was generated for JGI Proposal #503161. 11) NOTES If you are parsing this file to get metrics, please contact your JGI Project Manager to let them know what you need. This report format is subject to change without notice.