Merge pull request #94 from bacterial-genomics/dev-store-downsample-data

Dev store downsample data
bacterial-genomics · Nov 15, 2024 · 6231bc5 · 6231bc5
2 parents fb05266 + 0a877b4
commit 6231bc5
Show file tree

Hide file tree

Showing 81 changed files with 4,756 additions and 1,651 deletions.
diff --git a/.devcontainer/devcontainer.json b/.devcontainer/devcontainer.json
@@ -21,7 +21,11 @@
             },
 
             // Add the IDs of extensions you want installed when the container is created.
-            "extensions": ["ms-python.python", "ms-python.vscode-pylance", "nf-core.nf-core-extensionpack"]
+            "extensions": [
+                "ms-python.python",
+                "ms-python.vscode-pylance",
+                "nf-core.nf-core-extensionpack"
+            ]
         }
     }
 }
diff --git a/.nf-core.yml b/.nf-core.yml
@@ -44,7 +44,6 @@ repository_type: pipeline
 
 template:
   name: bacterial-genomics/wf-paired-end-illumina-assembly
-
-update:
-  https://github.com/nf-core/modules.git:
-    nf-core: False
+# update:
+#   https://github.com/nf-core/modules.git:
+#     nf-core: False
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,6 +3,65 @@
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## v3.0.0 - November 15, 2024
+
+### `Added`
+
+- Consistent metrics reported for each read cleaning step (@chrisgulvik)
+- Added SeqFu for FastQ format validation (@chrisgulvik)
+- Checksum (SHA-512) reporting of intermediate and output files (@chrisgulvik)
+- Report full input paths for each sample (@chrisgulvik)
+- For assembly depth reporting, added stdev depth metrics; added total paired+single mapped stats (@chrisgulvik)
+
+### `Changed`
+
+- Default uses SeqKit rather than SeqTk for downsampling (@chrisgulvik)
+- Output structure and filenames revised (@chrisgulvik)
+- For MLST, exclude all MLST databases with a \*\_<int> by default (> 1) to ensure the original MLST database version is used for each taxon (e.g., excludes leptospira_2 and leptospira_3) and avoids inconsistent versions used within a run which would occasionally give one sample a leptospira and a different sample leptospira_3 making it impossible to immediately compare between samples. (@chrisgulvik)
+- For MLST, store novel FastA when that situation occurs (@chrisgulvik)
+- Sample name in outputs and file content no longer contains assembler name (@chrisgulvik)
+- Changed RDP output to exclude unneccesary data columns such as "Phylum\nphylum", "Genus\ngenus" (@chrisgulvik)
+- Use both R1 and R2 and only Phred30 for estimate bp input for more accurate estimation of genome size (@chrisgulvik)
+- Changed default to always on to store stats and FastA of discarded contigs during biopython filtering (@chrisgulvik)
+- Output filenames within `pipeline_info/` changed to show month by name and include day of the week (@chrisgulvik)
+
+### `Fixed`
+
+- Order of operations in Trimmomatic process now ensures final output reads have minimum sequence length (default: 50 bp) (@chrisgulvik)
+- Fixed issue with missing column header names in the .kraken_summary.tsv output files (@chrisgulvik)
+- Fixed trailing tab character in Kraken1 and Kraken2 output TSV summaries, which made pandas XLSX conversion fail due to different column numbers in header and data (@chrisgulvik)
+- Fixed VERSION reporting RDP bug by removing spaces (@chrisgulvik)
+
+### `Updated`
+
+- Coloring of workflow process now corresponds to tab color in XLSX output summary sheet (@chrisgulvik)
+- Docker container version updates (@chrisgulvik)
+- Updated description on output files based on new files created as well as some renamed output files (@chrisgulvik)
+
+### `Deprecated`
+
+- Removed gene calling from QUAST output summary (@chrisgulvik)
+
+## v2.4.0 - August 28, 2024
+
+### `Added`
+
+- Statistics output files during the downsampling routine are all stored as output files under their CleanedReads/<package name>/ (@chrisgulvik)
+- FastQ outputs for individual steps are now saved (perhaps temporary and to a non-default option) (@chrisgulvik)
+- SeqKit downsampling option (already had Seqtk) (@chrisgulvik)
+- Use of a default seed value for both SeqKit and Seqtk subsampling (@chrisgulvik)
+
+### `Fixed`
+
+- RDP Classifier always failed once before succeeding on retry, so a higher RAM request was given as label to make it succeed on first attempt for speed (@chrisgulvik)
+- RDP Classifier data output are not tab-delimited instead of space-delimited (@chrisgulvik)
+
+### `Updated`
+
+- TSV output data files have header names with no spaces, all underscores replaced them (@chrisgulvik)
+
+### `Deprecated`
+
 ## v2.3.1 - August 23, 2024
 
 ### `Added`

diff --git a/CITATIONS.md b/CITATIONS.md
@@ -10,80 +10,133 @@
 
 ## Pipeline tools
 
-- [SPAdes](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3342519/)
+- [AWK](https://www.thriftbooks.com/w/the-awk-programming-language_brian-w-kernighan_alfred-v-aho/254416/?resultid=8ae6bc1c-db73-4eb0-b297-9d28a08ee38f#edition=2350735&idiq=2920572)
 
-  > Bankevich A, Nurk S, Antipov D, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol. May 2012;19(5):455-77. doi:10.1089/cmb.2012.0021
+  > Aho AV, Kernighan BW, Weinberger PJ. The AWK programming language. 224 pp. Pearson. ISBN-13: 978-0201079814
 
-- [Trimmomatic](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4103590/)
+- [Bakta](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8743544/)
 
-  > Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. Aug 1 2014;30(15):2114-20. doi:10.1093/bioinformatics/btu170
+  > Schwengers O, Jelonek L, Dieckmann MA, Beyvers S, Blom J, Goesmann A. Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microb Genom. Nov 2021;7(11):000685. doi: 10.1099/mgen.0.000685
+
+- [BEDTools](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4213956/)
+
+  > Quinlan AR. BEDTools: The Swiss-Army Tool for Genome Feature Analysis. Curr Protoc Bioinformatics. Sep 8 2014;47:11.12.1-34. doi:10.1002/0471250953.bi1112s47
 
 - [BLAST+](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-421)
 
   > Camacho C, Coulouris G, Avagyan V, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009/12/15 2009;10(1):421. doi:10.1186/1471-2105-10-421
 
-- [GTDB-Tk](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9710552/)
+- [BUSCO](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8476166/)
 
-  > Chaumeil PA, Mussig AJ, Hugenholtz P, Parks DH. GTDB-Tk v2: memory friendly classification with the genome taxonomy database. Bioinformatics. Nov 30 2022;38(23):5315-5316. doi:10.1093/bioinformatics/btac672
+  > Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Mol Biol Evol. Sep 27 2021;38(10):4647-4654. doi:10.1093/molbev/msab199
 
-- [BioPython](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2682512/)
+- [BWA-MEM](https://arxiv.org/abs/1303.3997)
 
-  > Cock PJ, Antao T, Chang JT, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. Jun 1 2009;25(11):1422-3. doi:10.1093/bioinformatics/btp163
+  > Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. May 26 2013; arXiv:1303.3997. doi:10.48550/arXiv.1303.3997
 
-- [QUAST](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3624806/)
+- [BioPython](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2682512/)
 
-  > Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. Apr 15 2013;29(8):1072-5. doi:10.1093/bioinformatics/btt086
+  > Cock PJ, Antao T, Chang JT, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. Jun 1 2009;25(11):1422-3. doi:10.1093/bioinformatics/btp163
 
-- [PubMLST](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6192448/)
+- [CAT](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6805573/)
 
-  > Jolley KA, Bray JE, Maiden MCJ. Open-access bacterial population genomics: BIGSdb software, the PubMLST.org website and their applications. Wellcome Open Res. 2018;3:124. doi:10.12688/wellcomeopenres.14826.1
+  > von Meijenfeldt FAB, Arkhipova K, Cambuy DD, Coutinho FH, Dutilh BE. Robust taxonomic classification of uncharted microbial sequences and bins with CAT and BAT. Genome Biol. Oct 22 2019;20(1):217. doi: 10.1186/s13059-019-1817-x
 
-- [RNAmmer](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1888812/)
+- [CheckM2](https://pubmed.ncbi.nlm.nih.gov/37500759/)
 
-  > Lagesen K, Hallin P, Rødland EA, Staerfeldt HH, Rognes T, Ussery DW. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. 2007;35(9):3100-8. doi:10.1093/nar/gkm160
+  > Chklovski A, Parks DH, Woodcroft BJ, Tyson GW. CheckM2: a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning. Nat Methods. Aug 2023;20(8):1203-1212. doi: 10.1038/s41592-023-01940-w
 
-- [SAMtools](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2723002/)
+- [Fastp](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6129281/)
 
-  > Li H, Handsaker B, Wysoker A, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. Aug 15 2009;25(16):2078-9. doi:10.1093/bioinformatics/btp352
+  > Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. Sep 1 2018;34(17):i884-i890. doi: 10.1093/bioinformatics/bty560
 
-- [BWA-MEM](https://arxiv.org/abs/1303.3997)
+- [Fastp latest](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10989850/)
 
-  > Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. May 26 2013; arXiv:1303.3997. doi:10.48550/arXiv.1303.3997
+  > Chen S. Ultrafast one-pass FASTQ data preprocessing, quality control, and deduplication using fastp. Imeta. May 8 2023;2(2):e107. doi: 10.1002/imt2.107
 
 - [FLASH](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3198573/)
 
   > Magoč T, Salzberg SL. FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics. Nov 1 2011;27(21):2957-63. doi:10.1093/bioinformatics/btr507
 
-- [BUSCO](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8476166/)
+- [GTDB-Tk](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9710552/)
 
-  > Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Mol Biol Evol. Sep 27 2021;38(10):4647-4654. doi:10.1093/molbev/msab199
+  > Chaumeil PA, Mussig AJ, Hugenholtz P, Parks DH. GTDB-Tk v2: memory friendly classification with the genome taxonomy database. Bioinformatics. Nov 30 2022;38(23):5315-5316. doi:10.1093/bioinformatics/btac672
 
-- [BEDTools](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4213956/)
+- [Hostile](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10749771/)
 
-  > Quinlan AR. BEDTools: The Swiss-Army Tool for Genome Feature Analysis. Curr Protoc Bioinformatics. Sep 8 2014;47:11.12.1-34. doi:10.1002/0471250953.bi1112s47
+  > Constantinides B, Hunt M, Crook DW. Hostile: accurate decontamination of microbial host sequences. Dec 1 2023;39(12):btad728. doi: 10.1093/bioinformatics/btad728
+
+- [Kraken](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4053813/)
+
+  > Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology. 2014/03/03 2014;15(3):R46. doi:10.1186/gb-2014-15-3-r46
+
+- [Kraken 2](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1891-0)
+
+  > Wood DE, Lu J, Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biology. 2019/11/28 2019;20(1):257. doi:10.1186/s13059-019-1891-0
 
 - [mlst](https://github.com/tseemann/mlst)
 
   > Seeman T. mlst. Github: <https://github.com/tseemann/mlst>
 
+- [Pilon](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4237348/)
+
+  > Walker BJ, Abeel T, Shea T, et al. Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement. PLOS ONE. 2014;9(11):e112963. doi:10.1371/journal.pone.0112963
+
+- [PubMLST](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6192448/)
+
+  > Jolley KA, Bray JE, Maiden MCJ. Open-access bacterial population genomics: BIGSdb software, the PubMLST.org website and their applications. Wellcome Open Res. 2018;3:124. doi:10.12688/wellcomeopenres.14826.1
+
 - [Prokka](https://academic.oup.com/bioinformatics/article/30/14/2068/2390517)
 
   > Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. Jul 15 2014;30(14):2068-9. doi:10.1093/bioinformatics/btu153
 
+- [QUAST](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3624806/)
+
+  > Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. Apr 15 2013;29(8):1072-5. doi:10.1093/bioinformatics/btt086
+
+- [QUAST latest](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6022658/)
+
+  > Mikheenko A, Prjibelski A, Saveliev V, Antipov D, Gurevich A. Versatile genome assembly evaluation with QUAST-LG. Jul 1 2018;34(13):i142-i150. doi: 10.1093/bioinformatics/bty266
+
+- [RDP Classifier](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11008197/)
+
+  > Wang Q, Cole JR. Updated RDP taxonomy and RDP Classifier for more accurate taxonomic classification. Microbiol Resour Announc. Apr 11 2024;13(4):e0106323. doi: 10.1128/mra.01063-23
+
+- [RNAmmer](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1888812/)
+
+  > Lagesen K, Hallin P, Rødland EA, Staerfeldt HH, Rognes T, Ussery DW. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. 2007;35(9):3100-8. doi:10.1093/nar/gkm160
+
+- [SAMtools](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2723002/)
+
+  > Li H, Handsaker B, Wysoker A, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. Aug 15 2009;25(16):2078-9. doi:10.1093/bioinformatics/btp352
+
+- [SeqFu](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8148589/)
+
+  > Telatin A, Fariselli P, Birolo G. SeqFu: A Suite of Utilities for the Robust and Reproducible Manipulation of Sequence Files. May 7 2021;8(5):59. doi: 10.3390/bioengineering8050059
+
+- [SeqKit](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5051824/)
+
+  > Shen W, Le S, Li Y, Hu F. SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. Oct 5 2016;11(10):e0163962. doi: 10.1371/journal.pone.0163962
+
+- [SeqKit latest](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11183193/)
+
+  > Shen W, Sipos B, Zhao L. SeqKit2: A Swiss army knife for sequence and alignment processing. Imeta. Apr 5 2024;3(3):e191. doi: 10.1002/imt2.191
+
 - [SKESA](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1540-z)
 
   > Souvorov A, Agarwala R, Lipman DJ. SKESA: strategic k-mer extension for scrupulous assemblies. Genome Biology. 2018/10/04 2018;19(1):153. doi:10.1186/s13059-018-1540-z
 
-- [Pilon](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4237348/)
+- [SPAdes](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3342519/)
 
-  > Walker BJ, Abeel T, Shea T, et al. Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement. PLOS ONE. 2014;9(11):e112963. doi:10.1371/journal.pone.0112963
+  > Bankevich A, Nurk S, Antipov D, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol. May 2012;19(5):455-77. doi:10.1089/cmb.2012.0021
 
-- [Kraken 2](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1891-0)
+- [SPAdes latest](https://pubmed.ncbi.nlm.nih.gov/32559359/)
 
-  > Wood DE, Lu J, Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biology. 2019/11/28 2019;20(1):257. doi:10.1186/s13059-019-1891-0
+  > Prjibelski A, Antipov D, Meleshko D, Lapidus A, Korobeynikov A. Using SPAdes De Novo Assembler. Curr Protoc Bioinformatics. Jun 2020;70(1):e102. doi: 10.1002/cpbi.102
 
-- [Kraken](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4053813/)
-  > Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology. 2014/03/03 2014;15(3):R46. doi:10.1186/gb-2014-15-3-r46
+- [Trimmomatic](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4103590/)
+
+  > Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. Aug 1 2014;30(15):2114-20. doi:10.1093/bioinformatics/btu170
 
 ## Software packaging/containerisation tools
 

diff --git a/README.md b/README.md
@@ -181,8 +181,7 @@ PhiX reference [NC_001422.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_001422.1) c
                         [Default: NaN]
 ```
 
-> [!NOTE]
-> _If user does not specify inputs for parameters with a default set to `NaN`, these options will not be performed during workflow analysis._
+> [!NOTE] > _If user does not specify inputs for parameters with a default set to `NaN`, these options will not be performed during workflow analysis._
 
 ### Additional parameters
 
@@ -201,9 +200,9 @@ nextflow run \
 The most well-tested and supported is a Univa Grid Engine (UGE) job scheduler with Singularity for dependency handling.
 
 1. UGE/SGE
-    - Additional tips for UGE processing are [here](docs/HPC-UGE-scheduler.md).
+   - Additional tips for UGE processing are [here](docs/HPC-UGE-scheduler.md).
 2. No Scheduler
-    - It has also been confirmed to work on desktop and laptop environments without a job scheduler using Docker with more tips [here](docs/local-device.md).
+   - It has also been confirmed to work on desktop and laptop environments without a job scheduler using Docker with more tips [here](docs/local-device.md).
 
 ## Output