Skip to content

Commit

Permalink
Merge pull request #94 from bacterial-genomics/dev-store-downsample-data
Browse files Browse the repository at this point in the history
Dev store downsample data
  • Loading branch information
chrisgulvik authored Nov 15, 2024
2 parents fb05266 + 0a877b4 commit 6231bc5
Show file tree
Hide file tree
Showing 81 changed files with 4,756 additions and 1,651 deletions.
6 changes: 5 additions & 1 deletion .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,11 @@
},

// Add the IDs of extensions you want installed when the container is created.
"extensions": ["ms-python.python", "ms-python.vscode-pylance", "nf-core.nf-core-extensionpack"]
"extensions": [
"ms-python.python",
"ms-python.vscode-pylance",
"nf-core.nf-core-extensionpack"
]
}
}
}
7 changes: 3 additions & 4 deletions .nf-core.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,6 @@ repository_type: pipeline

template:
name: bacterial-genomics/wf-paired-end-illumina-assembly

update:
https://github.com/nf-core/modules.git:
nf-core: False
# update:
# https://github.com/nf-core/modules.git:
# nf-core: False
59 changes: 59 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,65 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## v3.0.0 - November 15, 2024

### `Added`

- Consistent metrics reported for each read cleaning step (@chrisgulvik)
- Added SeqFu for FastQ format validation (@chrisgulvik)
- Checksum (SHA-512) reporting of intermediate and output files (@chrisgulvik)
- Report full input paths for each sample (@chrisgulvik)
- For assembly depth reporting, added stdev depth metrics; added total paired+single mapped stats (@chrisgulvik)

### `Changed`

- Default uses SeqKit rather than SeqTk for downsampling (@chrisgulvik)
- Output structure and filenames revised (@chrisgulvik)
- For MLST, exclude all MLST databases with a \*\_<int> by default (> 1) to ensure the original MLST database version is used for each taxon (e.g., excludes leptospira_2 and leptospira_3) and avoids inconsistent versions used within a run which would occasionally give one sample a leptospira and a different sample leptospira_3 making it impossible to immediately compare between samples. (@chrisgulvik)
- For MLST, store novel FastA when that situation occurs (@chrisgulvik)
- Sample name in outputs and file content no longer contains assembler name (@chrisgulvik)
- Changed RDP output to exclude unneccesary data columns such as "Phylum\nphylum", "Genus\ngenus" (@chrisgulvik)
- Use both R1 and R2 and only Phred30 for estimate bp input for more accurate estimation of genome size (@chrisgulvik)
- Changed default to always on to store stats and FastA of discarded contigs during biopython filtering (@chrisgulvik)
- Output filenames within `pipeline_info/` changed to show month by name and include day of the week (@chrisgulvik)

### `Fixed`

- Order of operations in Trimmomatic process now ensures final output reads have minimum sequence length (default: 50 bp) (@chrisgulvik)
- Fixed issue with missing column header names in the .kraken_summary.tsv output files (@chrisgulvik)
- Fixed trailing tab character in Kraken1 and Kraken2 output TSV summaries, which made pandas XLSX conversion fail due to different column numbers in header and data (@chrisgulvik)
- Fixed VERSION reporting RDP bug by removing spaces (@chrisgulvik)

### `Updated`

- Coloring of workflow process now corresponds to tab color in XLSX output summary sheet (@chrisgulvik)
- Docker container version updates (@chrisgulvik)
- Updated description on output files based on new files created as well as some renamed output files (@chrisgulvik)

### `Deprecated`

- Removed gene calling from QUAST output summary (@chrisgulvik)

## v2.4.0 - August 28, 2024

### `Added`

- Statistics output files during the downsampling routine are all stored as output files under their CleanedReads/<package name>/ (@chrisgulvik)
- FastQ outputs for individual steps are now saved (perhaps temporary and to a non-default option) (@chrisgulvik)
- SeqKit downsampling option (already had Seqtk) (@chrisgulvik)
- Use of a default seed value for both SeqKit and Seqtk subsampling (@chrisgulvik)

### `Fixed`

- RDP Classifier always failed once before succeeding on retry, so a higher RAM request was given as label to make it succeed on first attempt for speed (@chrisgulvik)
- RDP Classifier data output are not tab-delimited instead of space-delimited (@chrisgulvik)

### `Updated`

- TSV output data files have header names with no spaces, all underscores replaced them (@chrisgulvik)

### `Deprecated`

## v2.3.1 - August 23, 2024

### `Added`
Expand Down
109 changes: 81 additions & 28 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,80 +10,133 @@
## Pipeline tools

- [SPAdes](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3342519/)
- [AWK](https://www.thriftbooks.com/w/the-awk-programming-language_brian-w-kernighan_alfred-v-aho/254416/?resultid=8ae6bc1c-db73-4eb0-b297-9d28a08ee38f#edition=2350735&idiq=2920572)

> Bankevich A, Nurk S, Antipov D, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol. May 2012;19(5):455-77. doi:10.1089/cmb.2012.0021
> Aho AV, Kernighan BW, Weinberger PJ. The AWK programming language. 224 pp. Pearson. ISBN-13: 978-0201079814
- [Trimmomatic](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4103590/)
- [Bakta](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8743544/)

> Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. Aug 1 2014;30(15):2114-20. doi:10.1093/bioinformatics/btu170
> Schwengers O, Jelonek L, Dieckmann MA, Beyvers S, Blom J, Goesmann A. Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microb Genom. Nov 2021;7(11):000685. doi: 10.1099/mgen.0.000685
- [BEDTools](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4213956/)

> Quinlan AR. BEDTools: The Swiss-Army Tool for Genome Feature Analysis. Curr Protoc Bioinformatics. Sep 8 2014;47:11.12.1-34. doi:10.1002/0471250953.bi1112s47
- [BLAST+](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-421)

> Camacho C, Coulouris G, Avagyan V, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009/12/15 2009;10(1):421. doi:10.1186/1471-2105-10-421
- [GTDB-Tk](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9710552/)
- [BUSCO](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8476166/)

> Chaumeil PA, Mussig AJ, Hugenholtz P, Parks DH. GTDB-Tk v2: memory friendly classification with the genome taxonomy database. Bioinformatics. Nov 30 2022;38(23):5315-5316. doi:10.1093/bioinformatics/btac672
> Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Mol Biol Evol. Sep 27 2021;38(10):4647-4654. doi:10.1093/molbev/msab199
- [BioPython](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2682512/)
- [BWA-MEM](https://arxiv.org/abs/1303.3997)

> Cock PJ, Antao T, Chang JT, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. Jun 1 2009;25(11):1422-3. doi:10.1093/bioinformatics/btp163
> Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. May 26 2013; arXiv:1303.3997. doi:10.48550/arXiv.1303.3997
- [QUAST](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3624806/)
- [BioPython](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2682512/)

> Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. Apr 15 2013;29(8):1072-5. doi:10.1093/bioinformatics/btt086
> Cock PJ, Antao T, Chang JT, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. Jun 1 2009;25(11):1422-3. doi:10.1093/bioinformatics/btp163
- [PubMLST](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6192448/)
- [CAT](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6805573/)

> Jolley KA, Bray JE, Maiden MCJ. Open-access bacterial population genomics: BIGSdb software, the PubMLST.org website and their applications. Wellcome Open Res. 2018;3:124. doi:10.12688/wellcomeopenres.14826.1
> von Meijenfeldt FAB, Arkhipova K, Cambuy DD, Coutinho FH, Dutilh BE. Robust taxonomic classification of uncharted microbial sequences and bins with CAT and BAT. Genome Biol. Oct 22 2019;20(1):217. doi: 10.1186/s13059-019-1817-x
- [RNAmmer](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1888812/)
- [CheckM2](https://pubmed.ncbi.nlm.nih.gov/37500759/)

> Lagesen K, Hallin P, Rødland EA, Staerfeldt HH, Rognes T, Ussery DW. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. 2007;35(9):3100-8. doi:10.1093/nar/gkm160
> Chklovski A, Parks DH, Woodcroft BJ, Tyson GW. CheckM2: a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning. Nat Methods. Aug 2023;20(8):1203-1212. doi: 10.1038/s41592-023-01940-w
- [SAMtools](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2723002/)
- [Fastp](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6129281/)

> Li H, Handsaker B, Wysoker A, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. Aug 15 2009;25(16):2078-9. doi:10.1093/bioinformatics/btp352
> Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. Sep 1 2018;34(17):i884-i890. doi: 10.1093/bioinformatics/bty560
- [BWA-MEM](https://arxiv.org/abs/1303.3997)
- [Fastp latest](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10989850/)

> Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. May 26 2013; arXiv:1303.3997. doi:10.48550/arXiv.1303.3997
> Chen S. Ultrafast one-pass FASTQ data preprocessing, quality control, and deduplication using fastp. Imeta. May 8 2023;2(2):e107. doi: 10.1002/imt2.107
- [FLASH](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3198573/)

> Magoč T, Salzberg SL. FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics. Nov 1 2011;27(21):2957-63. doi:10.1093/bioinformatics/btr507
- [BUSCO](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8476166/)
- [GTDB-Tk](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9710552/)

> Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Mol Biol Evol. Sep 27 2021;38(10):4647-4654. doi:10.1093/molbev/msab199
> Chaumeil PA, Mussig AJ, Hugenholtz P, Parks DH. GTDB-Tk v2: memory friendly classification with the genome taxonomy database. Bioinformatics. Nov 30 2022;38(23):5315-5316. doi:10.1093/bioinformatics/btac672
- [BEDTools](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4213956/)
- [Hostile](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10749771/)

> Quinlan AR. BEDTools: The Swiss-Army Tool for Genome Feature Analysis. Curr Protoc Bioinformatics. Sep 8 2014;47:11.12.1-34. doi:10.1002/0471250953.bi1112s47
> Constantinides B, Hunt M, Crook DW. Hostile: accurate decontamination of microbial host sequences. Dec 1 2023;39(12):btad728. doi: 10.1093/bioinformatics/btad728
- [Kraken](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4053813/)

> Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology. 2014/03/03 2014;15(3):R46. doi:10.1186/gb-2014-15-3-r46
- [Kraken 2](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1891-0)

> Wood DE, Lu J, Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biology. 2019/11/28 2019;20(1):257. doi:10.1186/s13059-019-1891-0
- [mlst](https://github.com/tseemann/mlst)

> Seeman T. mlst. Github: <https://github.com/tseemann/mlst>
- [Pilon](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4237348/)

> Walker BJ, Abeel T, Shea T, et al. Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement. PLOS ONE. 2014;9(11):e112963. doi:10.1371/journal.pone.0112963
- [PubMLST](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6192448/)

> Jolley KA, Bray JE, Maiden MCJ. Open-access bacterial population genomics: BIGSdb software, the PubMLST.org website and their applications. Wellcome Open Res. 2018;3:124. doi:10.12688/wellcomeopenres.14826.1
- [Prokka](https://academic.oup.com/bioinformatics/article/30/14/2068/2390517)

> Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. Jul 15 2014;30(14):2068-9. doi:10.1093/bioinformatics/btu153
- [QUAST](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3624806/)

> Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. Apr 15 2013;29(8):1072-5. doi:10.1093/bioinformatics/btt086
- [QUAST latest](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6022658/)

> Mikheenko A, Prjibelski A, Saveliev V, Antipov D, Gurevich A. Versatile genome assembly evaluation with QUAST-LG. Jul 1 2018;34(13):i142-i150. doi: 10.1093/bioinformatics/bty266
- [RDP Classifier](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11008197/)

> Wang Q, Cole JR. Updated RDP taxonomy and RDP Classifier for more accurate taxonomic classification. Microbiol Resour Announc. Apr 11 2024;13(4):e0106323. doi: 10.1128/mra.01063-23
- [RNAmmer](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1888812/)

> Lagesen K, Hallin P, Rødland EA, Staerfeldt HH, Rognes T, Ussery DW. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. 2007;35(9):3100-8. doi:10.1093/nar/gkm160
- [SAMtools](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2723002/)

> Li H, Handsaker B, Wysoker A, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. Aug 15 2009;25(16):2078-9. doi:10.1093/bioinformatics/btp352
- [SeqFu](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8148589/)

> Telatin A, Fariselli P, Birolo G. SeqFu: A Suite of Utilities for the Robust and Reproducible Manipulation of Sequence Files. May 7 2021;8(5):59. doi: 10.3390/bioengineering8050059
- [SeqKit](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5051824/)

> Shen W, Le S, Li Y, Hu F. SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. Oct 5 2016;11(10):e0163962. doi: 10.1371/journal.pone.0163962
- [SeqKit latest](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11183193/)

> Shen W, Sipos B, Zhao L. SeqKit2: A Swiss army knife for sequence and alignment processing. Imeta. Apr 5 2024;3(3):e191. doi: 10.1002/imt2.191
- [SKESA](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1540-z)

> Souvorov A, Agarwala R, Lipman DJ. SKESA: strategic k-mer extension for scrupulous assemblies. Genome Biology. 2018/10/04 2018;19(1):153. doi:10.1186/s13059-018-1540-z
- [Pilon](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4237348/)
- [SPAdes](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3342519/)

> Walker BJ, Abeel T, Shea T, et al. Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement. PLOS ONE. 2014;9(11):e112963. doi:10.1371/journal.pone.0112963
> Bankevich A, Nurk S, Antipov D, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol. May 2012;19(5):455-77. doi:10.1089/cmb.2012.0021
- [Kraken 2](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1891-0)
- [SPAdes latest](https://pubmed.ncbi.nlm.nih.gov/32559359/)

> Wood DE, Lu J, Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biology. 2019/11/28 2019;20(1):257. doi:10.1186/s13059-019-1891-0
> Prjibelski A, Antipov D, Meleshko D, Lapidus A, Korobeynikov A. Using SPAdes De Novo Assembler. Curr Protoc Bioinformatics. Jun 2020;70(1):e102. doi: 10.1002/cpbi.102
- [Kraken](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4053813/)
> Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology. 2014/03/03 2014;15(3):R46. doi:10.1186/gb-2014-15-3-r46
- [Trimmomatic](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4103590/)

> Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. Aug 1 2014;30(15):2114-20. doi:10.1093/bioinformatics/btu170
## Software packaging/containerisation tools

Expand Down
7 changes: 3 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -181,8 +181,7 @@ PhiX reference [NC_001422.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_001422.1) c
[Default: NaN]
```

> [!NOTE]
> _If user does not specify inputs for parameters with a default set to `NaN`, these options will not be performed during workflow analysis._
> [!NOTE] > _If user does not specify inputs for parameters with a default set to `NaN`, these options will not be performed during workflow analysis._
### Additional parameters

Expand All @@ -201,9 +200,9 @@ nextflow run \
The most well-tested and supported is a Univa Grid Engine (UGE) job scheduler with Singularity for dependency handling.

1. UGE/SGE
- Additional tips for UGE processing are [here](docs/HPC-UGE-scheduler.md).
- Additional tips for UGE processing are [here](docs/HPC-UGE-scheduler.md).
2. No Scheduler
- It has also been confirmed to work on desktop and laptop environments without a job scheduler using Docker with more tips [here](docs/local-device.md).
- It has also been confirmed to work on desktop and laptop environments without a job scheduler using Docker with more tips [here](docs/local-device.md).

## Output

Expand Down
Loading

0 comments on commit 6231bc5

Please sign in to comment.