The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
- Consistent metrics reported for each read cleaning step (@chrisgulvik)
- Added SeqFu for FastQ format validation (@chrisgulvik)
- Checksum (SHA-512) reporting of intermediate and output files (@chrisgulvik)
- Report full input paths for each sample (@chrisgulvik)
- For assembly depth reporting, added stdev depth metrics; added total paired+single mapped stats (@chrisgulvik)
- Default uses SeqKit rather than SeqTk for downsampling (@chrisgulvik)
- Output structure and filenames revised (@chrisgulvik)
- For MLST, exclude all MLST databases with a *_ by default (> 1) to ensure the original MLST database version is used for each taxon (e.g., excludes leptospira_2 and leptospira_3) and avoids inconsistent versions used within a run which would occasionally give one sample a leptospira and a different sample leptospira_3 making it impossible to immediately compare between samples. (@chrisgulvik)
- For MLST, store novel FastA when that situation occurs (@chrisgulvik)
- Sample name in outputs and file content no longer contains assembler name (@chrisgulvik)
- Changed RDP output to exclude unneccesary data columns such as "Phylum\nphylum", "Genus\ngenus" (@chrisgulvik)
- Use both R1 and R2 and only Phred30 for estimate bp input for more accurate estimation of genome size (@chrisgulvik)
- Changed default to always on to store stats and FastA of discarded contigs during biopython filtering (@chrisgulvik)
- Output filenames within
pipeline_info/
changed to show month by name and include day of the week (@chrisgulvik)
- Order of operations in Trimmomatic process now ensures final output reads have minimum sequence length (default: 50 bp) (@chrisgulvik)
- Fixed issue with missing column header names in the .kraken_summary.tsv output files (@chrisgulvik)
- Fixed trailing tab character in Kraken1 and Kraken2 output TSV summaries, which made pandas XLSX conversion fail due to different column numbers in header and data (@chrisgulvik)
- Fixed VERSION reporting RDP bug by removing spaces (@chrisgulvik)
- Coloring of workflow process now corresponds to tab color in XLSX output summary sheet (@chrisgulvik)
- Docker container version updates (@chrisgulvik)
- Updated description on output files based on new files created as well as some renamed output files (@chrisgulvik)
- Removed gene calling from QUAST output summary (@chrisgulvik)
- Statistics output files during the downsampling routine are all stored as output files under their CleanedReads// (@chrisgulvik)
- FastQ outputs for individual steps are now saved (perhaps temporary and to a non-default option) (@chrisgulvik)
- SeqKit downsampling option (already had Seqtk) (@chrisgulvik)
- Use of a default seed value for both SeqKit and Seqtk subsampling (@chrisgulvik)
- RDP Classifier always failed once before succeeding on retry, so a higher RAM request was given as label to make it succeed on first attempt for speed (@chrisgulvik)
- RDP Classifier data output are not tab-delimited instead of space-delimited (@chrisgulvik)
- TSV output data files have header names with no spaces, all underscores replaced them (@chrisgulvik)
- #92 Refactored profiles to align with Seqera Tower requirements and added a Tower yaml for reporting (@slsevilla).
- #88 Resolved issue where SPAdes fails when a non-integer is used with the memory parameter (@gregorysprenger).
- #90 Add SKESA parameters to main params.config file to avoid
null
values (@gregorysprenger). - #88 Resolved issue where SPAdes fails when a non-integer is used with the memory parameter (@gregorysprenger).
- #82 Added CAT and CheckM2 to assess the assembly file formed (@chrisgulvik and @gregorysprenger).
- #83 Added fastp as an alternative to Trimmomatic (@chrisgulvik and @gregorysprenger).
- #73 Check if input FastQ files are corrupted before proceeding to downstream analyses (@gregorysprenger).
- #74 Fix emit statements to catch output files from modules (@gregorysprenger).
- #69 Replace baseDir with projectDir due to deprecation in latest version of nextflow (@gregorysprenger).
- #71 Updated path to Kraken1 database in bash wrapper script (@gregorysprenger).
- #77 Set max resources for release.yml GitHub action (@gregorysprenger).
- #37 Allow customization of the MLST module, such as specifying schemes to include/exclude and minimum thresholds (@gregorysprenger).
- #52 Concatenate kraken (1 and 2) output summaries and place into the output Summaries directory (@gregorysprenger).
- #47 Added RDP Classifier as another tool to classify 16S ribosomal RNA (@taylorpaisie).
- #51 Added header to BLAST output file before it is compressed (@gregorysprenger).
- #35 Added parameters to customize Trimmomatic parameters (@gregorysprenger).
- #61 Summarize all QA outputs and place into Summaries output directory, as well as able to convert TSV files to XLSX and a final excel report workbook (@gregorysprenger).
- #50 Host removal now runs all samples, QC file checks are implemented for host removal, fixed minimum file size param for host removal, and fixed kraken2 database issue (@gregorysprenger).
- #67 Added configuration file for gitleaks so that FastA files are ignored when checking for security tokens with MegaLinter (@gregorysprenger).
- #55 Replace linting GitHub workflows with MegaLinter to allow for more linters to be used, autofix issues and commit, etc. (@gregorysprenger).
- #45 Updated output documentation to allow for better readability and interpretation (@gregorysprenger).
- #56 Change how images are displayed based on light and dark themes due to GitHub changes (@gregorysprenger).
- #62 Ignore warnings for RDP parameters (@taylorpaisie).
- #23 Added a miscellanous issue template (@taylorpaisie).
- #25 Fix broken repository links (@gregorysprenger).
- #24 Update .nfcore.yml file to ignore specific nf-core file checks that do not apply to this workflow (@gregorysprenger)
- #41 Added ability to run
hostile
in remove_host_hostile channel (@chrisgulvik) - #41 Added ability to run NCBI's SRA Human Scrubber (
scrub.sh
) in remove_host_sra_human_scrubber channel (@chrisgulvik)- Added ability to remove the broken sister reads from scrub.sh in the remove_broken_pairs_bbtools_repair channel
- Added new prepare_db_sra_human_scrubber channel for future proofing in case a DB file retreived eventually gets gunzip compressed
- Added new update_db_sra_human_scrubber channel to fetch the human_filter.db file and avoid requiring it inside the container environment
- #41 Added 2 new test config files to locally run the miniburk test data with and without host removal (@chrisgulvik)
- #41 Added a new subworkflow for human host removal
subworkflows/host_removal.nf
(@chrisgulvik)
- #26 Updated database preparation modules to use
meta.id
to correctly rename.command.{out,err}
files in the log directory (@gregorysprenger). - #30 Fixed the the parsing of PhiX removal information from BBDuk to be added to the summary files (@gregorysprenger).
- #31 Removed the possibility of scientific notation in
Summary.CleanedReads-Bases.tab
file (@gregorysprenger). - #30 Fixed the the parsing of PhiX removal information from BBDuk to be added to the summary files (@gregorysprenger).
- #29 Updated nf-core GTDB-Tk and BUSCO modules to latest release and added
mash_db
parameter for GTDB-Tk (@gregorysprenger). - #36 Updated readme and markdown files in
docs/
, as well as separate skea and spades options in nextflow schema file (@gregorysprenger). - #41 Updated usage to include the new
--host_remove {both,hostile,skip,sra_human_scrubber}
. Default uses "both" SRA Human Scrubber first and then hostile, but options also exist to "skip" host removal entirely, or invoke just one as "hostile" or "sra_human_scrubber" (@chrisgulvik) - #41 Updated
workflows/assembly.nf
to run the host removal subworkflow after infile handling and prior to downsampling (@chrisgulvik)
- Added ability to use nf-core lint on pipeline.
- Validation parameters added to params.config for nf-core lint.
- Removal of duplicate entries in errors.tsv when using the bash runner scripts.
- Process name is now displayed when QC failures occur.
- Assembler name is appended to sample name in QC filechecks.
- Barrnap module now receives all biopython outputs.
- Renamed container cache parameter and added it to conda, docker, and singularity profiles.
- Added missing parameters to nextflow_schema.json.
- Removed system.exit(1) from extract.record.from.genbank.py script.
- Updated error catching of biopython, barrnap, and QC failure that are added to errors.tsv when using bash runner scripts.
- Use exit 1 instead of exit 2 in barrnap module.
- Minimum nextflow version in github actions now matches the minimum version in nextflow.config.
- Markdown files now use
:::tip
or:::note
for tips and notes. - Removed genbank file input for barrnap module as it is not used.
- Updated .nf-core.yml file to ignore certain lints when using nf-core lint.
- Replaced
gregorysprenger
workflow prefixes withbacterial-genomics
.
- "Bigdata" parameter is no longer used as resources are different for everyone - create a custom config instead.
- Catching of errors in bash runner scripts for BBDuk and SPAdes.
- Add sample name and assembler name to outputs.
- Added and sorted error codes in bash runner script to ignore if process is resubmitted and completes.
- Allow output of "fastq" or "fq" files from subsampled reads module.
- Within rosalind_hpc.config, resubmit SPAdes if the exit code is anything other than zero.
- Fixed parsing of 16S BLAST database input so that results have species names.
- Quast and Kraken1 software versions information is now parsed correctly.
- Bash runner scripts are updated to correctly parse run information for errors.
- Fixed publishDir for SPAdes and Kraken1 so that output files are placed into outdir.
- Concatenation of QC filechecks to create one large summary.
- Removal of input reads if reads are subsampled to avoid original and subsampled reads to be passed to downstream processes.
- Ignore header when counting number of lines in genome coverage summary file within bash runner script.
- Changed number of retries for SPAdes in rosalind_hpc.config to 5.
- Removed non-working loop in SPAdes module as the process should be resubmitted to abide by resource constraints.
- Changed rosalind_hpc.config high memory parameter and allow memory to be increased for resubmissions.
- Consolidated input and outputs for each module.
- Updated QC filecheck function to remove failed inputs from downstream processes without terminating entire workflow.
- Parsed out meta information when collecting outputfiles to create a concatenated summary file.
- Added the addition of headers to QC filechecks within each module.
- Dropped forced reformatting of FastQ files within BBDuk module.
- Added collect operators to database/reference channels to allow for all inputs to be analyzed for each process
- Removed extra slashes '/' from run_assembly.uge-nextflow scripts
- Updated paths for summary email when using run_assembly.uge-nextflow scripts
- Made bin directory executable when added to github so Nextflow can use them
- Changed log directory to pipeline_info directory in run_assembly.uge-nextflow scripts
- Removed --gtdb_db from run_assembly.uge-nextflow scripts until GTDB-Tk module version is updated
- nf-core Styling
- Allow samplesheet (XLSX, CSV, TSV) OR directory as input
- Use SKESA instead of SPAdes for assembling contigs
- Downsampling of input FastQ files
- Contig taxonomic classification using GTDB-Tk
- Intra-contig gene information using BUSCO
- Ability to use a BUSCO config file
- Always reformat FastQ files before removing PhiX using BBDuk
- Database prep modules to decompress and extract databases
- Function to check QC File Checks if they fail and place them in the QC log directory to remove the hassle of passing these files to each subsequent module
- Comments in input/output channels in subworkflows to make it easier to reference for future changes/additions
- Added "-<assembler>" to outputs after contigs are assembled
- Added headers to each output file
- Option to change the "mode" of SPAdes (ie. isolate, meta, etc.)
- Option to use different k-mer sizes for SPAdes
- Added evalue similarity cut-off to Prokka
- Ability to use a curated protein file along with Prokka's database
- Errors in run_assembly.uge-nextflow scripts
- Handle "lock" issues when using the same work directory with run_assembly.uge-nextflow scripts
- Remove duplicate rows and junk information in errors.tsv if workflow fails due to a locked session
- Updated "INFO" messages to be more readable
- Docker containers to ensure they all have built in tests and removed built-in databases
- More comments and whitespace to separate information in conf/params.config
- Separated kraken1 and kraken2 modules
- Use channels for databases instead of passing params
- Use downloadable database .tar.gz files instead of using built-in databases in docker images
- Converted spades and skesa workflows into one workflow and use subworkflows "downsampling.nf" and "assemble_contigs.nf"
- Renamed assemble*{spades,skesa} to assemble_contigs*{spades,skesa}
- CHANGED OUTPUT FILE STRUCTURE: Output files are generally under the name of the tool that produced them
- Updated docs/output.md to reflect new file structure
- Removed QC file checking from modules and use checkQCFileChecks function in workflows/assembly.nf and subworkflows/assemble_contigs.nf
- Renamed emit names to hopefully clarify what each output means
- Moved all
publishDir
information to conf/modules.config - Set filter contig length to 1000
- Place .command.{out,err} on one line for each module
- Added content section to main README.md to allow for easier navigation
- Added BUSCO and GTDB-Tk references to CITATIONS.md
- Parsing of BBDuk output to add more information to a PhiX Summary file
- Fixed github actions to properly run
Initial release of wf-paired-end-illumina-assembly.