Merge pull request #78 from phac-nml/dev

Dev
phac-nml · May 14, 2024 · f1efb35 · f1efb35
2 parents bb93c35 + 90ce217
commit f1efb35
Show file tree

Hide file tree

Showing 121 changed files with 3,912 additions and 1,032 deletions.
diff --git a/.github/workflows/linting_comment.yml b/.github/workflows/linting_comment.yml
@@ -11,7 +11,7 @@ jobs:
     runs-on: ubuntu-latest
     steps:
       - name: Download lint results
-        uses: dawidd6/action-download-artifact@f6b0bace624032e30a85a8fd9c1a7f8f611f5737 # v3
+        uses: dawidd6/action-download-artifact@09f2f74827fd3a8607589e5ad7f9398816f540fe # v3
         with:
           workflow: linting.yml
           workflow_conclusion: completed

diff --git a/.nf-core.yml b/.nf-core.yml
@@ -1,4 +1,5 @@
 repository_type: pipeline
+nf_core_version: "2.14.1"
 lint:
   files_exist:
     - CODE_OF_CONDUCT.md

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,32 +3,62 @@
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
-## v0.1.2 - [2024-05-02]
+## v0.2.0 - [2024-05-14]
+
+### `Added`
+
+- Updated documentation for params. See [PR 66](https://github.com/phac-nml/mikrokondo/pull/66)
+
+- Fixed param typos in schema, config and docs. See [PR 66](https://github.com/phac-nml/mikrokondo/pull/66)
+
+- Added parameter to skip length filtering of sequences. See [PR 66](https://github.com/phac-nml/mikrokondo/pull/66)
+
+- Added locidex for allele calling. See [PR 62](https://github.com/phac-nml/mikrokondo/pull/62)
+
+- Updated directory output structure and names. See [PR 66](https://github.com/phac-nml/mikrokondo/pull/66)
+
+- Added tests for Kraken2 contig binning. See [PR 66](https://github.com/phac-nml/mikrokondo/pull/66)
+
+### `Fixed`
+
+- If you select to filter contigs by length, those contigs will now be used for subsequent analysis. See [PR 66](https://github.com/phac-nml/mikrokondo/pull/66)
+
+- Matched ECTyper and SISTR parameters to what is set in the current IRIDA. See [PR 68](https://github.com/phac-nml/mikrokondo/pull/68)
+
+- Updated StarAMR point finder DB selection to resolve error when in db selection when a database is not selected addressing issue. See [PR 74](https://github.com/phac-nml/mikrokondo/pull/74)
 
-### Added
+- Fixed calculation of SeqtkBaseCount value include counts for both pairs of paird-end reads. See [PR 65](https://github.com/phac-nml/mikrokondo/pull/65).
+
+## `Changed`
+
+- Changed the specific files and metadata to store within IRIDA Next. See [PR 65](https://github.com/phac-nml/mikrokondo/pull/65)
+
+- Added separate report fields for (PASSED|FAILED|WARNING) values and for the the actual value. See [PR 65](https://github.com/phac-nml/mikrokondo/pull/65)
+
+- Updated StarAMR to version 0.10.0. See [PR 74](https://github.com/phac-nml/mikrokondo/pull/74)
+
+## v0.1.2 - [2024-05-02]
 
 ### Changed
 
-- Changed default values for database parameters `--dehosting_idx`, `--mash_sketch`, `--kraken2_db`, and `--bakta_db` to null.
-- Enabled checking for existance of database files in JSON Schema to avoid issues with staging non-existent files in Azure.
-- Set `--kraken2_db` to be a required parameter for the pipeline.
-- Hide bakta parameters from IRIDA Next UI.
+- Changed default values for database parameters `--dehosting_idx`, `--mash_sketch`, `--kraken2_db`, and `--bakta_db` to null. See [PR 71](https://github.com/phac-nml/mikrokondo/pull/71)
+- Enabled checking for existance of database files in JSON Schema to avoid issues with staging non-existent files in Azure. See [PR 71](https://github.com/phac-nml/mikrokondo/pull/71).
+- Set `--kraken2_db` to be a required parameter for the pipeline. See [PR 71](https://github.com/phac-nml/mikrokondo/pull/71)
+- Hide bakta parameters from IRIDA Next UI. See [PR 71](https://github.com/phac-nml/mikrokondo/pull/71)
 
 ## v0.1.1 - [2024-04-22]
 
-### Added
-
 ### Changed
 
-- Switched the resource labels for **parse_fastp**, **select_pointfinder**, **report**, and **parse_kat** from `process_low` to `process_single` as they are all configured to run on the local Nextflow machine.
+- Switched the resource labels for **parse_fastp**, **select_pointfinder**, **report**, and **parse_kat** from `process_low` to `process_single` as they are all configured to run on the local Nextflow machine. See [PR 67](https://github.com/phac-nml/mikrokondo/pull/67)
 
 ## v0.1.0 - [2024-03-22]
 
 Initial release of phac-nml/mikrokondo. Mikrokondo currently supports: read trimming and quality control, contamination detection, assembly (isolate, metagenomic or hybrid), annotation, AMR detection and subtyping of genomic sequencing data targeting bacterial or metagenomic data.
 
 - Bumped version number to 0.1.0
 
-- Updated docs to include awesome-page plugin and restructured readme. 
+- Updated docs to include awesome-page plugin and restructured readme.
 
 - Updated coverage defaults for Shigella, Escherichia and Vibrio
 
@@ -49,11 +79,3 @@ Initial release of phac-nml/mikrokondo. Mikrokondo currently supports: read trim
 - Changed salmonella default default coverage to 40
 
 - Added integration testing using [nf-test](https://www.nf-test.com/).
-
-### `Added`
-
-### `Fixed`
-
-### `Dependencies`
-
-### `Deprecated`
diff --git a/README.md b/README.md
@@ -8,6 +8,35 @@
 [![run with singularity](https://img.shields.io/badge/run%20with-singularity-1d355c.svg?labelColor=000000)](https://sylabs.io/docs/)
 <!-- [![Launch on Nextflow Tower](https://img.shields.io/badge/Launch%20%F0%9F%9A%80-Nextflow%20Tower-%234256e7)](https://tower.nf/launch?pipeline=https://github.com/mk-kondo/mikrokondo) -->
 
+- [Introduction](#introduction)
+  * [What is mikrokondo?](#what-is-mikrokondo-)
+  * [Is mikrokondo right for me?](#is-mikrokondo-right-for-me-)
+  * [Citation](#citation)
+    + [Contact](#contact)
+- [Installing mikrokondo](#installing-mikrokondo)
+  * [Step 1: Installing Nextflow](#step-1--installing-nextflow)
+  * [Step 2: Choose a Container Engine](#step-2--choose-a-container-engine)
+    + [Docker or Singularity?](#docker-or-singularity-)
+  * [Step 3: Install dependencies](#step-3--install-dependencies)
+    + [Dependencies listed](#dependencies-listed)
+  * [Step 4: Further resources to download](#step-4--further-resources-to-download)
+    + [Configuration and settings:](#configuration-and-settings-)
+- [Getting Started](#getting-started)
+  * [Usage](#usage)
+    + [Data Input/formats](#data-input-formats)
+    + [Output/Results](#output-results)
+  * [Run example data](#run-example-data)
+  * [Testing](#testing)
+    + [Install nf-test](#install-nf-test)
+    + [Run tests](#run-tests)
+  * [Troubleshooting and FAQs:](#troubleshooting-and-faqs-)
+  * [References](#references)
+  * [Legal and Compliance Information:](#legal-and-compliance-information-)
+  * [Updates and Release Notes:](#updates-and-release-notes-)
+
+<small><i><a href='http://ecotrust-canada.github.io/markdown-toc/'>Table of contents generated with markdown-toc</a></i></small>
+
+
 # Introduction
 
 ## What is mikrokondo?
@@ -127,18 +156,21 @@ For more information see the [useage docs](https://phac-nml.github.io/mikrokondo
 
 ### Output/Results
 
-All output files will be written into the `outdir` (specified by the user). More explicit tool results can be found in both the [Workflow](workflows/CleanAssemble/) and [Subworkflow](subworkflows/) sections of the docs. Here is a brief description of the outdir structure:
-
-- **annotations** - dir containing all annotation tool output.
-- **assembly** - dir containing all assembly tool related output, including quality, 7 gene MLST and taxon determination.
-- **pipeline_info** - dir containing all pipeline related information including software versions used and execution reports.
-- **ReadQuality** - dir containing all read tool related output, including contamination, fastq, mash, and subsampled read sets (when present)
-- **subtyping** - dir containing all subtyping tool related output, including SISTR, ECtyper, etc.
-- **SummaryReport** - dir containing collated results files for all tools, including: 
-   - Individual sample flatted json reports
-   - **final_report** - All tool results for all samples in both .json (including a flattened version) and .tsv format
-- **bco.json** - data providence file generated from the nf-prov plug-in
-- **manifest.json** - data providence file generated from the nf-prov plug-in
+All output files will be written into the `outdir` (specified by the user). More explicit tool results can be found in both the [Workflow](workflows/CleanAssemble/) and [Subworkflow](subworkflows/) sections of the docs. Here is a brief description of the outdir structure (though in brief the further into the structure you head, the further in the workflow the tool has been run):
+
+- **Assembly** - contains all output files generated as a result of read assembly and tools using assembled contigs as input
+	- **Annotation** - contains output files generated from tools applying annotation and/or gene characterization from assembled contigs
+	- **Assembling** - contains output files generated as a part of the assembly process in nested order
+	- **FinalAssembly** - this directory will always contain the final output contig files from the last step in the assembly process (will take into account any skip flags in the process)
+	- **PostProcessing** - contains output files from intermediary tools that run after assembly but before annotation takes place in the workflow
+	- **Quality** - contains all output files generated as a result of quality tools after assembly
+- **Subtyping** - contains all output files from workflow subtyping tools, based off assembled contigs
+- **FinalReports** - contains assorted reports including aggregated and flat reports
+- **pipeline_info** - includes tool versions and other pipeline specific information
+- **Reads** - contains all output files generated as a result of read processing and tools using reads as input
+	- **FinalReads** - this directory will contain the final output read files from the last step in read processing (taking into account any skip flags used in the run)
+	- **Processing** - contains output files from tools run to process reads in nested order
+	- **Quality** - contains all output files generated from read quality tools
 
 ## Run example data
 

diff --git a/bin/kraken2_bin.py b/bin/kraken2_bin.py
@@ -13,6 +13,7 @@
 from collections import defaultdict
 import os
 import sys
+import re
 
 
 kraken2_classifiers = frozenset(["U", "R", "D", "K", "P", "C", "O", "F", "G", "S"])
@@ -355,7 +356,7 @@ def write_fastas(self, sequences):
         """
         for k, v in sequences.items():
             with open(
-                f"{k.strip().replace(' ', '_').replace('(', '_').replace(')', '_').replace('.', '_')}_binned.fasta",
+                "{}.binned.fasta".format(re.sub(r'[^A-Za-z0-9\-_]', '_', k)),
                 "w",
                 encoding="utf8",
             ) as out_file: