Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release V2.0.0 #94

Merged
merged 71 commits into from
Dec 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
71 commits
Select commit Hold shift + click to select a range
11d7c6c
towards v2.0.0 species ID wit new mash gtdb+enterobase+refseq sketch…
kbessonov1984 Nov 28, 2023
56e9ba3
pathotyping module first release
kbessonov1984 Feb 7, 2024
19bb24b
updated EPEC rules
kbessonov1984 Feb 8, 2024
b7919fa
added ND (not detected) flag for pathotype
kbessonov1984 Feb 8, 2024
54c5fd3
soreted pathotype outputs
kbessonov1984 Feb 8, 2024
8bcfada
added new outputs to the report, %id, %cov, top gene hits and other
kbessonov1984 Feb 9, 2024
28c9ff3
added gene length ratio field to output.tsv
kbessonov1984 Feb 11, 2024
d9448ea
fixed bug on additional fields for non-E.coli genomes. Made species I…
kbessonov1984 Feb 12, 2024
2a18cb9
f-string issue resolved
kbessonov1984 Feb 13, 2024
ac3395c
duplicated filenames in batch mode error plus removed tempfile depend…
kbessonov1984 Feb 13, 2024
e6b1ae3
updated patho database with aggA gene
kbessonov1984 Feb 16, 2024
77ba347
added options to specify thresholds for pathotype blastn results
kbessonov1984 Feb 16, 2024
1e6e08d
small fixes
kbessonov1984 Feb 16, 2024
eb90af3
fixed local variable 'predictions_pathotype_dict' where it is not ass…
kbessonov1984 Feb 18, 2024
775d454
fixed db missing id field
kbessonov1984 Feb 18, 2024
abb23c5
Shiga toxin typing module implemented and improved datbase of pathoty…
kbessonov1984 Feb 20, 2024
3a58180
added hemolysin exhA, hlyA and hlyE typing and accessions field
kbessonov1984 Feb 21, 2024
8ca9e60
ehxA mislabelled
kbessonov1984 Feb 22, 2024
5fd1dbb
ehxA gene name correction in DB
kbessonov1984 Feb 24, 2024
09d0efe
fixed non-typable files being counted as non-E.coli genomes
kbessonov1984 Feb 26, 2024
741fff0
Added Dockerfile with Species ID DB init
kbessonov1984 Mar 2, 2024
e4827dc
Added Dockerfile with Species ID DB init
kbessonov1984 Mar 2, 2024
1923691
fixed empty fasta or non-fasta file with pathotype flag edge case
kbessonov1984 Mar 8, 2024
2c6bf12
Updated species ID sketch download and added ectyper_init
kbessonov1984 Mar 22, 2024
2a19b84
init.py added
kbessonov1984 Mar 22, 2024
2f55ae8
updated speciesID download link
kbessonov1984 Mar 22, 2024
decee61
fixed speciesID dbpath reduntant variable
kbessonov1984 Mar 22, 2024
4c2362d
Added Singularity.def file
kbessonov1984 Mar 23, 2024
3d8c291
Fixed empty O- and H- antigen BLAST output file found in metagenomic …
kbessonov1984 Apr 18, 2024
8b0281b
Fixed empty or invalid FASTA file error generated from FASTQ and impr…
kbessonov1984 Apr 20, 2024
d5ca377
big update for Shiga toxin module typing supporting reporting of mult…
kbessonov1984 May 11, 2024
b7ba367
shiga toxin module several improvements and application of a threshold
kbessonov1984 May 28, 2024
890f78b
Fixed contig naming reporting for FASTA headers with | symbol replace…
kbessonov1984 May 29, 2024
476cb3c
stx2l sequence for AM904726.1 corrected based on https://doi.org/10.3…
kbessonov1984 Jun 6, 2024
adbec57
fixed --verify bug when pathotyping is run on non-E.coli species but …
kbessonov1984 Jun 19, 2024
9f5344b
added gunzip compression support, added pathotype gene count, sorting…
kbessonov1984 Jul 17, 2024
003c114
Report EHEC as EHEC-STEC, add number of genes linked to pathotype(s)
kbessonov1984 Jul 17, 2024
434316a
renamed ectyper_pathotyping_database_v2.json pathotype database to ec…
kbessonov1984 Jul 17, 2024
3670514
added PMID to pathotype rules in the database
kbessonov1984 Jul 19, 2024
f92a8dd
github actions pytest CI feature added
kbessonov1984 Jul 22, 2024
0e65432
github actions pytest CI feature added
kbessonov1984 Jul 22, 2024
eacee16
github actions pytest CI feature added
kbessonov1984 Jul 22, 2024
86e1542
github actions pytest CI feature added
kbessonov1984 Jul 22, 2024
33eac5d
github actions pytest CI feature added
kbessonov1984 Jul 22, 2024
58b956e
Added STEC O157:H7 GCA_000181775.1 gzip compressed for testing
kbessonov1984 Jul 22, 2024
f8cd859
fixed multiprocess Pool freeze on subprocess error and added better m…
kbessonov1984 Jul 29, 2024
74edb81
added wildcard characters in path specification and also multiple dir…
kbessonov1984 Jul 29, 2024
22b6a98
Making tool and its module more robust to errors
kbessonov1984 Jul 30, 2024
4ef8526
Fixed pytest small error with test_emtpy_BLAST_antigen_hits
kbessonov1984 Jul 30, 2024
6cca035
Fixing small pytest errors
kbessonov1984 Jul 30, 2024
41a5093
fixed batch mode error when one of the files fails FASTA test causing…
kbessonov1984 Jul 31, 2024
8002091
fixed pytest issues
kbessonov1984 Jul 31, 2024
5d2da4d
Fixed bugs related to batch mode output files displaying only last sa…
kbessonov1984 Aug 1, 2024
d70a025
Fixed bugs related to batch mode output files displaying only last sa…
kbessonov1984 Aug 1, 2024
1d3ba78
Updated README.md to follow DAAD format and also updated url to Speci…
kbessonov1984 Sep 12, 2024
fe0de54
Updated README.md with species ID module description
kbessonov1984 Sep 18, 2024
5f2c847
Updated documentation on Shiga toxin and pathotyping, also fixed bug …
kbessonov1984 Sep 18, 2024
b420a54
updated stx database STX subunit A and B sequences. Only kept STX com…
kbessonov1984 Sep 20, 2024
776c2fb
Updated pathotyping and toxin typing database by adding protein seque…
kbessonov1984 Oct 3, 2024
02bc7c7
Updated pathotyping and toxin typing database by adding protein seque…
kbessonov1984 Oct 3, 2024
7481376
Added test of no directory specified and testing for multiple files i…
kbessonov1984 Oct 3, 2024
b2c7c5c
added new species ID fields that provide info on species hashes to re…
kbessonov1984 Oct 28, 2024
ee4db4e
added Shigella_sonnei test sample
kbessonov1984 Oct 28, 2024
c203a3b
Updated bowtie2 setting to be compatible with long reads
kbessonov1984 Nov 7, 2024
6e38b40
sorted for O mixed antigen is working
kbessonov1984 Nov 7, 2024
3963d92
--longreads option added for FASTQ long reads inputs such as PacBio, …
kbessonov1984 Nov 7, 2024
ea75384
--longreads option added for FASTQ long reads inputs such as PacBio, …
kbessonov1984 Nov 7, 2024
f1f6707
--longreads option added for FASTQ long reads inputs such as PacBio, …
kbessonov1984 Nov 7, 2024
9cd8494
release v2.0.0
kbessonov1984 Dec 12, 2024
9277821
updated README.md
kbessonov1984 Dec 12, 2024
33dd165
Merge branch 'master' into v2.0.0
kbessonov1984 Dec 12, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 41 additions & 0 deletions .github/workflows/github-actions.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# This workflow will install Python dependencies, run tests and lint with a single version of Python
# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python

name: Python application

on:
push:
branches: [ "master", "v2.0.0" ]
pull_request:
branches: [ "master", "v2.0.0" ]

permissions:
contents: read

jobs:
build:

runs-on: ubuntu-22.04

steps:
- uses: actions/checkout@v4
- name: Set up Python 3.12
uses: actions/setup-python@v4
with:
python-version: "3.12"
- name: Install dependencies
run: |
sudo apt-get update
sudo apt-get install samtools bowtie2 mash bcftools ncbi-blast+ seqtk libcurl4-openssl-dev libssl-dev ca-certificates -y
sudo apt-get install python3-pip python3-dev python3-pandas python3-requests python3-biopython -y
python3 -m pip install --upgrade pip setuptools
pip3 install pytest
if [ -f requirements.txt ]; then
pip3 install -r requirements.txt;
else
pip3 install -e .
fi
ectyper_init
- name: Test with pytest
run: |
pytest -o log_cli=true --basetemp=tmp-pytest
11 changes: 11 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,4 +38,15 @@ specified and MASH distance to RefSeq genomes fails
* Now species verification via MASH distance to RefSeq genomes and E.coli specific alleles is done only if `--verify`
parameter is specified.
* If `--verify` is not specified, all input genomes are treated as E.coli without doing any species verification

**v2.0.0**
* Updated species identification module now based on GTDB + custom Escherichia and Shigella sketch covering all known bacterial species
* Implemented pathotyping covering 7 DEC *Escherichia coli* pathotypes (`DAEC`, `EAEC`, `EHEC`, `EIEC`, `EPEC`, `ETEC` and `STEC`) supporting simultaneous presence of multiple signatures (e.g. `ETEC/STEC`). Note that `EHEC` is reported as `EHEC-STEC` as this is a more severe subtype of `STEC`.
* Implemented Shiga 1 and 2 toxin typing supporting multiple toxin signatures present in a single sample.
* A total of 4 *stx1* subtypes are supported: `stx1a`, `stx1c`, `stx1d` and `stx1e`.
* A total of 15 *stx2* subtypes are supported: `stx2a`, `stx2b`, `stx2c`, `stx2d`, `stx2e`, `stx2f`, `stx2g` ,`stx2h`, `stx2i`, `stx2j`, `stx2k`, `stx2l`, `stx2m`, `stx2n`, `stx2o`.
* new database of pathotypes and toxins in JSON clear transparent format composed of the key virulence factors based on both BioNumerics and literature sources
* support for gzip compressed inputs `fastq.gz` and `fasta.gz` saving storage and increasing versatility
* other toxin typing covering enterohemolysin A (`ehxA`), hemolysin E (`hlyE`), hemolysin A (`hlyA`)


13 changes: 13 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
FROM ubuntu:22.04
ENV DEBIAN_FRONTEND="noninteractive" TZ="America/New_York"
RUN apt update && apt install git python3-pip -y
RUN apt install libcurl4-openssl-dev libssl-dev -y
RUN pip3 install Cython numpy
RUN apt install mash ncbi-blast+ bowtie2 seqtk samtools bcftools -y
RUN git clone https://github.com/phac-nml/ecoli_serotyping.git
# install the tool and initialize its species ID MASH database
RUN cd ecoli_serotyping && git checkout v2.0.0 && pip3 install .
RUN ectyper_init

#build image: docker build --tag ectyper:2.0.0 .
#type a sample: docker run -it --rm -v $PWD:/mnt ectyper:2.0.0 ectyper -i /mnt/assembly.fasta -o /mnt/temp/ --pathotype
288 changes: 228 additions & 60 deletions README.md

Large diffs are not rendered by default.

16 changes: 16 additions & 0 deletions Singularity.def
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
Bootstrap: docker
From: ubuntu:22.04

%environment
DEBIAN_FRONTEND="noninteractive" TZ="America/New_York"

%post
apt update && apt install git python3-pip -y
apt install libcurl4-openssl-dev libssl-dev -y
pip3 install Cython numpy
apt install mash ncbi-blast+ bowtie2 seqtk samtools bcftools -y
git clone https://github.com/phac-nml/ecoli_serotyping.git
cd ecoli_serotyping && git checkout v2.0.0 && pip3 install .
ectyper_init
# To build an image run the following. Might use --remote flag if no sudo/admin priv.
# singularity build ectyper_v2.0.0_22032024.sif Singularity.def
5,876 changes: 5,876 additions & 0 deletions ectyper/Data/ectyper_patho_stx_toxin_typing_database.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion ectyper/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "1.0.0"
__version__ = "2.0.0"
59 changes: 48 additions & 11 deletions ectyper/commandLineOptions.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ def checkdbversion():
dbversion = checkdbversion()

parser = argparse.ArgumentParser(
description='ectyper v{} database v{} Prediction of Escherichia coli serotype from '
description='ectyper v{} antigen database v{}. Prediction of Escherichia coli serotype, pathotype and shiga toxin tying from '
'raw reads'
' or assembled genome sequences. The default settings are recommended.'.format(__version__, dbversion)
)
Expand All @@ -57,9 +57,25 @@ def checkdbversion():
"--input",
help="Location of E. coli genome file(s). Can be a single file, a \
comma-separated list of files, or a directory",
required=True
required=True,
nargs="+"
)

parser.add_argument(
"--longreads",
action="store_true",
default=False,
help="Enable for raw long reads FASTQ inputs (ONT, PacBio, other sequencing platforms). [default %(default)s]"
)

parser.add_argument(
"--maxdirdepth",
help="Maximum number of directories to descend when searching an input directory of files [default %(default)s levels]. Only works on path inputs not containing '*' wildcard",
default=0,
type=int,
required=False
)

parser.add_argument(
"-c",
"--cores",
Expand All @@ -73,7 +89,7 @@ def checkdbversion():
"--percentIdentityOtype",
type=check_percentage,
help="Percent identity required for an O antigen allele match [default %(default)s]",
default=90
default=95
)

parser.add_argument(
Expand All @@ -88,15 +104,15 @@ def checkdbversion():
"-opcov",
"--percentCoverageOtype",
type=check_percentage,
help="Minumum percent coverage required for an O antigen allele match [default %(default)s]",
help="Minimum percent coverage required for an O antigen allele match [default %(default)s]",
default=90
)

parser.add_argument(
"-hpcov",
"--percentCoverageHtype",
type=check_percentage,
help="Minumum percent coverage required for an H antigen allele match [default %(default)s]",
help="Minimum percent coverage required for an H antigen allele match [default %(default)s]",
default=50
)

Expand All @@ -114,14 +130,13 @@ def checkdbversion():

parser.add_argument(
"-r",
"--refseq",
help="Location of pre-computed MASH RefSeq sketch. If provided, "
"--reference",
default=definitions.SPECIES_ID_SKETCH,
help="Location of pre-computed MASH sketch for species identification. If provided, "
"genomes "
"identified as non-E. coli will have their species identified "
"using "
"MASH. For best results the pre-sketched RefSeq archive "
"https://gembox.cbcb.umd.edu/mash/refseq.genomes.k21s1000.msh "
"is recommended"
"MASH dist"
)

parser.add_argument(
Expand All @@ -140,7 +155,29 @@ def checkdbversion():

parser.add_argument(
"--dbpath",
help="Path to a custom database of O and H antigen alleles in JSON format.\nCheck Data/ectyper_database.json for more information"
help="Path to a custom database of O and H antigen alleles in JSON format.\n"
)

parser.add_argument(
"--pathotype",
action="store_true",
help="Predict E.coli pathotype and Shiga toxin subtype(s) if present\n"
)

parser.add_argument(
"-pathpid",
"--percentIdentityPathotype",
type=check_percentage,
help="Minimum percent identity required for a pathotype reference allele match [default: %(default)s]",
default=90
)

parser.add_argument(
"-pathpcov",
"--percentCoveragePathotype",
type=check_percentage,
help="Minimum percent coverage required for a pathotype reference allele match [default: %(default)s]",
default=50
)

if args is None:
Expand Down
26 changes: 18 additions & 8 deletions ectyper/definitions.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,10 @@
WORKPLACE_DIR = os.getcwd()

SEROTYPE_ALLELE_JSON = os.path.join(DATA_DIR, 'ectyper_alleles_db.json')
PATHOTYPE_ALLELE_JSON = os.path.join(DATA_DIR, 'ectyper_patho_stx_toxin_typing_database.json')
SPECIES_ID_SKETCH = os.path.join(DATA_DIR, 'EnteroRef_GTDBSketch_20231003_V2.msh')
#ECOLI_MARKERS = os.path.join(DATA_DIR, 'ecoli_specific_markers.fasta')
REFSEQ_SUMMARY = os.path.join(DATA_DIR, 'assembly_summary_refseq.txt')
#REFSEQ_SUMMARY = os.path.join(DATA_DIR, 'assembly_summary_refseq.txt')
OSEROTYPE_GROUPS_DICT = {'1': ['O20','O137'],
'2': ['O28','O42'],
'3': ['O118','O151'],
Expand All @@ -30,12 +32,20 @@
'15':['O89','O101','O162'],
'16':['O169','O183']
}
MASH_URLS = ["https://gembox.cbcb.umd.edu/mash/refseq.genomes.k21s1000.msh",
"https://share.corefacility.ca/index.php/s/KDhSNQfhE6npIyo/download",
"https://gitlab.com/kbessonov/ectyper/raw/master/ectyper/Data/refseq.genomes.k21s1000.msh"]
assembly_summary_refseq_url_dict = {"assembly_summary_refseq.txt":
"http://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt"
}
MASH_URLS = ["https://zenodo.org/records/13969103/files/EnteroRef_GTDBSketch_20231003_V2.msh?download=1"]

HIGH_SIMILARITY_THRESHOLD_O = 0.00771 # alleles that are 99.23% apart will be reported as mixed call ~ 8 nt difference on average
MIN_O_IDENTITY_LS = 95 #low similarity group O antigen min identity threshold to pre-filter BLAST output (identical to global threshold)
MIN_O_COVERAGE_LS = 48 #low similarity group O antigen min coverage threshold to pre-filter BLAST output (based on cross-talk study results)
MIN_O_COVERAGE_LS = 48 #low similarity group O antigen min coverage threshold to pre-filter BLAST output (based on cross-talk study results)
PATHOTYPE_TOXIN_FIELDS = ['pathotype', 'pathotype_count', 'pathotype_genes', 'pathotype_gene_names', 'pathotype_accessions', 'pathotype_allele_id',
'pathotype_pident', 'pathotype_pcov','pathotype_length_ratio', 'pathotype_rule_ids', 'pathotype_gene_counts', 'pathotype_database',
'stx_genes', 'stx_accessions', 'stx_allele_ids', 'stx_genes_full_name', 'stx_pidents', 'stx_pcovs', 'stx_gene_lengths', 'stx_contigs', 'stx_gene_ranges']
OUTPUT_TSV_HEADER = ['Name','Species', 'SpeciesMashRatio', 'SpeciesMashDistance','SpeciesMashTopID','O-type','H-type','Serotype','QC',
'Evidence','GeneScores','AlleleKeys','GeneIdentities(%)',
'GeneCoverages(%)','GeneContigNames','GeneRanges',
'GeneLengths','DatabaseVer','Warnings','Pathotype', 'PathotypeCounts', 'PathotypeGenes', 'PathotypeGeneNames', 'PathotypeAccessions', 'PathotypeAlleleIDs',
'PathotypeIdentities(%)','PathotypeCoverages(%)','PathotypeGeneLengthRatios','PathotypeRuleIDs', 'PathotypeGeneCounts', 'PathoDBVer',
'StxSubtypes','StxAccessions','StxAlleleIDs','StxAlleleNames', 'StxIdentities(%)','StxCoverages(%)','StxLengths',
'StxContigNames','StxCoordinates']
OUTPUT_FILES_LIST = ['blastn_output_alleles.txt', 'blastn_pathotype_alleles_overall.txt', 'mash_output.txt',
'stx1_allhits_annotated_df.txt', 'stx2_allhits_annotated_df.txt']
Loading