Skip to content

Commit

Permalink
Merge pull request #27 from phac-nml/dev
Browse files Browse the repository at this point in the history
Version 0.20 Release
  • Loading branch information
emarinier authored Jun 26, 2024
2 parents bfbeffe + fdf5162 commit 50949d1
Show file tree
Hide file tree
Showing 26 changed files with 507 additions and 83 deletions.
15 changes: 15 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,20 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.2.0] - 2024-06-26

### Added

- Support for mismatched IDs between the samplesheet ID and the ID listed in the corresponding allele file.

### Changed

- Updated ArborView to v0.0.7-rc1.

### Fixed

- The scaled distance thresholds provided when using `--pd_distm scaled` and `--gm_thresholds` are now correctly understood as percentages in the range [0.0, 100.0].

## [0.1.0] - 2024-05-28

Initial release of the Genomic Address Service Clustering pipeline to be used for distance-based clustering of cg/wgMLST data.
Expand All @@ -13,3 +27,4 @@ Initial release of the Genomic Address Service Clustering pipeline to be used fo
- Output of a dendrogram, cluster codes, and visualization using [profile_dists](https://github.com/phac-nml/profile_dists), [gas mcluster](https://github.com/phac-nml/genomic_address_service), and ArborView.

[0.1.0]: https://github.com/phac-nml/gasclustering/releases/tag/0.1.0
[0.2.0]: https://github.com/phac-nml/gasclustering/releases/tag/0.2.0
22 changes: 19 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,12 +24,28 @@ The main parameters are `--input` as defined above and `--output` for specifying

In order to customize metadata headers, the parameters `--metadata_1_header` through `--metadata_8_header` may be specified. These parameters are used to re-name the headers in the final metadata table from the defaults (e.g., rename `metadata_1` to `country`).

## Profile dists
## Distance Method and Thresholds

The Genomic Address Service Clustering workflow can use two distance methods: Hamming or scaled.

### Hamming Distances

Hamming distances are integers representing the number of differing loci between two sequences and will range between [0, n], where `n` is the total number of loci. When using Hamming distances, you must specify `--pd_distm hamming` and provide Hamming distance thresholds as integers between [0, n]: `--gm_thresholds "10,5,0"` (10, 5, and 0 loci).

### Scaled Distances

Scaled distances are floats representing the percentage of differing loci between two sequences and will range between [0.0, 100.0]. When using scaled distances, you must specify `--pd_distm scaled` and provide percentages between [0.0, 100.0] as thresholds: `--gm_thresholds "50,20,0"` (50%, 20%, and 0% of loci).

### Thresholds

The `--gm_thresholds` parameter is used to set thresholds for each cluster level, which in turn are used to assign cluster codes at each level. When specifying `--pd_distm hamming` and `--gm_thresholds "10,5,0"`, all sequences that have no more than 10 loci differences will be assigned the same cluster code for the first level, no more than 5 for the second level, and only sequences that have no loci differences will be assigned the same cluster code for the third level.

## profile_dists

The following can be used to adjust parameters for the [profile_dists][] tool.

- `--pd_outfmt`: The output format for distances. For this pipeline the only valid value is _matrix_ (required by [gas mcluster][]).
- `--pd_distm`: The distance method/unit, either _hamming_ or _scaled_. For _hamming_ distances, the distance values will be a non-negative integer. For _scaled_ distances, the distance values are between 0 and 1.
- `--pd_distm`: The distance method/unit, either _hamming_ or _scaled_. For _hamming_ distances, the distance values will be a non-negative integer. For _scaled_ distances, the distance values are between 0.0 and 100.0. Please see the [Distance Method and Thresholds](#distance-method-and-thresholds) section for more information.
- `--pd_missing_threshold`: The maximum proportion of missing data per locus for a locus to be kept in the analysis. Values from 0 to 1.
- `--pd_sample_quality_threshold`: The maximum proportion of missing data per sample for a sample to be kept in the analysis. Values from 0 to 1.
- `--pd_file_type`: Output format file type. One of _text_ or _parquet_.
Expand All @@ -48,7 +64,7 @@ The following can be used to adjust parameters for the [profile_dists][] tool.
The following can be used to adjust parameters for the [gas mcluster][] tool.
- `--gm_thresholds`: Thresholds delimited by `,`. Values should match units from `--pd_distm` (either _hamming_ or _scaled_).
- `--gm_thresholds`: Thresholds delimited by `,`. Values should match units from `--pd_distm` (either _hamming_ or _scaled_). Please see the [Distance Method and Thresholds](#distance-method-and-thresholds) section for more information.
- `--gm_method`: The linkage method to use for clustering. Value should be one of _single_, _average_, or _complete_.
- `--gm_delimiter`: Delimiter desired for nomenclature code. Must be alphanumeric or one of `._-`.
Expand Down
Loading

0 comments on commit 50949d1

Please sign in to comment.