phac-nml/LegioVue : Clustering

This document provides the neccessary steps to visualize the cgMLST output from LegioVue. At the moment these steps need to be run separately using the ouputs of the LegioVue. Pending updates, these steps will be incorporated into the nextflow workflow directly.

Visualizations of cgMLST data can be generated with or without clustering. Both options are presented below, though partitioning and visualization with ReporTree is the recommended approach if you are able to install and run ReporTree on the command-line.

Visualization-only with PHYLOViZ GUI

Use this option if you are unable to install ReporTree, or if you simply want to visualize relative allele differences between isolates without setting cluster/partition thresholds:

Navigate to https://online2.phyloviz.net/index in a browser window.
Scroll down and click on "Login-free upload" under Test PHYLOViZ Online. This will take you to a page where you can upload and visualize your cgMLST profile without storing any data in the application. Note that navigating away from this page will erase your data.
From the Possible Input Formats dropdown menu, select "Profile Data".
Under Input Files, upload your results/chewbbaca/allele_calls/cgMLST/cgMLST100.tsv file from LegioVue as Profile Data. Upload a .tsv metadata file as Auxiliary Data. Note: the "sample name" or similar column header (usually the first column) needs to match the cgMLST output in order to be visualized correctly. Change it to "FILE" to match the profile data or vice versa.
Select "Core Analysis" as the Analysis Method.
Provide a name and optional description for your dataset and click on Launch Tree. In a minute or two you will be redirected to a visualization of your data.
On the left sidebar, navigate to Assign Colors > By Auxiliary Data and select the appropriate metadata column (E.g., ST). Node and branch labels can be added by selecting the "Add Labels" checkbox under Graphic Properties > Nodes or Links.

Important: In this Minumum Spanning Tree (MST), branch (or "link") lengths represent the number of alleles that differ between linked isolates. The default schema that the pipeline uses for cgMLST determination has a maximum of 1521 possible alleles. These branch lengths tend to increase when there are many inferred (INF) alleles and fewer exact (EXC) alleles (which, in turn, is affected by underlying data quality) used to generate the profile data. These numbers can be found in the overall.qc.tsv output of the main pipeline and should be taken into consideration when interpreting the visualization of the profile data.

Partitioning and Visualization with ReporTree

Reportree can be used to partition the MST of isolates according to different thresholds, which may be useful for epidemiological investigation.

First, install ReporTree either with Conda or Docker according to the installation instructions in the Readme file on their GitHub page.
Prepare a metadata file with columns for sample and any other data you wish to include for downstream visualization.
Activate ReporTree and run grapetree analysis, using as input the cgMLST profile data and prepared metadata from Step 2. An example command is below to use with the test dataset:

reportree.py -m <PATH-TO-METADATA>/metadata.tsv \
-a <PATH-TO-PIPELINE-DIR>/results/chewbbaca/allele_calls/cgMLST/cgMLST100.tsv -thr 0-5 --columns_summary_report ST,n_ST \
--method MSTreeV2 --loci-called 1.0 --matrix-4-grapetree --analysis grapetree

You may wish to modify certain values depending on your analysis:

-thr indicates the threshold(s) to use for cluster partitioning. Setting -thr 0-5 will request that ReporTree assign samples to clusters at six different allele thresholds, ranging from 0 allele differences to 5. You may also select distinct threshold values, for example -thr 5,10,15,20, for more exploratory analysis.
--loci-called should correspond to the cgMLST profile used as input, i.e., --loci-called 0.95 should be used if the input profile is cgMLST95.tsv.
--columns_summary_report indicates columns from the metadata file that should be described for each cluster. For example, ST,n_ST requests that for each cluster, the ST and number of STs included in that cluster should be reported in the output. This information can help you investigate different clustering thresholds.
--out can be added to the above command to specify an existing directory and prefix for the output files. Ex. --out reportree/TD1 will append "TD1" as a prefix to all output files.

Once you have your output files from ReporTree, navigate to the local implementation of GrapeTree to visualize the MST data.
Under Inputs/Outputs, select "Load Files" and upload both *.nwk and *_metadata_w_partitions.tsv.
Under Tree Layout, you can customize the MST visualization including exploring different partitions by selecting MST-### in Node Style > Colour By:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clustering.md

clustering.md

phac-nml/LegioVue : Clustering

Visualization-only with PHYLOViZ GUI

Partitioning and Visualization with ReporTree

Files

clustering.md

Latest commit

History

clustering.md

File metadata and controls

phac-nml/LegioVue : Clustering

Visualization-only with PHYLOViZ GUI

Partitioning and Visualization with ReporTree