EasyFuse is a pipeline to detect fusion transcripts from paired-end RNA-seq data with high accuracy. The current version of EasyFuse uses three fusion gene detection tools, STAR-Fusion, Fusioncatcher and Arriba along with a powerful read filtering strategy, stringent re-quantification of supporting reads and machine learning for highly accurate predictions.
Please have a look at environment.yml. The conda environment to run nextflow can be installed with the following command:
conda env create -f environment.yml --prefix conda_env/
Before running EasyFuse the following reference annotation data needs to be downloaded (~104 GB).
# Download reference archive
wget ftp://easyfuse.tron-mainz.de/easyfuse_ref_v4.tar.gz
# Extract reference archive
tar xvfz easyfuse_ref_v4.tar.gz
There are two alternatives, manually install the workflow or let Nexftlow handle this via the GitHub repository.
To install manually:
git clone https://github.com/TRON-Bioinformatics/EasyFuse.git
cd EasyFuse
# In order to run the test script you have to move the reference folder to test/easyfuse_ref/
mv ../easyfuse_ref_v4/ test/easyfuse_ref/
To install with Nextflow (only available from release 2.0.1 onwards):
nextflow run tron-bioinformatics/easyfuse -r x.y.z --help
where x.y.z corresponds to an EasyFuse release.
Provide your downloaded reference data with the parameter --reference
Generate a tab-delimited input table with your matching FASTQs. The format of the table is: sample_name, fq1, fq2 (without headers). E.g.:
sample_01 /path/to/sample_01_R1.fastq.gz /path/to/sample_01_R2.fastq.gz
Start the pipeline as follows if you installed manually
nextflow run main.nf \
-profile conda \
--reference /path/to/reference/folder \
--input_files /path/to/input_table_file \
--output /path/to/output_folder
Or as follows if you installed it via Nextflow (only available from release 2.0.1 onwards):
nextflow run tron-bioinformatics/easyfuse -r x.y.z \
-profile conda \
--reference /path/to/reference/folder \
--input_files /path/to/input_table_file \
--output /path/to/output_folder
Note: If you want to use a custom profile (e.g. for running jobs on a cluster), please refer to https://www.nextflow.io/docs/latest/config.html for further information.
EasyFuse creates an output folder for each input sample containing the following files:
fusions.csv
fusions.pass.csv
Within the files, each line describes a candidate fusion transcript. The file fusions.csv
contains all candidate fusions with annotated features, the prediction probability assigned by the EasyFuse model, and the corresponding prediction class (positive or negative). The file fusions.pass.csv
contains only positive predicted gene fusions.
Overview of all features/columns annotated by EasyFuse:
-
BPID: The BPID (breakpoint ID) is an identifier composed of
chr1:position1:strand1_chr2:position2:strand2
and is used as the main identifier of fusion breakpoints throughout the EasyFuse publication. In the BPID,chr
andposition
are 1-based genomic coordinates (GRCh38 reference) of the two breakpoint positions. -
context_sequence_id: The context sequence id is a unique identifier (hash value) calculated from
context_sequence
, the fusion transcript sequence context (400 upstream and 400 bp downstream from the breakpoint position). -
FTID: The FTID is a unique identifier composed of
GeneName1_chr1:position1:strand1_transcript1_GeneName2_chr2:position2:strand2_transcript2
. All transcript combinations are considered. -
Fusion_Gene: Fusion Gene is a combination of the gene symbols of the involved genes in the form:
GeneName1_GeneName2
. -
Breakpoint1: Breakpoint1 is a combination of the first breakpoint position in the form:
chr1:position1:strand1
. -
Breakpoint2: Breakpoint2 is a combination of the second breakpoint position in the form:
chr2:position2:strand2
. -
context_sequence_100_id: The context sequence 100 id is a unique identifier (hash value) calculated from 200 bp context sequence (100 upstream and 100 bp downstream from the breakpoint).
type: EasyFuse identifies six different types of fusion genes. The type describes the configuration of the involved genes to each other with respect to location on chromosomes and transcriptional strands:cis_near
: Genes on the same chromosome, same strand, order of genes matches reading direction, genomic distance < 1Mb (read-through likely)cis_far
: Genes on the same chromosome, same strand, order of genes matches reading direction, genomic distance >= 1Mbcis_trans
: Genes on the same chromosome, same strand, but the order of genes does not match the reading directioncis_inv
: Genes on the same chromosome but on different strandstrans
: Genes on different chromosomes, same strandtrans_inv
: Genes on different chromosomes, different strands
-
exon_nr: Number of exons involved in the fusion transcript
-
ft1_exon_nr: Exon number of fusion partner 1 that is invoved in building the transcript breakpoint
-
ft2_exon_nr: Exon number of fusion partner 2 that is invoved in building the transcript breakpoint
-
exon_starts: Genomic starting positions of involved exons
-
exon_ends: Genomic end positions of involved exons
-
exon_boundary1: Exon boundary of the breakpoint in Gene1
left_boundary
is 5' in strand orientationright_boundary
is 3' in strand orientationwithin
means breakpoint is inside exon)
-
exon_boundary2: Exon boundary of the breakpoint in Gene2 (
left_boundary
is 5' in strand orientation,right_boundary
is 3' in strand orientation,within
means breakpoint is inside exon) -
exon_boundary: describes which of the partner genes (gene 1 + gene 2) has their breakpoint on an exon boundary:
both
:left_boundary
+right_boundary
5prime
:left_boundary
+within
3prime
:within
+right_boundary
no_match
:within
+within
-
bp1_frame: Reading frame of translated peptide at breakpoint for fusion transcript1 (-1 is non-coding region/no frame; 0,1,2 is coding region with indicated offset for reading frame)
-
bp2_frame: Reading frame of translated peptide at breakpoint for fusion transcript2 (-1 is none-coding region/no frame; 0,1,2 is coding region with indicated offset for reading frame)
-
frame: Type of frame for translation of fusion gene:
in_frame
: translation of wild type peptide sequences without frameshift after breakpoint (both coding frames are equal,bp1_frame
==bp2_frame
!=-1
)neo_frame
: translation of none-coding region after breakpoint leads to novel peptide sequence (bp1_frame
is 0, 1, or 2 andbp2_frame
is -1)no_frame
: no translation (bp1_frame
is -1)out_frame
: out of frame translation after breakpoints leads to novel peptide sequence (bp1_frame
!=bp2_frame
!= -1)
-
context_sequence: The fusion transcript sequence downstream and upstream from the breakpoint (default 800 bp, shorter if transcript start or end occurs within the region)
-
context_sequence_bp: Position of breakpoint in context sequence
-
neo_peptide_sequence: Translated peptide sequence of context sequence starting at 13 aa before breakpoint until 13 aa after breakpoint (for in-frame transcripts) or until next stop codon (for out frame and neo frame). This is to consider only the region around the breakpoint that may contain neo-epitopes.
-
neo_peptide_sequence_bp: Breakpoint on translated peptide sequence.
-
fusion_protein_sequence: Full-length protein sequence
-
fusion_protein_sequence_bp: Position of breakpoint in full-length protein seqeunce
-
toolname_detected: 1 if breakpoint was detected by respective tool, 0 if not
-
toolname_junc: Junction read count (reads covering breakpoint) reported by toolname
-
toolname_span: Spanning read count (read pairs with each partner on one side of breakpoint) reported by toolname
-
tool_frac: Fraction of tools detecting the fusion gene breakpoint
-
category_bp: Location of breakpoint on context sequence (400 for an 800 bp context sequence). Whereby category describes (here and in the following columns) the reference sequence to which the reads were mapped and quantified:
ft
: context_sequence of fusion transcriptwt1
: corresponding sequence of fusion partner 1 (wild type 1)wt2
: corresponding sequence of fusion partner 2 (wild type 2)
-
category_junc: Fraction of read counts from 1 million reads that map to sequence and overlap breakpoint by at least 10 bp
-
category_span: Fraction of read pairs from 1 million sequenced read pairs, that map to both sides of breakpoint position
-
category_anch: Maximal read anchor size across all junction reads, where the anchor size for a given read is defined as the minimum distance between read start and breakpoint or read end and the breakpoint.
-
category_junc_cnt: Number of reads that map to sequence and overlap breakpoint by at least 10 bp
-
category_span_cnt: Number of read pairs, that map to both sides of breakpoint position
-
category_anch_cnt: Maximal read anchor size across all junction reads, where the anchor size for a given read is defined as the minimum distance between read start and breakpoint or read end and the breakpoint.
-
prediction_prob: The predicted probability according to the machine learning model that the fusion candidate is a true positive.
-
prediction_class: The predicted class (
negative
orpositive
) according to the machine learning model. This classification relies on a user-defined threshold (default 0.5) applied to theprecition_prob
column.
If you use EasyFuse, please cite: Weber D, Ibn-Salem J, Sorn P, et al. Nat Biotechnol. 2022