Note This is a re-factoring of the code from raphael-group/NAIBR to make the tool a bit more user-friendly and adds some additional features. This means that the resulting output might differ from what you would get from
raphael-group/NAIBR
. See Benchmarks for a comparison.
NAIBR (Novel Adjacency Identification with Barcoded Reads) identifies novel adjacencies (NAs) created by structural variation events such as deletions, duplications, inversions, and complex rearrangements using linked-read whole-genome sequencing data produced by 10X Genomics. Please refer to the publication for details about the method.
NAIBR takes as in put a phased BAM, such as those produced by 10X Genomic's Long Ranger pipeline, and outputs a BEDPE file containing predicted novel adjacencies and a likelihood score for each adjacency.
Install using pip:
pip install git+https://github.com/pontushojer/NAIBR.git
This fork of NAIBR is also available from binconda under the name naibr-plus
. Install using conda:
conda install naibr-plus
NAIBR can be run using the following command:
naibr <configfile>
A template config file can be found in example/example.config. The following parameters can be set in the config file:
bam_file
: Input phased BAM file with barcode tagged (BX
tag) reads (REQUIRED)min_mapq
: Minimum mapping quality for a read to be included in analysis (default: 40)outdir
: Output directory (default:.
i.e. current workding directory)prefix
: Prefix for output files, set to empty sting to matchbam_file
name (default:NAIBR_SVs
)d
: The maximum distance in basepairs between reads in a linked-read (default: 10000)blacklist
: BED-file with regions to be excluded from analysiscandidates
: BEDPE-file with novel adjacencies to be scored by NAIBR. This will override automatic detection of candidate novel adjacenciesthreads
: Number of threads (default: 1)min_sv
: Minimum size of a structural variant to be detected (default:lmax
, i.e. the 95th percentile of the paired-end read insert size distribution)k
: minimum number of barcode overlaps supporting a candidate NA (default = 3)
Example files are provided in the 'example' directory. Running
naibr example/example.config
will produce the files example/NAIBR_SVs.bedpe
, example/NAIBR_SVs.reformat.bedpe
and example/NAIBR_SVs.vcf
.
Scored novel adjacencies are outputted in three different file formats to the output directory
BEDPE-like file in same format as outputted in raphael-group/NAIBR. Note that this file does not follow any actual BEDPE format. The file has a header with column descriptions.
Columns:
- Chr1: Chromosome of 1st breakpoint
- Break1: Position of 1st breakpoint
- Chr2: Chromosome of 2nd breakpoint
- Break2: Position of 2nd breakpoint
- Split molecules: Number of split molecule supporting NA
- Discordant reads: Number of discordant reads supporting NA
- Orientation: Orientation of NA. "+-" ~ DEL, "++"/"--" ~ INV, "-+" ~ "DUP"
- Haplotype
- Score: log-likelihood score for NA
- Pass filter: "PASS" if passes filter threshold, otherwice "FAIL"
Reformatted BEDPE that follows the 10x Genomics BEDPE format suitable for IGV visualization
NAs translated to the VCF format. Useful for operation using e.g. truvari.
The repository includes a utility script bedpe_to_vcf.py
for converting from NAIBR's BEDPE format to a VCF. For detail on how to use it run:
bedpe_to_vcf.py --help
Benchmarking this fork (labeled main_<commit>
) versus raphael-group/NAIBR (labeled master
). Master includes raphael-group#22 merged as this removes a lot of FP calls introduced by a code error.
Run NAIBR on 10x Genomics data for idividual HG002 (NA24385). Dataset from GIAB (source) has 51.7x coverage and was run through the Long Ranger pipeline using reference genome GRCh37.
NAIBR configs used
min_mapq=40
d=10000
min_sv=1000
threads=10
k=3
min_reads=2
min_discs=2
Compared deletions (DELs) to GIAB benchmark set Tier1
(v0.6) using truvari (v3.1.0). Compared both all entries (named *_all
) and only those that passed filters (labeled *_pass
).
Command for *_all
:
truvari bench -b HG002_SVs_Tier1_v0.6.withchr.DELs.vcf.gz -c master/NAIBR_SVs.DELs.vcf.gz -o truvari_bench/master_all --includebed HG002_SVs_Tier1_v0.6.withchr.noY.bed --pctsim 0.5 --pctsize 0.5 --reference /Users/pontus.hojer/analysis/references/hg19_longranger/genome.fa --sizemin 1000 --sizemax 20000000 --multimatch --giabreport
Command for *_pass
:
truvari bench -b HG002_SVs_Tier1_v0.6.withchr.DELs.vcf.gz -c master/NAIBR_SVs.DELs.vcf.gz -o truvari_bench/master_pass --includebed HG002_SVs_Tier1_v0.6.withchr.noY.bed --pctsim 0.5 --pctsize 0.5 --reference /Users/pontus.hojer/analysis/references/hg19_longranger/genome.fa --sizemin 1000 --sizemax 20000000 --multimatch --giabreport --passonly
Results
Sample | Precision | Recall | F1 score | TP-base | TP-call | FP | FN |
---|---|---|---|---|---|---|---|
main_cbbdf01_all |
67.0% | 59.6% | 63.1% | 480 | 478 | 235 | 325 |
main_cbbdf01_pass |
99.3% | 77.4% | 87.0% | 411 | 411 | 3 | 120 |
master_all |
34.3% | 28.6% | 31.2% | 230 | 229 | 439 | 575 |
master_pass |
96.3% | 14.9% | 25.8% | 79 | 79 | 3 | 452 |
This fork, i.e. main_cbbdf01_pass
, had much higher F1 score of 87.0% than the master branch at 25.8%. Also the main_cbbdf01_all
seems to include a lot of TP calls that are filtered out by NAIBR, possibly filtering should be re-optimized.
Tag the commit with the new version number, e.g. v0.1.0
:
git tag <version>
Push tag
git push origin <version>
Build package
python3 -m pip install --upgrade build
python3 -m build
This generates a file dist/NAIBR-<version>.tar.gz
.
Create a release on GitHub and upload the dist/NAIBR-<version>.tar.gz
for the new version.
To cite this particular software please use this identifier:
The algorithm behind NAIBR is described in the article below:
Elyanow, Rebecca, Hsin-Ta Wu, and Benjamin J. Raphael. Identifying structural variants using linked-read sequencing data. Bioinformatics (2017).
@article{elyanow2017identifying,
title={Identifying structural variants using linked-read sequencing data},
author={Elyanow, Rebecca and Wu, Hsin-Ta and Raphael, Benjamin J},
journal={Bioinformatics},
year={2017}
}