Test using pyfaidx instead of Biopython for loading reference FASTA files #295

creisle · 2022-01-21T01:32:00Z

Loading FASTA files with biopython is quite sloe (~30s). Test whether we can replace the biopython load with using the pyfaidx library instead

I copied the current function that uses biopython and modified it to use pyfaidx and then tested loading various reference genome fasta files (3x replication) over 6 human genomes (hg38, hg19. h18, no alt, etc)

import os
import re
import time

import pandas as pd
from Bio import SeqIO
from pyfaidx import Fasta


def load_reference_genome(*filepaths: str):
    reference_genome = {}
    for filename in filepaths:
        with open(filename, 'rU') as fh:
            for chrom, seq in SeqIO.to_dict(SeqIO.parse(fh, 'fasta')).items():
                if chrom in reference_genome:
                    raise KeyError('Duplicate chromosome name', chrom, filename)
                reference_genome[chrom] = seq

    names = list(reference_genome.keys())

    # to fix hg38 issues
    for template_name in names:
        if template_name.startswith('chr'):
            truncated = re.sub('^chr', '', template_name)
            if truncated in reference_genome:
                raise KeyError(
                    'template names {} and {} are considered equal but both have been defined in the reference'
                    'loaded'.format(template_name, truncated)
                )
            reference_genome.setdefault(truncated, reference_genome[template_name].upper())
        else:
            prefixed = 'chr' + template_name
            if prefixed in reference_genome:
                raise KeyError(
                    'template names {} and {} are considered equal but both have been defined in the reference'
                    'loaded'.format(template_name, prefixed)
                )
            reference_genome.setdefault(prefixed, reference_genome[template_name].upper())
        reference_genome[template_name] = reference_genome[template_name].upper()

    return reference_genome


def fast_load_reference_genome(*filepaths: str):
    reference_genome = {}
    for filename in filepaths:
        seqs = Fasta(filename, sequence_always_upper=True, build_index=False)
        for chrom, seq in seqs.items():
            if chrom in reference_genome:
                raise KeyError('Duplicate chromosome name', chrom, filename)
            reference_genome[chrom] = seq

    names = list(reference_genome.keys())

    # to fix hg38 issues
    for template_name in names:
        if template_name.startswith('chr'):
            truncated = re.sub('^chr', '', template_name)
            if truncated in reference_genome:
                raise KeyError(
                    'template names {} and {} are considered equal but both have been defined in the reference'
                    'loaded'.format(template_name, truncated)
                )
            reference_genome.setdefault(truncated, reference_genome[template_name])
        else:
            prefixed = 'chr' + template_name
            if prefixed in reference_genome:
                raise KeyError(
                    'template names {} and {} are considered equal but both have been defined in the reference'
                    'loaded'.format(template_name, prefixed)
                )
            reference_genome.setdefault(prefixed, reference_genome[template_name])
        reference_genome[template_name] = reference_genome[template_name]

    return reference_genome

creisle · 2022-01-21T01:32:28Z

Results

creisle · 2022-01-21T01:35:57Z

Not unexpected since the pyfaidx will use the index where it exists and biopython presumably is not. Will also need to test how much impact this has on an overall runtime

zhemingfan · 2022-01-22T01:21:11Z

Seems like there's also some performance advantages to pyfaidx compared to the samtools version

creisle · 2022-01-22T19:34:57Z

So there definitely ARE performance boosts but I did run into an annoying issue. As far as I can tell you can't read a fasta without an index. Mostly this would be fine except that if you don't have write permissions to the place where the reference genome lives you can't build one either. Not sure how to resolve this so I am putting this on hold for now

creisle added the enhancement label Jan 21, 2022

creisle added the performance To do with improving speed or resource usage label Jan 21, 2022

creisle changed the title ~~Use pyfaidx instead of Biopython for loading reference FASTA files~~ Test using pyfaidx instead of Biopython for loading reference FASTA files Jan 21, 2022

creisle mentioned this issue Jan 22, 2022

Performance/pyfaidx #297

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test using pyfaidx instead of Biopython for loading reference FASTA files #295

Test using pyfaidx instead of Biopython for loading reference FASTA files #295

creisle commented Jan 21, 2022

creisle commented Jan 21, 2022

creisle commented Jan 21, 2022

zhemingfan commented Jan 22, 2022

creisle commented Jan 22, 2022

Test using pyfaidx instead of Biopython for loading reference FASTA files #295

Test using pyfaidx instead of Biopython for loading reference FASTA files #295

Comments

creisle commented Jan 21, 2022

creisle commented Jan 21, 2022

creisle commented Jan 21, 2022

zhemingfan commented Jan 22, 2022

creisle commented Jan 22, 2022