You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Loading FASTA files with biopython is quite sloe (~30s). Test whether we can replace the biopython load with using the pyfaidx library instead
I copied the current function that uses biopython and modified it to use pyfaidx and then tested loading various reference genome fasta files (3x replication) over 6 human genomes (hg38, hg19. h18, no alt, etc)
importosimportreimporttimeimportpandasaspdfromBioimportSeqIOfrompyfaidximportFastadefload_reference_genome(*filepaths: str):
reference_genome= {}
forfilenameinfilepaths:
withopen(filename, 'rU') asfh:
forchrom, seqinSeqIO.to_dict(SeqIO.parse(fh, 'fasta')).items():
ifchrominreference_genome:
raiseKeyError('Duplicate chromosome name', chrom, filename)
reference_genome[chrom] =seqnames=list(reference_genome.keys())
# to fix hg38 issuesfortemplate_nameinnames:
iftemplate_name.startswith('chr'):
truncated=re.sub('^chr', '', template_name)
iftruncatedinreference_genome:
raiseKeyError(
'template names {} and {} are considered equal but both have been defined in the reference''loaded'.format(template_name, truncated)
)
reference_genome.setdefault(truncated, reference_genome[template_name].upper())
else:
prefixed='chr'+template_nameifprefixedinreference_genome:
raiseKeyError(
'template names {} and {} are considered equal but both have been defined in the reference''loaded'.format(template_name, prefixed)
)
reference_genome.setdefault(prefixed, reference_genome[template_name].upper())
reference_genome[template_name] =reference_genome[template_name].upper()
returnreference_genomedeffast_load_reference_genome(*filepaths: str):
reference_genome= {}
forfilenameinfilepaths:
seqs=Fasta(filename, sequence_always_upper=True, build_index=False)
forchrom, seqinseqs.items():
ifchrominreference_genome:
raiseKeyError('Duplicate chromosome name', chrom, filename)
reference_genome[chrom] =seqnames=list(reference_genome.keys())
# to fix hg38 issuesfortemplate_nameinnames:
iftemplate_name.startswith('chr'):
truncated=re.sub('^chr', '', template_name)
iftruncatedinreference_genome:
raiseKeyError(
'template names {} and {} are considered equal but both have been defined in the reference''loaded'.format(template_name, truncated)
)
reference_genome.setdefault(truncated, reference_genome[template_name])
else:
prefixed='chr'+template_nameifprefixedinreference_genome:
raiseKeyError(
'template names {} and {} are considered equal but both have been defined in the reference''loaded'.format(template_name, prefixed)
)
reference_genome.setdefault(prefixed, reference_genome[template_name])
reference_genome[template_name] =reference_genome[template_name]
returnreference_genome
The text was updated successfully, but these errors were encountered:
creisle
changed the title
Use pyfaidx instead of Biopython for loading reference FASTA files
Test using pyfaidx instead of Biopython for loading reference FASTA files
Jan 21, 2022
Not unexpected since the pyfaidx will use the index where it exists and biopython presumably is not. Will also need to test how much impact this has on an overall runtime
So there definitely ARE performance boosts but I did run into an annoying issue. As far as I can tell you can't read a fasta without an index. Mostly this would be fine except that if you don't have write permissions to the place where the reference genome lives you can't build one either. Not sure how to resolve this so I am putting this on hold for now
Loading FASTA files with biopython is quite sloe (~30s). Test whether we can replace the biopython load with using the pyfaidx library instead
I copied the current function that uses biopython and modified it to use pyfaidx and then tested loading various reference genome fasta files (3x replication) over 6 human genomes (hg38, hg19. h18, no alt, etc)
The text was updated successfully, but these errors were encountered: