Question

how to get sequences by location when pyfaidx.Fasta(read_long_names=True) creates keys from FastaRecord (dmel r6 genome build)?

0

Entering edit mode

11 months ago

mk ▴ 300

I'm attempting to use pyfaidx to index the dmel r6 genome build so that I can get actual sequences from tuples like (chromosome, start,end).

In the pyfaidx documentation, they describe this process using pyfaidx.Fasta, which is able to access sequences using discrete chromosome locations:

>>> genes = Fasta('tests/data/genes.fasta')
>>> genes['NM_001282543.1'][200:230]
NM_001282543.1:201-230
CTCGTTCCGCGCCCGCCATGGAACCGGATG

Using the [potentially wrong] genome build (dmel-all-aligned-r6.16.fasta) from the flybase r6 page, I'm able to index the build with the following script. Note that without the read_long_names=True option, the following call to pyfaidx.Fasta throws error: ValueError: Duplicate key ":516332_sim4":

from pyfaidx import Fasta

# Path to the Drosophila genome reference file
genome_fasta_file = "dmel-all-aligned-r6.16.fasta"

# Load the reference genome
genome = Fasta(genome_fasta_file, read_long_names=True)

However, instead of giving discrete keys like in the pyfaidx tutorial, I end up with keys that are FastaRecord objects, is there some alternate way to access these indices?

In particular, how would I reference just the loc attribute of the following (indentation is mine):

>>> next(iter(genome))
FastaRecord(
":1539792_tblastxwrap_masked type=match; 
program=tblastxwrap_masked; 
sourcename=na_baylorf1_scfchunk.dpse; 
version=1.0; 
target=Dpf1glom15-1 21561 21701 -; 
loc=2L:complement(22278645..22278785); 
ID=:1539792_tblastxwrap_masked; 
name=Dpf1glom15-1-AE002603.4_22263978_22335911-na_baylorf1_scfchunk.d; 
MD5=f5bf1c797886d935433d8ef40a9ffed3; length=141; release=r6.16; species=Dmel;")

selene_sdk pyfaidx fasta python flybase • 1.1k views

ADD COMMENT • link updated 11 months ago by Wayne ★ 2.1k • written 11 months ago by mk ▴ 300

0

Entering edit mode

The documentation you quote is outdated. With the equivalent of the code you attempt you get an ordered dict object when you use gene.keys() and so then genes['NM_001282543.1'][200:230] fails. This shows the ordered dict so that genes['NM_001282543.1'] would fail even:

>>> genes = Fasta('pyfaidx/tests/data/genes.fasta')
>>> genes.keys()
odict_keys(['gi|563317589|dbj|AB821309.1|', 'gi|557361099|gb|KF435150.1|', 'gi|557361097|gb|KF435149.1|', 'gi|543583796|ref|NR_104216.1|', 'gi|543583795|ref|NR_104215.1|', 'gi|543583794|ref|NR_104212.1|', 'gi|543583788|ref|NM_001282545.1|', 'gi|543583786|ref|NM_001282543.1|', 'gi|543583785|ref|NM_000465.3|', 'gi|543583740|ref|NM_001282549.1|', 'gi|543583738|ref|NM_001282548.1|', 'gi|530384540|ref|XM_005249645.1|', 'gi|530384538|ref|XM_005249644.1|', 'gi|530384536|ref|XM_005249643.1|', 'gi|530384534|ref|XM_005249642.1|', 'gi|530373237|ref|XM_005265508.1|', 'gi|530373235|ref|XM_005265507.1|', 'gi|530364726|ref|XR_241081.1|', 'gi|530364725|ref|XR_241080.1|', 'gi|530364724|ref|XR_241079.1|'])

You can get around that using Python. Here's one option:

>>> NM_001282543_key = [key for key in genes.keys() if 'NM_001282543.1' in key][0]
>>> genes[NM_001282543_key][200:230]
>gi|543583786|ref|NM_001282543.1|:201-230
CTCGTTCCGCGCCCGCCATGGAACCGGATG

See how I get the full first key that contains a match to the string NM_001282543.1. Then I use that key to do the updated equivalent of genes['NM_001282543.1'][200:230]

ADD REPLY • link 11 months ago by Wayne ★ 2.1k

score 0 · Answer 1 · 2023-12-15

0

Entering edit mode

11 months ago

mk ▴ 300

Turns out I should have been using the dmel-all-chromosome-r6.16.fasta.gz instead of the dmel-all-aligned fasta. There are no duplicate locations in the all-chromosome file so the issue goes away.

ADD COMMENT • link 11 months ago by mk ▴ 300