I'm attempting to use pyfaidx to index the dmel r6 genome build so that I can get actual sequences from tuples like (chromosome, start,end)
.
In the pyfaidx documentation, they describe this process using pyfaidx.Fasta
, which is able to access sequences using discrete chromosome locations:
>>> genes = Fasta('tests/data/genes.fasta')
>>> genes['NM_001282543.1'][200:230]
NM_001282543.1:201-230
CTCGTTCCGCGCCCGCCATGGAACCGGATG
Using the [potentially wrong] genome build (dmel-all-aligned-r6.16.fasta
) from the flybase r6 page, I'm able to index the build with the following script. Note that without the read_long_names=True
option, the following call to pyfaidx.Fasta
throws error: ValueError: Duplicate key ":516332_sim4"
:
from pyfaidx import Fasta
# Path to the Drosophila genome reference file
genome_fasta_file = "dmel-all-aligned-r6.16.fasta"
# Load the reference genome
genome = Fasta(genome_fasta_file, read_long_names=True)
However, instead of giving discrete keys like in the pyfaidx tutorial, I end up with keys that are FastaRecord
objects, is there some alternate way to access these indices?
In particular, how would I reference just the loc
attribute of the following (indentation is mine):
>>> next(iter(genome))
FastaRecord(
":1539792_tblastxwrap_masked type=match;
program=tblastxwrap_masked;
sourcename=na_baylorf1_scfchunk.dpse;
version=1.0;
target=Dpf1glom15-1 21561 21701 -;
loc=2L:complement(22278645..22278785);
ID=:1539792_tblastxwrap_masked;
name=Dpf1glom15-1-AE002603.4_22263978_22335911-na_baylorf1_scfchunk.d;
MD5=f5bf1c797886d935433d8ef40a9ffed3; length=141; release=r6.16; species=Dmel;")
The documentation you quote is outdated. With the equivalent of the code you attempt you get an ordered dict object when you use
gene.keys()
and so thengenes['NM_001282543.1'][200:230]
fails. This shows the ordered dict so thatgenes['NM_001282543.1']
would fail even:You can get around that using Python. Here's one option:
See how I get the full first key that contains a match to the string
NM_001282543.1
. Then I use that key to do the updated equivalent ofgenes['NM_001282543.1'][200:230]