Let's say I have a GenBank ID for a genome (BA000007.3) and a list with genome coordinates for blocks:
blocks = [(1443423, 1444553), ...]
I need to find out what genes (or parts of what genes) are covered by these blocks.
I've tried this:
from Bio import Entrez, SeqIO
Entrez.email = "some@example.com"
with Entrez.efetch(db="nucleotide", rettype="gb", retmode="text", id="BA000007.3") as handle:
seq_record = SeqIO.read(handle, "gb")
sub_record = seq_record[1443423:1444553]
print(len(sub_record.features))
It returns 0, as no genes are fully covered by this block. However, if you open the NCBI record in a browser and change region shown you will see that there are 2 genes that are partially covered by this block.
I've also tried substracting with SeqFeature and FeatureLocation but it didn't work either (a new sequence doesn't have features at all).
So is there any way to do this without manually checking each block for each genome?
Your approach should work, but you may have to come up with some new logic if you’re interested in partial coverage too. This is exactly the approach I’ve used myself for extracting genes within ranges:
https://github.com/jrjhealey/bioinfo-tools/blob/master/get_spans.py
Off the top of my head, you could check something like (pseudocode):
Thank you for the answer! I've thought of this but as I have many blocks for many genomes with many features I was wondering if there is more elegant way. I guess such a task may be quite common.
I’m not sure what you mean? What is the exact bit that’s stumping you?
I'm worried that due to a large number of features and blocks this approach will take a lot of time. That's why I was asking for a more faster way (maybe there is a non-obvious way to do this using BioPython).
The sequence subsetting should be pretty rapid. You might encounter memory issues if you try to hold all the results in memory at once.
Do you need the entrez bit in your code or can you do it all locally? The entrez queries will likely be much slower than the block querying.
If that’s still too slow, (bio)python is not the language for you.