Question

Extracting blocks features by their coordinates in a genome (using BioPython)?

0

Entering edit mode

5.4 years ago

little_more ▴ 70

Let's say I have a GenBank ID for a genome (BA000007.3) and a list with genome coordinates for blocks:

blocks = [(1443423, 1444553), ...]

I need to find out what genes (or parts of what genes) are covered by these blocks.

I've tried this:

from Bio import Entrez, SeqIO

Entrez.email = "some@example.com"
with Entrez.efetch(db="nucleotide", rettype="gb", retmode="text", id="BA000007.3") as handle: 
    seq_record = SeqIO.read(handle, "gb")

sub_record = seq_record[1443423:1444553]
print(len(sub_record.features))

It returns 0, as no genes are fully covered by this block. However, if you open the NCBI record in a browser and change region shown you will see that there are 2 genes that are partially covered by this block.

I've also tried substracting with SeqFeature and FeatureLocation but it didn't work either (a new sequence doesn't have features at all).

So is there any way to do this without manually checking each block for each genome?

biopython • 1.8k views

ADD COMMENT • link updated 5.4 years ago by vkkodali_ncbi ★ 3.8k • written 5.4 years ago by little_more ▴ 70

0

Entering edit mode

Your approach should work, but you may have to come up with some new logic if you’re interested in partial coverage too. This is exactly the approach I’ve used myself for extracting genes within ranges:

https://github.com/jrjhealey/bioinfo-tools/blob/master/get_spans.py

Off the top of my head, you could check something like (pseudocode):

if feature.loc.start() or feature.loc.end() in range(block):
    print(feature)

ADD REPLY • link 5.4 years ago by Joe 21k

0

Entering edit mode

Thank you for the answer! I've thought of this but as I have many blocks for many genomes with many features I was wondering if there is more elegant way. I guess such a task may be quite common.

ADD REPLY • link 5.4 years ago by little_more ▴ 70

0

Entering edit mode

I’m not sure what you mean? What is the exact bit that’s stumping you?

ADD REPLY • link 5.4 years ago by Joe 21k

0

Entering edit mode

I'm worried that due to a large number of features and blocks this approach will take a lot of time. That's why I was asking for a more faster way (maybe there is a non-obvious way to do this using BioPython).

ADD REPLY • link 5.4 years ago by little_more ▴ 70

0

Entering edit mode

The sequence subsetting should be pretty rapid. You might encounter memory issues if you try to hold all the results in memory at once.

Do you need the entrez bit in your code or can you do it all locally? The entrez queries will likely be much slower than the block querying.

If that’s still too slow, (bio)python is not the language for you.

ADD REPLY • link 5.4 years ago by Joe 21k

score 0 · Answer 1 · 2019-07-04

You can use the Entrez Direct tool efetch as follows:

efetch -db nuccore -id BA000007.3 -format gb -seq_start 1443423 -seq_stop 1444553

This will return a genbank flat file with the two partial gene features. This translates to the following in biopython:

with Entrez.efetch(db="nucleotide", rettype="gb", retmode="text", id="BA000007.3", seq_start="1443423", seq_stop="1444553") as handle: 
    seq_record = SeqIO.read(handle, "gb")

Now, the seq_record object has the two gene features:

>>> print(seq_record.features)
[SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(1131), strand=1), type='source'), 
SeqFeature(FeatureLocation(BeforePosition(0), AfterPosition(1131), strand=1), type='misc_feature'), 
SeqFeature(FeatureLocation(BeforePosition(0), ExactPosition(497), strand=1), type='gene'), 
SeqFeature(FeatureLocation(BeforePosition(0), ExactPosition(497), strand=1), type='CDS'), 
SeqFeature(FeatureLocation(ExactPosition(868), AfterPosition(1131), strand=1), type='gene'), 
SeqFeature(FeatureLocation(ExactPosition(868), AfterPosition(1131), strand=1), type='CDS')]