Question

Is there a specific option in BioPython that lets you call the gene name given an address of a nucleotide?

0

Entering edit mode

8.7 years ago

Tom ▴ 20

So I have a simple task, but I cannot find the specific method on BioPython documentation. I have a genbank of a bacteria. I have a list numbers, which are locations of specific SNPs on the genome:

234
1135
51515
63475
...etc

I want to know what gene each of these addresses correspond to. What series of methods do I use?

As a clear example, the final output should look like this:

234 : Is within Dehydroxylate III
1135 : Is within Dehydroxylate III
51515 : Is within Collagen Repeat factor alpha
63475 : is within DNA Polymerase III
94818 : is within DNA Helicase subunit 1
etc...

BioPython • 2.3k views

ADD COMMENT • link updated 8.7 years ago by Steven Lakin ★ 1.8k • written 8.7 years ago by Tom ▴ 20

0

Entering edit mode

You know the name of your bacteria, am I right?

See the following post, find my answer to that bacterial question.

C: where can I get environmental bacteria genome in fasta format (as many as possib

Current version of NCBI provides a file with urls for all bacteria in the database.

Find your bacterium name, press on it, copy this url and apply in some browser.

You will easily find all files available in NCBI for your bacterium.

faa-files have a list of all bacterial genes, and they provide coordinates of these genes

ffn-files have the same gene order but no coordinates, unfortunately.

What is nice - these files are collinear. Most bacteria have a single chromosome.

First, you find your gene in faa-file (they have their coordinates in their header).

So this is a simple search in protein fasta-file headers,

If .fna file is provided, you can map the corresponding pretranslated gene (.ffn)

having needed coordinate from your list to the chromosome *.fna-file.

I would try to do something like that. It doesn't look easy,

so I hope some biopython option does exist.

ADD REPLY • link 8.7 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

No , this is for custom genbanks.

I need an output file that will look like this

234 : Is within Dehydroxylate III
1135 : Is within Phosponate gamma
51515 : Is within Collagen Repeat factor alpha
63475 : is within DNA Polymerase III
...etc

ADD REPLY • link 8.7 years ago by Tom ▴ 20

0

Entering edit mode

There were the following posts in Biostars, I hope they may help you.

A: Identify Snps For Bacteria - Annotation

A: Variant Discovery In Bacteria

SOLVED***EDIT*** BioPython: how to edit a fasta file to include a large deletion

Get gene names from rs SNP ids

ADD REPLY • link 8.7 years ago by natasha.sernova ★ 4.0k

score 2 · Answer 1 · 2016-04-03

You're looking for this section in the biopython tutorial and cookbook

4.3.2.4 Location testing

You can use the Python keyword in with a SeqFeature or location object to see if the base/residue for a parent coordinate is within the feature/location or not.

For example, suppose you have a SNP of interest and you want to know which features this SNP is within, and lets suppose this SNP is at index 4350 (Python counting!). Here is a simple brute force solution where we just check all the features one by one in a loop:

>>> from Bio import SeqIO
>>> my_snp = 4350
>>> record = SeqIO.read("NC_005816.gb", "genbank")
>>> for feature in record.features:
...     if my_snp in feature:
...         print("%s %s" % (feature.type, feature.qualifiers.get('db_xref')))
...
source ['taxon:229193']
gene ['GeneID:2767712']
CDS ['GI:45478716', 'GeneID:2767712']
Note that gene and CDS features from GenBank or EMBL files defined with joins are the union of the exons – they do not cover any introns.