Question

How To Fetch Genomics Sequence Using Coordinates In Biopython

9

Entering edit mode

13.3 years ago

dustar1986 ▴ 380

Hi everyone,

I'm a newbie of biopython. My question may be stupid but I would appreciate your help.

I want to use chromosome number, start position, end position, strand to fetch the corresponding sequence in the mouse genome.

How can this be done with biopython connecting to NCBI database? Could anyone help me please?

Thanks a lot.

biopython sequence retrieval entrez database • 20k views

ADD COMMENT • link updated 13.3 years ago by Steve Moss 2.3k • written 13.3 years ago by dustar1986 ▴ 380

0

Entering edit mode

Thanks a lot for your editing and rephrasing, Eric.

ADD REPLY • link 13.3 years ago by dustar1986 ▴ 380

score 22 · Answer 1 · 2011-08-25

22

Entering edit mode

13.3 years ago

Alex ★ 1.5k

It is a very simple, but you have to find sequence GI instead chromosome number. You can find GI in NCBI's Nucleotide DB.

For example, the mouse chromosome 6 has GI = 307603377, and you want to get a sequence of plus strand from 400100 to 400200:

from Bio import Entrez, SeqIO
Entrez.email = "A.N.Other@example.com"     # Always tell NCBI who you are
handle = Entrez.efetch(db="nucleotide", 
                       id="307603377", 
                       rettype="fasta", 
                       strand=1, 
                       seq_start=4000100, 
                       seq_stop=4000200)
record = SeqIO.read(handle, "fasta")
handle.close()
print record.seq

Parameters description from NCBI's efetch help:

strand - what strand of DNA to show (1 = plus or 2 = minus)
seq_start - show sequence starting from this base number
seq_stop - show sequence ending on this base number
complexity - gi is often a part of a biological blob, containing other gis

ADD COMMENT • link 13.3 years ago by Alex ★ 1.5k

0

Entering edit mode

This is great.

I'm looking at some miRNA sequences for TFBS and was going to ask a similar question being a python newbie myself (although the Biopython cookbook was helping). Anyway, great timing!

ADD REPLY • link 13.3 years ago by Duff ▴ 670

0

Entering edit mode

Extremely helpful. Thanks a lot.

ADD REPLY • link 13.3 years ago by dustar1986 ▴ 380

0

Entering edit mode

Very helpful. I'm also working on promoter analysis of TFBS. thanks!

ADD REPLY • link 13.1 years ago by Andrius • 0

0

Entering edit mode

The Human chromosomes follow this pattern: "NC_000001", "NC_000002", ..., "NC_000023" (X), "NC_000024" (Y)

ADD REPLY • link 11.8 years ago by Leandro Lima ▴ 970

0

Entering edit mode

How can we get sequences for a certain genome build and group label? example: For homo sapiens, hg19, Grch37.p10 ? Thanks

ADD REPLY • link 10.8 years ago by burcakotlu ▴ 40

score 2 · Answer 2 · 2011-08-25

2

Entering edit mode

13.3 years ago

Leszek 4.2k

Another homework?
Use combination of googling and reading, please. There you are biopython cook book.

ADD COMMENT • link 13.3 years ago by Leszek 4.2k

4

Entering edit mode

@ Leszek- This should have been comment not an answer

ADD REPLY • link 13.3 years ago by Thaman ★ 3.3k

3

Entering edit mode

No, it's not a homework. Thanks for your suggestion. I'm currently doing some research on 3' UTR region. I got the 3' UTR coordinates from USCS and need to know the sequence about them. I know this can be done use galaxy. As galaxy is written in python, just wonder if there is a module within biopython can do the same work or not.

ADD REPLY • link 13.3 years ago by dustar1986 ▴ 380

score 1 · Answer 3 · 2011-08-25

1

Entering edit mode

13.3 years ago

Steve Moss 2.3k

I think you can also use EnsEMBL (and NCBI I believe) via the PyCogent toolkit to do this using Python.

Check out http://pycogent.sourceforge.net/ - the examples and cookbook contain some decent code that may be helpful :-)