Question

Difficult To Download Gene Sequences From Ncbi

0

Entering edit mode

11.7 years ago

jcastrofigueroa ▴ 140

Hello everyone: I'm having a problem trying to download gene sequences from the Gene database at NCBI website using biopyhon. I iniciated the code by setting up a basic test search for two gene sequences in the "gene" database for S. coelicolor (txid100226).

from Bio import Entrez
Entrez.email = "chief@marsstation.com"
handle = Entrez.esearch(db="gene",term="txid100226[Organism]",retmax=2)
record = Entrez.read(handle)

The first ID for the first hit on this search is:

record_list = record["IdList"]
print record_list[0]
1096915

So this first ID was used to download the gene of interest by using this:

seq = Entrez.efetch(db="gene",id=record_list[0],rettype="fasta").read()

However the result stored in "seq" is the following:


http://www.ncbi.nlm.nih.gov/data_specs/dtd/NCBI_Entrezgene.dtd">
<Entrezgene-Set>SCO1489 –DNA-binding protein [Streptomyces coelicolor A3(2)] 
DNA-binding protein
    Other Aliases: 
SCO1489, SC9C5.13, bldD
Genomic context: 
Chromosome
 Annotation: 
NC_003888.3 (1592381..1592884)
ID:
 1096915
 
 </Entrezgene-Set>

If I put db="protein" instead of gene I get the correct protein sequence.

I realize that one way to download the DNA sequence was manually, directly from the contig NC_003888.3 in S. coelicolor at the position 1592381..1592884 for this particular ID. That info is stored in "seq"

So here is the question: Is there any method (or trick) to download that DNA sequence using biopython? How can I solve this problem?

JFC

biopython entrez • 5.9k views

ADD COMMENT • link updated 11.7 years ago by Leandro Lima ▴ 970 • written 11.7 years ago by jcastrofigueroa ▴ 140

score 1 · Answer 1 · 2013-04-26

1

Entering edit mode

11.7 years ago

Neilfws 49k

The short answer is that rettype = "fasta" is not a valid return mode for the Gene database. Please refer to Table 1 in the EFetch section of the NCBI EUtils documentation.

The longer answer - how to solve this problem - I'll edit this answer later, no time to write it just now.

ADD COMMENT • link 11.7 years ago by Neilfws 49k

0

Entering edit mode

Even if I try to change the rettype, it doesn't work. The gene sequence for this example is within contig sequence, so the GI code for this sequence directs you to the contig. I don't know what to do to solve it, but thank you for your answer.

ADD REPLY • link 11.7 years ago by jcastrofigueroa ▴ 140

0

Entering edit mode

Well no, changing rettype won't work. The only valid rettype for db=Gene is gene_table; valid retmodes are asn.1, xml and text. In short: sequences cannot be retrieved from the Gene database.

ADD REPLY • link 11.7 years ago by Neilfws 49k

score 0 · Answer 2 · 2013-04-26

0

Entering edit mode

11.7 years ago

Ashutosh Pandey 12k

Well I am not used to using Entrez gene but I think you are retrieving the Entrez gene page information instead of the sequence information. You should try either "genbank" or "nucleotide" instead of "gene" and see if it helps.

ADD COMMENT • link 11.7 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Thanks for your answer, but it didn't work :( If I use "gene bank" it displays an error and if I try with nucleotide database, what I get is the whole contig. Hmm, about using Entrez gene I'm sure that I'm not retrieving the information page, because I get a protein sequence.

ADD REPLY • link 11.7 years ago by jcastrofigueroa ▴ 140

score 0 · Answer 3 · 2013-04-26

0

Entering edit mode

11.7 years ago

Leandro Lima ▴ 970

Hello! I think this could help you.

problem when downloading large number of sequences from Genbank

ADD COMMENT • link 8.3 years ago by Leandro Lima ▴ 970

0

Entering edit mode

Not really since fasta cannot be retrieved from the Gene database.

ADD REPLY • link 11.7 years ago by Neilfws 49k

1

Entering edit mode

In this case, db="nuccore"

ADD REPLY • link 11.7 years ago by Leandro Lima ▴ 970