Difficult To Download Gene Sequences From Ncbi
3
0
Entering edit mode
11.6 years ago

Hello everyone: I'm having a problem trying to download gene sequences from the Gene database at NCBI website using biopyhon. I iniciated the code by setting up a basic test search for two gene sequences in the "gene" database for S. coelicolor (txid100226).

from Bio import Entrez
Entrez.email = "chief@marsstation.com"
handle = Entrez.esearch(db="gene",term="txid100226[Organism]",retmax=2)
record = Entrez.read(handle)

The first ID for the first hit on this search is:

record_list = record["IdList"]
print record_list[0]
1096915

So this first ID was used to download the gene of interest by using this:

seq = Entrez.efetch(db="gene",id=record_list[0],rettype="fasta").read()

However the result stored in "seq" is the following:


http://www.ncbi.nlm.nih.gov/data_specs/dtd/NCBI_Entrezgene.dtd">
<Entrezgene-Set>

SCO1489 –DNA-binding protein [Streptomyces coelicolor A3(2)]

DNA-binding protein

Other Aliases:
SCO1489, SC9C5.13, bldD
Genomic context:
Chromosome
Annotation:
NC_003888.3 (1592381..1592884)
ID:
1096915
</Entrezgene-Set>

If I put db="protein" instead of gene I get the correct protein sequence.

I realize that one way to download the DNA sequence was manually, directly from the contig NC_003888.3 in S. coelicolor at the position 1592381..1592884 for this particular ID. That info is stored in "seq"

So here is the question: Is there any method (or trick) to download that DNA sequence using biopython? How can I solve this problem?

JFC

biopython entrez • 5.8k views
ADD COMMENT
1
Entering edit mode
11.6 years ago
Neilfws 49k

The short answer is that rettype = "fasta" is not a valid return mode for the Gene database. Please refer to Table 1 in the EFetch section of the NCBI EUtils documentation.

The longer answer - how to solve this problem - I'll edit this answer later, no time to write it just now.

ADD COMMENT
0
Entering edit mode

Even if I try to change the rettype, it doesn't work. The gene sequence for this example is within contig sequence, so the GI code for this sequence directs you to the contig. I don't know what to do to solve it, but thank you for your answer.

ADD REPLY
0
Entering edit mode

Well no, changing rettype won't work. The only valid rettype for db=Gene is gene_table; valid retmodes are asn.1, xml and text. In short: sequences cannot be retrieved from the Gene database.

ADD REPLY
0
Entering edit mode
11.6 years ago

Well I am not used to using Entrez gene but I think you are retrieving the Entrez gene page information instead of the sequence information. You should try either "genbank" or "nucleotide" instead of "gene" and see if it helps.

ADD COMMENT
0
Entering edit mode

Thanks for your answer, but it didn't work :( If I use "gene bank" it displays an error and if I try with nucleotide database, what I get is the whole contig. Hmm, about using Entrez gene I'm sure that I'm not retrieving the information page, because I get a protein sequence.

ADD REPLY
0
Entering edit mode
11.6 years ago
Leandro Lima ▴ 970

Hello! I think this could help you.

problem when downloading large number of sequences from Genbank

ADD COMMENT
0
Entering edit mode

Not really since fasta cannot be retrieved from the Gene database.

ADD REPLY
1
Entering edit mode

In this case, db="nuccore"

ADD REPLY

Login before adding your answer.

Traffic: 2558 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6