Sequence Parsed Through Biopython Different Than In The Database
1
0
Entering edit mode
11.3 years ago
Mateusz ▴ 70

Hi,

I am trying to access gene sequences of a given organism using BioPython. After search on a Gene database using e.g. "Escherichia coli[organism]" as query, I am downloading XML file containing genes locus and identifiers. Now, using those information I want to retrieve sequence from the entry in nucleotide database. For this purpose I type:

In [144]: handle = Entrez.efetch(db="nucleotide", id="NC_005327", rettype='gb', retmode='text')
In [145]: rec = SeqIO.read(handle, 'genbank')
In [146]: rec.seq
Out[146]: UnknownSeq(92353, alphabet = IUPACAmbiguousDNA(), character = 'N')
In [147]: print rec.seq.tostring()
Out[147]: NNNNNNNNNNNNNNNNNNNN (.....) NNNNNNNNNNNNNNNNNNNNNNNNN

When I am accessing nucleotide db entry of a given accession through the website, sequence is there and respective genes can be mapped onto it. What's wrong? Am I screwing something during parsing?

Thanks in advance for any insight!

sequence parsing biopython entrez • 3.6k views
ADD COMMENT
1
Entering edit mode
11.3 years ago
Peter 6.0k

The NCBI changed the Entrez behaviour so now retype='gb' can give your the annotation WITHOUT the actual sequence (GenBank lite?). Change it to retype='gbwithparts'.

You can see this for yourself by:

>>> print Entrez.efetch(db="nucleotide", id="NC_005327", rettype='gb', retmode='text').read()
...
     CDS             91849..92250
                     /gene="pemK"
                     /locus_tag="pC15-1a_102"
                     /codon_start=1
                     /transl_table=11
                     /product="hypothetical protein"
                     /protein_id="NP_957647.1"
                     /db_xref="GI:41057027"
                     /db_xref="GeneID:2716471"
                     /translation="MLKYQLKNENGWMHRRLVRRKSDMERGEIWLVSLDPTAGHEQQG
                     TRPVLIVTPAAFNRVTRLPVVVPVTSGGNFARTAGFAVSLDGVGIRTTGVVRCDQPRT
                     IDMKARGGKRLERVPETIMNEVLGRLSTILT"
CONTIG      join(AY458016.1:1..92353)
//

Versus:

>>> print Entrez.efetch(db="nucleotide", id="NC_005327", rettype='gbwithparts', retmode='text').read()
...
    92221 cttggccgcc tgtccactat tctgacttga acatggggtt tgaggggcaa ctggatgaaa
    92281 acgtacgatt tagggcacta aaaccgctgt tgtcccacca ttctggtgat tcccaaacgt
    92341 tatttggcta aaa
//
ADD COMMENT
0
Entering edit mode

Works perfectly, thanks!

ADD REPLY

Login before adding your answer.

Traffic: 1892 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6