Question

NCBI Entrez not returning the correct DNA sequence for all entries

0

Entering edit mode

7.1 years ago

ahir29 ▴ 10

I wrote a python definition that accepts a multi-fasta file (usually after performing a blast search and downloading the fasta file of all the hits) and basically searches the "identical protein groups" database at ncbi, finds the corresponding "nuccore" database accession number, and the start and stop position of the gene from the full genome. I use this information to do another efetch request of the nuccore database and therefore retrieve the corresponding DNA sequence that codes for the specific protein. It works perfectly for some proteins but fails completely in others! But, when I access the nuccore database manually , I see that the gene is actually correct, it is just my script has failed to retrieve it (and infact has outputted a DNA sequence that is not even a truncation).

Why is it failing for some and working for others in the same loop?! Wondering if anyone can spot my mistake. I give you a sample multi-fasta that I tried, with the first one that doesnt work ,and the second one that does, for example.

gi|1205103960|ref|WP_087811211.1| NAD(P)-dependent alcohol dehydrogenase [Streptomyces sp. CS113] MKALQYRTIGAPPEVVTVPDPEPGSGQVLLKVTAAGVCHSDIAVMGWPAEGFPYELPLTL GHEGVGTVAALGAGVTGLAEGDAVAVYGPWGCGTCAKCAEGKENYCLRADELGIRPPGLG RPGSMAEYLLIDDPRHLVPLDGLDPVAAVPLTDAGLTPYHAIKRSLPKLVPGSTAVVIGT GGLGHVAIQLLRALTSARVVALDVSEEKLRLAREVGAHEAVLSDAKAADAVRRITGGLGA EAVFDFVGVAPTVKTAGAVAAVEGDVTLVGI-GGGSLPVGFGILPFEVSVNAPYWGSRSE LIEVLNLARSGAVSVHTETYSLDDAPLAYERLHEGKVNGRAVILP

gi|494710617|ref|WP_007446488.1| NAD(P)-dependent alcohol dehydrogenase [Streptomyces coelicoflavus] MKALQYRTIGAPPEVVTVPDPEPGPGQVLLKVTAAGVCHSDIAVMGWPAEGFPYELPLTL GHEGVGTVAALGAGVTGVAEGDAVAVYGPWGCGTCAKCAEGKENYCLRADELGIRPPGLG RPGSMAEYLLIDDPRHLVPLDGLDPVAAVPLTDAGLTPYHAIKRSLPKLVPGSTAVVIGT GGLGHVAIQLLRALTSARVVALDVSEEKLRLARAVGAHEAVLSDAKAADAVRELTGGLGA EAVFDFVGAAPTVRTAGAVAAVEGDVTLVGI-GGGSLPVGFGVLPFEVSVNAPYWGSRSE LIEVLNLARSGAVSVHTETYSLDDAPLAYERLHEGRVNGRAVILP

def protein_to_genomicdna(blast_fasta_file):
    handle = open(blast_fasta_file,"r")
    records = SeqIO.parse(handle, "fasta")
    records_dict = SeqIO.to_dict(records)
    handle.close()
    filename = blast_fasta_file.split(".")[0]
    dnafile = filename+"dna.fasta"
    with open("./"+dnafile,"w") as dna_file:
        for k in records_dict:
            description  = str(records_dict[k].description)
            prot_id = k.split("|")[3].strip()
            print "Acquiring NCBI DNA of", prot_id
            Entrez.email = "I entered my email"
            Entrez.tool = "MyLocalScript"
            Entrez.api_key = "I entered my API key"
            handle = Entrez.efetch(db = "protein", id=prot_id,rettype="ipg",retmode="xml")
            record = Entrez.parse(handle)

            first_rec = next(record)
            for item in (first_rec[u'ProteinList']):
                accession_check = item.attributes
                if accession_check[u'accver'] == prot_id:
                    facts = item[u'CDSList']
                    facts0 = facts[0]
                    facts_attr = facts0.attributes
                    dnaacc = str(facts_attr[u'accver']).replace('u','')
                    start1 = str(facts_attr[u'start']).replace('u','')
                    stop1 = str(facts_attr[u'stop']).replace('u','')
                    print dnaacc,start1,stop1
                    dnahandle= Entrez.efetch(db = "nuccore", id=dnaacc,rettype="fasta",retmode="txt",seq_start=start1,seq_stop=stop1)
                    print dnahandle
                    dnarecord = SeqIO.parse(dnahandle,"fasta")
                    dnarecord_dict = SeqIO.to_dict(dnarecord)
                    header = (">" + description + "\n")
                    print header
                    print dnarecord_dict

The output for these two sequences , if I ran the definition on that fasta file is :

> Acquiring NCBI DNA of WP_087811211.1 NZ_KZ195574.1 8125684 8126724
> <addinfourl at 53963392 whose fp = <socket._fileobject object at
> 0x03384CF0>>
> >gi|1205103960|ref|WP_087811211.1| NAD(P)-dependent alcohol dehydrogenase [Streptomyces sp. CS113]
> 
> {'NZ_KZ195574.1:8125684-8126724':
> SeqRecord(seq=Seq('TCAGCCGTGGGGCAGGATCACCGCGCGGCCGTTGACCTTGCCCTCGTGCAGCCG...CAT',
> SingleLetterAlphabet()), id='NZ_KZ195574.1:8125684-8126724',
> name='NZ_KZ195574.1:8125684-8126724',
> description='NZ_KZ195574.1:8125684-8126724 Streptomyces sp. CS113
> scaffold00001, whole genome shotgun sequence', dbxrefs=[])}

> Acquiring NCBI DNA of WP_007446488.1 NZ_AHGS01001332.1 5588 6628
> <addinfourl at 53945328 whose fp = <socket._fileobject object at
> 0x03384E70>>
> >gi|494710617|ref|WP_007446488.1| NAD(P)-dependent alcohol dehydrogenase [Streptomyces coelicoflavus]
> 
> {'NZ_AHGS01001332.1:5588-6628':
> SeqRecord(seq=Seq('ATGAAGGCACTGCAGTACCGCACCATCGGGGCCCCGCCCGAGGTCGTCACCGTC...TGA',
> SingleLetterAlphabet()), id='NZ_AHGS01001332.1:5588-6628',
> name='NZ_AHGS01001332.1:5588-6628',
> description='NZ_AHGS01001332.1:5588-6628 Streptomyces coelicoflavus
> ZG0656 contig01375, whole genome shotgun sequence', dbxrefs=[])}

If you notice, the first sequence does not start at ATG (start codon) and end with a stop codon, even though it has searched the nuccore database perfectly. If you go manually to that endpoint , you see clearly that the proper full gene does exist, so why didnt it get it programmatically?

I have this same scenario on multiple sequences, some work fine, some dont - in the same definition..... Help me!

ncbi entrez sequence retrieval • 2.2k views

ADD COMMENT • link updated 7.1 years ago by piet ★ 1.9k • written 7.1 years ago by ahir29 ▴ 10

score 1 · Answer 1 · 2018-10-19

dnahandle= Entrez.efetch(db = "nuccore", id=dnaacc,rettype="fasta",retmode="txt",seq_start=start1,seq_stop=stop1)

You are passing the "accession" to "id". While accession is an alphanumerical string, the id is a pure number. The id of the "nuccore" table is also called GI. In NCBI Genbank, such numerical ID have been assign to all tables in their database (nuccore, protein, taxonomy, assembly, biosample ...) as primary keys. The Entrez design is completely bases on using these numerical primary keys.