Tool:[share] Script to bulk retrieve DNA sequences by protein IDs from GenBank
4
8
Entering edit mode
10.0 years ago
qiyunzhu ▴ 130

Hi all,

I recently wrote a very simple Python script, and guess it would be useful for some people. You throw in a list of protein accessions or GIs, and the script will download the corresponding DNA sequences (CDS) from NCBI GenBank.

Here is the download link: https://drive.google.com/uc?export=download&id=0BxcrsszS5z4FMGt4UklSeGo4UWs

Have fun!

Best

python genbank sequence cds protein • 6.9k views
ADD COMMENT
1
Entering edit mode

sorry to ruin the party, but there are already the Entrez Direct E-utilities tools that allow you to do so. Nice work anyway!

ADD REPLY
0
Entering edit mode

Oh they look quite handy! I knew E-utilities before but I didn't know that they can be run as unix commands. Thanks for sharing.

ADD REPLY
0
Entering edit mode

Actually I find it extremely helpful - thanks for sharing it!

EDIT: After testing the script with more sequences, it is failing to retrieve some 10% of the CDS. I tried to re-run these which failed, but they failed again. "Supplied id parameter is empty." was retrieved for them instead of their sequence. I checked for these protein IDs their GenBank entries and the CDS information is there. So either there are some hidden problems in the GenBank or your script has some flaws.

ADD REPLY
0
Entering edit mode

Thanks for your comment! Can you provide some examples (protein accession no.s) that result in "supplied id parameter is empty"? Let me see if I can fix it.

ADD REPLY
0
Entering edit mode

Sure, e.g.: XP_012247746 XP_012248212 XP_012245590 (I tried to run the python script twice so it should not be just a temporal NCBI hiccup)

ADD REPLY
0
Entering edit mode

While using your script again after some time I found that all the sequences for which retrieval fails ('Supplied id parameter is empty.'), are label in GenBank as "partial mRNA", i.e. not full CDS while it did not fail for any of my protein IDs for which there is a corresponding full mRNA in the database.

UPDATE: the incomplete CDS have ">" or "<" characters in their header to indicate their incompletness at the 3' resp 5' end, i.e. XP_012174533.1|XM_012319143.2:<1..942 or XP_012174003.2|XM_012318613.2:20..>1197 ...maybe handling these characterscauses some problems to the script...

ADD REPLY
2
Entering edit mode
10.0 years ago
lelle ▴ 830

I was just working on something similar, so I gave it a try.

Here is some feedback:

It does not adhere to the eutils user guideline It is very likely that it will make more than 3 requests per second.

It does not work for all entries (e.g. this one).

It should give some indication if it can not find a DNA sequence for a protein and not fail silently like it is now.

ADD COMMENT
0
Entering edit mode

Thanks for your comments. All of them are helpful to me. I should have spent a little more time on error handling.

ADD REPLY
0
Entering edit mode
10.0 years ago
5heikki 11k

Here's a script that utilizes Entrez Direct and protein GIs:

#!/bin/bash
IFS=$'\n'

if [ -n "$1" ]
then
  split -l 500 $1 input.

  for f in input.*
  do
    query=$(cat $f | tr "\n" ",")
    efetch -db protein -id $query -format fasta > $f.faa.tmp
    rm $f
  done

  cat *.faa.tmp | grep . > $1.faa
  rm *.faa.tmp

else
  echo "Usage: fetchSeqs listOfNcbiGis"
  echo "Requires Entrez Direct in \$PATH: http://www.ncbi.nlm.nih.gov/books/NBK179288/"
fi

edit. Actually, it doesn't do what OP's script does. Just proteins with protein GIs. If you want nucleotides with protein GIs, replace the efetch line with:

epost -db protein -id $query | elink -target nuccore | efetch -format fasta_cds_na > $f.faa.tmp

edit2. actually, I think this fetches all CDS of the related nt molecule. I linked gi to cds_nt before but can't remember right now how to do it :(

ADD COMMENT
0
Entering edit mode

This also fails for my example record above, but I start to think that is because the information is just not there.

ADD REPLY
0
Entering edit mode

NCBI records seem to have lots of issues. For example, some sequences point to an organism, but its taxonomic information is obsolete. It is indeed tricky to work with NCBI data though.

ADD REPLY
0
Entering edit mode
10.0 years ago

Here is python script that can also be used to download GenBank files given accession numbers.

Copy the python code below and save it as fetch_accession.py. Create a text file and give it a name say accessions.txt. Write these accession numbers (AP010935.1, AE014074.1, CP001129.1) to the file and save.

Run it as follows:

./fetch_accession.py -a accessions.txt

The output should files with the accession id followed by .gb and .fasta extensions.

#!/usr/bin/env python
import os, sys, glob
from Bio import SeqIO, Entrez
import argparse

def main():
    opts=argparse.ArgumentParser(sys.argv[0],description="fetch genbank files given accessions",
        prefix_chars="-",add_help=True,epilog="2014")
    opts.add_argument("-a",action="store",nargs=1,metavar="ACCESSION FILE",dest="ACCESSION",help="specify accessions file",required=True)
    options=opts.parse_args()
    input_file=options.ACCESSION[0]
    indata=""
    ncounter=0
    try:
        indata=[str(j).strip() for j in open(str(input_file),"rU")]
    except IOError as inError:
        print "Error: Failed to open input file "+str(input_file)+"..."
        print "\n\nException caught: \n",inError
        sys.exit()

    try:
        for I in list(indata):
            ncounter=ncounter+1
            print "Parsing...\t"+str(i)+"\t"+str(ncounter)+"/"+str(len(indata))
            seqfile=Entrez.efetch(email="example@email.com",db="nucleotide",id=str(i),retmode="text",rettype="gb")
            gb_record=SeqIO.read(seqfile,"genbank")
            SeqIO.write(gb_record,str(i)+".gb","genbank")
            SeqIO.write(gb_record,str(i)+".fasta","fasta")
    except Exception as exError:
        print "Error: Problem encountered while downloading files..."
        sys.exit()
if __name__ == "__main__":
    main()
ADD COMMENT
0
Entering edit mode
10.0 years ago
qiyunzhu ▴ 130

Thanks guys for your insightful comments and contributions to this issue!

ADD COMMENT
0
Entering edit mode

Hello every one

I have a list of genes accession number and i want to download corresponding CDS sequence for each from NCBI in fasta format kindly guide me how can i do so ??? I'm new in bioinformatics ..

and the following link is daed .. :(

Here is the download link: https://drive.google.com/uc?export=download&id=0BxcrsszS5z4FMGt4UklSeGo4UWs

ADD REPLY

Login before adding your answer.

Traffic: 2957 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6