Here is python script that can also be used to download GenBank files given accession numbers.
Copy the python code below and save it as fetch_accession.py. Create a text file and give it a name say accessions.txt. Write these accession numbers (AP010935.1, AE014074.1, CP001129.1) to the file and save.
Run it as follows:
./fetch_accession.py -a accessions.txt
The output should files with the accession id followed by .gb
and .fasta
extensions.
#!/usr/bin/env python
import os, sys, glob
from Bio import SeqIO, Entrez
import argparse
def main():
opts=argparse.ArgumentParser(sys.argv[0],description="fetch genbank files given accessions",
prefix_chars="-",add_help=True,epilog="2014")
opts.add_argument("-a",action="store",nargs=1,metavar="ACCESSION FILE",dest="ACCESSION",help="specify accessions file",required=True)
options=opts.parse_args()
input_file=options.ACCESSION[0]
indata=""
ncounter=0
try:
indata=[str(j).strip() for j in open(str(input_file),"rU")]
except IOError as inError:
print "Error: Failed to open input file "+str(input_file)+"..."
print "\n\nException caught: \n",inError
sys.exit()
try:
for I in list(indata):
ncounter=ncounter+1
print "Parsing...\t"+str(i)+"\t"+str(ncounter)+"/"+str(len(indata))
seqfile=Entrez.efetch(email="example@email.com",db="nucleotide",id=str(i),retmode="text",rettype="gb")
gb_record=SeqIO.read(seqfile,"genbank")
SeqIO.write(gb_record,str(i)+".gb","genbank")
SeqIO.write(gb_record,str(i)+".fasta","fasta")
except Exception as exError:
print "Error: Problem encountered while downloading files..."
sys.exit()
if __name__ == "__main__":
main()
sorry to ruin the party, but there are already the Entrez Direct E-utilities tools that allow you to do so. Nice work anyway!
Oh they look quite handy! I knew E-utilities before but I didn't know that they can be run as unix commands. Thanks for sharing.
Actually I find it extremely helpful - thanks for sharing it!
EDIT: After testing the script with more sequences, it is failing to retrieve some 10% of the CDS. I tried to re-run these which failed, but they failed again. "Supplied id parameter is empty." was retrieved for them instead of their sequence. I checked for these protein IDs their GenBank entries and the CDS information is there. So either there are some hidden problems in the GenBank or your script has some flaws.
Thanks for your comment! Can you provide some examples (protein accession no.s) that result in "supplied id parameter is empty"? Let me see if I can fix it.
Sure, e.g.: XP_012247746 XP_012248212 XP_012245590 (I tried to run the python script twice so it should not be just a temporal NCBI hiccup)
While using your script again after some time I found that all the sequences for which retrieval fails ('Supplied id parameter is empty.'), are label in GenBank as "partial mRNA", i.e. not full CDS while it did not fail for any of my protein IDs for which there is a corresponding full mRNA in the database.
UPDATE: the incomplete CDS have ">" or "<" characters in their header to indicate their incompletness at the 3' resp 5' end, i.e. XP_012174533.1|XM_012319143.2:<1..942 or XP_012174003.2|XM_012318613.2:20..>1197 ...maybe handling these characterscauses some problems to the script...