Best way to obtain info from genbank given accession codes
1
0
Entering edit mode
18 months ago

Hi all,

I have a lot of (order of a million) accession codes that I need to fetch info of from genbank. I have found these by blasting nucleotide sequences against the nt database. I'd like to know whether they have associated protein sequences, and if so, fetch them. I'm currently doing this using biopython's Entrez module. I was hoping to process the codes in parallel in batches of 100, because fetching info about one code can take up to a minute, but I'm limited by 3 queries per second. Further, I have to fetch the entire info about a given entry, while sometimes knowing the organism and protein name, if any, would allow me to filter out the codes that will not be helpful.

I also have access to a local copy of BLAST+ and the most up-to-date databases. I can use blastdbcmd to get the organism name and nucleotide sequence, but this doesn't give me info about whether the associated protein sequence is known.

What are my options at this point?

Thanks!

ncbi genbank python biopython • 886 views
ADD COMMENT
1
Entering edit mode
18 months ago

I would say that the quickest way to find matches is to blast your nucleotides against the nr protein database and select the exact matches.

Then you can download the GenBank data from here

https://ftp.ncbi.nlm.nih.gov/genbank/

and parse out what you need,

PS: I go curious to see what performance you could get with say Python using bio (pip install bio)

# Compressed 91MB, uncompressed 450MB, contains 100K entries
wget https://ftp.ncbi.nlm.nih.gov/genbank/gbbct1.seq.gz

# Parses each GenBank and extracts FASTA
time bio fasta gbbct1.seq.gz | wc -l

prints

3224390
real    0m59.794s
ADD COMMENT
0
Entering edit mode

Thanks! I should have blasted my nucleotides against nr, but I have already processed a ton by blasting against nt... It's a learning process I guess :)

I will try downloading the database and searching through it locally.

ADD REPLY

Login before adding your answer.

Traffic: 3030 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6