extract accessions from blastdbcmd
0
2
Entering edit mode
6.7 years ago
navela78 ▴ 70

My copy of the nr database has about 148 million sequences which i found using blastdbcmd command using the -info option.

However when I try to extract the accession numbers using the following command:

blastdbcmd -db nr -entry all -outfmt "%a"

I am getting much more than 148 million accessions. I am guessing it is because multiple genbank accessions having the exact same sequence will be represented with only one sequence but with multiple accessions in the nr database.

What is the best way to get only one accession per sequence using blastdbcmd? I donn't care which accession it is but the number of accessions returned by blastdbcmd should match the number of sequences in it. How do I do this?

blastdbcmd blast • 6.3k views
ADD COMMENT
0
Entering edit mode

Can I ask why you want to do this? You can find the fasta files (which are used to build the nr database) here. You could get the ID's either way and process them to leave just the first entry in fasta header.

ADD REPLY
0
Entering edit mode

Thanks genomax. I just thought it will be easier to extract accessions this way than to write a script. Looks like there is no straight forward way in blastdbcmd to extract one accession per sequence? In that case I can work with the fasta file like you mentioned.

My ultimate goal is to see how blastdbs perform for random access of sequence data. I am developing a software that needs sequences to be quickly extracted given a list of accessions. Any thoughts on this? Should I start a new question on indexing strategies? I am thinking of testing blastdb, Python seqio indexing, and samtools (I read somewhere that samtools can do this). Not sure which one will be faster. My db will have about 500 million - 1 billion sequences.

ADD REPLY
0
Entering edit mode

blastdbcmd to extract one accession per sequence?

Does it matter if there are other homologs in the sequence header? You could either re-write the header on the fly to just include the accession you searched with.

Edit (looks like there are 102 30S ribosomal protein S18):

$ blastdbcmd -db nr -entry AE006448_5 -outfmt "%i" | wc -l
102
$ blastdbcmd -db nr -entry OEU38668.1 -outfmt "%i" | wc -l
102

You get all identifiers if you asked for the fasta sequence with any one of those IDs.

samtools faidx, seqtk subseq, @Matt Shirley's pyfaidx can all be used to rapidly extract sequences from fasta files. There is faSomeRecords from Jim Kent's UCSC utilities that is very fast as well and does not need sequences to be indexed. SeqKit by @shenwei may also be another option.

You can also start a new question if you have thoughts about indexing strategies.

ADD REPLY
0
Entering edit mode

Thanks genomax. Sorry for the miscommunication ... but i was looking to extract one accession per sequence for all sequences in nr. Looks like there is no blastdbcmd option for it ... so like you mentioned I created it on the fly like so:

blastdbcmd -db nr -entry all -outfmt "%f" | awk '/>(.+) / {print substr($1,2);}'

this worked for me for now ... however it is slow because it is parsing through the sequences as well (outfmt is %f). I looked through the documentation and I don't think there is a way to extract only the def line

Thanks for your pointers on indexing ... I will look into them and start a separate thread if I have questions

ADD REPLY
0
Entering edit mode

interesting (and confirmed this with our own local nr db ;)) .

I hope it's correct to assume they mean same sequence present in different DBs right, and not as in a sequence that is 100% identical but from different species?

ADD REPLY
1
Entering edit mode

See for yourself

blastdbcmd -db nr -entry OEU38668.1 -outfmt "%f"

or

blastdbcmd -db nr -entry AE006448_5 -outfmt "%f"
ADD REPLY
0
Entering edit mode

was doing that while you were writing your reply :)

and yes, it's thus a combination of both: exact same seq from diff species as well as same entry from diff DBs

ADD REPLY

Login before adding your answer.

Traffic: 2016 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6