I've tried to make a subset of pre-formated blast database with blastdb_aliastool
from ncbi-blast-2.3.0+. It failed on nt but success on nr. I pretty sure the file is intact, because I've check the md5sum.
Here's a quick sample:
#retrieve data
wget http://ftp.ncbi.nlm.nih.gov/blast/db/nt.00.tar.gz
wget http://ftp.ncbi.nlm.nih.gov/blast/db/nr.00.tar.gz
tar -xf nt.00.tar.gz
tar -xf nr.00.tar.gz
#get some gi to test
blastdbcmd -db nr.00 -entry all|head|grep "^>"|sed -e 's/>gi|//g' -e 's/|.*//g' > nr_gi.txt
#success
blastdb_aliastool -gilist nr_gi.txt -db nr.00 -out nr_gi
#check alias db content
blastdbcmd -db nr_gi -entry all
#get some gi to test
blastdbcmd -db nt.00 -entry all|head|grep "^>"|sed -e 's/>gi|//g' -e 's/|.*//g' > nt_gi.txt
#failed
blastdb_aliastool -gilist nt_gi.txt -db nt.00 -out nt_gi
#check alias db content
blastdbcmd -db nt_gi -entry all
It failed with this message:
Converted 2 GIs from nt_gi.txt to binary format in nt_gi.p.gil
BLAST Database error: BLASTDB alias file creation failed. Some referenced files may be missing
Why blastdb_aliastool only works on nr? Some post said specify -parse_seqids
when makeblastdb
should work(it is also nr). Then I tried:
# try makeblastdb first
blastdbcmd -db nr.00 -entry all|head -n 1000 > nr_test.fa
makeblastdb -in nr_test.fa -dbtype prot -parse_seqids -out nr_test
#success
blastdb_aliastool -gilist nr_gi.txt -db nr_test -out nr_gi_test
#check alias db content
blastdbcmd -db nr_gi_test -entry all
blastdbcmd -db nt.00 -entry all|head -n 1000 > nt_test.fa
makeblastdb -in nt_test.fa -dbtype nucl -parse_seqids -out nt_test
#failed again
blastdb_aliastool -gilist nt_gi.txt -db nt_test -out nt_gi_test
#check alias db content
blastdbcmd -db nt_gi_test -entry all
It's still not working. I found another post, which seems nt is also working. Was It related to the blast+ version? How to make alias db with blastdb_aliastool
on nt correctly?
That's not the reason why it failed. I've download the whole nt database. The codes is to use the smallest data to regenerate the same issue. With or without the whole database, it's the same, Nr will work but Nt will not. GI files is a list of gi, which looks like:
You can get example gi files by running the codes above. Sequences could be pull out using GI list with
blastdbcmd
, both nr and nt. So it's quite wired.