Hi - I downloaded a copy of the nr.fasta database from NCBI a couple months ago. I used the program DIAMOND to compare a set of sequences to that database and generate a blast tabular output file. DIAMOND requires the initial database to be in fasta, from which one makes a special DIAMOND formatted database. From that resulting blast tabular file, I can easily get a column of accession ids from hits in a text file like so:
XP_013992594.1
KKF19911.1
XP_006633669.1
Now, I want the description lines for each accession in a column. Once I have that I can merge it with the original tabular output file to have all the fields I want in one file. I read about using the blastdbcmd to do this, but from what I understand that needs a pre-formatted NCBI database. My options were to generate the nr database from the nr.fasta file using makeblastdb, or I could download the nr database again as NCBI pre-formatted. Since NCBI says it is more efficient to use the pre-formatted database, I opted to download the complete nr database yesterday. I then used the following command:
blastdbcmd -db /home/aaron/nr_db/nr -entry_batch ids.txt -outfmt '%g %t' -target_only -out test.txt
Because I am comparing hits from the earlier database to the newer download of nr, some of the accessions records have been removed, and I get the error:
Error: XP_006633669.1: OID not found
That's fine, except I now have an original blast tabular file with say 1,000,000 lines, and I want to merge it with a list of descriptions in a column of say 999,000 lines because 1000 entries were no longer in the database. I would like a way of printing to -out from blastdbcmd "Entry not found" whenever it encounters a missing entry, but I don't see this as an option. Doing that would give me two files of the same number of columns, and I would know what entries were missing based on their description. I assume there is a simple way to script this at the command line but I can't figure it out. I tried this:
blastdbcmd -db /home/aaron/nr_db/nr -entry_batch ids.txt -outfmt '%g %t' | awk '{if($0=="") print "Sequence removed from db"; else print $0}'
but I know that is wrong. I don't know what blastdbcmd is "seeing" when it encounters a missing record. Could someone please provide some guidance? The obvious easy answer would be to just format the original nr.fasta into a blastable database, but I am concerned with that approach NCBI recommends against making the nr database that way versus downloading the pre-formatted version. I appreciate your time - Thanks -
Thanks for the response. I need to clarify what I wrote above in that I made a mistake in what I wrote about output I am getting. If I run:
I get the following error:
If I run with the
-target_only
flag:I get this error:
which is repeated 26 times without any associated accession number. I am using blast version 2.2.29+ if that makes a difference. If I append
2> dbcmd_err
as suggested, the resulting file entries have no accession number associated with them, just 'Error: Entry not found in BLAST database' repeated 26 times. Any idea why the error message I am getting is not the same?I used blast v. 2.3.0 and I get the following error for each accession not found in the database (last two entries are fake used for testing) with the command line posted above.
Which includes the accession #. Can you upgrade to 2.3.0?
Okay, upgrading to the latest version fixed that, and I get the same output as you do now. I can run the commands as you suggested and append the output to the end of the tabular blast file and have the same number rows in each column.
However, my goal was to be able to take the newly generated column of accessions and descriptions and merge it to the tabular file and have the rows line up in perfect original order. The way I have the columns structured now, the orders of data in the columns would not be the same because missing accessions with their descriptions would all be at the bottom of the new column that I am appending to the original tabular blast file. Without merging the tables in R or perl or something, I could have lines in the tabular blast file that are matched with the wrong descriptions.
This newest version of blast prints to screen exactly how I want the output, where it goes record by record and when it sees a missing accession, it prints the error. The earlier version I had before did not do that it just printed errors. I tried the following which directs the output to a file but also prints it to screen:
Now the file output.txt has the error message printed into the output where it occurs instead of having to append it on the end afterwards. That gives me the data in the order I want but it all prints to screen and to the file. Can I alter the command to just print to the file and turn off printing to the screen?
Thanks again for your help.
The following should write only to the file and suppress writing to screen.