how to extract best hits of blast in terms of e-value, identity,...?
2
4
Entering edit mode
9.6 years ago
seta ★ 1.9k

Hi everybody,

I used the following code to do blastx (by last version of ncbi-blast+) of my transcriptome assembly against the proteome reference,

blastx -query file1 -db pr_database -out file1_1.out -e-value 1e-3 -max_target_seqs 20 -outfmt 6

Now, I have two questions; given my command, are all blast hit is the best or I look at also other parameters, like identity and alignment length? Sharing your factors to select the best hit would be highly appreciated.

Also, could you please provide me how to extract desired hit in terms of e-value and identity or alignment length (if it is necessary)?

Thanks

blast RNA-Seq alignment • 17k views
ADD COMMENT
8
Entering edit mode
9.6 years ago
5heikki 11k
export LANG=C; export LC_ALL=C; sort -k1,1 -k12,12gr -k11,11g -k3,3gr output | sort -u -k1,1 --merge > bestHits

Sort by 1. query name, 2. bitscore, 3. evalue, 4. nucleotide identity, and extract the best line for each query (bitscore more important than evalue, evalue more important than nucleotide identity). I've been using en_US.UTF-8 locale, but I think this should also work with C (and be somewhat faster).

ADD COMMENT
0
Entering edit mode

Just giving this a try and it doesn't output the correct columns for my last BLAST output -- is this for standard BLAST output? What flags did you set in your BLAST for this to work?

ADD REPLY
1
Entering edit mode

Standard outfmt 6 plus some extra fields after that. I'm guessing it's not really working if you used blastx (instead of IMO the far superior protein prediction + blastp approach), meaning that all your query names from the same contigs (or whatever) are identical. Tabular blastx output is pretty much unsuitable for automated sorting, unless input is short reads and you're not really expecting more than 1 protein per query sequence..

ADD REPLY
0
Entering edit mode

+1, You're right -- I did use blastx. I'll check on the next BLAST run. Thanks for all your help.

ADD REPLY
0
Entering edit mode
Don't forget the outfmt 6 flag. I recommend prodigal for protein prediction..
ADD REPLY
0
Entering edit mode

Thanks 5heikki,

It works for me, for me -outfmt 6.

Just for clarification, as I used -max_target_seqs 20 in my blast command, I expected that all hits were best hit, but using suggested command I got about 27000 hit from 32000 hits as best hit. Please let me know how to explain this difference? Sorry for this question, I'm a new this filed and may be have a stupid question in your professional view! Thanks

ADD REPLY
0
Entering edit mode

-max_target_seqs 20 for each query contig/read.

ADD REPLY
0
Entering edit mode

thanks dear friend

ADD REPLY
0
Entering edit mode
6.5 years ago

WARNING: sort -u --merge appears to fail spectacularly and silently on MacOS (at least for me):

$ sort --version
2.3-Apple (99)
$ echo -e "foo\nbar\neggs" | sort -u --merge
$

i.e. it produces no output. Compare with the sort on (my) Ubuntu:

$ sort --version
sort (GNU coreutils) 8.21
[other details suppressed]
$ echo -e "foo\nbar\neggs" | sort -u --merge
foo
bar
eggs

This means the sort -u --merge-based commands for keeping the top blast hit, i.e. for doing this kind of thing

$ echo -e "foo,1\nbar,\nfoo,2\neggs," | sort -t, -k1,1 -k2,2 | sort -t, -k1,1 -u --merge
bar,
eggs,
foo,1

(here keeping the line with the smallest value of the second comma-separated field) work with that linux sort but simply produce nothing with the MacOS sort.

To do what we want robustly I wrote https://github.com/ChrisHIV/shiver/blob/master/tools/KeepBestLinesInDataFile.py, which you call from the command line telling it which fields to sort & merge by.

ADD COMMENT

Login before adding your answer.

Traffic: 1946 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6