Question

how to extract best hits of blast in terms of e-value, identity,...?

4

Entering edit mode

10.0 years ago

seta ★ 1.9k

Hi everybody,

I used the following code to do blastx (by last version of ncbi-blast+) of my transcriptome assembly against the proteome reference,

blastx -query file1 -db pr_database -out file1_1.out -e-value 1e-3 -max_target_seqs 20 -outfmt 6

Now, I have two questions; given my command, are all blast hit is the best or I look at also other parameters, like identity and alignment length? Sharing your factors to select the best hit would be highly appreciated.

Also, could you please provide me how to extract desired hit in terms of e-value and identity or alignment length (if it is necessary)?

Thanks

blast RNA-Seq alignment • 18k views

ADD COMMENT • link updated 2.4 years ago by Ram 45k • written 10.0 years ago by seta ★ 1.9k

Ram · Answer 1 · 2015-06-01

8

Entering edit mode

10.0 years ago

5heikki 11k

export LANG=C; export LC_ALL=C; sort -k1,1 -k12,12gr -k11,11g -k3,3gr output | sort -u -k1,1 --merge > bestHits

Sort by 1. query name, 2. bitscore, 3. evalue, 4. nucleotide identity, and extract the best line for each query (bitscore more important than evalue, evalue more important than nucleotide identity). I've been using en_US.UTF-8 locale, but I think this should also work with C (and be somewhat faster).

ADD COMMENT • link updated 2.4 years ago by Ram 45k • written 10.0 years ago by 5heikki 11k

0

Entering edit mode

Just giving this a try and it doesn't output the correct columns for my last BLAST output -- is this for standard BLAST output? What flags did you set in your BLAST for this to work?

ADD REPLY • link updated 2.4 years ago by Ram 45k • written 10.0 years ago by Josh Herr 5.8k

1

Entering edit mode

Standard outfmt 6 plus some extra fields after that. I'm guessing it's not really working if you used blastx (instead of IMO the far superior protein prediction + blastp approach), meaning that all your query names from the same contigs (or whatever) are identical. Tabular blastx output is pretty much unsuitable for automated sorting, unless input is short reads and you're not really expecting more than 1 protein per query sequence..

ADD REPLY • link updated 2.4 years ago by Ram 45k • written 10.0 years ago by 5heikki 11k

0

Entering edit mode

+1, You're right -- I did use blastx. I'll check on the next BLAST run. Thanks for all your help.

ADD REPLY • link updated 2.4 years ago by Ram 45k • written 10.0 years ago by Josh Herr 5.8k

0

Entering edit mode

Don't forget the outfmt 6 flag. I recommend prodigal for protein prediction..

ADD REPLY • link 10.0 years ago by 5heikki 11k

0

Entering edit mode

Thanks 5heikki,

It works for me, for me -outfmt 6.

Just for clarification, as I used -max_target_seqs 20 in my blast command, I expected that all hits were best hit, but using suggested command I got about 27000 hit from 32000 hits as best hit. Please let me know how to explain this difference? Sorry for this question, I'm a new this filed and may be have a stupid question in your professional view! Thanks

ADD REPLY • link updated 2.4 years ago by Ram 45k • written 10.0 years ago by seta ★ 1.9k

0

Entering edit mode

-max_target_seqs 20 for each query contig/read.

ADD REPLY • link updated 2.4 years ago by Ram 45k • written 10.0 years ago by 5heikki 11k

0

Entering edit mode

thanks dear friend

ADD REPLY • link 10.0 years ago by seta ★ 1.9k

Ram · Answer 2 · 2018-07-05

WARNING: sort -u --merge appears to fail spectacularly and silently on MacOS (at least for me):

$ sort --version
2.3-Apple (99)
$ echo -e "foo\nbar\neggs" | sort -u --merge
$

i.e. it produces no output. Compare with the sort on (my) Ubuntu:

$ sort --version
sort (GNU coreutils) 8.21
[other details suppressed]
$ echo -e "foo\nbar\neggs" | sort -u --merge
foo
bar
eggs

This means the sort -u --merge-based commands for keeping the top blast hit, i.e. for doing this kind of thing

$ echo -e "foo,1\nbar,\nfoo,2\neggs," | sort -t, -k1,1 -k2,2 | sort -t, -k1,1 -u --merge
bar,
eggs,
foo,1

(here keeping the line with the smallest value of the second comma-separated field) work with that linux sort but simply produce nothing with the MacOS sort.

To do what we want robustly I wrote https://github.com/ChrisHIV/shiver/blob/master/tools/KeepBestLinesInDataFile.py, which you call from the command line telling it which fields to sort & merge by.