Find overlapping blast hits and keep the highest evalue
3
0
Entering edit mode
4.7 years ago

Hi everyone,

I am performing tblastn with a set of >1000 proteins as queries against a genome.

I am trying to keep every regions of my genome that match a query protein (evalue > 1e-10) but in many cases, 1 genome region will have many hits (several queries in the same region). This is mostly due that my proteins are all similar (same gene family)

For example :

query1 hit scaffold 1 from coordinates 60 to 120 (E = 1e-5)

query2 hit scaffold1 from coordinates 70 to 110 (E = 1e-3)

To filter those results, i would like to find a way to : 1) Find regions with overlapping queries 2) Keep only the best-hit on these regions (based on e-value)

(here, i would keep coordinates 60 to 120 on scaffold 1)

I have a tabular output from blast (outfmt 6) but i can't find an efficient way to apply such filters.

I would prefer something in R or bash but i could try to understand other languages.

Thanks for your help,

Maxime

blast genome R • 3.5k views
ADD COMMENT
6
Entering edit mode
4.7 years ago

Thanks for your answers !

I finally did it using R and the package GenomicRanges as suggested by Macolm.Cook.

I provide the code below :

blast_rslt <- read.table("blast_results.tblastn", header=FALSE, sep="\t")

colnames(blast_rslt) <- c("query", "sseqid", "pident", "length", "mismatch", 
                          "gapopen", "qstart", "qend", "sstart", "send",
                          "evalue", "bitscore")

test <- blast_rslt %>% dplyr::select(sseqid, sstart, send, evalue) 



test <- test %>% mutate(i_start = case_when(
  sstart < send ~ sstart,
  send < sstart ~ send
))


test <- test %>% mutate(i_end = case_when(
  sstart < send ~ send,
  send < sstart ~ sstart
))

test <- test %>% mutate(strand = case_when(
  sstart < send ~ "+",
  send < sstart ~ "-"
))


test <- test %>% select(sseqid, i_start, i_end, evalue, strand)

colnames(test) <- c("seqnames", "start", "end", "evalue", "strand")



test_irange <- test %>% as_granges()


test_disjoin <- reduce(test_irange,with.revmap=TRUE)


list_revmap <- as.data.frame(mcols(test_disjoin))



filtered_data <- c()
for(i in 1:nrow(list_revmap)){
  filtered_data <- c(filtered_data, (slice(list_revmap, i) %>% unlist(use.names=FALSE))[which.min(slice(test,  slice(list_revmap, i) %>% unlist(use.names=FALSE))$evalue)])
}


Best_hits <- slice(test, filtered_data)
ADD COMMENT
0
Entering edit mode

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work. You can also accept your own answer(as an exception) in this case since you provided actual code which was implied in @Malcom's answer.
Upvote|Bookmark|Accept

ADD REPLY
0
Entering edit mode

This is very helpful. I'm having the exact same problem and this Rscript is a game changer. Thank you so much for sharing!

ADD REPLY
4
Entering edit mode
4.7 years ago
Malcolm.Cook ★ 1.5k

Here's an approach using R:

The GenomicRanges package provides function disjoin with which you can compute all disjoint (non-overlapping) ranges of your hits. When passed optional argument with.revmap=TRUE, the results not only include the desired disjoint ranges, but also the corresponding indices of the hits which overlap them, which you can use to index back into your hits, and minimize on e-value (probably using which.min).

ADD COMMENT
2
Entering edit mode
4.7 years ago

If you reformat your BLAST output into sorted BED files, with five columns (i.e. with the score data in the fifth column), you could use BEDOPS bedmap to group regions by scaffold ("chromosome") and then use its --min-element argument to get the lowest-scoring value on overlapping regions.

To get the outfmt 6-formatted BLAST hit data into BED, you could use awk to select columns from the one-indexed BLAST output file (say, a file called blastHits.mtx), and write them to a zero-indexed, sorted BED file:

$ awk -v FS="\t" -v OFS="\t" '{print($2, ($7-1), $8, $1, $11)}' blastHits.mtx | sort-bed - > blastHits.bed

The blastHits.bed file will be sorted and contain the expectation score in the fifth column.

Let's also say the regions of interest over your genome are collected in a sorted BED file called roi.bed.

You could use the following command in a bash shell to get the lowest-scored element over regions-of-interest:

$ bedmap --echo --min-element --skip-unmapped roi.bed blastHits.bed > answer.bed

The file answer.bed will contain, for each region-of-interest that has an overlapping BLAST hit, the region of interest, and its lowest-scored BLAST hit that overlaps by one or more bases.

If a region-of-interest has no overlapping BLAST hits, it is not included in the output file.

The key is that the names of the scaffolds or chromosomes in both blastHits.bed and roi.bed must be identical, in order for the map step to work correctly. So just check the first column of both files to make sure you're using the same naming scheme.

This gives results for overlaps of one or more bases. If you want more stringency, there are options for that. Take a look at bedmap --help for the overlap options, or read the online documentation for more detail.

ADD COMMENT
0
Entering edit mode

Just a note that there has been a *-float128 version of BEDOPS tools since v2.4.32, which offers greater precision than the default set of tools:

https://bedops.readthedocs.io/en/latest/content/revision-history.html#v2-4-32

This "float128" version of the toolkit would be useful if you need to distinguish hits with very low E-values (anything lower than 1e-323).

It's been a while, but I think you could get very low E-values with sufficiently long BLAST queries, so it could be something to be aware of, depending on the makeup of your data and what kinds of E-values you are seeing in the hits.

Another option is to do a -log10 transformation on the E-value, and swap out --min-element for --max-element in the bedmap statement, to instead get the highest scoring element.

Additional precision should not be needed, in this case, because the negative-log transformation brings values into ranges that are easy to work with, with the standard precision toolkit.

ADD REPLY
0
Entering edit mode

On reading the edit to your question, it looks likeĀ all your E-values are greater than 1e-10, so you would not need to use the float128 binaries. The regular binaries would work fine here.

ADD REPLY

Login before adding your answer.

Traffic: 1032 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6