Hi everyone,
I am performing tblastn with a set of >1000 proteins as queries against a genome.
I am trying to keep every regions of my genome that match a query protein (evalue > 1e-10) but in many cases, 1 genome region will have many hits (several queries in the same region). This is mostly due that my proteins are all similar (same gene family)
For example :
query1 hit scaffold 1 from coordinates 60 to 120 (E = 1e-5)
query2 hit scaffold1 from coordinates 70 to 110 (E = 1e-3)
To filter those results, i would like to find a way to : 1) Find regions with overlapping queries 2) Keep only the best-hit on these regions (based on e-value)
(here, i would keep coordinates 60 to 120 on scaffold 1)
I have a tabular output from blast (outfmt 6) but i can't find an efficient way to apply such filters.
I would prefer something in R or bash but i could try to understand other languages.
Thanks for your help,
Maxime
If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work. You can also accept your own answer(as an exception) in this case since you provided actual code which was implied in @Malcom's answer.
This is very helpful. I'm having the exact same problem and this Rscript is a game changer. Thank you so much for sharing!