I am working with few protein sequences and a genome and I want to the output to consists only the sequences with identity >=50 and coverage >=50. I can not find any command line for this, any suggestions on how to do this would be a great help to me!
Thank you.
the identity one you can specify as a parameter, but not so for the %coverage.
one approach to get it done : get the tabular output of blast and do some postprocessing on it (python? awk? perl? ... ). Keep in mind that even then this won't be very straightforward (but doable nonetheless) to get an accurate result, due to the nature of your blast search (== protein will be split over different HSP/"exons" over your genomic sequence) so take that into account
EDIT (in reference to @genomax comment) what I'm writing above is recommended/required if you want the stats per protein. The ones for a per HSP basis you can define through blast parameters
I was confusing it with the blast tabular output options you can request: for that output you can ask the piden or ppos for percentage identical and positive matches respec. (which are then suitable for post processing)
hey, I found this really time saving and really helpful if anyone doesnt want to code.
1. export the data to excel sheet.
2. Go to a new column,
3. Go to fromula and insert an "if" statement, a dialogue box will pop up on the screen with
if (statement) : A1 < 50 (note that the identity or coverage is in the A column)
this is true "less_than_50"
this is false "not_less_than_50"
and then click on done.
Drage the equation all the way down to the last sequence.
Now, we have a column which is filled with "less_than_50" and "not_less_than_50"
After this, we can selest the filter option and click on the column and select "less_than_50"
It will delete the rows which have identity <50.
Only parameter you could potentially use for
tblastn
isYou will need to post-process the results for any other filtering you want to do.
oh okay, I will try that. Thank you!