Dear Bioinformaticians,
I would like to ask about defining the level of filtering by sequence identity (pident, %) from tblastn results.
I have a table of tblasn results in Galaxy including about 800,000 sequences. I would like to filter them by sequence identity but if I filter them with 98% I lose almost all sequences. I would like to know what is the accepted level for filtering considering that this is from protein! data. I think this should not be as strict as a blastn filtering (commonly 98 or 99%). Please give me advice and link me to any publication which tells me a proper percentage.
All answers are greatly appreciated. :)
Thend
It is impossible to say without knowing any details about the project. Why do you need to filter the sequences?
Even when knowing the details, there is probably no perfect threshold, it is often a trade-off between removing artifacts (I guess this is what you want to do) and not losing too much information.
hey endretoth, I am doing a similar work Can you tell me how did you filter the sequences ? manually or did you use any programming language to do that?
It depends on what kind of sequences you are using in search. you should also be looking for the blosum or PAM matrix you used in blast, depending on divergence between the sequences you are looking for and query sequence.