My dataset contains ~ 20K proteins that are multi-domain and exhibit extensive length variation. As a result, homology detection tools are not generating clusters of high quality with truly homologous sequences...the sequence alignment of proteins within each inferred cluster can be very gappy and crappy. With that as context, here are my questions for forum members:
1. Are there useful run parameters while performing my BLASTp all-by-all?
2. Are there useful post-processing tools after generating my BLASTp all-by-all tabular output?
I have come across suggestions elsewhere, such as % query coverage >= 90% and % subject coverage >= 90%.
Please let me know how I may improve the quality of my tabular output to MCL so that the clusters contain homologous sequences - no more, no less. I do realize my dataset is very complex, so I am hoping to improve my results, without any delusional hopes for the perfect solution!
You could try changing the substitution matrix. Typically BLOSUM62 is used by default and tends to give good results for low similarities but you could also try BLOSUM45 which should be more suitable for very low sequence similarities.
As an alternative, if possible, you could try to assign your proteins to existing gene families using HMM profiles, e.g. for animal proteins, you could use TreeFam HMMs and the treefamscan tool.