Question

BLASTp all-by-all run parameters & post-processing tools

0

Entering edit mode

8.9 years ago

Anand Rao ▴ 640

My dataset contains ~ 20K proteins that are multi-domain and exhibit extensive length variation. As a result, homology detection tools are not generating clusters of high quality with truly homologous sequences...the sequence alignment of proteins within each inferred cluster can be very gappy and crappy. With that as context, here are my questions for forum members:

1. Are there useful run parameters while performing my BLASTp all-by-all?

2. Are there useful post-processing tools after generating my BLASTp all-by-all tabular output?

I have come across suggestions elsewhere, such as % query coverage >= 90% and % subject coverage >= 90%.

Please let me know how I may improve the quality of my tabular output to MCL so that the clusters contain homologous sequences - no more, no less. I do realize my dataset is very complex, so I am hoping to improve my results, without any delusional hopes for the perfect solution!

BLAST coverage length clustering homology • 1.9k views

ADD COMMENT • link 8.9 years ago by Anand Rao ▴ 640

0

Entering edit mode

You could try changing the substitution matrix. Typically BLOSUM62 is used by default and tends to give good results for low similarities but you could also try BLOSUM45 which should be more suitable for very low sequence similarities.

As an alternative, if possible, you could try to assign your proteins to existing gene families using HMM profiles, e.g. for animal proteins, you could use TreeFam HMMs and the treefamscan tool.

ADD REPLY • link 8.9 years ago by Jean-Karim Heriche 27k