I have the task of clustering ~ 18K plant proteins, with the ultimate goal of inferring gene gain and loss - this necessitating inference of orthology. This has been a nightmare because of the multi-domain nature, also because one of them is a highly promisciuous domain. Sequences in my orthogroups have poor gappy alignment, and their trees have several branches with little to no bootstrp support.
Therefore, I dont want to really bother any more about finding 'orthologs' as much as I want to simply gather sets of 'homologous protein sequences'. The only domain common to all these proteins - the promiscuous one - is relatively short (48aa) and poorly conserved. So I cant use domain-only alignment or phylogeny for obvious reasons.
Rather than using BLAST's local search algorithm, I've started wondering about ggsearch36 from FASTA package by Bill Pearson. It employs a global-global search algorithm. It also allows the option of producing output in BLAST format (I think tabulated). If I can re-produce global-global FASTA search results in BLAST's -m 8 tabular format, then it should work with MCL, correct?
Other than the workflow logistics, more importantly, would this be scientifically unacceptable for any reason? Or come with any big caveats? I can think of a few, but I'll wait for your responses.
I suppose I'd have to define what a 'sequence homolog' would be for this approach? For example, could I use cutoffs of 90% sequence identity and +/-10% sequence length variation? Any thoughts? Thanks!
MCL operates on the adjacency matrix of a (preferably undirected) graph so even if your output is not identical to BLAST's
-m 8
, you can always post-process your data to get a matrix of similarities between your sequences in one of the formats that mcl accepts as input.To infer gains/losses, you would still need a tree. You could use the clusters as starting points as in the Treefam strategy (see the first paper and the update).