Question

Feeding FASTA-ggsearch36 results for MCL clustering

0

Entering edit mode

9.6 years ago

Anand Rao ▴ 640

I have the task of clustering ~ 18K plant proteins, with the ultimate goal of inferring gene gain and loss - this necessitating inference of orthology. This has been a nightmare because of the multi-domain nature, also because one of them is a highly promisciuous domain. Sequences in my orthogroups have poor gappy alignment, and their trees have several branches with little to no bootstrp support.

Therefore, I dont want to really bother any more about finding 'orthologs' as much as I want to simply gather sets of 'homologous protein sequences'. The only domain common to all these proteins - the promiscuous one - is relatively short (48aa) and poorly conserved. So I cant use domain-only alignment or phylogeny for obvious reasons.

Rather than using BLAST's local search algorithm, I've started wondering about ggsearch36 from FASTA package by Bill Pearson. It employs a global-global search algorithm. It also allows the option of producing output in BLAST format (I think tabulated). If I can re-produce global-global FASTA search results in BLAST's -m 8 tabular format, then it should work with MCL, correct?

Other than the workflow logistics, more importantly, would this be scientifically unacceptable for any reason? Or come with any big caveats? I can think of a few, but I'll wait for your responses.

I suppose I'd have to define what a 'sequence homolog' would be for this approach? For example, could I use cutoffs of 90% sequence identity and +/-10% sequence length variation? Any thoughts? Thanks!

FASTA global local cluster • 2.4k views

ADD COMMENT • link 9.6 years ago by Anand Rao ▴ 640

0

Entering edit mode

MCL operates on the adjacency matrix of a (preferably undirected) graph so even if your output is not identical to BLAST's -m 8, you can always post-process your data to get a matrix of similarities between your sequences in one of the formats that mcl accepts as input.

To infer gains/losses, you would still need a tree. You could use the clusters as starting points as in the Treefam strategy (see the first paper and the update).

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 9.6 years ago by Jean-Karim Heriche 27k