Feeding FASTA-ggsearch36 results for MCL clustering
0
0
Entering edit mode
8.9 years ago
Anand Rao ▴ 640

I have the task of clustering ~ 18K plant proteins, with the ultimate goal of inferring gene gain and loss - this necessitating inference of orthology. This has been a nightmare because of the multi-domain nature, also because one of them is a highly promisciuous domain. Sequences in my orthogroups have poor gappy alignment, and their trees have several branches with little to no bootstrp support.

Therefore, I dont want to really bother any more about finding 'orthologs' as much as I want to simply gather sets of 'homologous protein sequences'. The only domain common to all these proteins - the promiscuous one - is relatively short (48aa) and poorly conserved. So I cant use domain-only alignment or phylogeny for obvious reasons.

Rather than using BLAST's local search algorithm, I've started wondering about ggsearch36 from FASTA package by Bill Pearson. It employs a global-global search algorithm. It also allows the option of producing output in BLAST format (I think tabulated). If I can re-produce global-global FASTA search results in BLAST's -m 8 tabular format, then it should work with MCL, correct?

Other than the workflow logistics, more importantly, would this be scientifically unacceptable for any reason? Or come with any big caveats? I can think of a few, but I'll wait for your responses.

I suppose I'd have to define what a 'sequence homolog' would be for this approach? For example, could I use cutoffs of 90% sequence identity and +/-10% sequence length variation? Any thoughts? Thanks!

FASTA global local cluster • 2.3k views
ADD COMMENT
0
Entering edit mode

MCL operates on the adjacency matrix of a (preferably undirected) graph so even if your output is not identical to BLAST's -m 8, you can always post-process your data to get a matrix of similarities between your sequences in one of the formats that mcl accepts as input.

To infer gains/losses, you would still need a tree. You could use the clusters as starting points as in the Treefam strategy (see the first paper and the update).

ADD REPLY

Login before adding your answer.

Traffic: 2641 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6