How To Find Protein Clusters/Families?
2
4
Entering edit mode
14.1 years ago
Panos ★ 1.8k

I have a metagenomic dataset with several hundred thousand, relatively small genes and I want to find the protein families that exist in there (based on their aminoacid sequence)...

I've tried cdhit with a 40% threshold as well as mclblastline with a varying inflation value of 3.0 up to 10.0. cdhit creates smaller (less members in each family) but more homogeneous families compared to mcl (as seen by the present domains / KEGG orthology).

I think I like better cdhit's clusters; the only thing that I don't like about it is that I can't decrease the threshold below 40% (which I think it may be necessary for such a task). On the other hand, mcl is used exactly for this kind of task, right?

Is there anyone who has done such an analysis so that he/she can give me some advice? I'm particularly interested in people's opinion about mcl.

clustering protein • 4.8k views
ADD COMMENT
1
Entering edit mode

Are you trying to find a specific family or classify your whole set? Is it enough for you to classify your genes to known protein families or do want unknowns clustered too? - I think there are quite a few possible approaches, depending on what you are trying to achieve.

ADD REPLY
0
Entering edit mode

Thanks for the comment Michael!

There are two things I want to do:

(a) find sub-families within families identified by HMMer3 (eg different groups of ABC transporters), and

(b) find families that don't exist in pfam.

ADD REPLY
4
Entering edit mode
14.1 years ago
Casbon ★ 3.3k

I wrote a paper on clustering protein families a while back: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1409676/ This figure compares tribe mcl to a spectral method - the target groups are SCOP structural families. There is now an implementation available (haven't tried it myself) http://www.paccanarolab.org/software/scps/index.html

Nevertheless, I would treat this is as a classification problem than a clustering problem. I would use hmmer3 with pfam models to classify them into known pfam families unless you have reason to suspect that your data contains previously unseen families.

ADD COMMENT
0
Entering edit mode

Thanks for the reply!

I have already done HMMer prediction using the entire pfam database but there are still lots of genes containing no domain(s) at all...

ADD REPLY
1
Entering edit mode
14.1 years ago
Rm 8.3k

Try Sarah Teichmann's "GENEFAMMER" its a protein family clustering package.

Its an old programm, I could not find the direct link to download. But you can write to Authors (Jong Park & Sarah Techimann).

other info http://www.ncbi.nlm.nih.gov/proteinclusters

UCLUST: http://www.drive5.com/usearch/intro.html

ADD COMMENT

Login before adding your answer.

Traffic: 1966 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6