I have a metagenomic dataset with several hundred thousand, relatively small genes and I want to find the protein families that exist in there (based on their aminoacid sequence)...
I've tried cdhit with a 40% threshold as well as mclblastline with a varying inflation value of 3.0 up to 10.0. cdhit creates smaller (less members in each family) but more homogeneous families compared to mcl (as seen by the present domains / KEGG orthology).
I think I like better cdhit's clusters; the only thing that I don't like about it is that I can't decrease the threshold below 40% (which I think it may be necessary for such a task). On the other hand, mcl is used exactly for this kind of task, right?
Is there anyone who has done such an analysis so that he/she can give me some advice? I'm particularly interested in people's opinion about mcl.
Are you trying to find a specific family or classify your whole set? Is it enough for you to classify your genes to known protein families or do want unknowns clustered too? - I think there are quite a few possible approaches, depending on what you are trying to achieve.
Thanks for the comment Michael!
There are two things I want to do:
(a) find sub-families within families identified by HMMer3 (eg different groups of ABC transporters), and
(b) find families that don't exist in pfam.