I have a set of organisms, and I want an approximation of the distance matrix between them (don't need the tree). My plan were taking several COGs and concatenating them together as single sequences, which, after alignment would give me the p-distance (or any other distance derived from it).
The problem is that in the COG dataset I am using, there is no "universal COG". That is to say, there are few organisms which are left out for every COG. One option is to ignore the organisms which are left out in at least one COG and work with the rest. Another option, which I've just thought is to build a distance matrix for each separate COG, and finally compute the average matrix of all the n individual results. Obviously, if a distance is not defined in a matrix (for example) I would add 0.0 to the sum, and divide by n-1 instead of by n.
I think the solution is very naive, but I still have a question. Do you think this approach is trustable? Is it a standard thing (in case it is, could you give a reference where it is used)? Do you propose other alternative? Note that I want an approximation of the distance, not a perfectly computed tree.
What kind of organisms do you have? How closely related are they?
I'm assuming you mean Three domains? (Eukaryotes, Eubacteria, and Archea) If this is the case using COGs may be problematic as COG is built around bacterial representation. You could use OrthoMCL definitions but they are split on a finer scale than you may want when it comes to co-orthologs, inparalogs, etc. If you meant three kingdoms as in Plants, Metazoa, and Fungi they aren't THAT unrelated. I'd recommend Homologene if that is the case.
Many organisms (>300, including all classical model organisms), from the three kingdoms (therefore, very unrelated)
I mean domains, yes. In Ciccarelli et al, "Toward automatic reconstruction of a highly resolved tree of life", Science. 2006 May 5;312(5774):697, they use a set of 31 COGs (not KOGs) for building a "universal" tree of life for around 100 species. The tree seems to be very accurate compared with previous findings and beliefs and it's a highly cited paper (>500). What I basically want to do is the same, maybe not being so accurate (a rough approximation should be good). My problem is that there is no single COG covering all the species of my dataset, and there goes my initial question :)...
Michael, basically all STRING core species :).
Lack of a Universal COG isn't a problem, it is dealt with in Phylogenomic analyses all of the time currently.