Phylogenetic Distance From Incomplete Dataset
2
2
Entering edit mode
13.4 years ago
Alf ▴ 490

I have a set of organisms, and I want an approximation of the distance matrix between them (don't need the tree). My plan were taking several COGs and concatenating them together as single sequences, which, after alignment would give me the p-distance (or any other distance derived from it).

The problem is that in the COG dataset I am using, there is no "universal COG". That is to say, there are few organisms which are left out for every COG. One option is to ignore the organisms which are left out in at least one COG and work with the rest. Another option, which I've just thought is to build a distance matrix for each separate COG, and finally compute the average matrix of all the n individual results. Obviously, if a distance is not defined in a matrix (for example) I would add 0.0 to the sum, and divide by n-1 instead of by n.

I think the solution is very naive, but I still have a question. Do you think this approach is trustable? Is it a standard thing (in case it is, could you give a reference where it is used)? Do you propose other alternative? Note that I want an approximation of the distance, not a perfectly computed tree.

phylogenetics distance • 4.0k views
ADD COMMENT
1
Entering edit mode

What kind of organisms do you have? How closely related are they?

ADD REPLY
1
Entering edit mode

I'm assuming you mean Three domains? (Eukaryotes, Eubacteria, and Archea) If this is the case using COGs may be problematic as COG is built around bacterial representation. You could use OrthoMCL definitions but they are split on a finer scale than you may want when it comes to co-orthologs, inparalogs, etc. If you meant three kingdoms as in Plants, Metazoa, and Fungi they aren't THAT unrelated. I'd recommend Homologene if that is the case.

ADD REPLY
0
Entering edit mode

Many organisms (>300, including all classical model organisms), from the three kingdoms (therefore, very unrelated)

ADD REPLY
0
Entering edit mode

I mean domains, yes. In Ciccarelli et al, "Toward automatic reconstruction of a highly resolved tree of life", Science. 2006 May 5;312(5774):697, they use a set of 31 COGs (not KOGs) for building a "universal" tree of life for around 100 species. The tree seems to be very accurate compared with previous findings and beliefs and it's a highly cited paper (>500). What I basically want to do is the same, maybe not being so accurate (a rough approximation should be good). My problem is that there is no single COG covering all the species of my dataset, and there goes my initial question :)...

ADD REPLY
0
Entering edit mode

Michael, basically all STRING core species :).

ADD REPLY
0
Entering edit mode

Lack of a Universal COG isn't a problem, it is dealt with in Phylogenomic analyses all of the time currently.

ADD REPLY
1
Entering edit mode
13.4 years ago
DG 7.3k

I think the solution you propose is reasonable, as it is an approximation. It's really not that different from how joint estimation of branch lengths is done with full trees on concatenated alignments/supermatrices where you have missing data.

Depending on the distance metric you want to use, it may already handle missing data if you concatenate all the sequences together, using gap characters where a taxon doesn't have a gene as part of that COG.

ADD COMMENT
1
Entering edit mode
13.4 years ago
Lyco ★ 2.3k

Is there a particular reason why you want to use COGs ? Usually, when people want to make spcies trees they focus on ribosomal RNAs, which are present in all organisms and are clearly related and alignable, even over large evolutionary distances. Actually, there are quite a few rRNA databases that server mainly this purpose (http://www.arb-silva.de or http://rdp.cme.msu.edu)

There are situations where using protein is appropriate. If you strictly need a protein-based distance matrix, you should focus on proteins that are found everywhere, e.g. ribosomal core subunits.

ADD COMMENT
1
Entering edit mode

Why would it be difficult to get the rRNA sequence, or why should it be harder to get the rRNA than to get a protein sequence? The rRNA are by far the easiest message to detect. Or are you talking about metagenomics data? But then, your multi-COG approach would also be impossible.

ADD REPLY
1
Entering edit mode

I'm not sure why you would have a protein fasta file where you don't know what species the sequence belongs to. If it is public data and it just happens to be missing from the file you obtained/were given a simple BLASTP search to find the identical record in NCBI is trivial and would give you the taxonomic assignment.

As for Lyco's question of why to use COGs in the first place versus the rRNA distance, COGs seem a natural choice for ortholog clusters in bacteria, and doing a distance based on a concatenated set of COGs would give you a more robust distance in a more phylogenomic context.

ADD REPLY
1
Entering edit mode

Generate clusters of homologous sequences, pre-build trees using something like FastTree or RAxML and go from there. Either way you're going to have to do some sort of profile based searching anyway to figure out which clusters of sequences the user inputed sequences match, and calculate p-distances just for those matches and average them if you don't want to make trees.

ADD REPLY
1
Entering edit mode

If you're adding a new species to a set of reference species, I'd use a given, more exact tree (like the Cicarelli tree) and figure out the closest neighbor in this tree and use this a proxy. Building trees is hard, e.g. see this nice review: http://www.biology-direct.com/content/6/1/32

ADD REPLY
0
Entering edit mode

The idea is to add new species afterwards, in many cases being unknown (just add the fasta file). So, getting the rRNA is not easy, but I can get a COG if there is a mutual best match in the COG database (again approximately)... It is probable that there is not a good candidate for some of the COGS, so the new guy would only have distances in some of the matrices...

ADD REPLY
0
Entering edit mode

It's not metagenomics, by sure. The scenario is the following: imagine I got a protein fasta file, and I don't know which specie it belongs to. Maybe I am missing something, but how do I get the rRNA? Sorry for the newbie questions :)

ADD REPLY
0
Entering edit mode

The thing is that I want to build a server for a certain algorithm which uses, as a parameter, the phylogenetic distance of the newly added specie to all of the species from which I have data in the server. So, if I use a newly sequenced specie, for example, I may have no access to rRNA. And even if I had, I wouldn't like to ask to the user for the name of the specie. I would prefer to make an automatic distance computation only dependant on the sequence given by the user.

ADD REPLY

Login before adding your answer.

Traffic: 1838 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6