I'm having a hard time finding a solution to a problem with a set of phylogenetic trees. I'm getting a newick tree from an online database, and I need to pare it down to match an alignment that I created myself. The tip labels are GenBank accessions in this format: ACCESSION.1.XXXX. The accessions from my alignment are also GenBank accessions, but with just the ACCESSION portion of the above.
The simplest way to filter the tree to match my alignment is to strip off the '.1.XXXX' portion of the tree tip names, and then prune the tree to remove accessions not present in the alignment. This is easy to achieve with existing tools, bash, QIIME, etc.
The problem is that removing the last portion of the tree tip name results in tree tips with non-unique labels. I'd like to figure out how to trim the tree so that I can remove all but one of each non-unique tip label.
I'd have to establish some rules about how to choose which tips I'd like to preserve and which I'd like to delete, but for now I think I'd just prefer to keep the one with the highest confidence value. I can always adjust later if needed once I have some kind of framework to do the pruning in the first place.