How To Replace Sequence Id'S In A Text (Tree) File With Taxonomy Strings From A Corresponding Tab Delimited Taxonomy File
2
4
Entering edit mode
11.4 years ago

Hi!

I have a phylogenetic tree file, which is basically a text file that looks like this:

((((((EF019104:0.08997,(EU135350:0.05132,((EU135320:0.04788,FJ592807:0.04392)0.465:0.01790,(EU134780:0.05467,(EU135316:0.03496,((EU135328:0.04185,((AY456902:0.01911,HM445090:0.00744)0.991:0.01269,(EU135341:0.03902,((EU135339:0.01451,EU135345:0.01593)0.726:0.00145,JN580049:0.05998)0.771:0.00197)0.783:0.00175)0.862:0.00467)0.689:0.01098,HM445243:0.03376)0.871:0.01490)0.701:0.00326)0.813:0.01180)0.818:0.00986)0.954:0.02106)0.942:0.01986,(HM187318:0.10108,(HM186778:0.03166,(EU132538:0.05510,(

Next I have a corresponding taxonomy file that looks like this:

JN178341    Bacteria;__Verrucomicrobia;__OPB35_soil_group;__o;__f;__g
GQ898616    Bacteria;__Firmicutes;__Clostridia;__Clostridiales;__Ruminococcaceae;__Incertae_Sedis

Now the identifiers in the first column of the tab delimited taxonomy file are also in the tree file, and the goal is to replace those identifiers in the tree file with the corresponding taxonomy string from the taxonomy file

I'm guessing that this isn't a complicated problem for experienced bioinformaticians, but I currently draw blank here

Any help would be greatly appreciated!

perl taxonomy identifiers tree search • 7.5k views
ADD COMMENT
4
Entering edit mode
11.4 years ago
Kenosis ★ 1.3k

Given your data sets, perhaps the following will help:

use strict;
use warnings;

my $treeFile = pop;
my %taxonomy = map { /(\S+)\s+(.+)/; $1 => $2 } <>;

push @ARGV, $treeFile;

while ( my $line = <> ) {
    $line =~ s/\b$_\b/$taxonomy{$_}/g for keys %taxonomy;
    print $line;
}

Usage: perl script.pl taxonomyFile phylogeneticTreeFile [>outFile]

The last, optional parameter will direct output to a file.

First, the tree file name is poped off (the implicit) @ARGV, and saved for later. Then the taxonomy file is read into a hash, where the identifiers becomes the keys and the taxonomy strings become the values. Next the tree file is read a line at a time, and for each line, there's a global replacement in that line of any identifier key with its associated value. Finally, the line's printed.

ADD COMMENT
0
Entering edit mode

Works perfectly! Thanks

ADD REPLY
0
Entering edit mode

perfect! Thanks much!

ADD REPLY
0
Entering edit mode

Works perfectly. Thank you!

ADD REPLY
4
Entering edit mode
11.4 years ago
jhc ★ 3.0k

You could use any of the phyloinformatics toolkits available. A possible approach using Python and the ETE toolkit would look like this:

    from string import strip
    from ete2 import Tree

    # load taxonomy table (assuming that the two idname and taxonomy
    # columns are tab delimited) as python dictionary
    name2tax = dict([map(strip, line.split("\t")) for line in open("taxonomyTable.tab")])

    # loads your tree
    t = Tree("myTreeFile.nw")
    print t
    #     /-JN178341
    #----|
    #     \-GQ898616

    # replace leaf names
    for leaf in t.iter_leaves():
      leaf.name = name2tax[leaf.name]

    # check that everything is ok
    print t
    t.show()
    #     /-Bacteria;__Verrucomicrobia;__OPB35_soil_group;__o;__f;__g
    #----|
    #     \-Bacteria;__Firmicutes;__Clostridia;__Clostridiales;__Ruminococcaceae;    

    # export newick (note that you may need to replace some spacial characters from node names (i.e. ";")
    print t.write(outfile="newTree.nw")
ADD COMMENT
0
Entering edit mode

Interesting! Seems like something I should check out, this ETE toolkit. Thank you

ADD REPLY

Login before adding your answer.

Traffic: 1541 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6