Newick 2 Json Converter (Preferably In Perl)
3
1
Entering edit mode
12.4 years ago
Fabsta ▴ 120

Hi! Does anyone know a Perl snippet that converts a tree in newick format into a JSON string?

I had a look at Bio::Phylo, but could not really find a solution.

Any help is much appreciated.

An example newick string could be (sorry for the length):

(Capsaspora_owczarzaki,(Proterospongia,Monosiga_brevicollis)Codonosigidae,(Amphimedon_queenslandica,Trichoplax_adhaerens,(((((((((((((Tupaia_belangeri,((Cavia_porcellus,(Ictidomys_tridecemlineatus,(Rattus_norvegicus,Mus_musculus)Murinae,Dipodomys_ordii)Sciurognathi)Rodentia,(Oryctolagus_cuniculus,Ochotona_princeps)Lagomorpha)Glires,((Otolemur_garnettii,Microcebus_murinus)Strepsirrhini,((((Nomascus_leucogenys,(Pongo_abelii,(Homo_sapiens,Pan_troglodytes,Gorilla_gorilla)Homininae)Hominidae)Hominoidea,Macaca_mulatta)Catarrhini,Callithrix_jacchus)Simiiformes,Tarsius_syrichta)Haplorrhini)Primates)Euarchontoglires,(Procavia_capensis,Loxodonta_africana,Echinops_telfairi)Afrotheria,((Pteropus_vampyrus,Myotis_lucifugus)Chiroptera,Equus_caballus,(Vicugna_pacos,Bos_taurus,Sus_scrofa,Tursiops_truncatus)Cetartiodactyla,(Felis_catus,(Ailuropoda_melanoleuca,Canis_lupus_familiaris)Caniformia)Carnivora,(Sorex_araneus,Erinaceus_europaeus)Insectivora)Laurasiatheria,(Dasypus_novemcinctus,Choloepus_hoffmanni)Xenarthra)Eutheria,(Monodelphis_domestica,Macropus_eugenii,Sarcophilus_harrisii)Metatheria)Theria,Ornithorhynchus_anatinus)Mammalia,(Anolis_carolinensis,(Taeniopygia_guttata,(Meleagris_gallopavo,Gallus_gallus)Phasianidae)Neognathae)Sauria)Amniota,Xenopus_tropicalis)Tetrapoda,((((Tetraodon_nigroviridis,Takifugu_rubripes)Tetraodontidae,(Gasterosteus_aculeatus,Oryzias_latipes)Smegmamorpha)Percomorpha,Gadus_morhua)Holacanthopterygii,Danio_rerio)Clupeocephala)Euteleostomi,Petromyzon_marinus)Vertebrata,Branchiostoma_floridae,(Ciona_savignyi,Ciona_intestinalis)Ciona)Chordata,Strongylocentrotus_purpuratus)Deuterostomia,(Lottia_gigantea,(Ixodes_scapularis,((((Atta_cephalotes,Apis_mellifera)Aculeata,(((Drosophila_virilis,Drosophila_mojavensis)Drosophila,Drosophila_grimshawi,(Drosophila_willistoni,(Drosophila_pseudoobscura,Drosophila_persimilis)pseudoobscura_subgroup,((Drosophila_yakuba,Drosophila_simulans,Drosophila_sechellia,Drosophila_melanogaster,Drosophila_erecta)melanogaster_subgroup,Drosophila_ananassae)melanogaster_group)Sophophora)Drosophila,(Anopheles_gambiae,(Culex_quinquefasciatus,Aedes_aegypti)Culicinae)Culicidae)Diptera,Bombyx_mori,Tribolium_castaneum)Endopterygota,(Pediculus_humanus,Acyrthosiphon_pisum)Paraneoptera)Neoptera,(Parhyale_hawaiensis,Daphnia_pulex)Crustacea)Pancrustacea)Arthropoda,(Capitella_teleta,Helobdella_robusta)Annelida)Protostomia)Coelomata,(Pristionchus_pacificus,(Caenorhabditis_japonica,Caenorhabditis_brenneri,Caenorhabditis_remanei,Caenorhabditis_elegans,Caenorhabditis_briggsae)Caenorhabditis)Chromadorea,Schistosoma_mansoni)Bilateria,(Nematostella_vectensis,Hydra_magnipapillata)Cnidaria)Eumetazoa)Metazoa,(Spizellomyces_punctatus,Allomyces_macrogynus,Saccharomyces_cerevisiae,Phycomyces_blakesleeanus)Fungi)Opisthokonta;

Thanks a lot in advance, Fabian

perl • 7.0k views
ADD COMMENT
0
Entering edit mode

JSON is just a general data structure specification whereas Newick is specifically used for trees. I don't think there are any standardize rules for representing Newick in JSON. You'll have to tailor something according to how you want to use the JSON data. Why do you want to convert it to JSON?

ADD REPLY
0
Entering edit mode

sounds fun. Do you have any sample file please ?

ADD REPLY
2
Entering edit mode

I smell a round of code golf happening...

ADD REPLY
0
Entering edit mode

Yes, I updated the post. Looking forward to a solution :-)

ADD REPLY
0
Entering edit mode

How do you expect the JSON to look?

ADD REPLY
4
Entering edit mode
12.4 years ago
asjo ▴ 120

Here is an attempt, with the caveat that your example is really complicated, and that you haven't really defined what the JSON result should look like:

#!/usr/bin/perl

use strict;
use warnings;

use Bio::Phylo::IO;
use JSON;

my $forest=Bio::Phylo::IO->parse(-file=>"example.newick", -format=>"newick");

while (my $tree=$forest->next) {
    my $out=[];
    my $children=$out;
    my $cur;
    my $parent;
    $tree->visit_breadth_first(
                               -pre=>sub { $cur={ name=>shift->get_name }; push @$children, $cur; },
                               -pre_daughter=>sub { $cur->{children}=[]; $parent=$cur; $children=$cur->{children} },
                              );
    print JSON->new->pretty->encode($out);
}

If I run it on a smaller example (from wikipedia's entry on Newick format):

(A,B,(C,D)E)F;

I get this:

[
   {
      "name" : "F",
      "children" : [
         {
            "name" : "A"
         },
         {
            "name" : "B"
         },
         {
            "name" : "E",
            "children" : [
               {
                  "name" : "C"
               },
               {
                  "name" : "D"
               }
            ]
         }
      ]
   }
]

But again, I don't know how you want the JSON output formatted.

ADD COMMENT
0
Entering edit mode

Thanks a lot, asjo, for the quick and elegant answer. The output format is exactly what I need.

ADD REPLY
2
Entering edit mode
12.4 years ago

What's great about using python to output JSON is that stringifying native python arrays/dictionary conforms to JSON specs. So you can really just print str(myStructure) and it will output JSON accordingly.

Like I said previously, I am not sure how you want the JSON to look as there are no standardized rules for writing Newick in JSON. I just made it output a simple key:value structure. For example the output of your sample would be something like: (I took out a bunch of data in the middle so I don't go over the post character limit)

{'Opisthokonta': ['Capsasporaowczarzaki', {'Codonosigidae': ['Proterospongia', 'Monosigabrevicollis']}, {'Metazoa': .....[bunch of stuff]}, {'Fungi': ['  Spizellomycespunctatus', 'Allomycesmacrogynus', 'Saccharomycescerevisiae', 'Phycomyces_blakesleeanus']}]}

Here is something in python without using BioPython (yes I was bored):

edit** This is just for fun. Use the BioPython/BioPerl solutions from the other answers if you want accurate results.

Not sure if it's 100% working with everything; however, It does work with your sample. It requires a root node.

def parseNode(nwString):
    parenCount = 0

    key = ''
    processed = ''
    index = 0
    for char in nwString:
        if char == "(":
            parenCount += 1
            if parenCount == 1:
                continue
        elif char == ")":
            parenCount -= 1
            if parenCount == 0:
                if index + 2 > len(nwString):
                    break
                else:
                    key = nwString[index + 2:]
                    break

        if char == ",":
            if parenCount != 1:
                processed += "|"
            else:
                processed += ","
        else:
            processed += char

        index += 1

    data = processed.split(',')

    for i in range(len(data)):
        data[i] = data[i].replace('|',',')

    return (key.strip(),data)

def recurseBuild(nwString):
    if nwString.find('(') == -1:
        if len(nwString.split(',')) == 1:
            return nwString
        else:
            return nwString.split(',')
    else:
        key, data = parseNode(nwString)

        dataArray = []
        for item in data:
            dataArray.append(recurseBuild(item))

        return {key:dataArray}

result = recurseBuild(myNewickstring)

print result
ADD COMMENT
0
Entering edit mode

Thank you very much for sharing.

I adapted this functions to fit the newick-format regarding the optional branch-lengths. Now I've got three keys for each node/leaf: a "label", a "distance" and a "tree" ("tree" contains the nested clades). Code available here: http://pastebin.com/Pk717Uc2

ADD REPLY
1
Entering edit mode
12.4 years ago

Here is a C lex/yacc solution: https://gist.github.com/3056165

The bison parser: https://raw.github.com/gist/3056165/54220d50bcad1a72462bdb00dc258d6d472c7cfd/newick.y

The *flex lexer: https://raw.github.com/gist/3056165/407f2924a533793b33c9c5b0d144ca7e1dd24e31/newick.l

The Makefile:

all:
bison -d newick.y
flex newick.l
gcc -Wall -O3 newick.tab.c lex.yy.c

Test:

a.out < input.newick.txt | fold -w 60

{"label":"Opisthokonta","children":[{"label":"Capsasporaowcz
arzaki"},{"label":"Codonosigidae","children":[{"label":"Prot
erospongia"},{"label":"Monosigabrevicollis"}]},{"label":"Met
azoa","children":[{"label":"Amphimedonqueenslandica"},{"labe
l":"Trichoplaxadhaerens"},{"label":"Eumetazoa","children":[{
"label":"Bilateria","children":[{"label":"Coelomata","childr
en":[{"label":"Deuterostomia","children":[{"label":"Chordata
(...)
[{"label":"Caenorhabditisjaponica"},{"label":"Caenorhabditis
brenneri"},{"label":"Caenorhabditisremanei"},{"label":"Caeno
rhabditiselegans"},{"label":"Caenorhabditisbriggsae"}]}]},{"
label":"Schistosomamansoni"}]},{"label":"Cnidaria","children
":[{"label":"Nematostellavectensis"},{"label":"Hydramagnipap
illata"}]}]}]},{"label":"Fungi","children":[{"label":"Spizel
lomycespunctatus"},{"label":"Allomycesmacrogynus"},{"label":
"Saccharomycescerevisiae"},{"label":"Phycomyces_blakesleeanu
s"}]}]}

Speed ?

$ time (x=1; while [ $x -le 1000 ]; do  ./a.out < input.newick.txt > /dev/null ; x=$(( $x + 1)); done )

real    0m2.876s
user    0m0.192s
sys    0m0.516s
ADD COMMENT

Login before adding your answer.

Traffic: 2209 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6