Question

Automatic Data Extraction From Timetree

1

Entering edit mode

12.3 years ago

Biojl ★ 1.7k

Anyone knows how to programatically extract information from http://timetree.org/

I have to build a 40x40 matrix with information about species time of divergence and my wrist is starting to hurt since I have to do all the pairwise combinations manually

UPDATE: The provided solutions stopped working

evolution tree • 4.9k views

ADD COMMENT • link updated 8.8 years ago by Biostar 20 • written 12.3 years ago by Biojl ★ 1.7k

0

Entering edit mode

Any chance you have or know of a new solution to this problem? Would love to get some of the data off the site.

ADD REPLY • link 10.0 years ago by UnivStudent ▴ 440

0

Entering edit mode

No, sorry. I stopped using timetree.org since without the allowance to extract data automatically is of little use in science. Just a curiosity to show to friends in the phone.
You can give it a try to DateLife.org (see last response). It didn't worked for me and I don't know if it's still on development. Test it and report your results!

ADD REPLY • link 10.0 years ago by Biojl ★ 1.7k

score 4 · Answer 1 · 2013-05-08

4

Entering edit mode

12.3 years ago

Pierre Lindenbaum 166k

say you have a text file containing a list of organisms:

$ cat input.txt
Homo Sapiens
Drosophila melanogaster
Canis lupus familiaris
Escherichia coli

the following bash script send some request with curl and extract the distance with xmllint/xpath

#!/bin/bash
IFS="
"
cat input.txt | tr " " "+" | while read O1
do
cat input.txt | tr " " "+" | while read O2
do
if [[ "${O1}" <  "${O2}" ]]
then
curl -s  "http://timetree.org/index.php?taxon_a=${O1}&taxon_b=${O2}&submit=Search" |\
xmllint --html --format --xpath 'concat("insert into SPECIES(org1,org2,dist) values (__QUOTE____A____QUOTE__,__QUOTE____B____QUOTE__,__QUOTE__",normalize-space(//span[@class="panel year block"][h1]),"__QUOTE__);#")' - 2> /dev/null |\
tr "#" "\n" |
sed -e "s/__A__/${O1}/g" |
sed -e "s/__B__/${O2}/g" |
sed -e "s/__QUOTE__/'/g" |
tr "+" " "
fi
done 
done

Result:

~$ bash organisms.sh 
insert into SPECIES(org1,org2,dist) values ('Drosophila melanogaster','Homo Sapiens','782.7 Million Years Ago');
insert into SPECIES(org1,org2,dist) values ('Drosophila melanogaster','Escherichia coli','2535.8 Million Years Ago');
insert into SPECIES(org1,org2,dist) values ('Canis lupus familiaris','Homo Sapiens','94.2 Million Years Ago');
insert into SPECIES(org1,org2,dist) values ('Canis lupus familiaris','Drosophila melanogaster','782.7 Million Years Ago');
insert into SPECIES(org1,org2,dist) values ('Canis lupus familiaris','Escherichia coli','2535.8 Million Years Ago');
insert into SPECIES(org1,org2,dist) values ('Escherichia coli','Homo Sapiens','2535.8 Million Years Ago');

ADD COMMENT • link 12.3 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

That's awesome! Unfortunately it's not working for me. I'm trying to figure out what's happening. I suspect is the --xpath argument in the xmllint. I don't see it in the manual nor I guess what should be doing.

ADD REPLY • link 12.3 years ago by Biojl ★ 1.7k

1

Entering edit mode

$ xmllint --version
xmllint: using libxml version 20708
compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug Zlib

ADD REPLY • link 12.3 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Ok. Apparently I have version 20706. I'll update it!

ADD REPLY • link 12.3 years ago by Biojl ★ 1.7k

0

Entering edit mode

I'm not sure that will fix it. I saw some versions of xmllint missing the '--xpath' argument. But there are many ways to extract this information .: xslt, /usr/bin/xpath,a simple grep "Million Years", etc...

ADD REPLY • link 12.3 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Finally I decided to implement it in Python. It might be slower but the output is exactly as I want. Your solution was my inspiration, thank you!

ADD REPLY • link 12.3 years ago by Biojl ★ 1.7k

score 4 · Answer 2 · 2013-05-08

4

Entering edit mode

12.3 years ago

David W 4.9k

There is no official way to automate this process, but check out the urls

http://timetree.org/index.php?taxon_a=homo&taxon_b=pongo&submit=Search

It should be straight forward to pick your favourite scripting language, build urls for each comparison and (maybe with a bit more difficulty) parse out the dates from the resulting pages.

Just a matter of deciding if the time writing the scripts is worth avoiding the pain in your wrist

ADD COMMENT • link 12.3 years ago by David W 4.9k

3

Entering edit mode

Whoops, my scant answer crossed with Pierre's much more complete one. Should change mine to "do what Pierre says" :-)

ADD REPLY • link 12.3 years ago by David W 4.9k

score 4 · Answer 3 · 2013-05-08

4

Entering edit mode

12.3 years ago

omeara.brian ▴ 50

Note that TimeTree asks that you don't do this; from the bottom of their page: "Currently large scale, automated, data-mining is not permitted". I haven't tested to see if it's possible (I imagine it would be, though an easy thing to do on their end would be to block your IP eventually), but they don't want you to.

We've been building a more open alternative to TimeTree called DateLife.org. It still needs more trees (TimeTree is much better populated) but we encourage scraping, downloading the source, downloading the set of trees, etc. Let me know if you have patches or more trees for it.

ADD COMMENT • link 12.3 years ago by omeara.brian ▴ 50

0

Entering edit mode

Very good initiative, I'll take a look. I fail to see why TimeTree does not provide tools to mine their database, to me it's a terrible mistake, encouraging researchers not to use it.

ADD REPLY • link 12.2 years ago by Biojl ★ 1.7k