I have discovered DNA regulatory motifs that are 4-9bp long by fetching ~2kb DNA upstream of all genes from 21 newly sequenced/assembled species. Then separately for each species, used DREME (a speedy version of MEME) to detect over-represented 6-9mer sequences in each of these 21 "promoteromes".
I have a Newick tree that is built by comparing similarity (euclidean distance) of Position Weight Matrices (PWMs or PSSMs) of the DNA motifs discovered by DREME (mentioned above). I have pooled all the discovered motifs from the 21 species into this tree (similar to what you would do when trying to generate orthologous gene families). An interactive version of the tree is up on iTol (here), which you can freely play with - just press "update tree" after setting your parameters:
My specific goal: Now that I have discovered a set of ~100 motifs from each of the 21 species, I need to determine which of these are orthologous to one another. That is, I want to generate orthologous "regulatory motif families". This is because we want to determine which of these motifs discovered independently from each species play the same biological role (have the same function).
My code and data: A link to my Python script is here, I have heavily commented it and it will generate the tree data and plot above for you (use the arguments d_from
, d_to
and d_step
to explore the distance cut-offs, X). You will need to install ete2 by simply executing these two bash commands if you have easy-install and Python:
apt-get install python-setuptools python-numpy python-qt4 python-scipy python-mysqldb python-lxml
easy_install -U ete2