I've done a DNA motif study where I have gotten motifs for every promoter in my organism of interest, ~20 000 in total. The motif finding was done with MEME, and it was done on each promoter individually. This resulted in a large dataset of motifs where many of them are similar and some are identical.
My problem is that I would like to create groups of similar motifs where a group is defined by some kind of similarity measure threshold, this to be able to do a reasonable comparison of promoter profiles. I've been able to find the number of identical motifs in my dataset, but for motifs that are not identical, the computation gets quite intense.
I tried to use TOMTOM (standalone) to perform an all pair comparison of the motifs with my complete motif set as database. However, TOMTOM didn't agree with me, and I got a segmentation fault. I also did a quick test with TAMO and its euclidean distance measure, but the computation time was huge.
I really like the TOMTOM way of comparing motifs, but as stated, I get a segfault when trying to run it. I suspect it has something to do with the size of my motif collection, since it works for smaller versions of it.
I'm aware that there is a lot of comparisons that have to be made to achieve this, but before I actually start running any of my scripts, I would like to know if anyone has done something similar in the past to avoid that my one week run ends up as garbage. Suggestions on alternative programs/algorithms, or why I get a segfault from TOMTOM are highly appreciated!
Edit: I should mention that I store the motifs as PSSMs.
Edit: An example set of my motifs added upon request
MOTIF 1
log-odds matrix: alength= 4 w= 15 E= 0
-1045 -1045 212 -1045
-92 131 -10 -191
-191 31 31 40
-1045 -1045 212 -1045
40 -1045 131 -191
-92 -1045 -1045 167
-1045 63 131 -191
-92 -1045 177 -191
-33 31 -69 40
-1045 -10 -1045 154
-1045 -1045 177 -33
-1045 -69 190 -1045
-1045 -1045 190 -92
-92 -1045 190 -1045
-1045 63 63 8
MOTIF 2
log-odds matrix: alength= 4 w= 20 E= 0
-923 212 -923 -923
-69 53 112 -923
-923 -923 185 -69
-923 -47 -923 163
30 -923 -923 130
-923 -923 185 -69
-923 -923 185 -69
-923 53 -923 130
-923 -923 185 -69
-923 -923 185 -69
163 -923 -923 -69
-923 112 -923 89
-923 -923 212 -923
-923 -923 53 130
-923 112 112 -923
189 -923 -923 -923
-923 -47 185 -923
-923 -923 -47 163
-69 -923 185 -923
-923 -923 212 -923
MOTIF 3
log-odds matrix: alength= 4 w= 20 E= 0
-92 190 -945 -945
-945 -945 -945 189
140 -69 -945 -92
167 -945 -69 -945
-945 31 163 -945
-92 -69 163 -945
67 -945 -945 108
-945 -945 -945 189
-945 -945 31 140
-92 -69 -69 108
-945 -69 -945 167
-945 -69 -945 167
-92 -945 -945 167
-945 163 -945 8
189 -945 -945 -945
-92 -945 131 8
-945 131 -945 67
-945 -945 -945 189
-945 -945 -945 189
-945 -69 -945 167
MOTIF 4
log-odds matrix: alength= 4 w= 15 E= 0
-965 193 -965 -111
-965 212 -965 -965
189 -965 -965 -965
-111 144 -965 -11
-965 -965 -965 189
-965 -965 12 147
-965 193 -88 -965
170 -965 -88 -965
47 -965 144 -965
-111 144 -965 -11
-965 193 -965 -111
-111 -965 -88 147
-111 -965 144 -11
-965 193 -88 -965
-965 12 170 -965
Just a guess, it might be that TOMTOM is written in C and that there's a fixed-size array somewhere with a size less than 20,000. So perhaps a little diving into the source code can help. I've had to do this in the past for another program.
I've skimmed through the source, and I found a
struct
for the motif database, but no limitations on its size. However, I ran TOMTOM throughgdb
so I could see where it actually got the segmentation fault, and it was in some call tostrcpy
, but that on the other hand isn't ever called in the TOMTOM source (probably called in some included file, there are many...). I will try to compile the MEME suite with the debug flag on my local machine and see if I can find the reason for the crash.Yep, it's written in C. Thanks for the suggestion, I'll look into it and see what I can find.
Can you post up an example set of motifs?
Also there is a web version of TOMTOM which you could use to verify if its an install problem.
I've added the four first motifs in my collection. The web version of TOMTOM doesn't allow (what I have found at least) to submit your own database.