Creating Groups Of Similar Dna Motifs
2
3
Entering edit mode
12.6 years ago
Maehler ▴ 80

I've done a DNA motif study where I have gotten motifs for every promoter in my organism of interest, ~20 000 in total. The motif finding was done with MEME, and it was done on each promoter individually. This resulted in a large dataset of motifs where many of them are similar and some are identical.

My problem is that I would like to create groups of similar motifs where a group is defined by some kind of similarity measure threshold, this to be able to do a reasonable comparison of promoter profiles. I've been able to find the number of identical motifs in my dataset, but for motifs that are not identical, the computation gets quite intense.

I tried to use TOMTOM (standalone) to perform an all pair comparison of the motifs with my complete motif set as database. However, TOMTOM didn't agree with me, and I got a segmentation fault. I also did a quick test with TAMO and its euclidean distance measure, but the computation time was huge.

I really like the TOMTOM way of comparing motifs, but as stated, I get a segfault when trying to run it. I suspect it has something to do with the size of my motif collection, since it works for smaller versions of it.

I'm aware that there is a lot of comparisons that have to be made to achieve this, but before I actually start running any of my scripts, I would like to know if anyone has done something similar in the past to avoid that my one week run ends up as garbage. Suggestions on alternative programs/algorithms, or why I get a segfault from TOMTOM are highly appreciated!

Edit: I should mention that I store the motifs as PSSMs.

Edit: An example set of my motifs added upon request

MOTIF 1

log-odds matrix: alength= 4 w= 15 E= 0
 -1045  -1045    212  -1045 
   -92    131    -10   -191 
  -191     31     31     40 
 -1045  -1045    212  -1045 
    40  -1045    131   -191 
   -92  -1045  -1045    167 
 -1045     63    131   -191 
   -92  -1045    177   -191 
   -33     31    -69     40 
 -1045    -10  -1045    154 
 -1045  -1045    177    -33 
 -1045    -69    190  -1045 
 -1045  -1045    190    -92 
   -92  -1045    190  -1045 
 -1045     63     63      8 

MOTIF 2

log-odds matrix: alength= 4 w= 20 E= 0
  -923    212   -923   -923 
   -69     53    112   -923 
  -923   -923    185    -69 
  -923    -47   -923    163 
    30   -923   -923    130 
  -923   -923    185    -69 
  -923   -923    185    -69 
  -923     53   -923    130 
  -923   -923    185    -69 
  -923   -923    185    -69 
   163   -923   -923    -69 
  -923    112   -923     89 
  -923   -923    212   -923 
  -923   -923     53    130 
  -923    112    112   -923 
   189   -923   -923   -923 
  -923    -47    185   -923 
  -923   -923    -47    163 
   -69   -923    185   -923 
  -923   -923    212   -923 

MOTIF 3

log-odds matrix: alength= 4 w= 20 E= 0
   -92    190   -945   -945 
  -945   -945   -945    189 
   140    -69   -945    -92 
   167   -945    -69   -945 
  -945     31    163   -945 
   -92    -69    163   -945 
    67   -945   -945    108 
  -945   -945   -945    189 
  -945   -945     31    140 
   -92    -69    -69    108 
  -945    -69   -945    167 
  -945    -69   -945    167 
   -92   -945   -945    167 
  -945    163   -945      8 
   189   -945   -945   -945 
   -92   -945    131      8 
  -945    131   -945     67 
  -945   -945   -945    189 
  -945   -945   -945    189 
  -945    -69   -945    167 

MOTIF 4

log-odds matrix: alength= 4 w= 15 E= 0
  -965    193   -965   -111 
  -965    212   -965   -965 
   189   -965   -965   -965 
  -111    144   -965    -11 
  -965   -965   -965    189 
  -965   -965     12    147 
  -965    193    -88   -965 
   170   -965    -88   -965 
    47   -965    144   -965 
  -111    144   -965    -11 
  -965    193   -965   -111 
  -111   -965    -88    147 
  -111   -965    144    -11 
  -965    193    -88   -965 
  -965     12    170   -965
dna motif similarity pssm • 4.9k views
ADD COMMENT
2
Entering edit mode

Just a guess, it might be that TOMTOM is written in C and that there's a fixed-size array somewhere with a size less than 20,000. So perhaps a little diving into the source code can help. I've had to do this in the past for another program.

ADD REPLY
2
Entering edit mode

I've skimmed through the source, and I found a struct for the motif database, but no limitations on its size. However, I ran TOMTOM through gdb so I could see where it actually got the segmentation fault, and it was in some call to strcpy, but that on the other hand isn't ever called in the TOMTOM source (probably called in some included file, there are many...). I will try to compile the MEME suite with the debug flag on my local machine and see if I can find the reason for the crash.

ADD REPLY
1
Entering edit mode

Yep, it's written in C. Thanks for the suggestion, I'll look into it and see what I can find.

ADD REPLY
0
Entering edit mode

Can you post up an example set of motifs?

Also there is a web version of TOMTOM which you could use to verify if its an install problem.

ADD REPLY
1
Entering edit mode

I've added the four first motifs in my collection. The web version of TOMTOM doesn't allow (what I have found at least) to submit your own database.

ADD REPLY
4
Entering edit mode
12.6 years ago

I would calculate a consensus sequence for every motif, and then align or compare the consensus sequences directly instead of the matrixes. That would save you much computational time; the results may be less precise, but it should be fine until your objective is just to make groups.

For example, the example motifs that you posted could be translated as:

MOTIF 1 : 
GC[CG]GGT[CG]GCT....

MOTIF 2: 
C[CG]GTTGGT......

MOTIF 3: 
CTA[CG]GTTTTT.....

MOTIF 4:
CCACTTCA[AG]......

Once you have these sequences, you can just align them all against all, or use some unsupervised learning algorithm to group them together.

An alternative way would be to change all the negative values to 0. Note that for every position in each motif, only the positive scores really matter.... the difference between having a probability score of -140 and a probability score of -2 are not much meaningful biologically, at least in my opinion. So, if you reset the negative scores to 0, the computation should be faster. It won't be much different from taking the consensus score of the sequence.

ADD COMMENT
0
Entering edit mode

I guess it would be much faster than comparing the PSSMs as they are. I will try this approach, and when I have the groups I can do a small test on the PSSMs to see if the groups make sense in that aspect as well.

ADD REPLY
3
Entering edit mode
12.6 years ago

If you were using regular expressions, instead of PSSMs, then I would suggest comparimotif: http://bioware.ucd.ie/~testing/biowareweb/Server_pages/comparimotif.html

You could make the regular expression using slimmaker, which takes the set of instances of the motif. http://bioware.soton.ac.uk/slimmaker.html

Though I appreciate this isn't exactly what you want, I thought the information might be of some use.

ADD COMMENT
1
Entering edit mode

You're right, it's not exactly what I'm after. But I'll definitely take it into consideration!

ADD REPLY
1
Entering edit mode

the SLIM maker is such a neat and generic tool and I never heard of it - I'd like to add it to the tool section or perhaps you'd like to do so. A paragraph or so about it would suffice. Let me know if you want to do so.

ADD REPLY

Login before adding your answer.

Traffic: 1627 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6