Question

Best practice to find regulatory motifs in a set of sequences

1

Entering edit mode

4.9 years ago

jeni ▴ 90

Hi!

I am trying to find if there are some regulatory regions in a set of sequences that I am studying. I have seen a lot of tools to perform this, such as Jaspar, Transfac, MEME, or even looking for intersection between the coordinates of my sequences and those of the motifs described in the Genome Browser. But I was wondering if there are some best practice to perform this kind of analysis, as I have not much experience.

Thanks!

regulatory motif finding • 1.8k views

ADD COMMENT • link updated 4.9 years ago by Mensur Dlakic ★ 29k • written 4.9 years ago by jeni ▴ 90

0

Entering edit mode

About how many regions are we talking? Hundreds, thousands?

ADD REPLY • link 4.9 years ago by ATpoint 88k

0

Entering edit mode

About 5 thousand approximately.

ADD REPLY • link 4.9 years ago by jeni ▴ 90

0

Entering edit mode

Then I would simply perform a motif enrichment analysis, be it with MEME or Homer against the whole genome as background.

ADD REPLY • link 4.9 years ago by ATpoint 88k

0

Entering edit mode

Thanks! And what do you think about looking for intersection between the coordinates of my sequences and the coordinates of TFBS described in the Genome Browser?

ADD REPLY • link 4.9 years ago by jeni ▴ 90

1

Entering edit mode

I do not think this is meaningful. Motifs exist all over the genome simply by random nucleotide co-occurrence. This is why it is so important to use proper statistics and control of false-positives (FDR). In order to check if your regions separate from random motifs you have to perform enrichment analysis. If you run it against the genome you will exclude a lot of standard motifs which are pretty much everywhere. Simply intersection will probably give you an excessive number of motifs, many of them just by change and without any biological function.

ADD REPLY • link 4.9 years ago by ATpoint 88k

0

Entering edit mode

But, if I do the intersection with conserved TFBS coordinates then couldn't I assume those "reference" motifs exist in my sequences?

ADD REPLY • link 4.9 years ago by jeni ▴ 90

score 0 · Answer 1 · 2020-06-15

0

Entering edit mode

4.9 years ago

Mensur Dlakic ★ 29k

This sounds like you are trying to find at once all regulatory motifs in a whole genome, and seems prokaryotic by size. That can't be done because there will not be enough statistical support for whatever is found, and it may not be tractable in terms of time. Even if I am wrong in my guess regarding your intentions, it would be difficult to find a regulatory motif in this big a set of sequences unless there are literally hundreds of motif occurrences. Bottom line, I think you will need to narrow down your set of sequences.

ADD COMMENT • link 4.9 years ago by Mensur Dlakic ★ 29k

0

Entering edit mode

No, it is not prokaryotic, it is a set of human sequences in which I am interested, and I would like to know if there is some interesting regulatory motif there.

ADD REPLY • link 4.9 years ago by jeni ▴ 90

1

Entering edit mode

Still the same advice, except that now the argument is even stronger for not doing it on such a large dataset. At least in prokaryotes one could safely assume that a motif is likely palindromic and at least 10 bp wide, which amounts to a decent signal. Eukaryotic motifs are typically shorter and non-palindromic, so the signal tends to be weaker.

What you are trying to do is very difficult to do in terms of de novo motif finding, and I would be very surprised if you found anything other than motifs for general transcription factors (TATA-box, etc). It may be a better strategy to look for co-occurrence of motifs in clusters, and this Google search may give you some ideas.

ADD REPLY • link 4.9 years ago by Mensur Dlakic ★ 29k