Question

Appropriately Establishing FIMO background

0

Entering edit mode

2.8 years ago

gkunz ▴ 30

I am using FIMO to identify motifs present in H3K27ac ChIP-seq peaks that are differential between experimental groups - in some cases I have as few as 1 peak (~200 nts) and in other cases a few thousand sequences of variable length. I am unsure regarding the best practices for establishing the background model for FIMO.

My understanding is that the background should be biologically similar to peaks that I am asking about but should not contain instances of the motif of interest. I have run fasta-get-markov on the following files:

The sequences in which I am trying to identify motifs
Sequences (peaks) that are common to all my experimental groups
The entire genome

Each yields a different backgrounds model and the FIMO result vary greatly based which I use - It is my understanding that this is expected. Considering the fact that this was a histone modification IP and that in some cases I am only asking about a single sequence, is one of these methods best or is there an alternative approach that I should be taking to generate background? I am simply trying to make sure the motifs identified are the most accurate and I am struggling to find a clear answer on which approach is the most accurate and supported. If anyone has experience using FIMO and has a strong justification for how they set the background, I would greatly appreciate some insight!

Thanks in advance for any assistance.

H3K27ac FIMO MEME-SUITE IDENTIFICATION MOTIF • 1.6k views

ADD COMMENT • link updated 2.8 years ago by Malcolm.Cook ★ 1.5k • written 2.8 years ago by gkunz ▴ 30

score 0 · Answer 1 · 2022-02-20

0

Entering edit mode

2.8 years ago

Malcolm.Cook ★ 1.5k

I see you already received an response from MEME team to your parallel post on google groups: Clarification regarding best practices for FIMO background selection

I note that nothing in the answer provided there pertains directly to the differential analysis you are intent upon performing, for which you will probably want to reach for AME or SEA, in which cases you might choose to present the sequences in all experimental groups combined as a common control, interrogating the sequences in each individual experimental group (do you have more than 2?) in turn.

ADD COMMENT • link 2.8 years ago by Malcolm.Cook ★ 1.5k

0

Entering edit mode

Hello,

My experiment has 10 total experimental groups. There are 45 potential comparisons to be made, and 90 directional comparisons to be made. I have obtained differentially enriched peak sets for all all of the comparisons of interest and narrowed those down such that the bed files I am looking to analyze contain the putative NFR regions that make up each of the differential peak sets. Some of the comparisons have quite a large number of differential peaks, while others have only a few. My question now becomes 'What transcription factor binding sites are located within the differential peak sets and how might these differences in accessibility contribute to the the biological differences between the groups being compared?'.

From the reading I did, I thought FIMO would be the best tool to accomplish this, because it is not looking for enriched motifs, in the way that HOMER or other motif finding programs might be. Based on the IP and the data processing up to this point, I do not think looking for enriched motifs over background generated from the genome is sensible. Instead it is simply asking what motifs are there above a set statistical threshold and returning a count with location information. Maybe there is a better way ton ask this question within MEME-SUITE like AME or SEA? I will check those out!

As respectfully as possible, I have some disagreement with the background recommendation made by the MEME team. Setting the background in the way that they have suggested (using the statistics of the peaks being input) drastically reduced the likiehood of any CG rich motif being identified and greatly increase the likelihood of any AT containing motif being called, because the sequences are somewhat CG rich. As such, setting the background in this way eliminates the identification of well-conserved binding sites for transcription factors that we would expect to see from an IP such as this, such as CTCF. In some cases, I would also be establishing background off of as few as ~200 bp, which I think also presents an issue.

Really, I am just trying to find the most supported and justifiable way to proceed with motif identification and analysis. I would love to talk about this further and am happy to provide further detail regarding my experiment and data processing up to this point if beneficial.

Thanks

ADD REPLY • link 2.8 years ago by gkunz ▴ 30

0

Entering edit mode

To repeat and perhaps clarify, my suggestion is that you evaluate motif enrichment under the peaks from each of your 10 groups in turn, individually comparing each group's motifs to the motifs under the combined peaks of all experimental groups pooled.

If such an approach addresses the questions you seek to answer, I expect that AME or SEA might make implementing it possible. (Aside: I would be hesitant to interpret any resulting p-values as anything more than a value by which you might rank candidates).

Let us know your thoughts, whether you follow my suggestion, and if you do, whether is sheds light on your science.

ADD REPLY • link 2.8 years ago by Malcolm.Cook ★ 1.5k

0

Entering edit mode

So I have performed the analysis you are suggesting using using HOMER where the background was generated from peaks that were consistent across all ten experimental groups and maintained the same for each group. This is fine in that it gives me quite a bit of information regarding what TFBSs are potentially enriched within my peak sets, but the results tell me little about what might be happening at the differential enriched sites, unless I attribute the differences in TFBS enrichments solely to the peaks differential between the groups.

My goal was/ is to probe at the differentially enriched peaks between the groups and see exactly what TFs might be regulating expression at these sites. Additionally, I was to compare across comparisons to to see if some of my groups are differing from each other in the same or different ways that other groups might be differing from one another.

ADD REPLY • link 2.8 years ago by gkunz ▴ 30

0

Entering edit mode

I am unclear from your description whether your strategy for selection of background comports with my suggestion.

Nevertheless it sounds like it provided an approach you feel addresses your earlier stated aim, to:

identify motifs present in H3K27ac ChIP-seq peaks that are differential between experimental groups

If so, I'd appreciate an upvote or thumbs-up.

I am unfamiliar with your experiment design and unable now to comment on your further aims.

ADD REPLY • link 2.8 years ago by Malcolm.Cook ★ 1.5k