Hi all,
I have a series of peaks located in a .txt file (chr / start / end) and would like to know if there are tf motifs enriched in each of the individual peaks.
For eg, I am looking for an output that will eventually look something like this:
chr | start | end | tf motif
1 | 100024 | 100288 | GATA1
1 | 153313 | 155590 | RUNX1
.
.
.
Where each row is a unique peak and the tf motif is the most significantly enriched.
I downloaded the JASPAR2022 core collection to get a set of PWMs for different TFs, which I then concatenated in to a single .meme file (following this post: Finding individual motif occurrences with FIMO from the MEME suite) and have started using the FIMO command line tool. However I can only figure out how to query a single fasta sequence at a time?
fimo --parse-genomic-coord /path/to/meme/combined.meme input.fa
Is there a way to do this such that I can query all 15,000 peaks at once, instead of doing them individually?
Thanks in advance.
Thanks - this is exactly what I was looking for! I appreciate your help.
I corrected my answer. The usage scenario of FIMO I documented will generate a BED file of potential TF sites, not consume it.
Ok, this makes more sense. Thank you!
Out of curiosity, is it "better" to do it this way (ie scanning the entire genome for TF motifs, then getting the intersect with my peaks) as opposed to generating a fasta sequence for each of the 15000 peaks individually, and then scanning that for TF motifs? I guess another issue would be deciding what to use as the background model if I were to follow this latter route.
If those are the regions you're interested in, then use those regions and create the background from them.
I would do a whole-genome FIMO scan, using the reference genome (with UCSC blacklisted regions removed, say) as a background, generally. This takes slightly longer but creates a set of FIMO hits I can bedmap against any number of regions/peaks/whatever, whenever I need to. But it may depend on what you're trying to do. If you're just doing a one-off query, then the above advice is probably fine. If you might do this again on other sets of peaks, then a whole-genome set of hits may be a useful resource.
Got it, this makes sense. Thanks again for your advice!