Hello, I need some help understanding how HOMER motif finding works. I have a ChIP-seq data for a transcription factor that I want to look for enriched motifs in. I use HOMER findMotifsGenome.pl) to look for all possible enriched motifs in my dataset:
findMotifsGenome.pl ./target.bed mm10 ./output -size 200 -mask
The command runs fine and this gives me the typical lists of enriched known and de novo motifs. From this list I picked out a motif of interest, which is found as both a 'known' motif and a 'de novo' motif. In a total of ~18000 input regions there are ~3400 regions found with this motif (the actual number of sequences found in 'known motif' and 'de novo homer motif' was slightly different, but not extremely so, which makes sense).
Seeing that HOMER default setting does not give the actual list of sequences that have the motif of interest, I rerun the above command with the additional -find option
findMotifsGenome.pl ./target.bed mm10 ./output -size 200 -mask -find ./motifs/[motif_name].motif >>./output.txt
Note that the .motif file is from the included HOMER database, i.e the same one that is identified in 'known motif' output. Again the command runs fine and extracts sequences containing the query motif. The only thing that is confusing to me is that it gives me a different number of sequence - ~4200 sequences instead of ~3400.
It does not make sense to me that there is a difference in the number of sequence found with this motif. From what I understand, in the first instance HOMER scans the entire input sequence set and look for the common motif pattern (and actually counts how many sequences have this motif, because otherwise how can it have a '# Target Sequences with Motif' column in its output?). In the second instance, HOMER instead of scanning for all possible motifs in its database, will only scan for the query motif, which is what it should've done in the first instance anyway. Am I misunderstanding something in the way HOMER works? And how should I interpret the results from each command (I eventually want to do more comparative analysis with this result, thus would like to know exactly what the sequences are).
Any help would be much appreciated.
I hope you have seen this post, however I also haven't got the answer yet.