You could do a FIMO scan with (for example) the JASPAR or UniPROBE databases, to give you "hits" or positions for motifs in those databases. (You could also do a FIMO scan against your own database of motifs of interest, assuming you bring your own position frequency matrix file.)
Once you have those positions, you can take a BED file of gene annotations and use BEDOPS bedops
and bedmap
to look for any hits that map to a region upstream of the gene.
For example, here's one way to look 1000 bases upstream of stranded genes for any overlapping motif hits, returning the gene and a list of unique motif names associated with the gene's upstream region:
$ awk '$6=="+"' genes.bed > genes.for.bed
$ awk '$6=="-"' genes.bed > genes.rev.bed
$ bedops --range -1000:0 --everything genes.for.bed \
| bedmap --echo --echo-map-id-uniq - motif_hits.bed \
| bedops --range 1000:0 --everything - \
> answer.for.bed
$ bedops --range 0:1000 --everything genes.rev.bed \
| bedmap --echo --echo-map-id-uniq - motif_hits.bed \
| bedops --range 0:-1000 --everything - \
> answer.rev.bed
$ bedops --everything answer.*.bed > answer.bed
If you don't know what motifs you're expecting to find, you can extract genomic regions upstream of your stranded genes, get the FASTA sequences with bed2fasta and run that through MEME:
$ awk -v window=1000 ' \
BEGIN { OFS = "\t"; } \
{ \
if ($6 == "+") { \
print $1, ($2 - window), ($2 + 1), $4, $5, $6; \
} \
else { \
print $1, ($3 - 1), ($3 + window), $4, $5, $6; \
} \
} \
' genes.bed > upstream_reg.bed
$ bed2fasta.pl upstream_reg.bed /path/to/fasta/seqs > upstream_reg.fa
$ meme upstream_reg.fa <meme_search_options...>
The MEME output can then be used with FIMO as described above to retrieve a more detailed annotation of hits in upstream regions.
Thanks a lot this is very informative :)
I am particularly more interested in just one motif and look them up on the geneset and see if its shared. If Iam not wrong, this would be great to find all the known motifs in the geneset, isn't it?
If you have data for your own custom motif, you could make a MEME-formatted file from that data and use FIMO to scan over the upstream regions to look for that motif.
Hi,
So, I am trying to use FIMO scan, I know the consensus sequences for the motif. Do you know how i get MEME formatted input motif file?
thanks in advance!
Hello Alex, Can you please guide me with this question: A: Prediction of TF binding sites at genome wide scale