Extracting the exact motif location from FIMO
2
0
Entering edit mode
6.6 years ago
rbronste ▴ 420

So I am using FIMO on some small datasets to look for motifs of interest and it works quite well to scan a PWM against some fasta file I chose, however the result basically outputs something like the following:

MA0041.1    Foxd3   chr1:146119062-146119793    +   24  35  2.03e-07    0.0191  GATTGTTTGTTT
MA0041.1    Foxd3   chr9:95744293-95745031  -   12  23  3.27e-07    0.0191  GAATGTTTATTT
MA0041.1    Foxd3   chr16:39712307-39712704 +   29  40  5.52e-07    0.0191  gtatgtttgtTT
MA0041.1    Foxd3   chr17:55425949-55426275 +   119 130 7.77e-07    0.0191  gtttgtttgttt
MA0041.1    Foxd3   chr17:55425949-55426275 +   123 134 7.77e-07    0.0191  gtttgtttgttt

This is great but as you can see in the last two rows it gives the same interval as having the PWM match in multiple places. What I am trying to do is take a FIMO output such as this and actually pull out intervals corresponding to the exact site of the motif within the original input intervals. I wonder how this can possibly be done? Thanks!

meme fimo motif ChIP-Seq • 2.5k views
ADD COMMENT
0
Entering edit mode

Perhaps one reason that you get a second hit over the same interval is that the DNA that the TF binds to is a repeat (gttt three times). The model might decide to identify neighboring regions as hits if they contain repeats that would still allow TF binding.

ADD REPLY
2
Entering edit mode
6.6 years ago

From the FIMO documentation:

The HTML and plain text output contain the following columns:

• The motif identifier
• The (optional) alternate identifier for the motif
• The sequence identifier
• The strand '+' indicates the motif matched the forward strand, '-' the reverse strand, and '.' indicates strand is not applicable (as for amino acid sequences).
• The start position of the motif occurrence (closed, 1-based coordinates, unless genomic coordinates are provided)
• The end position of the motif occurrence (closed, 1-based coordinates, unless genomic coordinates are provided)
...

Parse the sequence identifier (e.g., chr1:146119062-146119793) and add 24 and 35 to the start position to get a modified interval: chr1:146119086-146119097.

Here's one way to make a sorted, 0-indexed BED file of binding site "hits":

$ awk -vOFS="\t" '{ n = split($3, a, /[:-]/); print a[1],a[2]+$5-1,a[2]+$6,$1; }' fimo.out | sort-bed - > fimo.bed

Add or edit columns, as desired. You might add the p-value into the score column for thresholding, etc.

You can confirm this modified interval matches the sequence-matched-to-the-motif in the tenth column (e.g., GATTGTTTGTTT) by opening up the UCSC Genome Browser (assuming mm10 from gene name):

http://genome.ucsc.edu/cgi-bin/hgTracks?db=mm10&&position=chr1%3A146119086-146119097

The sequence track at the top of the browser window should match what FIMO reports (and does).

The FIMO documentation notes that it is using a 1-based index, and the UCSC Genome Browser renders data with the same 1-based index. Keep this in mind if you plan using the coordinates for set operations with BEDOPS etc., so that you don't accidentally mix 1- and 0-based indexed sets.

You might also look at converting the FIMO GFF output to BED with BEDOPS and using that output to get back to binding sites. I haven't done this myself but convert2bed deals with indexing, as BED is commonly 0-indexed, half-open and GFF is 1-based. Could be worth investigating.

ADD COMMENT
1
Entering edit mode
6.6 years ago
h.mon 35k

Modifying the awk command from Alex Reynolds previous post:

cat fimo.out | awk 'BEGIN { OFS="\t"; } { n = split($3, a, /[:-]/); print a[1]":"a[2]+$5"-"a[2]+$6; }'

Assuming you have an indexed mm10 genome (and assuming mm10 because Alex assumed mm10):

samtools faidx genome.fa $(awk 'BEGIN { OFS="\t"; } { n = split($3, a, /[:-]/); print a[1]":"a[2]+$5"-"a[2]+$6; }' fimo.out)

>chr1:146119086-146119097

GATTGTTTGTTT

>chr9:95744305-95744316

AAATAAACATTC

>chr16:39712336-39712347

gtatgtttgtTT

>chr17:55426068-55426079

gtttgtttgttt

>chr17:55426072-55426083

gtttgtttgttt

ADD COMMENT

Login before adding your answer.

Traffic: 2647 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6