Scan a large genome for the ocurrence of a single sequence motif?
3
2
Entering edit mode
9.8 years ago
qwerty ▴ 50

Hello,

I want to compile a BED file with genomic coordinates of all occurrences of a particular sequence motif in the complete genome. One obvious online tool to do this task, FIMO, can not be applied because it has a limit of 1 million nucleotides. I need to scan the whole genome (~1e9 bp). Any software suggestions?

Thanks!

sequence next-gen genome motif ChIP-Seq • 4.3k views
ADD COMMENT
1
Entering edit mode

What does your motif look like? Is it a sequence you guys are working on (i.e. not in databases already) or a previously known motif?

You could try IGV http://www.broadinstitute.org/igv/motif_finder

ADD REPLY
0
Entering edit mode

It's not a known TF motif, it's a novel motif determined from ChIP-seq

ADD REPLY
2
Entering edit mode
9.8 years ago
thackl ★ 3.0k

If I understand correctly, you can download and install MEME/FIMO locally (http://meme.nbcr.net/meme/meme-download.html, http://meme.nbcr.net/meme/doc/fimo.html). That might solve the 1Mbp limit. If it doesn't you could chop up your sequence like this:

bedtools makewindow -g GEN.fa -w 1000000 > GEN.bed
bedtools getfasta -fi GEN.fa -bed GEN.bed -fo GEN-win.fa
csplit -z GEN-win.fa '/>/' '{*}'
for FILE in GEN-win.fa.*; do FIMO; done

You would have to adjust coordinates by 1Mbp * split.count for each result file

ADD COMMENT
1
Entering edit mode
9.8 years ago
Fidel ★ 2.0k

Use TRAP. What you need is the genome in fasta format and the motif matrix. It is possible to use Transfac and Jaspar or use your own motif as a position-specific scoring matrix.

You may want to take a look at the Nature Protocols paper describing the tool.

ADD COMMENT
0
Entering edit mode

TRAP has a limit of 5000 bp :(

ADD REPLY
1
Entering edit mode

It can scan whole genomes -that is the idea- and it does it very fast.

This is how I ran it using drosophila genome:

ANNOTATE_v3.04 -s dm3.fa --pssm meme.pssm -g 0.5 --ttype balanced -name 'MOTIF_1' -d > trap.txt
ADD REPLY
0
Entering edit mode

Thank you. Could you please tell me an approximate duration of such a job? As you have suggested, I am running a command line with the same parameters as you just indicated. My motif length is 27 nucleotides, and the genome is human. It took already more than 3 hours, still running, and no indications of the state of the job...

ADD REPLY
0
Entering edit mode

Fidel, So I waited for 24 hours for the program to finish, and then killed the task. Any ideas how long should it take normally?

ADD REPLY
0
Entering edit mode

24 hours is too much! The problem may lie on the --ttype balanced, to solve this replace it by -t 5.0 (I just asked my boss Thomas Manke who participated in the development of the tool). The -t (threshold) is the score used to define hits. Higher numbers result in fewer hits that are of higher confidence. How did you compute the pssm matrix?

ADD REPLY
0
Entering edit mode

The PSSM matrix was reported by MEME

ADD REPLY
0
Entering edit mode

The MEME output only include the frequency counts. You need a matrix that has log ratios summing up the probabilities of finding a particular base against a background. Try converting your meme matrix using this tool: http://rsat.ulb.ac.be/convert-matrix_form.cgi

We just tried and a local run took only 10 minutes using the human genome.

ADD REPLY
0
Entering edit mode

Hi Fidel,

Could you please tell which input and output matrix formats should be selected at RSAT?

Thanks!

ADD REPLY
0
Entering edit mode

rsat seems to be down for several days now. They have a nice manual page explaining what you want.

ADD REPLY
0
Entering edit mode
9.8 years ago
qwerty ▴ 50

So finally I ended up mapping the fixed motif by Bowtie using the "-a" option. It took just several seconds.

FIMO did not install locally after several attempts and upgrades of the Perl and Python. Some compilation errors...

TRAP did install locally, but it did not finish the job, and the format of the input matrix required for TRAP remains unclear.

ADD COMMENT

Login before adding your answer.

Traffic: 2489 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6