Hi all,
I am trying to detect novel and known motifs in a set of 43 genes that are tissue specific.
What is the optimum length of upstream sequence per gene that I include in the FASTA file for meme?
Also, is n=43 enough for significance? I can increase n if needed.
Previous attempts:
1) first nucleotide is 50 positions downstream of TSS, last nucleotide is 200 positions upstream of TSS, total length 250
2) first nucleotide is 50 positions downstream of TSS, last nucleotide is 400 positions upstream of TSS, total length 450
3) first nucleotide is 50 positions downstream of TSS, last nucleotide is 600 positions upstream of TSS, total length 650
4) first nucleotide is 50 positions downstream of TSS, last nucleotide is 800 positions upstream of TSS, total length 850
5) first nucleotide is 50 positions downstream of TSS, last nucleotide is 1000 positions upstream of TSS, total length 1050
Each of these attempts returns either no motifs or poor results - even the short run of 250 length close to the TSS. I'd have expected to produce something from that short run, so I just don't trust that I'm doing it right.
Now assume for a moment that the optimal length is 100: To identify all mofifs within 1kb upstream of TSS, should I actually be doing the following:
1) first nucleotide is 50 positions downstream of TSS, last nucleotide is 49 positions upstream of TSS, total length 100
2) first nucleotide is 50 positions upstream of TSS, last nucleotide is 149 positions upstream of TSS, total length 100
3) first nucleotide is 150 positions upstream of TSS, last nucleotide is 249 positions upstream of TSS, total length 100
4) first nucleotide is 250 positions upstream of TSS, last nucleotide is 349 positions upstream of TSS, total length 100
5) first nucleotide is 350 positions upstream of TSS, last nucleotide is 449 positions upstream of TSS, total length 100
and so on...
I am currently going through the previous meme questions and can't find an answer to this as yet apart from "Which is why the recommendation [for meme-chip] is short sequences of less than 500bp" here. Also, here: "DREME works best with lots of short (~100bp) sequences". So, I am starting to think that the shorter the better.
If someone could give me a diffinitive answer with a reference that would be great, but just some experienced advise would be great because the amount of time Im wasting on big meme runs is just silly now.
Thanks all in advance, Kenneth.
EDIT: I dont want to confuse the scope of the question. It is only MEME I am considering. Not Meme-chip. I am aware of what meme-chip does to long sequences. But this is not my concern. Only what is the optimal input length for meme
EDIT2: The original paper gives some clues in the section "sensitivity to noise" but Id still like some input from those who are experienced with using meme.
EDIT3: I have managed to find many significant motifs by using a sequence length of 100. I first create multiple input files. Each file contains 100 nucleotides from each gene - all the same distance from the TSS. e.g. file 1 = 43(0-100 upstream of tss, file) 2 = 43(80-160 nucleoties upstream of tss) so all files overlap. Using this as a basis, I have identified 46 significant motifs by meme (p<0.05). The runtime is also dramatically reduced. It appears I have my answer unless anyone knows any better.
This article tells the following:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4175909/
"Limitations on sequence length and number
The MEME-ChIP web server supports analysis of data sets of up to 50 Mb, but it performs some of its analyses on subsets of these data. Most notably, it performs motif discovery (using MEME and DREME) on the central 100 bp of sequences, and MEME uses only 600 sequences. Using the central 100 bp works very well with ChIP-seq and CLIP-seq data, but a different length may be preferable for other applications. The sampling of 600 sequences for MEME is necessary to limit CPU usage per MEME-ChIP job on the (free) web server. If you wish to change either of these aspects of MEME-ChIP, you can do so if you install and run MEME-ChIP on your own computer (Box 4)."
But it seems to me you have already seen it...
Thanks but I was aware of that alredy . I dont want to confuse the discussion with reference to meme-chip. It is only meme i am considering.