Question

PWMs and FDR

0

Entering edit mode

8.6 years ago

Opt ▴ 50

I'm using PWM (position weight matrix) scores to determine whether a TF binds to a DNA sequence across the genome. However I also need to do correction for multiple hypothesis. I was thinking of FDR benjamini hochberg correction but doesn't that assume independence of test statistics (in this the PWM score)? However aren't scores on overlapping sequences gonna be correlated?

multiple-hypothesis-testing fdr pwm • 2.5k views

ADD COMMENT • link updated 8.6 years ago by Santosh Anand 5.8k • written 8.6 years ago by Opt ▴ 50

0

Entering edit mode

What is a PWM score? Pulse-width-modulation doesn't seem to make sense. What is a TD binding? When all else fails, Bonferroni correction is overconservative.

ADD REPLY • link 8.6 years ago by karl.stamm 4.1k

0

Entering edit mode

Oops, was typing on phone. TD is TF (transcription factor). PWM (position weight matrix) score is the score of binding computed by multiplying the base probability in the PWM matrix for a TF for the position of that base across all positions. Bonferri is definitely too conservative but I wanted to make sure I wasn't violating any assumptions of Benjamini-Hochberg either.

ADD REPLY • link 8.6 years ago by Opt ▴ 50

0

Entering edit mode

Thanks for the clarification.

ADD REPLY • link 8.6 years ago by karl.stamm 4.1k

score 0 · Answer 1 · 2016-11-23

So you're going to have a lot of scores that are indeed somewhat non-independent. Benjamini-Hochberg can work, but you don't really have any statistical hypothesis tests going on.

Why not just collect a big list of scores and use the top few? It depends on how you intend on using the result. Top 5% locations of a 4gigabase reference is still 200 million bases.

I think what you need is a statistical control. You want to say the TF binds "well" to the location, but need to define "binding well" as compared to something. That would be the null distribution of your measure.

I would possibly take random fake TF sequences and collect their score lists to compare. Scores above the top 1% of random sequences could be considered real. Because you don't know at the outset if the TF could truly bind to dozens or millions of genomic locations.

score 0 · Answer 2 · 2016-11-23

0

Entering edit mode

8.6 years ago

Santosh Anand 5.8k

meme suite provides a program to convert p- to qvalues http://meme-suite.org/doc/qvalue.html

ADD COMMENT • link 8.6 years ago by Santosh Anand 5.8k