Question

how to interpret scores from a PWM match

0

Entering edit mode

9.9 years ago

Affan ▴ 310

I created a PWM using true binding sites from Riken 4 database, hg 18.

To test my PWM, I picked a random true binding site and grabbed a ~1000 bp neighbourhood around the binding site, with the center of this segment being as close to the binding site. IE the binding site is not EXACTLY right in the middle of the segment. After running my PWM I get the following results

-----------------
True binding site: CTCTTAATAG 

Views on a 1011-letter DNAString subject
subject: AGTGCACTTGCTAAAACAAAAGGAGGCCTGAGCGGCCGCAGGGCACCGCGGCG...TAACAGATTACCAACTGTTAATTTCAAACTAATTTCTTACCCACCCACAATTA
views:
    start end width
[1]   648 659    12 [TTTATTTCAAAG]
[2]   748 759    12 [TTTGTTTAAAAA]
[3]   885 896    12 [GCTTTAAATAAA]
[4]   940 951    12 [TCAATTTTTATG]
DataFrame with 4 rows and 1 column
      score
  <numeric>
1 0.8362088
2 0.8342433
3 0.8309675
4 0.8779209

In this particular example, the true binding site is

views:
    start end width
[1]   456 465    10 [CTCTTAATAG]

Now my first job here is to get the accuracy (sensitivity and specificity) of the PWM. To do this I am looking for a way to detect false positives. I am unsure on how to do this. According to the results above, 4 sites were scored higher than 40%. How do I incorporate this data into accuracy analysis.

Background information: I have exactly 1875 true binding sites. I can replicate the above analysis for all of these binding sites (ie, grab the neighbourhood, apply the PWM, analyze the score). All programming is done with R.

Secondary question: Do I have to take care of the strand information? My true binding site data looks like this:

>chr1:6585537-6585547
CTATAAATAG
>chr1:6767854-6767864
CTTTGTTTAG
>chr1:8686282-8686292
CTCTTAATAG
>chr1:10660923-10660933
GTATTTTTAA

PWM tfbs • 2.5k views

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Affan ▴ 310