I created a PWM using true binding sites from Riken 4 database, hg 18.
To test my PWM, I picked a random true binding site and grabbed a ~1000 bp neighbourhood around the binding site, with the center of this segment being as close to the binding site. IE the binding site is not EXACTLY right in the middle of the segment. After running my PWM I get the following results
-----------------
True binding site: CTCTTAATAG
Views on a 1011-letter DNAString subject
subject: AGTGCACTTGCTAAAACAAAAGGAGGCCTGAGCGGCCGCAGGGCACCGCGGCG...TAACAGATTACCAACTGTTAATTTCAAACTAATTTCTTACCCACCCACAATTA
views:
start end width
[1] 648 659 12 [TTTATTTCAAAG]
[2] 748 759 12 [TTTGTTTAAAAA]
[3] 885 896 12 [GCTTTAAATAAA]
[4] 940 951 12 [TCAATTTTTATG]
DataFrame with 4 rows and 1 column
score
<numeric>
1 0.8362088
2 0.8342433
3 0.8309675
4 0.8779209
In this particular example, the true binding site is
views:
start end width
[1] 456 465 10 [CTCTTAATAG]
Now my first job here is to get the accuracy (sensitivity and specificity) of the PWM. To do this I am looking for a way to detect false positives. I am unsure on how to do this. According to the results above, 4 sites were scored higher than 40%. How do I incorporate this data into accuracy analysis.
Background information: I have exactly 1875 true binding sites. I can replicate the above analysis for all of these binding sites (ie, grab the neighbourhood, apply the PWM, analyze the score). All programming is done with R.
Secondary question: Do I have to take care of the strand information? My true binding site data looks like this:
>chr1:6585537-6585547
CTATAAATAG
>chr1:6767854-6767864
CTTTGTTTAG
>chr1:8686282-8686292
CTCTTAATAG
>chr1:10660923-10660933
GTATTTTTAA