I've created a PWM for the MEF2 transcription factor. I've done this using R. The MEF2 transcription factor binding sites were obtained from Riken database and aligned with Clustal Omega.
Parsing and experimenting with different matrix widths, I have the following "best" result. If I use my PWM to score all the binding sites from Riken (from which the PWM was created itself) I only get that 65% of the binding sites are high scoring. So in other words, I have 1875 binding sites and only 1200 sites score above a threshold.
I was hoping to get a number above 90%. If my PWM can't detect atleast 90% of the binding sites, I am not comfortable with saying I have an accurate PWM. I tried different methods to see if I can improve this, but to no avail. I will show all my work below.
To further check, I downloaded a bonafide Mef2 PWM given by different databases. (Using the MotifDb package in R, and searching for Mef in the database). Using this PWM, I get that 97% of the true binding sites (so ~1700/1875) scored really high. This is what I was looking for.
How do I improve my accuracy? Are their tools that will accept a multiple alignment file and produce a PWM? I feel like I am getting low accuracy because of all the unaligned bases, which get marked as "-".
What about applying those matrices to random sequences, how many false-positives will you get with them? I mean, in your case PWM acts as a classifier, so there is always a precision/recall characteristic which should be taken into account. Have you tried varying the threshold, alignment parameters, etc?
I was following your original message as I have never constructed a PWM from raw sequence. It is good to hear you worked it out. I was surprised there was not a single response, nor could I find a good guide. It may be obvious to some, but I was concerned about a lack of consensus. How did you get from the sequence alignment to frequency matrix to PWM?
Hi Ian, before I start, it will be good to keep UnivStudent's answer below in mind. Now, I downloaded the TFBS from Riken 4 database (hg18). I then used Clustal Omega (and Mafft) to align these sequences. Then I just counted the number of bases in each position and used that as a frequency count. I ofcourse tried to make it optimal by experimenting with the number of columns. For this, I decided to throw away columns that had "-" (a gap) has the highest frequency. For some columns, I redistributed the "-" counts according to the distribution of ACTG in the binding sites. At first I said 0.25 for ACTG but I later I calculated that I have more AT then GC so I changed it. That gave me a boost to 60%. I am kind of rambling on here, so it's probably confusing and you've already lost interest.
I used BioConductor and R's PWM() function to create the PWM.