How can I create a more accurate PWM?
1
5
Entering edit mode
9.9 years ago
Affan ▴ 310

I've created a PWM for the MEF2 transcription factor. I've done this using R. The MEF2 transcription factor binding sites were obtained from Riken database and aligned with Clustal Omega.

Parsing and experimenting with different matrix widths, I have the following "best" result. If I use my PWM to score all the binding sites from Riken (from which the PWM was created itself) I only get that 65% of the binding sites are high scoring. So in other words, I have 1875 binding sites and only 1200 sites score above a threshold.

I was hoping to get a number above 90%. If my PWM can't detect atleast 90% of the binding sites, I am not comfortable with saying I have an accurate PWM. I tried different methods to see if I can improve this, but to no avail. I will show all my work below.

To further check, I downloaded a bonafide Mef2 PWM given by different databases. (Using the MotifDb package in R, and searching for Mef in the database). Using this PWM, I get that 97% of the true binding sites (so ~1700/1875) scored really high. This is what I was looking for.

How do I improve my accuracy? Are their tools that will accept a multiple alignment file and produce a PWM? I feel like I am getting low accuracy because of all the unaligned bases, which get marked as "-".

pwm • 3.6k views
ADD COMMENT
0
Entering edit mode

What about applying those matrices to random sequences, how many false-positives will you get with them? I mean, in your case PWM acts as a classifier, so there is always a precision/recall characteristic which should be taken into account. Have you tried varying the threshold, alignment parameters, etc?

ADD REPLY
0
Entering edit mode

I was following your original message as I have never constructed a PWM from raw sequence. It is good to hear you worked it out. I was surprised there was not a single response, nor could I find a good guide. It may be obvious to some, but I was concerned about a lack of consensus. How did you get from the sequence alignment to frequency matrix to PWM?

ADD REPLY
0
Entering edit mode

Hi Ian, before I start, it will be good to keep UnivStudent's answer below in mind. Now, I downloaded the TFBS from Riken 4 database (hg18). I then used Clustal Omega (and Mafft) to align these sequences. Then I just counted the number of bases in each position and used that as a frequency count. I ofcourse tried to make it optimal by experimenting with the number of columns. For this, I decided to throw away columns that had "-" (a gap) has the highest frequency. For some columns, I redistributed the "-" counts according to the distribution of ACTG in the binding sites. At first I said 0.25 for ACTG but I later I calculated that I have more AT then GC so I changed it. That gave me a boost to 60%. I am kind of rambling on here, so it's probably confusing and you've already lost interest.

I used BioConductor and R's PWM() function to create the PWM.

ADD REPLY
0
Entering edit mode
9.9 years ago
UnivStudent ▴ 440

This probably has something to do with how you are generating the PWM. Without knowing exactly the type of sequences you have (length, ChIP peaks, etc?) it's hard to give advice. However, most PWMs like this are generated by motif finders and not simply alignments of sequences. I would suggest inputting your sequences into a motif finding software such as MEME]1 or some of the motif-discovery tools in RSAT (those are just a couple of software examples) and looking at the PWMs you get from those. Alternatively, you can try to optimize your PWM that you've gotten from alignments with methods like DiMo.

ADD COMMENT
0
Entering edit mode

Ah thanks. I just used MEME today to discover motifs, and you are right it is much different than aligning them. May I ask what exactly the difference is and if there is literature on it?

I have tweaked my PWM so now I have that 60% of the true binding sites are scored high. This still means that my PWM does not detect 40% of binding sites which is a problem. Maybe I will use the frequency matrix MEME gives me. OR better yet I can just use a bonafide PWM from Jasper DB for my transcription factor.

ADD REPLY
0
Entering edit mode

With PWMs and scanning you're most likely going to have many false positives and false negatives because of the nature of in vivo TF binding (this holds true for most TF's PWMs with the exception of CTCF). Here is a review on some of the most common methods of motif finding. I think the main reason alignment might not work is you're presumably using long sequences to find a small aligned region (~10 bp wide) which would be your binding site and that doesn't always provide the correct answer. It also might not provide a biochemically relevant answer because it wouldn't model a TF's affinity for different sequences. This is why a PWM generated from in vitro binding assays like PBMs or SELEX can model a TF's specificity more directly.

ADD REPLY
0
Entering edit mode

Thanks, I have read that paper, it was quite insightful. I have a couple of questions though. 1) I only have true binding sites i.e., I am not using long sequences to find a small region. My true binding sites are 8 - 15 bp long. I used the MEME suite, and got this as a result: MEME Results. I used the given PSPM below with the highest score and only got an accuracy of 8%! i.e., if I scored the true binding site with it - only 8% of the binding sites scored 80% or higher. Even my own alignment did better.

So either I have done the conversion from PSPM -> PWM wrong (which I don't think I did) or that MEME actually requires a list of LONG sequences so that it may discover motifs inside. Any comments or insight by you?

ADD REPLY
0
Entering edit mode

Do you get a different or better motif when you run meme instead of GLAM? I'm not really sure what GLAM does. Do you believe that this TF has a gapped motif for some reason? Also in the GLAM manual it says "It is harder to discover gapped motifs than gapless motifs: so when you use GLAM2, we recommend that you also do a simpler gapless analysis with MEME". I don't actually think most TF binding sites would be gapped as the example they use seems to be more for looking for motifs that span an insertion or deletion in a protein to me. And it seems to provide a gapped output that's not similar to a PWM.

ADD REPLY

Login before adding your answer.

Traffic: 1806 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6