Do I have to take strand information when creating a PWM?
1
1
Entering edit mode
9.8 years ago
Affan ▴ 310

I am afraid my lack of biology may have lead me to a couple weeks of bad results for my thesis. Here it goes:

I am trying to construct a position weight matrix based on Transcription Factor Binding Sites (for TF MEF2), based on already known positions. Here are the TFBS off FANTOM 4.

>chr1:167603901-167603911(+)
TTATTTTAAA
>chr1:167603905-167603915(-)
CTACTTTAAA
>chr1:170967144-170967154(-)
ttatttttag
>chr1:179269507-179269517(-)
ATATTTATGG
>chr1:190811495-190811505(+)
CTATATAAAG

Now, what I did was send this information to Clustal (and MAFFT) to get an alignment back. The alignment was needed to get the frequency of a base in a position. More on the alignment at the end.

Problem and Question:

What does the +, - represent here? Obviously when looking at

chr1:167603901-167603911(+)

on the genome browser, I correctly see the sequence: TTATTTTAAA

On the other hand, if I look on the genome browser at

chr1:179269507-179269517(-)

I don't see ATATTTATGG but clearly the reverse complement of it (duh).

The question is what does the transcription factor bind to, in "terms" of multiple alignment. Can we essentially stop talking about the negative stand and convert everything to the positive strand?

Should I send this as is to a multiple sequence alignment OR should I first convert the the minus-strand to its reverse complement AND THEN try to align it. The table below highlights the reverse complement.

given sequence                  reverse complement-ed sequence
>chr1:167603901-167603911(+)    >chr1:167603901-167603911(+)
TTATTTTAAA                      TTATTTTAAA
>chr1:179269507-179269517(-)    >chr1:179269507-179269517(-) #this entry
ATATTTATGG                      CCATAAATAT

Note: I was fairly certain that a multiple sequence alignment program would take this into consideration, but here is the "alignment" that was returned to me by MAFFT

>chr1:167603901-167603911(+)
------ttattttaaa-------
>chr1:179269507-179269517(-)
------atatttatgg-------

Obviously it has marked the sequence as being on the (-) strand, however the alignment sequence was never converted to the reverse complement:

This meant that .. for example, at the last position I counted 1 A, and 1 G.. If it used the reverse complement I would've counted something else entirely based on the alignment.

I hope I have made sense here.

PWM interpretation • 3.1k views
ADD COMMENT
0
Entering edit mode

I understand your confusion, but I think everything is fine here! The binding motive should be given in the correct orientation already.

ADD REPLY
0
Entering edit mode

So when I align them, I just use what's given to me? I mean clearly if I search for >chr1:179269507-179269517(-) on the genome browser, I don't get ATATTTATGG. I get the reverse complement, so doesn't that mean the TF "binds" to the reverse complement instead of ATATTTATGG?

ADD REPLY
0
Entering edit mode
9.8 years ago

To expand on the comment by Michael Dondrup

The strand information is there to inform you of the location of the sequence in the genome. Very few tools actually parse sequence identifiers for strand information, thus the (+) or (-) there will not be used in later processes. That is because that is not a standardized representation of information.

The sequence itself is (should be) the observed sequence. You should not need to reverse complement that yourself.

ADD COMMENT
0
Entering edit mode

Thank you. So I don't need worry about reverse complements when sending it to an alignment software, correct?

I have another question. After scanning the chr3 with my PWM, I get a thousands of "hits", ie potential binding sites. However, most of them will be false positives unless hit in near/or in the location of the binding site.

So right now what I am thinking is:

  1. Scan chr3 (+) with the PWM, store the hits
  2. Go through each hit, and see if the hit's location overlaps the TRUE binding site on the PLUS STRAND.
  3. Repeat this, scan chr3 (-) with the PWM.
  4. Go through each hit and see if the hit's location overlap the TRUE bindingsite on the NEGATIVE strand.

So the question is, is it necessary to scan both the +, - strands?

Example: I scanned both the positive and negative strands, here is my part of my result:

  cstart  cend     score         strand  HYP   ISTRUE
1  35926 35935 0.9561291      +   TRUE  FALSE
2  35926 35935 0.9820160      -    TRUE  FALSE

As you can see, the start/end coordinates are the same in the positive and negative strand. How do I interpret this?

ADD REPLY
0
Entering edit mode

IMO if you built your matrix on the actual binding sequences you would need to scan both directions. If you get hits in both directions it just means that the site scores high both for forward and reverse. That observation on its own is does not mean it is not right - the motif might be palindromic of some sorts.

ADD REPLY
0
Entering edit mode

I did build my matrix on the original binding sequences. I sent them to Clustal Omega/MAFFT for alignment and then counted the frequency of each base.

So my workflow is correct then?

  1. I download FANTOM 4 TFBS which has strand information but is irrelevant in some sort
  2. I send these binding sites to get an alignment (which again I am confused about because its going to align sequences from both the positive and negative strand?)
  3. I build my PWM.
  4. I scan my genome/chromosome
  5. If the hit is in the positive strand, I can check for overlap of the true binding site location on the positive strand
  6. If the hit is in the negative strand, I can just still use the locations from the true binding sites on the negative strand.

Thank you so much for your help.


Alternative: As an alternative, would it be better to first create a PWM based on alignments ONLY from the positive strand. And then create a PWM based ONLY on the alignments of the negative strand. I'll have to run everything twice - once for the positive and once for the negative.

ADD REPLY
0
Entering edit mode

No don't mix strands - that won't be correct, what you end up with a matrix that is a mixture of valid and invalid motifs.

The only trick with strands is that regardless of which strand you get a hit you always operate with the coordinate that corresponds to the positive strand.

When there is a hit either on positive or negative strand that is a good candidate and you take the position and that is a candidate. Imagine a protein binding, it does not care which strand the motif is on when it binds and it occupies the same position regardless of whether it got bound to the forward or the reverse strand.

ADD REPLY

Login before adding your answer.

Traffic: 1773 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6