I am afraid my lack of biology may have lead me to a couple weeks of bad results for my thesis. Here it goes:
I am trying to construct a position weight matrix based on Transcription Factor Binding Sites (for TF MEF2), based on already known positions. Here are the TFBS off FANTOM 4.
>chr1:167603901-167603911(+)
TTATTTTAAA
>chr1:167603905-167603915(-)
CTACTTTAAA
>chr1:170967144-170967154(-)
ttatttttag
>chr1:179269507-179269517(-)
ATATTTATGG
>chr1:190811495-190811505(+)
CTATATAAAG
Now, what I did was send this information to Clustal (and MAFFT) to get an alignment back. The alignment was needed to get the frequency of a base in a position. More on the alignment at the end.
Problem and Question:
What does the +, - represent here? Obviously when looking at
chr1:167603901-167603911(+)
on the genome browser, I correctly see the sequence: TTATTTTAAA
On the other hand, if I look on the genome browser at
chr1:179269507-179269517(-)
I don't see ATATTTATGG but clearly the reverse complement of it (duh).
The question is what does the transcription factor bind to, in "terms" of multiple alignment. Can we essentially stop talking about the negative stand and convert everything to the positive strand?
Should I send this as is to a multiple sequence alignment OR should I first convert the the minus-strand to its reverse complement AND THEN try to align it. The table below highlights the reverse complement.
given sequence reverse complement-ed sequence
>chr1:167603901-167603911(+) >chr1:167603901-167603911(+)
TTATTTTAAA TTATTTTAAA
>chr1:179269507-179269517(-) >chr1:179269507-179269517(-) #this entry
ATATTTATGG CCATAAATAT
Note: I was fairly certain that a multiple sequence alignment program would take this into consideration, but here is the "alignment" that was returned to me by MAFFT
>chr1:167603901-167603911(+)
------ttattttaaa-------
>chr1:179269507-179269517(-)
------atatttatgg-------
Obviously it has marked the sequence as being on the (-) strand, however the alignment sequence was never converted to the reverse complement:
This meant that .. for example, at the last position I counted 1 A, and 1 G.. If it used the reverse complement I would've counted something else entirely based on the alignment.
I hope I have made sense here.
I understand your confusion, but I think everything is fine here! The binding motive should be given in the correct orientation already.
So when I align them, I just use what's given to me? I mean clearly if I search for
>chr1:179269507-179269517(-)
on the genome browser, I don't get ATATTTATGG. I get the reverse complement, so doesn't that mean the TF "binds" to the reverse complement instead of ATATTTATGG?