Question

PWM scores oscillating at highly regular intervals after ATG start codon

0

Entering edit mode

8.0 years ago

ahtmatrix • 0

I have an R script that takes in a fasta containing all coding sequences in human RNA. I then extracted 9 bases before the start codon of each CDS and the 1 base after the start codon. Here is an example sequence: "ATCGTAGCT ATG A".

I then used Biostrings in R to construct a PWM as shown below.

#make a pwm of length 13
human = readDNAStringSet('extracted_rna.fasta)
human.kozak = DNAStringSet(human)
human.pwm    = PWM(human.kozak, type = 'log2probratio')

I then used the human mRNA PWM to score the first 99 positions of a fasta of viral coding sequences below.

subject.fasta = readDNAStringSet('viral.fasta')

pwm.score.dataframe = NULL

#loop through length of subject DNAStringSet which should be num of sequences in fasta
for (i in 1:length(subject.fasta)) {

  #using the human pwm score using a sliding window of 13 as defined by the PWM until position 99 in subject seq
  scores <- PWMscoreStartingAt(human.pwm, subject.fasta[[i]], starting.at = 1:99)

  pwm.score.dataframe = cbind(pwm.score.dataframe, scores)
}

write.table(pwm.score.dataframe, file = "humanPWM_viral.csv")

When plotted in Microsoft Excel, I averaged the score of each position and graphed the position vs score below. Regardless of the subset of viral sequences I used, after the ATG spike the PWM scores oscillate with a wavelength of 3 positions. What is causing this behavior? You'd expect that after the the ATG start codon there would be enough random noise to cancel out fluctuations in the PWM score.

enter image description here

R biostrings biopython pwm position weight matrix • 2.1k views

ADD COMMENT • link 8.0 years ago by ahtmatrix • 0

0

Entering edit mode

The result looks really strange to me. What were your reasons to do this analysis and what were you expecting to see?

ADD REPLY • link 8.0 years ago by Santosh Anand 5.8k

0

Entering edit mode

You say:

I then used the human mRNA PWM

But in your code I see:

#using the human pwm score using a sliding window of 13 as defined by the PWM until position 99 in subject seq
  scores <- PWMscoreStartingAt(virus.pwm, subject.fasta[[i]], starting.at = 1:99)

It seems instead of using human.pwm you are using virus.pwm. Could that be the source of your problems?