I have an R script that takes in a fasta containing all coding sequences in human RNA. I then extracted 9 bases before the start codon of each CDS and the 1 base after the start codon. Here is an example sequence: "ATCGTAGCT ATG A".
I then used Biostrings in R to construct a PWM as shown below.
#make a pwm of length 13
human = readDNAStringSet('extracted_rna.fasta)
human.kozak = DNAStringSet(human)
human.pwm = PWM(human.kozak, type = 'log2probratio')
I then used the human mRNA PWM to score the first 99 positions of a fasta of viral coding sequences below.
subject.fasta = readDNAStringSet('viral.fasta')
pwm.score.dataframe = NULL
#loop through length of subject DNAStringSet which should be num of sequences in fasta
for (i in 1:length(subject.fasta)) {
#using the human pwm score using a sliding window of 13 as defined by the PWM until position 99 in subject seq
scores <- PWMscoreStartingAt(human.pwm, subject.fasta[[i]], starting.at = 1:99)
pwm.score.dataframe = cbind(pwm.score.dataframe, scores)
}
write.table(pwm.score.dataframe, file = "humanPWM_viral.csv")
When plotted in Microsoft Excel, I averaged the score of each position and graphed the position vs score below. Regardless of the subset of viral sequences I used, after the ATG spike the PWM scores oscillate with a wavelength of 3 positions. What is causing this behavior? You'd expect that after the the ATG start codon there would be enough random noise to cancel out fluctuations in the PWM score.
The result looks really strange to me. What were your reasons to do this analysis and what were you expecting to see?
You say:
But in your code I see:
It seems instead of using
human.pwm
you are usingvirus.pwm
. Could that be the source of your problems?Ah thats a typo I reran it with the corrections and it gives the same result