Question

Normalizing a number of a given motif on the length of sequences

0

Entering edit mode

8.8 years ago

kevinm ▴ 40

Hi everyone ! I am a newbie on data treatment and...

I am working on a data set of sequences (fasta format) and i had found a motif by ab initio alignement. Now i have found a way to know the number of motif by sequence in my fasta file. I just want to know if someone know how to normalized the motif count per sequence into the length of the sequence, because, correct me if i'm wrong, there is more chance of finding a motif on a longer sequence.

For the example, i am using a 4 nt motif (the binding motif of a RNA binding protein), and i can easily see that a longer sequence have more motif than shorter one... Can someone help me for this case...

Just for indication that's how i know the number of motif by sequence :

library(Biostrings)

library(seqinr)

fasta <- read.fasta("X.fasta", as.string=T)

pattern <- "tcaa" # for example

dict <- PDict(pattern, max.mismatch=0)

seq <- DNAStringSet(unlist(fasta))

result <- vcountPDict(dict, seq)

result

It return a matrix with a n number of columns (each sequence are in a column) and the corresponding number of motif in the corresponding sequence on a second row.

Thanks

rna-seq R sequence RNA-Seq • 1.7k views

ADD COMMENT • link updated 8.8 years ago by colindaven 7.7k • written 8.8 years ago by kevinm ▴ 40

score 1 · Answer 1 · 2016-09-21

1

Entering edit mode

8.8 years ago

colindaven 7.7k

You might want to look at zero order Markov models, for example here:

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0009841

ADD COMMENT • link 8.8 years ago by colindaven 7.7k