Hi everyone ! I am a newbie on data treatment and...
I am working on a data set of sequences (fasta format) and i had found a motif by ab initio alignement. Now i have found a way to know the number of motif by sequence in my fasta file. I just want to know if someone know how to normalized the motif count per sequence into the length of the sequence, because, correct me if i'm wrong, there is more chance of finding a motif on a longer sequence.
For the example, i am using a 4 nt motif (the binding motif of a RNA binding protein), and i can easily see that a longer sequence have more motif than shorter one... Can someone help me for this case...
Just for indication that's how i know the number of motif by sequence :
library(Biostrings)
library(seqinr)
fasta <- read.fasta("X.fasta", as.string=T)
pattern <- "tcaa" # for example
dict <- PDict(pattern, max.mismatch=0)
seq <- DNAStringSet(unlist(fasta))
result <- vcountPDict(dict, seq)
result
It return a matrix with a n number of columns (each sequence are in a column) and the corresponding number of motif in the corresponding sequence on a second row.
Thanks