Question

How compute the average distance between two distinct motifs insidse a list of DNA sequence?

2

Entering edit mode

8.2 years ago

kevinm ▴ 40

I've got a new issue...

I would like to compute the average distance between two distinct motifs in the same sequences list than before. Have you some clue on how manage this ???

I just come to do it for one motif like that :

source("motifOccurrence.R")

motif <- c("T", "C", "A", "A")

 motidist <- sapply(df, FUN=function(df, motif) {

      computeDistance(coordMotif(df, motif))

}, motif = motif)

This R code give me the average distance between a define motif inside all the sequences of my list. And I would like to do the same but with two motif... Can someone help me ?

To extend informations about what i wanted to do :

i worked with a fasta file in input :

'> 1

GACTCTACTATAAACGGGAGATAGCAATCTAACGCAGTGCTTCAACTCCTCCTCCATCTGAACACCCTTCAACCTTTGATACTCAGACGTTTTAGGTCGG

'> 2

ACCACCCCTTTGTCCAGAAATAGGACTCTTGGGCCTGTTGCCTGAATAAAGTCCAACCACCACAACCACTACACTACCATATGTAAGCTTCACTGATGGT

'> 3

CACCACAAGTGCGCGCCACGACGTGCATAGCCTCTAGATCGGCAACTCAGGCGAGAAGTGTTTTATTTCGGTGTGGCCGGTCCTGGGCATTTTACGGAAA

'> 4

GTTAGTGTACAAGTCCGAATAGAGTCACGAAAGACCCACACAACCACGTAATGACCTCGCTGTAATGAGATCAGTTGGCTCATGAAGGAAGAACGTAATG

'> 5

TGAGCGTTCGCCAATAACCATCCCTCTCGTTCCTTGTAACTGTACTATGATAGCGGGCGCCCCCCTAATTAAATAGCGGACGCCCTGACCTATTGTATGA

'> 6

TGATATATCTACTCGATAAGGATATAGAGGTCTAATTGTTGAGAAGTGTACCACCTTAGAGCACGAGTTTAGGATACTTAGTAGGTTCTTGCGAAGGATA

etcetc

Then the motidist object look like this :

                     1                               2

                     152                           94 

                     3                               4 

                      36                          138 

                     5                               6

                      92                          113

And the distances given by the function stand for one motif, and now, i would like to do the same but for the disatnce between two motifs like this :

atcgacatagacgactgatcgtcag MOTIF1 acggtagacagt MOTIF2 agcagatgacta # And this for all sequences in the file !

Thanks by advance

R sequence motif average distance • 3.8k views

ADD COMMENT • link updated 8.2 years ago by ddiez ★ 2.0k • written 8.2 years ago by kevinm ▴ 40

2

Entering edit mode

Can't you just loop over the sequence, record the index of the motif if found and subtract that when you find the next one? Additional question: what should be done if you see:

1) atcgacatagacgactgatcgtcag MOTIF1 acggtagacagt MOTIF2 agcagatgacta MOTIF2 acgtcgtagctgatgctcggct (twice motif 2 after each other)

2) atcgacatagacgactgatcgtcag MOTIF1 acggtagacagt MOTIF2 agcagatgactaacgtgtgtgtg MOTIF1 acgtcgtagctgatgctcggct (motif2 sandwhiched by motif1 with different lengths)

3) atcgacatagacgactgatcgtcag MOTIF1 acggtagacagtagcagatgactaacgtcgtagctgatgctcggct (just a motif1 without motif2)

ADD REPLY • link 8.2 years ago by WouterDeCoster 47k

0

Entering edit mode

What would you mean by loop over the sequence ?

And for the additional questions :

1) I would like to know only the distance between the closer motifs motif 1 and first motif 2.

2) In this case the two informations are interesting and in a first time the average distance will be a sufficient info.

3) If there is no motif 2, return me 0.

ADD REPLY • link 8.2 years ago by kevinm ▴ 40

1

Entering edit mode

You should have a look at the re python module, regular expression, and just get the index of the position(s) in which a motif is found. This should be pretty easy.

ADD REPLY • link 8.2 years ago by WouterDeCoster 47k

1

Entering edit mode

Can you detailled your answer with a running example

ADD REPLY • link 8.2 years ago by Macherki M E ▴ 120

1

Entering edit mode

I've edited my post with more details.

ADD REPLY • link 8.2 years ago by kevinm ▴ 40

1

Entering edit mode

Do you need it in R or any other language would work?

ADD REPLY • link 8.2 years ago by lakhujanivijay 5.9k

0

Entering edit mode

R would be the easier way for me, but i can handle with python or perl if you have an idea in those languages !

ADD REPLY • link 8.2 years ago by kevinm ▴ 40

1

Entering edit mode

What does source("motifOccurrence.R") does? What does df looks like? A minimum working example would help.

ADD REPLY • link 8.2 years ago by ddiez ★ 2.0k

0

Entering edit mode

For source("motifOccurrence.R") ==> https://www.r-bloggers.com/calculate-the-average-distance-between-a-given-dna-motif-within-dna-sequences-in-r/

And df is just DNAStringSet instance looking like this :

width seq                                               names               

[1]  3144 ATGGTCGAAAATAGTACGCAGAA...TGTTTGGCAATGACTTTGCTTAA NC_001133.9:48563...

[2]  1242 TTATAAGTACTTCGTAAAATTCG...GATTCAGGAGAATATAATCCCAT NC_001133.9:80709...

[3]   921 TTATTGTGCCACTTCTTGCTCAG...TGCCAAACAATTTCGTCGGACAT NC_001133.9:10022...

[4]  1404 ATGTTGAAGTCAGCCGTTTATTC...ACGACACTTTATTAAAACAGTAA NC_001133.9:22545...

[5]  1920 TTAGATATATCGTATCTTTTCAT...AATCCAGAATCAGATACCGGCAT NC_001134.8:13834...

[6]  2070 TTAATCAATTTTGTCGTACAATT...GATTGTGAACCTTGGGCTTGCAT NC_001134.8:32187...

ADD REPLY • link 8.2 years ago by kevinm ▴ 40

0

Entering edit mode

8.2 years ago

Macherki M E ▴ 120

I think that I already give answer here. Moreover, the distance between two motifs A and B is usually formulated as geometric distribution Where:

P(A,B,distance =k)=p(A)P(B)*(1-P(B))^k

You can simulate k (exemple 1:10000), and the Expected value will be the sum of pi*ni

ADD COMMENT • link 8.2 years ago by Macherki M E ▴ 120

score 4 · Accepted Answer · 2016-10-12

Assuming I understood your problem this is my attempt to a solution. I do not use the functions in the script you pointed out, but the Biostrings Bioconductor package. Also, I use the Bioconductor package BSgenome.Scerevisiae.UCSC.sacCer3 for the yeast genome as example sequence data.

# load required packages.
library(ggplot2) # for qplot().
library(Biostrings)
library(BSgenome.Scerevisiae.UCSC.sacCer3)

# get sequence data (DNAStringSet)
seq <- getSeq(Scerevisiae)

# create motif dictionary.
dict0 <- DNAStringSet(
  x = c("TCAA",
        "GGAT",
        "ATAT",
        "GGCC")
)
pdict0 <- PDict(dict0)

# search with the dictionary every sequence in seq.
res <- lapply(seq, function(s) matchPDict(pdict0, s))
res[[1]]

MIndex object of length 4
[[1]]
IRanges object with 1533 ranges and 0 metadata columns:
             start       end     width
         <integer> <integer> <integer>
     [1]       114       117         4
     [2]       161       164         4
     [3]       456       459         4
     [4]       719       722         4
     [5]       776       779         4
     ...       ...       ...       ...
  [1529]    228963    228966         4
  [1530]    229173    229176         4
  [1531]    229260    229263         4
  [1532]    229817    229820         4
  [1533]    229825    229828         4

...
<3 more elements>

# calculate combinations of motifs to compare
# assuming you want to compute different motifs to each other.
motifcomp <- t(combn(seq_len(length(pdict0)), 2))
colnames(motifcomp) <- c("motif_i", "motid_j")
motifcomp

     motif_i motid_j
[1,]       1       2
[2,]       1       3
[3,]       1       4
[4,]       2       3
[5,]       2       4
[6,]       3       4

# iterate over the comparison matrix. for each pair of motifs and
# compute the distance to the nearest one over all sequences.
# this gives a list with each elements being the distances.
foo <- lapply(seq_len(nrow(motifcomp)), function(i) {
  m <- motifcomp[i,]
  d <- sapply(res, function(r) {
    mcols(distanceToNearest(r[[m[1]]], r[[m[2]]]))$distance
  })
  unlist(d)
})

# we can check these distances are not normally distributed.
qplot(foo[[1]])

# now compute mean (and maybe other summaries, here standard deviation).
hoo <- lapply(foo, function(x) {
  data.frame(mean = mean(x), sd = sd(x))
})
hoo <- do.call(rbind, hoo)

data.frame(motifcomp, hoo)

  motif_i motid_j      mean        sd
1       1       2 150.15481 154.67573
2       1       3  71.63235  85.10428
3       1       4 358.76577 390.83777
4       2       3  69.26831  87.50602
5       2       4 348.63816 389.88737
6       3       4 378.64977 409.81555