Question

Indel coding in python

1

Entering edit mode

4.2 years ago

adrian18_07 ▴ 10

I have 2 sequences with gaps:

15888 CATG----ACAGAGCGACCCGCG--CACGTTACAAACACTACGCGGGGTGGCCCCGG
42690 CATGCCCGACAGAGCGACCCGCGAACACGTTACAAACACTACG---GGTGGCCCCGG

I would like to encode indels in binary form. Gap is marked with 0 and no gap is 1. I want to receive something like this:

15888 0 0 1
42690 1 1 0

Thanks for any answer.

indel coding biopython • 2.1k views

ADD COMMENT • link updated 4.2 years ago by zx8754 12k • written 4.2 years ago by adrian18_07 ▴ 10

1

Entering edit mode

Did you try anything? The idea of online communities is typically not to be code factories but to support each other by providing advise, help and sharing experiences.

ADD REPLY • link 4.2 years ago by ATpoint 85k

0

Entering edit mode

I don't expect to get the code ready. I am asking for advice on how I could start, because I am a beginner.

ADD REPLY • link 4.2 years ago by adrian18_07 ▴ 10

0

Entering edit mode

You should expect that the people answring you will be willing to put in a fraction of the amount of work answering you as you have demonstrated you have put into solving your own problem. How much effort do you think you have demonstrated here?

ADD REPLY • link 4.2 years ago by swbarnes2 14k

0

Entering edit mode

Is it always 2 sequences? If not, is one always considered the reference? - since your result will be subjective on the frame of reference.

ADD REPLY • link 4.2 years ago by Joe 21k

0

Entering edit mode

First I want to do it for these 2 sequences and then expand it.

ADD REPLY • link 4.2 years ago by adrian18_07 ▴ 10

3

Entering edit mode

It's important to frame the larger question carefully, because a solution that works for 2 sequences may not easily generalise to more than 2, hence my questions.

What if, for example, the data looked like this:

15888 CATG----ACAGAGCGACCCGCG--CACGTTACAAACACTACGCGGGGTGGCCCCGG
42690 CATGCCCGACAGAGCGACCCGCGAACACGTTACAAACACTACG---GGTGGCCCCGG
12345 CATG----ACAGA---ACCCGCG--CACGTTACAAACACTACGCGGGGTGGCCCCGG

Is that first gap a deletion in 1 and 3, or an insertion in 2? What about the new deletion I've added which isn't present in any of the other sequences and now changes the numbering? There is no easy way to decide what the output should be without stating your frame of reference.

ADD REPLY • link 4.2 years ago by Joe 21k

2

Entering edit mode

Or how would you notate:

15888 CATG----ACAGAGCGACCCGCG---ACGTTACAAACACTACGCGGGGTGGCCCCGG
42690 CATGCCCGACAGAGCGACCCGCGAACACGTTACAAACACTACG---GGTGGCCCCGG
12345 CATG----ACAGA---ACCCGCG--CACGTTACAAACACTACGCGGGGTGGCCCCGG

That's two different gaps at the same coordinates.

This seems like one of those situations where rather than asking "How do I do X", you need to be able to articulate why are you are doing X, and ask "How do I do Y", where Y is your real analysis goal.

ADD REPLY • link 4.2 years ago by swbarnes2 14k

0

Entering edit mode

Ultimately, the code is to be for an unlimited number of sequences. For example: for 10 or 30 sequences. I generally treat insertion or deletion as indel.

ADD REPLY • link 4.2 years ago by adrian18_07 ▴ 10

0

Entering edit mode

What you're asking for is impossible without comparing to a specific reference.

ADD REPLY • link 4.2 years ago by Joe 21k

score 2 · Accepted Answer · 2020-08-24

As suggested, consider the the larger question. But, based on assumptions from your example, looks like an interesting problem.

Here is using R, not going to explain the code in detail. In short, find position of "-" in all strings, create groups based on consecutive positions, i.e.: make "indels" ranges, then compare those position for each string.

foo <- function(x){
  xSplit <- strsplit(x, "")

  # indels
  indels <- unique(
    unlist(
      lapply(xSplit, function(i){ 
        ix <- which(i == "-")
        g <- c(0, cumsum(diff(ix) > 1))
        split(ix, g)
      }),  recursive = FALSE))
  # sort by position
  indels <- indels[order(sapply(indels, `[`, 1), lengths(indels))]
  names(indels) <- paste0("indel_", seq(length(indels)))

  # return binary
  t(
    sapply(xSplit, function(i){
      sapply(indels, function(j){
        !all(i[ j ] == "-")
      })
    }) * 1
  )
}
# example data
x1 <- c("CATG----ACAGAGCGACCCGCG--CACGTTACAAACACTACGCGGGGTGGCCCCGG",
        "CATGCCCGACAGAGCGACCCGCGAACACGTTACAAACACTACG---GGTGGCCCCGG")
x2 <- c("CATG----ACAGAGCGACCCGCG--CACGTTACAAACACTACGCGGGGTGGCCCCGG",
        "CATGCCCGACAGAGCGACCCGCGAACACGTTACAAACACTACG---GGTGGCCCCGG",
        "CATG----ACAGA---ACCCGCG--CACGTTACAAACACTACGCGGGGTGGCCCCGG")

# test
foo(x1)
#      indel_1 indel_2 indel_3
# [1,]       0       0       1
# [2,]       1       1       0

foo(x2)
#      indel_1 indel_2 indel_3 indel_4
# [1,]       0       1       0       1
# [2,]       1       1       1       0
# [3,]       0       0       0       1