As suggested, consider the the larger question. But, based on assumptions from your example, looks like an interesting problem.
Here is using R, not going to explain the code in detail. In short, find position of "-"
in all strings, create groups based on consecutive positions, i.e.: make "indels" ranges, then compare those position for each string.
foo <- function(x){
xSplit <- strsplit(x, "")
# indels
indels <- unique(
unlist(
lapply(xSplit, function(i){
ix <- which(i == "-")
g <- c(0, cumsum(diff(ix) > 1))
split(ix, g)
}), recursive = FALSE))
# sort by position
indels <- indels[order(sapply(indels, `[`, 1), lengths(indels))]
names(indels) <- paste0("indel_", seq(length(indels)))
# return binary
t(
sapply(xSplit, function(i){
sapply(indels, function(j){
!all(i[ j ] == "-")
})
}) * 1
)
}
# example data
x1 <- c("CATG----ACAGAGCGACCCGCG--CACGTTACAAACACTACGCGGGGTGGCCCCGG",
"CATGCCCGACAGAGCGACCCGCGAACACGTTACAAACACTACG---GGTGGCCCCGG")
x2 <- c("CATG----ACAGAGCGACCCGCG--CACGTTACAAACACTACGCGGGGTGGCCCCGG",
"CATGCCCGACAGAGCGACCCGCGAACACGTTACAAACACTACG---GGTGGCCCCGG",
"CATG----ACAGA---ACCCGCG--CACGTTACAAACACTACGCGGGGTGGCCCCGG")
# test
foo(x1)
# indel_1 indel_2 indel_3
# [1,] 0 0 1
# [2,] 1 1 0
foo(x2)
# indel_1 indel_2 indel_3 indel_4
# [1,] 0 1 0 1
# [2,] 1 1 1 0
# [3,] 0 0 0 1
Did you try anything? The idea of online communities is typically not to be code factories but to support each other by providing advise, help and sharing experiences.
I don't expect to get the code ready. I am asking for advice on how I could start, because I am a beginner.
You should expect that the people answring you will be willing to put in a fraction of the amount of work answering you as you have demonstrated you have put into solving your own problem. How much effort do you think you have demonstrated here?
Is it always 2 sequences? If not, is one always considered the reference? - since your result will be subjective on the frame of reference.
First I want to do it for these 2 sequences and then expand it.
It's important to frame the larger question carefully, because a solution that works for 2 sequences may not easily generalise to more than 2, hence my questions.
What if, for example, the data looked like this:
Is that first gap a deletion in 1 and 3, or an insertion in 2? What about the new deletion I've added which isn't present in any of the other sequences and now changes the numbering? There is no easy way to decide what the output should be without stating your frame of reference.
Or how would you notate:
That's two different gaps at the same coordinates.
This seems like one of those situations where rather than asking "How do I do X", you need to be able to articulate why are you are doing X, and ask "How do I do Y", where Y is your real analysis goal.
Ultimately, the code is to be for an unlimited number of sequences. For example: for 10 or 30 sequences. I generally treat insertion or deletion as indel.
What you're asking for is impossible without comparing to a specific reference.