Question

Code design of DNA alignment degapping when annotations

0

Entering edit mode

8.9 years ago

Michael G ▴ 80

This is a question regarding a more efficient code design:

Assume three aligned DNA sequences (strings; seq1, seq2 and seq3) that represent two genes (gene1 and gene2). Start and stop positions of these genes are known relative to the aligned DNA sequences.

# Input
alignment = {"seq1":"ATGCATGC", # In seq1, gene1 and gene2 are of equal length.
             "seq2":"AT----GC",
             "seq3":"A--CA--C"}
posi_seq1 = {"gene1":[1,4], "gene2":[5,8]} # For simplicity, please assume counting starts with 1.
posi_seq2 = {"gene1":[1,4], "gene2":[5,8]}
posi_seq3 = {"gene1":[1,4], "gene2":[5,8]}

I wish to remove the gaps (i.e., dashes) from the alignment and maintain the relative association of the start and stop positions of the genes.

# Desired output
alignment = {"seq1":"ATGCATGC",
             "seq2":"ATGC",
             "seq3":"ACAC"}
posi_seq1 = {"gene1":[1,4], "gene2":[5,8]}
posi_seq2 = {"gene1":[1,2], "gene2":[3,4]}
posi_seq3 = {"gene1":[1,2], "gene2":[3,4]}

Obtaining the desired output is less trivial than it may seem. Below I wrote some line-numbered pseudocode for this problem, but surely there is a more elegant design.

1 measure lengths of each aligned gene # take any seq, since all aligned
2 list_lengths = list of gene lengths # order is important
3 for seq in alignment
4     outseq = ""
5     for each num in range(1, length(seq)) # weird for-loop is intentional
6         if seq[num] == "-"
7             current_gene = gene whose start/stop positions include num
8             subtract 1 from length of current_gene
9             subtract 1 from lengths of all genes following current_gene in list_lengths
10        else
11            append seq[num] to outseq
12     append outseq to new variable
13     convert gene lengths into start/stop positions and append ordered to new variable

Can anyone give me suggestions for a shorter, more direct algorithm/code design?

sequence alignment • 1.4k views

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 8.9 years ago by Michael G ▴ 80