Code design of DNA alignment degapping when annotations
0
0
Entering edit mode
8.9 years ago
Michael G ▴ 80

This is a question regarding a more efficient code design:

Assume three aligned DNA sequences (strings; seq1, seq2 and seq3) that represent two genes (gene1 and gene2). Start and stop positions of these genes are known relative to the aligned DNA sequences.

# Input
alignment = {"seq1":"ATGCATGC", # In seq1, gene1 and gene2 are of equal length.
             "seq2":"AT----GC",
             "seq3":"A--CA--C"}
posi_seq1 = {"gene1":[1,4], "gene2":[5,8]} # For simplicity, please assume counting starts with 1.
posi_seq2 = {"gene1":[1,4], "gene2":[5,8]}
posi_seq3 = {"gene1":[1,4], "gene2":[5,8]}

I wish to remove the gaps (i.e., dashes) from the alignment and maintain the relative association of the start and stop positions of the genes.

# Desired output
alignment = {"seq1":"ATGCATGC",
             "seq2":"ATGC",
             "seq3":"ACAC"}
posi_seq1 = {"gene1":[1,4], "gene2":[5,8]}
posi_seq2 = {"gene1":[1,2], "gene2":[3,4]}
posi_seq3 = {"gene1":[1,2], "gene2":[3,4]}

Obtaining the desired output is less trivial than it may seem. Below I wrote some line-numbered pseudocode for this problem, but surely there is a more elegant design.

1 measure lengths of each aligned gene # take any seq, since all aligned
2 list_lengths = list of gene lengths # order is important
3 for seq in alignment
4     outseq = ""
5     for each num in range(1, length(seq)) # weird for-loop is intentional
6         if seq[num] == "-"
7             current_gene = gene whose start/stop positions include num
8             subtract 1 from length of current_gene
9             subtract 1 from lengths of all genes following current_gene in list_lengths
10        else
11            append seq[num] to outseq
12     append outseq to new variable
13     convert gene lengths into start/stop positions and append ordered to new variable

Can anyone give me suggestions for a shorter, more direct algorithm/code design?

sequence alignment • 1.4k views
ADD COMMENT

Login before adding your answer.

Traffic: 2747 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6