Question

Sequence Alignment Utility Sought

3

Entering edit mode

14.8 years ago

Bryan Maloney ▴ 40

I seek a utility that will perform a multiple alignment of structured DNA sequences. The sequences are a set of degenerate repeat polymorphisms. The number of repeats go from 7 to 24 subunits in the polymorphic region. The repeat units have a four-subunit structure, and each subunit appears to vary in a non-random fashion, independently of the other three subunits within a repeat unit. All told, there are well over 150 distinct repeat units that can appear in the polymorphic region, and there are fewer of each subunit, meaning that this could also be seen as a permutation/combination problem.

While these are DNA sequences, I wish to do an alignment based purely on the repeat units. I can already construct similarity matrices for the repeat units to test different hypotheses. The problem is that a manual alignment would be prohibitively tedious with the number of sequences I have.

sequence multiple • 2.9k views

ADD COMMENT • link updated 14.6 years ago by Iain ▴ 260 • written 14.8 years ago by Bryan Maloney ▴ 40

1

Entering edit mode

Is it possible to determine which repeats are orthologous based on the sequence? With normal repeats this is often impossible- but you say these are degenerate?

ADD REPLY • link 14.8 years ago by Dave Lunt ★ 2.0k

0

Entering edit mode

That specific detail I can do with fairly simple pairwise comparisons of all units, partitioned into subunits. Unfortunately, that still does not solve the problem of the multiple alignment of the actual complete repeat sequences (over 30 sequences).

ADD REPLY • link 14.8 years ago by Bryan Maloney ▴ 40

Ram · Answer 1 · 2011-07-18

Hi Bryan,

In case you haven't found a solution to your problem:

if it is possible to calculate all pairwise alignments of the units that you are interested, you should be able to generate the multiple sequence alignment using the T-Coffee program.

All the pairwise alignments should be converted into a library, which T-Coffee can do. These libraries assign a weight to each of the aligned residues. The T-Coffee algorithm will calculate the multiple sequence alignment that maximizes the sum of these weights.

You might have to do some sequence manipulation to generate the initial alignments (or t-coffee might handle this natively)

extract the sequence for the subunits you want to align
calculate the alignment
append back on the sequence removed at part a to the alignment so that in the alignment the full length sequences are the same as input sequences, with only the subunits aligned.

The code is available here.