Question

Repeat Subunit Based Multiple Alignment Of Dna

3

Entering edit mode

14.7 years ago

Bryan Maloney ▴ 30

I want to align over 50 sequences of a polymorphic stretch of promoter DNA. The sequences consist of repeats of selections from 174 incompletely homologous subunits (14-31 subunits per sequence), the subunits are 18-29 bases in length, with a four-part internal structure. I wish the alignment to be guided by the subunits more than by unstructured primary DNA sequence. Is there any software out there that can do this?

Thank you.

multiple alignment dna • 3.1k views

ADD COMMENT • link updated 13 months ago by Ram 44k • written 14.7 years ago by Bryan Maloney ▴ 30

Ram · Answer 1 · 2010-03-05

2

Entering edit mode

14.7 years ago

Darked89 4.7k

No idea how to do it exactly but I can think about two routes to investigate:

LASTZ has something called "quantum DNA": http://www.bx.psu.edu/miller_lab/dist/README.lastz-1.02.00/README.lastz-1.02.00.html#fmt_qdna
instead of using "linear" aligner go for graph based ones:
- POA: http://bioinfo.mbi.ucla.edu/poa/
- AliWABA: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1538870/)

ADD COMMENT • link updated 13 months ago by Ram 44k • written 14.7 years ago by Darked89 4.7k

0

Entering edit mode

Small update: another aligner can use an arbitrary (but less than 250) alphabet: http://www.vmatch.de/

ADD REPLY • link updated 13 months ago by Ram 44k • written 14.7 years ago by Darked89 4.7k

Ram · Answer 2 · 2010-03-19

As tagged correctly, this is a multiple sequence alignment problem, so pairwise/database alignment tools (BLAST or maybe B/LASTZ as mentioned above) are not an option.

So well understood tools such as ClustalW, or T-Coffe, or DIALIGN, might already do the job. Try low gap extension/opening costs. I would try a well-known algorithm and try the more 'esoteric' stuff later on, and only if that doesn't give good results.

Better than ClustalW maybe in your case

Dialign: It uses no gap-costs.

This approach can be used for both global and local alignment, but it is particularly successful in situations where sequences share only local homologies. (From the BiBiServ description)

This seems to fit your case quite well.

Another possibility would be to mask out the portions of each sequence that is a priori known to be less conserved. That requires to have knowledge about each sequence and to manually design a mask. Or even a vector of base-specific weights, but I don't know any MSA tool that takes such input. If you find one, please post it here :)