Again a case where I feel unable to help a colleague of mine, but I am sure that somebody here has an easy answer.
What we need is a program for multiple-alignment of DNA sequences. I am a protein person, so I had a look at Muscle and Mafft, which both can handle DNA sequences as well. However, this did not work well, since I don't see a way of tweaking the match/gap parameters the way we need them.
There are several alignment scenarios that need to be covered, one of them being DNA sequences with a relatively modest overlap. In this case, the alignment programs (treating endgaps like normal gaps) decided to find some non-existent 'similarity' between the DNAs and aligned them this way rather than providing the correct alignment with enormous end-gaps. It din't even help to add a reference sequence to the alignment (note that the reference is not necessarily from the same species, so there are some mismatches but clearly enough similarity to guide the alignment process)
As this problem resembles the 'assembly problem' common to the sequencing community (of which I am not a member). Thus, I had a look at things like phred/phrap (which is much too expensive for us) or bwa (which uses lots of funny terminology like 'color space', which is beyond my horizon). Moreover, our sequences are not exactly genome-size but typically 1-10 kB pieces of genomic and mRNA sequence. Moreover, the 'assembly-type' software does not return anything that looks like a multiple alignment.
Can anybody recommend a (free) software that either does conventional DNA multiple alignments but allows to set the endgap penalty to zero and allows very cheap gap-extensions for accomodating splicing? Alternatively, is there a free or cheap DNA assembly software that can produce multiple alignment files in a standard MSA format (fasta, MSF, whatever) ?
what is the biggest contig that comes out of the assembly?
It is not really an assembly problem, I just guessed that it can be treated similar. In a typical situation, a genomic sequence of 2-10 kB is aligned to a number of genomic and cDNA sequences, either coming from the same or a closely related species. It would already be great if I had a solution for aligning one gene (exons, introns, everything) to e.g. 10 different cDNA fragments with a few mutations.