I'm interested in aligning a population of related protein sequences against a canonical reference. This doesn't seem to fit into either of the major categories of alignment packages: it's neither pairwise alignment nor multiple sequence alignment. The main problem is that I don't want to allow the aligner to "delete" residues in the reference: every aligned sequence should be the same length as the reference. Perhaps this can be done within existing MSA programs (like clustal and MAFFT) but it's not obvious to me how to do it. Can someone help out?
An alignment is, first and foremost, an inference of homology. If the sequences are not homologous (i.e., if they are different proteins, even the same gene but from different transcripts), it does not make sense to align them.
For the sake of argument, let's assume that they are. If every sequence is the same length as the reference, why do you need to align them? Wouldn't a more appropriate workflow involve confirming that everything is actually the same length and then simply writing out the file because isn't every site homologous across all samples?
If a sequence has bases that are not present in the reference, that means that there was an insertion in that sequence or a deletion in the reference - both introduce gaps to the reference and lengthen it.
If you could clarify your question, I'll be happy to weigh in - I've used just about every alignment program (and sometimes mappers) to accomplish various tasks. However, it seems to me that you're trying to align things that perhaps shouldn't be aligned in the first place.