If I have a FASTA file that gives aligned protein sequences for genes, e.g. a sequence for paralog A and a sequence for paralog B, is there a requirement that the aligned sequences be the same length? Do most alignment programs yield same length sequences at the end?
Yes, alignment programs usually output alignments with equal-length sequences.
Of course, if you align two sequences with different lengths, gap characters will be introduced (global alignment)
It can be a little bit confusing when you look at your multi fasta file because the gaps may be represented two symbols :
- at the end and beginning of sequences they are represented by spaces
- inside the sequences they are represented by '-'
No matter what the length of A and B are, A' and B' (the aligned sequences) will be of the same length. This simply comes from the definition of an alignment. The characters (representing base pairs) from the two sequences are arranged as to minimize the differences between them, and then the empty spaces (if any) are filled in with gaps (dash characters). These gaps are typically interpreted as evolutionary events between two homologous sequences, i.e. an insertion of nucleotides to one sequence or a deletion of nucleotides from the other (indels).
I agree with Peter and Daniel.
It can be a little bit confusing when you look at your multi fasta file because the gaps may be represented two symbols : - at the end and beginning of sequences they are represented by spaces - inside the sequences they are represented by '-'
I prefer to use the same symbol for all gaps.