I'm quite new in bioinformatics area, but will try to describe my problem accurately.
I have protein and corresponding nucleotide sequences. I use multiple alignment to align my original sequence with two other similar proteins (slight difference in solubility) to find conserved domains. But when I align protein sequences and then nucleotide sequences I obtain different results, which is a natural consequence, since the alphabets of the sequences differ (in protein case we have 20, and in nucleotide case we have 4). So my question would be: what should I use for alignment? Protein sequences or nucleotide sequences?
if the species are phylogenetically far, it is better to use the protein sequence, which is more conserved (think of the third codon, which can usually be mutated without any consequence for the protein sequence)
if the species are close, or if you are comparing individuals of the same species, it is better to use the DNA sequence. This is because the protein sequences will be too similar, and you will get too few results.
Thanks! That's very useful to know. Is there any papers that would speak about these problems in depth? I mean, that would maybe do some quantitative comparison of these matters?
sorry, but this is something that has been taught me from other generations of bioinformaticians, and I do not know if there is an official source. I think it comes from the NCBI-Blast manual, and from life experience. I'll try to find a good reference.
You should ALWAYS use protein sequences for a multiple sequence alignment when you have both, and you are aligning a coding sequence. A DNA multiple alignment may be more useful for building evolutionary trees over shorter distances ( < 100 million years), but the actual DNA alignment should be driven by the protein alignment. If the proteins are closely related, there will not be many (any) gaps, and your alignment will be very accurate and robust. But a DNA alignment does not know about codons, so it may put in some gaps at inappropriate places. If the DNA and protein alignments differ, the protein alignment will almost certainly be more accurate, so use proteins.
Once you have a multiple protein sequence alignment, you can use that alignment to build the corresponding DNA sequence alignment, using the protein alignment as a template. This will ensure that all protein gaps become 3-residue (codon-sized) DNA gaps.
Hmm. And now I am confused. According to Giovanni M Dall'Olio, I should use nucleotide sequence alignement, because I'm comparing very closely related proteins, but according to you, I should stick to the protein sequence alignment.
I think Bill's point on the codons is right on. Alignment algorithms are not aware of codons when they're used in nucleotide alignment, which will make you lose information on sequence homology if codons are disrupted to optimize the scoring criteria employed by the algorithms.
I know its pretty late reply but I hope it could help the new reader coming to this post.
Both approaches are correct. All it depends upon what you want from the MSA results.
i.e IF you want to know conserved/non-conserved AAs or nucleotides (which are ultimately making protein) you should do it as Bill described
BUT if you are dealing with nucleotides seqs which show attachment with miRNAs (just an example) or other smRNA families you need to stick with MSA of nucleotides as there is no point to do it at Protein level.
The matter is complicated by the presence of alternative splicing. What you wrote is correct, but assumes that there is only one protein isoform per gene. Given that almost 90% of human genes have more than one isoform, this becomes a problem.
let's imagine that a gene has 2 splicing isoform in one species, and 3 in another. Which isoforms would you choose to make the alignment? Which criteria would you use?
let's imagine that a gene has multiple splicing isoforms, but some of these are not known. How would you correct for this in the alignment?
most exons are about 200 bp in length, while introns are usually ~10,000 bp in length. If you align only the protein sequence, you loose any information from all the intronic and non-coding sequences: splicing signals, regulatiory motifs, promoter, etc.. If the species compared are significantly close, it may be more accurate to align the whole dna sequence, rather than only the protein.
Thanks! That's very useful to know. Is there any papers that would speak about these problems in depth? I mean, that would maybe do some quantitative comparison of these matters?
sorry, but this is something that has been taught me from other generations of bioinformaticians, and I do not know if there is an official source. I think it comes from the NCBI-Blast manual, and from life experience. I'll try to find a good reference.