Trying to better understand methods of reaching a consensus sequence, while keeping the input as simple as possible.
For example, say there are 10 bases numbered 0123456789, and each base listed within the brackets is the base pulled from the same base position.
0[GG-GGA-GA-TCT-AC]
1[GGAG-GTAAC-TCG-TC]
2[AAAAAAAG]
3[AACTGGG-GAAAGATC]
4[A----ATGAT]
5[TG-TC-CC-GGCCTGA]
6[CCC-GA-TA-GA-CTC]
7[AG-CTA-AGC-G-GCT]
8[ATCAGCTGATGC]
9[GAAAAAATCTATTATA]
How would you reach and notate a consensus sequence?
I guess that if you are trying to get the consensus sequence, the starting sequences should look at least somewhat similar. It is not really the case here. Any special application in mind? Should you put up an example that makes more biological sens? On another note, what are you looking for exactly: a set of rules? an algorithm? a program? Cheers
From your comment to bilouweb, this feels like a question of definition: http://en.wikipedia.org/wiki/Consensus_sequence : "In molecular biology and bioinformatics, consensus sequence refers to the most common nucleotide or amino acid at a particular position after multiple sequences are aligned. A consensus sequence is a way of representing the results of a multiple sequence alignment, where related sequences are compared to each other, and similar functional sequence motifs are found. The consensus sequence shows which residues are most abundant in the alignment at each position."
Of course, you could use IUPAC nucleotide codes: http://www.dna.affrc.go.jp/misc/MPsrch/InfoIUPAC.html You can represent any possible combination of nucleotides at one position using this code. Cheers. You talk about 'not loosing' information. This method is not lossless either, in the sens that you loose the proportion of representation of each base. Hope this can help!
@Eric normandeau: Are you saying the reference sequence would have more weight? If so, how so? Also, the sequences are meant to be examples of distribution in compact form, didn't think bias was given to the stablity of a given sequence, just base position. See my comment on the answer below for additional info. If there's something that's still not clear just ask and I'll do my best to address your question(s). Thanks, and cheers!!
+3 @Eric Normandeau: Thanks for all your comments, my question was more of a generalization of the issue, since it's not my problem, it's someone else; just trying to get a feel for it. I in fact am aware of the IUPAC nucleotide codes, and agree that they're a good option - though it still does not account for the weight of values within those bases. Again, thanks for all your comments -- cheers!!