Question

Consensus Sequence Algorithms And Notation

5

Entering edit mode

13.5 years ago

Blunders ★ 1.1k

Trying to better understand methods of reaching a consensus sequence, while keeping the input as simple as possible.

For example, say there are 10 bases numbered 0123456789, and each base listed within the brackets is the base pulled from the same base position.

0[GG-GGA-GA-TCT-AC]
1[GGAG-GTAAC-TCG-TC]
2[AAAAAAAG]
3[AACTGGG-GAAAGATC]
4[A----ATGAT]
5[TG-TC-CC-GGCCTGA]
6[CCC-GA-TA-GA-CTC]
7[AG-CTA-AGC-G-GCT]
8[ATCAGCTGATGC]
9[GAAAAAATCTATTATA]

How would you reach and notate a consensus sequence?

consensus • 9.1k views

ADD COMMENT • link updated 13.5 years ago by brentp 24k • written 13.5 years ago by Blunders ★ 1.1k

2

Entering edit mode

I guess that if you are trying to get the consensus sequence, the starting sequences should look at least somewhat similar. It is not really the case here. Any special application in mind? Should you put up an example that makes more biological sens? On another note, what are you looking for exactly: a set of rules? an algorithm? a program? Cheers

ADD REPLY • link 13.5 years ago by Eric Normandeau 11k

1

Entering edit mode

From your comment to bilouweb, this feels like a question of definition: http://en.wikipedia.org/wiki/Consensus_sequence : "In molecular biology and bioinformatics, consensus sequence refers to the most common nucleotide or amino acid at a particular position after multiple sequences are aligned. A consensus sequence is a way of representing the results of a multiple sequence alignment, where related sequences are compared to each other, and similar functional sequence motifs are found. The consensus sequence shows which residues are most abundant in the alignment at each position."

ADD REPLY • link 13.5 years ago by Eric Normandeau 11k

1

Entering edit mode

Of course, you could use IUPAC nucleotide codes: http://www.dna.affrc.go.jp/misc/MPsrch/InfoIUPAC.html You can represent any possible combination of nucleotides at one position using this code. Cheers. You talk about 'not loosing' information. This method is not lossless either, in the sens that you loose the proportion of representation of each base. Hope this can help!

ADD REPLY • link 13.5 years ago by Eric Normandeau 11k

0

Entering edit mode

@Eric normandeau: Are you saying the reference sequence would have more weight? If so, how so? Also, the sequences are meant to be examples of distribution in compact form, didn't think bias was given to the stablity of a given sequence, just base position. See my comment on the answer below for additional info. If there's something that's still not clear just ask and I'll do my best to address your question(s). Thanks, and cheers!!

ADD REPLY • link 13.5 years ago by Blunders ★ 1.1k

0

Entering edit mode

+3 @Eric Normandeau: Thanks for all your comments, my question was more of a generalization of the issue, since it's not my problem, it's someone else; just trying to get a feel for it. I in fact am aware of the IUPAC nucleotide codes, and agree that they're a good option - though it still does not account for the weight of values within those bases. Again, thanks for all your comments -- cheers!!

ADD REPLY • link 13.5 years ago by Blunders ★ 1.1k

score 7 · Answer 1 · 2011-05-13

+1 To Michael's suggestion of using a sequence logo. In addition, it's hard to answer your question fully without knowing the context--are your bases coming from sequenced reads? Anyway, might have a look at Titus Brown's motility. Below is an example of its use-- generating PWM's and consensus sequences Note that the output (GGAAACCGNA) is the IUPAC sequence with the highest score from the generated position weight matrix:

import motility
import sys

def make_pwm(fname):
    """create a position weight matrix for the data in format like:
        GG-GGA-GA-TCT-AC
        GGAG-GTAAC-TCG-TC
        AAAAAAAG
        AACTGGG-GAAAGATC
        A----ATGAT
        TG-TC-CC-GGCCTGA
        CCC-GA-TA-GA-CTC
        AG-CTA-AGC-G-GCT
        ATCAGCTGATGC
        GAAAAAATCTATTATA

    '-' is converted to 'N'
    """
    matrix = []
    for line in open(fname):
        nts = line.upper().rstrip().replace("-", "N")
        freq = dict.fromkeys('ACGTN', 0.0)
        for nt in nts: freq[nt] += 1
        matrix.append(tuple((freq[bp] / len(nts)) for bp in "ACGT"))

    return motility.PWM(matrix)

def main():
    consensus_fn = sys.argv[1]
    pwm = make_pwm(consensus_fn)
    max_score = pwm.max_score()
    best_sites = pwm.generate_sites_over(max_score)
    # ('GGAAACCGAA', 'GGAAACCGCA', 'GGAAACCGGA', 'GGAAACCGTA')
    print motility.make_iupac_motif(best_sites)
    # GGAAACCGNA

if __name__ == "__main__":
    main()

Ram · Answer 2 · 2011-05-13

4

Entering edit mode

13.5 years ago

Michael Schubert ★ 7.1k

It seems you want a sequence logo instead of a consensus sequence (see Wikipedia as well).

However (1) Eric's comment is right on point and (2) if you gave us more information on what you are trying to do we could provide you with a better answer.

ADD COMMENT • link 13.5 years ago by Michael Schubert ★ 7.1k

0

Entering edit mode

@Michael Schubert: Thanks, I'd run across the sequence logo when reviewing the wiki for http://en.wikipedia.org/wiki/Consensus_sequence; which appear to be a twist on http://en.wikipedia.org/wiki/Tag_cloud. That's not what I'm looking for, since that's designed main for humans to read, I'm looking more for data structures that are semi-optimality for an algo, but still human readable; unable to think of an example of this off the top of my head.

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 13.5 years ago by Blunders ★ 1.1k

score 3 · Answer 3 · 2011-05-12

3

Entering edit mode

13.5 years ago

Bilouweb ★ 1.1k

A first and simple algorithm is to take the amino acid mostly represented in each column of the multiple alignment. If there is an ambiguity it is common to put a question mark.

For your example, it gives something like : AG?AGATGATGCT?TCC

This web-site gives simple examples.

ADD COMMENT • link 13.5 years ago by Bilouweb ★ 1.1k

0

Entering edit mode

@bilouweb: True, though using that algo, [AATGC], [A], [AAA], [AA-], etc. would all have the same result, right? Plus, what would [AAGG] be? There has to be a better way that condense info without losing info. Thinking something along the lines of relative probablity, entropy, etc.

ADD REPLY • link 13.5 years ago by Blunders ★ 1.1k

0

Entering edit mode

@bilouweb: Finally got a chance to look at the link you posted (I was on my phone until now) -- pretty cool, too bad the code for the "adv. consensus" is not open source. Thanks!

ADD REPLY • link 13.5 years ago by Blunders ★ 1.1k