Question

Code For Generating 'Consensus' Sequence From Pairwise Alignment

4

Entering edit mode

11.8 years ago

Woa ★ 2.9k

Hi All,

Let's say I have the following toy PAIRWISE GLOBAL alignment of two sequences:

aaaa----fghijk
||||.......***
aaaabcde---iij

I wish to generate the following 'consensus' or combined sequence

aaaabcdefghiij

The methodology is:where ever along the length of alignment, there is identity choose any one of the identical characters, when there is a gap in any one of the sequences, choose the non gap character from the other sequence, and when there is a mismatch, choose the corresponding character from the lower sequence.

The actual alignments are derived using Emboss "Needle". Are there scripts/modules available in (Bio)Perl / Python or any other tool to do this?

Thanks in advance

pairwise consensus • 8.7k views

ADD COMMENT • link updated 11.8 years ago by SES 8.6k • written 11.8 years ago by Woa ★ 2.9k

0

Entering edit mode

Have you tried this? https://metacpan.org/module/Bio::AlignIO::emboss

ADD REPLY • link 11.8 years ago by jing ▴ 10

score 3 · Answer 1 · 2013-10-15

3

Entering edit mode

11.8 years ago

Niek De Klein ★ 2.6k

I don't know an existing module, but in python you can do it with this function

def consensus(sequence_1, sequence_2):
    consensus_sequence = ''
    for index, amino_acid in enumerate(sequence_1):
        if amino_acid == sequence_2[index]:
            consensus_sequence += amino_acid
        elif amino_acid == '-' and sequence_2[index] != '-':
            consensus_sequence += sequence_2[index]
        elif sequence_2[index] == '-' and amino_acid != '-':
            consensus_sequence += amino_acid
        elif amino_acid != sequence_2[index]:
            consensus_sequence += sequence_2[index]
    return consensus_sequence

Then test it with

sequence_1 = 'aaaa----fghijk'
sequence_2 = 'aaaabcde---iij'
consensus_seq = consensus(sequence_1, sequence_2)
print consensus_seq, consensus_seq == 'aaaabcdefghiij'           # gives aaaabcdefghiij True

ADD COMMENT • link 11.8 years ago by Niek De Klein ★ 2.6k

0

Entering edit mode

This is a very useful piece of code. I want to ask if the consensus is to be desired for a multiple sequences alignment data, how will it be implemented?

ADD REPLY • link 6.8 years ago by mdsiddra ▴ 30

0

Entering edit mode

Haven't tested for edge cases but can try something like this

def consensus(*sequences):
    if(len(sequences) <= 1):
        raise RuntimeError('Only '+str(len(sequences))+' sequences given, need at least 2')

    consensus_sequence = ''
    for index, amino_acid in enumerate(sequences[0]):
        amino_acids = set([amino_acid])
        for sequence in sequences[1:]:
            amino_acids.add(sequence[index])
        amino_acids.discard('-')

        if len(amino_acids) == 1:
            consensus_sequence += amino_acids.pop()
        else:
            consensus_sequence += '-'
    return consensus_sequence
sequence_1 = 'aaaa----fghijk----'
sequence_2 = 'aaaabcde---iij----'
sequence_3 = 'aaaabcde---iij-l-n'
sequence_4 = 'aaaabcde---iijq-o-'
consensus_seq = consensus(sequence_1, sequence_2, sequence_3, sequence_4)
print (consensus_seq)

ADD REPLY • link 6.8 years ago by Niek De Klein ★ 2.6k

0

Entering edit mode

I was using a sequence file in fasta format. First I used it to extract sequences in a list, as follows:

fasta = {}
with open('seq.fas') as file_one:
    for line in file_one:
        line = line.strip()
        if not line:
            continue
        if line.startswith(">"):
            active_sequence_name = line[1:]
            if active_sequence_name not in fasta:
                fasta[active_sequence_name] = []
            continue
        sequence = line
        fasta[active_sequence_name].append(sequence)
result_values = []
values = list(fasta.values()) # List of values
for val in values:
    result_values.append(val)
print (result_values)

and then I used your code for generating consensus:

def consensus(*sequences):
    if(len(sequences) <= 1):
        raise RuntimeError('Only '+str(len(sequences))+' sequences given, need at least 2')

    consensus_sequence = ''
    for index, amino_acid in enumerate(sequences[0]):
        amino_acids = set([amino_acid])
        for sequence in sequences[1:]:
            amino_acids.add(sequence[index])
        amino_acids.discard('-')

        if len(amino_acids) == 1:
            consensus_sequence += amino_acids.pop()
        else:
            consensus_sequence += '-'
    return consensus_sequence
consensus_seq = consensus(result_values)
print (consensus_seq)

Your code works well for the above condition with 4 sequences but it does not take the list as arguement which I as trying. Can you suggest something for this?

ADD REPLY • link 6.8 years ago by mdsiddra ▴ 30

score 1 · Answer 2 · 2013-10-15

Here is a one line BioPerl solution:

$ perl -MBio::AlignIO -e 'my $aln = Bio::AlignIO->new(-file => shift); my $str = $aln->next_aln(); print $str->consensus_string(50)' aln.fas
aaaabcdefghiij

If you want to consider DNA/RNA alignments, you will need to incorporate ambiguity codes. See How To Merge Consensus Motif To Degenerate Motif ? [Motif Merge Or Motif Clustering] for an example of how to expand this code into something for DNA/RNA.