Code For Generating 'Consensus' Sequence From Pairwise Alignment
2
4
Entering edit mode
11.1 years ago
Woa ★ 2.9k

Hi All,

Let's say I have the following toy PAIRWISE GLOBAL alignment of two sequences:

aaaa----fghijk
||||.......***
aaaabcde---iij

I wish to generate the following 'consensus' or combined sequence

aaaabcdefghiij

The methodology is:where ever along the length of alignment, there is identity choose any one of the identical characters, when there is a gap in any one of the sequences, choose the non gap character from the other sequence, and when there is a mismatch, choose the corresponding character from the lower sequence.

The actual alignments are derived using Emboss "Needle". Are there scripts/modules available in (Bio)Perl / Python or any other tool to do this?

Thanks in advance

pairwise consensus • 8.1k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
3
Entering edit mode
11.1 years ago
Niek De Klein ★ 2.6k

I don't know an existing module, but in python you can do it with this function

def consensus(sequence_1, sequence_2):
    consensus_sequence = ''
    for index, amino_acid in enumerate(sequence_1):
        if amino_acid == sequence_2[index]:
            consensus_sequence += amino_acid
        elif amino_acid == '-' and sequence_2[index] != '-':
            consensus_sequence += sequence_2[index]
        elif sequence_2[index] == '-' and amino_acid != '-':
            consensus_sequence += amino_acid
        elif amino_acid != sequence_2[index]:
            consensus_sequence += sequence_2[index]
    return consensus_sequence

Then test it with

sequence_1 = 'aaaa----fghijk'
sequence_2 = 'aaaabcde---iij'
consensus_seq = consensus(sequence_1, sequence_2)
print consensus_seq, consensus_seq == 'aaaabcdefghiij'           # gives aaaabcdefghiij True
ADD COMMENT
0
Entering edit mode

This is a very useful piece of code. I want to ask if the consensus is to be desired for a multiple sequences alignment data, how will it be implemented?

ADD REPLY
0
Entering edit mode

Haven't tested for edge cases but can try something like this

def consensus(*sequences):
    if(len(sequences) <= 1):
        raise RuntimeError('Only '+str(len(sequences))+' sequences given, need at least 2')

    consensus_sequence = ''
    for index, amino_acid in enumerate(sequences[0]):
        amino_acids = set([amino_acid])
        for sequence in sequences[1:]:
            amino_acids.add(sequence[index])
        amino_acids.discard('-')

        if len(amino_acids) == 1:
            consensus_sequence += amino_acids.pop()
        else:
            consensus_sequence += '-'
    return consensus_sequence
sequence_1 = 'aaaa----fghijk----'
sequence_2 = 'aaaabcde---iij----'
sequence_3 = 'aaaabcde---iij-l-n'
sequence_4 = 'aaaabcde---iijq-o-'
consensus_seq = consensus(sequence_1, sequence_2, sequence_3, sequence_4)
print (consensus_seq)
ADD REPLY
0
Entering edit mode

I was using a sequence file in fasta format. First I used it to extract sequences in a list, as follows:

fasta = {}
with open('seq.fas') as file_one:
    for line in file_one:
        line = line.strip()
        if not line:
            continue
        if line.startswith(">"):
            active_sequence_name = line[1:]
            if active_sequence_name not in fasta:
                fasta[active_sequence_name] = []
            continue
        sequence = line
        fasta[active_sequence_name].append(sequence)
result_values = []
values = list(fasta.values()) # List of values
for val in values:
    result_values.append(val)
print (result_values)

and then I used your code for generating consensus:

def consensus(*sequences):
    if(len(sequences) <= 1):
        raise RuntimeError('Only '+str(len(sequences))+' sequences given, need at least 2')

    consensus_sequence = ''
    for index, amino_acid in enumerate(sequences[0]):
        amino_acids = set([amino_acid])
        for sequence in sequences[1:]:
            amino_acids.add(sequence[index])
        amino_acids.discard('-')

        if len(amino_acids) == 1:
            consensus_sequence += amino_acids.pop()
        else:
            consensus_sequence += '-'
    return consensus_sequence
consensus_seq = consensus(result_values)
print (consensus_seq)

Your code works well for the above condition with 4 sequences but it does not take the list as arguement which I as trying. Can you suggest something for this?

ADD REPLY
1
Entering edit mode
11.1 years ago
SES 8.6k

Here is a one line BioPerl solution:

$ perl -MBio::AlignIO -e 'my $aln = Bio::AlignIO->new(-file => shift); my $str = $aln->next_aln(); print $str->consensus_string(50)' aln.fas
aaaabcdefghiij

If you want to consider DNA/RNA alignments, you will need to incorporate ambiguity codes. See How To Merge Consensus Motif To Degenerate Motif ? [Motif Merge Or Motif Clustering] for an example of how to expand this code into something for DNA/RNA.

ADD COMMENT

Login before adding your answer.

Traffic: 2557 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6