Question

Filtering MSA by similiarity to a consensus sequence/motif

0

Entering edit mode

3.8 years ago

Jonathan Lefebre ▴ 70

Dear all,

anyone knows a good way of filtering or sorting a large multiple sequence alignment (~8000 sequences) by similarity to a given consensus sequence? A solution using python/biopython would be optimal.

Any help is appreciated!

Best Jonathan

biopython motif multiple alignment MSA python sequence search • 1.7k views

ADD COMMENT • link updated 3.8 years ago by Joe 22k • written 3.8 years ago by Jonathan Lefebre ▴ 70

score 3 · Accepted Answer · 2021-08-13

Here's the basics of some code that will do what you need. If you already have the consensus sequence in another file, you don't need to derive the consensus as I have here. I am also using an arbitrarily low identity and high hamming distance for illustration purposes:

from Bio import AlignIO
from Bio.Align import AlignInfo
import sys


def hamming_distance(s1, s2):
    """Return the Hamming distance between equal-length sequences"""
    if len(s1) != len(s2):
        raise ValueError("Undefined for sequences of unequal length")
    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))


alignment = AlignIO.read(sys.argv[1], "fasta")
threshold = 0.10

summary = AlignInfo.SummaryInfo(alignment)
consensus_seq = summary.dumb_consensus(threshold)

filtered = []
for record in alignment:
    if hamming_distance(record.seq, consensus_seq) < 100:
        filtered.append(record)

[print(record.id) for record in filtered]

In short:

Read your alignment in with Bio.AlignIO.
Read in your consensus sequence (e.g. with SeqIO or derive it in the code).
Iterate each record in the alignment, and compare its sequence distance (in this case using the Hamming Distance) to the consensus.
If it passes whatever similarity threshold, add it to a list for recall later.