anyone knows a good way of filtering or sorting a large multiple sequence alignment (~8000 sequences) by similarity to a given consensus sequence? A solution using python/biopython would be optimal.
Here's the basics of some code that will do what you need. If you already have the consensus sequence in another file, you don't need to derive the consensus as I have here. I am also using an arbitrarily low identity and high hamming distance for illustration purposes:
from Bio import AlignIO
from Bio.Align import AlignInfo
import sys
def hamming_distance(s1, s2):
"""Return the Hamming distance between equal-length sequences"""
if len(s1) != len(s2):
raise ValueError("Undefined for sequences of unequal length")
return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
alignment = AlignIO.read(sys.argv[1], "fasta")
threshold = 0.10
summary = AlignInfo.SummaryInfo(alignment)
consensus_seq = summary.dumb_consensus(threshold)
filtered = []
for record in alignment:
if hamming_distance(record.seq, consensus_seq) < 100:
filtered.append(record)
[print(record.id) for record in filtered]
In short:
Read your alignment in with Bio.AlignIO.
Read in your consensus sequence (e.g. with SeqIO or derive it in the code).
Iterate each record in the alignment, and compare its sequence distance (in this case using the Hamming Distance) to the consensus.
If it passes whatever similarity threshold, add it to a list for recall later.
Note that this is a very naive approach (as evidenced by the use of the dumb_consensus). This will not take in to consideration any amino acid similarity or any other factors other than absolute residue-for-residue identity.
Note that this is a very naive approach (as evidenced by the use of the
dumb_consensus
). This will not take in to consideration any amino acid similarity or any other factors other than absolute residue-for-residue identity.