This could be as simple as the below, but it has the following assumptions until OP clarifies:
- Sequences are aligned or already the same length.
- If they are aligned, they are provided as an aligned FASTA, rather than another alignment format.
If not, the Hamming functionality can be supplemented with an alignment step, but this will probably start to get slow with the OP's 400,000 sequences.
from Bio import SeqIO
import sys
def hamming_distance(s1, s2):
"""Return the Hamming distance between equal-length sequences"""
if len(s1) != len(s2):
raise ValueError("Undefined for sequences of unequal length")
return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
for i, record in enumerate(SeqIO.parse(sys.argv[1], 'fasta')):
if i == 0:
reference = record
print("\t".join([reference.id, record.id, str(hamming_distance(reference.seq, record.seq))]))
Example input:
>Zebrafish
ESLLRFGLRSDLDFRLSLNGKEDLLDTGQSLSSCGVVSGDLISVILPASSQTSSAAHQTH
TDQQSSQECVDLQQDCMDQQQQQEQECVCAAAPPLLCCEAEDGLLPLALERLLDSSTCRS
PSDCLMLALHLLLLETGFIPQGGAVSSGEMPIGWQAAGVFRLQYVHPLLENSLVSVVAVP
MGQTLVINAVLKMETSLENSRKLLLKPDEYVTAWTGGSSGVVYRDLRRLSRLVRDQLVYP
LMATARQALGLPLLFGLPVLPPELLLRLLRLLDVRSLVSLSAVCRHLNTATHDASLWRHL
LHRDFRVSFPAHRDTDWRELYKQKYRQRAARRGRHWFYPPPISPLIPFPSSPALYPPGII
GDYDQMPILPRPRFHPIGPLPGMSAPV
>Fugu
ETVLSVGLSAETEISLSLNGSEPLEDTGQTLASCGIVSGDLIRVALIRAADAPDRDDGGG
HSEQVSQEAKLPDASGASTDSDQAPGPAASCWEPMLCSETDEGQAPWSLELLYHSAQVSG
PGDALVVAANLLMIETGFSPQDSQLKPAEMPAGWRCGGVYKLQYSHRLCGDSVVVMVAVS
MGSALIINGLLEVNQSADSVCKLCVDPSSYVTEWPGDSAAAAFKELNKLSRVFKDQVAYP
LITAARHAMALPVAFGLTALPPELLLRVFRLLDVRSVVMLSAVCRHFGAITRDTALWRHL
YCRDFRDSHAGSRDTDWKEVYRRSYKSRSAVRRSHECFLPPLYPNPRGVFTPPPPVPGII
GEYDQRPILPRPRYDPMSPFPDLDRQP
>Chicken
RALLAWGYSSDTEFSITLNGKDALTEDEKTLASYGIVPGDLICLLLEETDLPPPSSSPPS
LQNGKNGSSLEFPSGLVPEDVDLEEGTGSYPSEPMLCSEAADGEIPHSLEVLYLSAECTS
ATDALIVLVHLLMMETGYVPQGTEAKAVSMPEKWRGNGVYKLQYTHPLCEEGSAGLTCVP
LGDLVAINATLKINREIKGVKRIQLLPASFVCFQEPEKVAGVYKDLQKLSRLFKDQLVYS
LLAAARQALNLPDVFGLVVLPLELKLRIFRLLDVRSLISLSAVCRDLYAASNDQLLWRFM
YLRDFRDPIARPRDTDWKELYKKKLKQKEALRWRHMFLPPPFHPNPFYPSPFPIYPPMVI
GEYGERPSLIPPHFDPIGSLPGANPTL
>Zebra
SMTENRTAGSDTAFSVTLNRKDALTEDQKTLASYGIVSGDLICLLLEEPDLPPPPATPAP
LQNGNNGSSLEFPSGLVPEDADLEEGTGSYPSEPMLCSEAADGETPHSLEMLYLSAECTS
ATDALIVLVHLLMMETGYVPQGIEAKAVFMPEKWRGNGVYKLQYTHPLCGEGCAGLTCVP
LGDLIAINATLKINEEIRSVKRIQLLPSSFVCFQDPEKVAGVYKDLQKLSRLFKDQLVYS
LLAAARQALNLPDVFGLLVLPLELKLRIFRLLDVRSLISLSAVCRDLYTASNDQLLWRFM
YLRDFRDPIARPRDTDWKELYKKKLKQKEALRWRHMMLLPPFHPNPFYPNPFPIYPPMII
GEYDERPSLIPPHFDPIGSLPGANPML
>Anole
QALLSWGYSSETKFEITLNNKDSLVGDQDTLASFGIVSGDLICLILEDDASSPSSSLPSS
QSNHHSGPSQEFTSEGGPDDLDLQEATGSFPSEPMLCCEATDGQVPHSLQTLYHSAECTN
ANDALIVSIHLIMMETGYVPQGTEAKASSMPENWRNKGVYKLLYTHPLCENGFAVLTCVP
LGNLIVVNAMLKITSDIKSVKRLQLLPTSFICFQDSANVVGVYKDLQKLSRLFKDRLVYP
LLAAARQALNLPDVFGLVVLPLELKLRIFRLLDFRSLLSLSAVCHDLYAASNDQLLWRFI
YLRDFRDPVARSRDTDWKELYKKKMKQKDALRWRHMMFLPPLHPNPLYPNPFPLYPPMII
GEMDERPSLFPSHLDPFGSFQNPNPTL
>Human
QSLLTWGYSSNTRFTITLNYKDPLTGDEETLASYGIVSGDLICLILQDDIIPSSTSEHSS
LQNNSNGPSQNFEAESIQDNAHMAEGTGFYPSEPMLCSESVEGQVPHSLETLYQSADCSD
ANDALIVLIHLLMLESGYIPQGTEAKALSMPEKWKLSGVYKLQYMHPLCEGSSATLTCVP
LGNLIVVNATLKINNEIRSVKRLQLLPESFICKKLGENVANIYKDLQKLSRLFKDQLVYP
LLAFTRQALNLPDVFGLVVLPLELKLRIFRLLDVRSVLSLSAVCRDLFTASNDPLLWRFL
YLRDFRDNTVRVQDTDWKELYRKRHIQRESPKGRFVMLLPPFYPNPLHPRPFPRLPPGII
GEYDQRPSLIPPRFDPVGPLPGPNPIL
Example output:
Zebrafish Zebrafish 0
Zebrafish Fugu 214
Zebrafish Chicken 223
Zebrafish Zebra 226
Zebrafish Anole 233
Zebrafish Human 223
This sounds like genome assembly. Why are you doing this in BioPython and not using an existing tool?
I think Ram is right and you'll want to look for another tool for this. It won't be difficult in python, but it probably will be slow with that much data.
Do you need the sequences to be aligned, or are they already the same length/aligned?
Sounds like you just need multiple pairwise Hamming or Levenshtein values.