Hi All,
I have been using Biopython to explore the diversity in some of my antibody sequences using a pairwise alignment. However because there are established places in the amino acid sequences where insertions and deletions may occur, there are certain numbering schemes for these sequences which allow residues to be compared like-for-like.
I have provided a couple of examples where these sequences have been numbered and aligned to the Chothia scheme. Missing residues have been spaced out using a dash so the result is that the sequences are the same length.
seq1 = "QVQLVQSGAEVKKPGASVKVSCKASGYTFTV--FYIFWVRQAPGQGPEWMGWINP--NSGGTSYAQNFQGRVTMTRDTSVSTAYMELSRLTSDDTAVYFCARGRRGLITEF--------DYWGQGTLVTVSS"
seq2 = "QVQLVESGGGLVKPGGSLRLSCAASGFTFSD--YYMSWIRQAPGKGLEWVSYISS--SGSTIYYADSVKGRFTISRDNAKNSLYLQMNSLRAEDTAVYYCARIAAAGKN----------DYWGQGTLVTVSS"
Using Biopython Align, the program continues to insert gaps into the sequence like so:
from Bio import Align
Align.PairwiseAligner(seq1,seq2)
The outputted alignment is like so:
QVQLVQ-SGAE---VKKPGA-SVKV---SCKA-SGYTFTV-----FYIF---WV-RQAPGQ-GP-EWMGW---INP----NSGGTS--Y-AQNFQ----GRV-TMT--RDTSVST-A----YMEL---SRLTS---DDTAVYF-CARGRRGLITEF----------------DYWGQGTLVTVSS
QVQLV-ESG--GGLVK-PG-GS---LRLSC-AASG--FT-FSD---Y--YMSW-IRQAPG-KG-LEW---VSYI--SS---SG--STIYYA----DSVKGR-FT--ISRD-----NAKNSLY--LQMNS-L--RAED-TAVY-YCAR-----I---AAAGKN----------DYWGQGTLVTVSS
But I would like it to obtain a pairwise alignment score us the original gaps in the sequence. How can this be achieved? Or is there another program which may help? Would a simple Hamming distance be sufficient?
Best, James
If you already have the alignment itself done as you want, you could just score it yourself directly in Python with whatever rules you have in mind. A straight hamming distance could just be a simple one-liner like
sum(x != y for x, y in zip(seq1, seq2))
. That said, what's your end goal? There are a bunch of specialized immune receptor programs that might end up being more the right tool than plain Python/Biopython, and some related posts here-- for example see this recent answer about sequence annotation and all those competing numbering schemes.