Hi!
I am trying to perform 1:1 pairwise global alignments for a bunch of proteins in a fasta file. My end goal is to get a total number of identical hits (I.e. proteins which have a perfect match from start to end). So far I am using this code from Biopython to perform those alignments:
fasta = sys.argv[1]
with open(fasta, 'r') as f:
seqs = []
for line in f:
if not line.startswith('>'):
seqs.append(line.strip())
combos = itertools.combinations(seqs, 2)
for k,v in combos:
aln = pairwise2.align.globalxx(k,v)
print(format_alignment(*aln[0]))
and it outputs several lines that look like this with a score:
KLARQYMTRARINVLIWPSCSPDLNLIENVWSVLKH----*
|| | |
--------------------------IE------K-IEQQ*
Score=4
Is there a way to get the number of identical hits from this output? Or is there an entirely easier way to go about this problem? I am new to python so any tips are greatly appreciated. Thank you.
If you're only interested in perfect matches, with the
xx
algorithm, this should be equivalent to finding out which alignments have ascore == len(seq)
I think. There's no penalty for gaps (IIRC), even if it does insert them.You will just need to check that a gapped sequence cannot score identically to a full perfect match, then simply throw in a
if aln[0][3] == len(k)
check should do it I think.I havent tested this at all though so I might be missing something.
In short, you need to know what the maximal score possible for a given sequence is (no gaps, perfect matches), given your scoring system, and then just check for alignments with that score.