Question

Extract Identical hits from pairwise global alignments

0

Entering edit mode

5.2 years ago

mnsp088 ▴ 100

Hi!

I am trying to perform 1:1 pairwise global alignments for a bunch of proteins in a fasta file. My end goal is to get a total number of identical hits (I.e. proteins which have a perfect match from start to end). So far I am using this code from Biopython to perform those alignments:

fasta = sys.argv[1]
with open(fasta, 'r') as f:
    seqs = []
    for line in f:
        if not line.startswith('>'):
            seqs.append(line.strip())

combos = itertools.combinations(seqs, 2)

    for k,v in combos:
        aln = pairwise2.align.globalxx(k,v)
        print(format_alignment(*aln[0]))

and it outputs several lines that look like this with a score:

KLARQYMTRARINVLIWPSCSPDLNLIENVWSVLKH----*
                          ||      |     |
--------------------------IE------K-IEQQ*
  Score=4

Is there a way to get the number of identical hits from this output? Or is there an entirely easier way to go about this problem? I am new to python so any tips are greatly appreciated. Thank you.

alignment protein biopython • 1.2k views

ADD COMMENT • link updated 5.2 years ago by Joe 21k • written 5.2 years ago by mnsp088 ▴ 100

0

Entering edit mode

If you're only interested in perfect matches, with the xx algorithm, this should be equivalent to finding out which alignments have a score == len(seq) I think. There's no penalty for gaps (IIRC), even if it does insert them.

You will just need to check that a gapped sequence cannot score identically to a full perfect match, then simply throw in a if aln[0][3] == len(k) check should do it I think.

I havent tested this at all though so I might be missing something.

In short, you need to know what the maximal score possible for a given sequence is (no gaps, perfect matches), given your scoring system, and then just check for alignments with that score.

ADD REPLY • link 5.2 years ago by Joe 21k

score 3 · Accepted Answer · 2019-09-25

3

Entering edit mode

5.2 years ago

Joe 21k

To illustrate my comment, here's some example input, code, and output:

Input testseqs.fa:

>1
ATACTAGCTACGATCGGGCTATATGCGCGCT
>2
ATCGCGATTGAGCGCGCTAGTGATGCCGCGCAATGCTAGG
>3
ATCGCGATTGAGCGCGCTAGTGATGCCGCGCAATGCTAGG
>4
TTCGCGATTGAGCGCGCTAGTGATGCCGCGCAATGCTA

Note that 2 and 3 are identical.

With the following identicalseqs.py script invoked as python identicalseqs.py testseqs.fa:

import sys
from Bio import SeqIO
from Bio import pairwise2
import itertools

for k,v in itertools.combinations(SeqIO.parse(sys.argv[1], 'fasta'), 2):
    aln = pairwise2.align.globalxx(k.seq, v.seq)
    print(">{0} (Len = {1}) -> {2} (Len = {3}):".formatk.id, str(len(k.seq)), v.id, str(len(v.seq))))
    if aln[0][2] == len(k.seq):
        print(aln[0])
    else:
        print("Non-identical")

Yields the following output (change as required, this prints explicity which comparison is being made and the aln object tuple for the first alignment:

>1 (Len = 31) -> 2 (Len = 40):
Non-identical
>1 (Len = 31) -> 3 (Len = 40):
Non-identical
>1 (Len = 31) -> 4 (Len = 38):
Non-identical
>2 (Len = 40) -> 3 (Len = 40):
('ATCGCGATTGAGCGCGCTAGTGATGCCGCGCAATGCTAGG', 'ATCGCGATTGAGCGCGCTAGTGATGCCGCGCAATGCTAGG', 40.0, 0, 40)
>2 (Len = 40) -> 4 (Len = 38):
Non-identical
>3 (Len = 40) -> 4 (Len = 38):
Non-identical

If you wanted to be really robust about it, you could calculate Hamming distances on each pair of strings in aln[0][0] and aln[0][1] and they should be 0 if they're identical post-alignment.

Pro tip:

itertools is designed to do exactly what it says on the tin, work with iterators, so you can put this directly in to the itertools.combinations() call above (as I have), which means you dont have to have everything stored in memory at once - could be useful if your fasta is very large as n choose k combinations can get really large really quickly.

ADD COMMENT • link 5.2 years ago by Joe 21k

0

Entering edit mode

This method worked perfectly. thank you for your thorough explanation and example code!

ADD REPLY • link 5.2 years ago by mnsp088 ▴ 100

0

Entering edit mode

Joe Just a side question out of curiosity - can one use blastp to do this task? I know it is used for local alignments but can we modify the command line arguments to make it perform a global alignment?

ADD REPLY • link 5.2 years ago by mnsp088 ▴ 100

1

Entering edit mode

You could, in theory, but probably shouldn't, make BLAST behave 'pseudo'-globally by reducing the penalties for gaps etc., but it will likely never perform the same as a true global aligner because of the seed strategy etc.

Your best bet would be to use a global aligner like needle - I think BLAT might also work but I forget whether it does global or not, but can be used as essentially a drop in replacement for BLAST. The approach above only works for the specific xx alignment above though, because the scoring matrix is simply 1 for a match, 0 for a mismatch, and gaps are unpenalised. If you switched to using a different scoring matrix, you'd have to know what those scores are and adjust accordingly, to whatever the theoretical maximum score for a sequence is.