Question

Getting Pairwise Sequence Alignment Score With Biopython

0

Entering edit mode

13.0 years ago

Lakshmi • 0

Hello,

I used the following code to run clustalw.I got an alignment file using this code. But I need to get pairwise sequence alignment score and also has to get distance matrix based on sequence identity.My aim is to do hierarchical clustering. How can I do these things with biopython.

import os
from Bio.Align.Applications import ClustalwCommandline
clustalw_exe = r"C:\Program Files\clustalw2\clustalw2.exe"
clustalw_cline = ClustalwCommandline(clustalw_exe, infile="C:\Users\pp\Desktop\myseqs.fasta") 
assert os.path.isfile(clustalw_exe), "Clustal W executable missing"
 stdout, stderr = clustalw_cline()
from Bio import AlignIO
align = AlignIO.read("C:\Users\pp\Desktop\myseqs.aln", "clustal")
print align

biopython alignment clustalw • 5.9k views

ADD COMMENT • link updated 18 months ago by Jonathan Lefebre ▴ 70 • written 13.0 years ago by Lakshmi • 0

0

Entering edit mode

So you basically want to build a phylogenetic tree?

ADD REPLY • link 13.0 years ago by Damian Kao 16k

0

Entering edit mode

Thanks for your comment. Yes I would like to create a dendrogram.

ADD REPLY • link 13.0 years ago by Lakshmi • 0

score 1 · Answer 1 · 2011-12-07

1

Entering edit mode

13.0 years ago

Damian Kao 16k

There are already plenty of software you can use to make a tree. You can try:

[?]EBI's ClustalW phylogeny tool[?]
[?]Mr Bayes[?]
[?]PHYLIP[?]
[?]A huge list of other phylogenetic tree generation software[?]

People usually tell me to us Mr Bayes.

ADD COMMENT • link 13.0 years ago by Damian Kao 16k

0

Entering edit mode

Thanks for your answer.I know about software. But I am trying to do this with Biopython.

ADD REPLY • link 13.0 years ago by Lakshmi • 0

score 1 · Answer 2 · 2023-06-06

1

Entering edit mode

18 months ago

Jonathan Lefebre ▴ 70

Maybe quite late, but in case other people are looking for this. Pairwise sequence identity can be easily calculated using basic python functionality and numpy

    #load your alignment as biopython object
    alignment = Bio.AlignIO.read('alignment_refined_1.fasta', "fasta")

    #get the number of sequences in the alignment
    Nseq = len(alignment)

    # Create a 2D NumPy array to store the PID values
    seq_id_array = numpy.zeros((Nseq, Nseq))

    # Iterate through each pair of sequences in the alignment
    for i in range(Nseq):
        for j in range(i+1, Nseq):
            seq1 = alignment_refined_1[i]
            seq2 = alignment_refined_1[j]

            # Calculate the number of identical positions in the two sequences, ignoring positions with a gap
            identical_residues = sum([1 for x, y in zip(seq1, seq2) if x == y and x != "-"])

            # Calculate the PID, ignoring positions with a gap
            pid = identical_residues / sum([1 for x in seq1 if x != "-"]) 

            # Store the PID value in the NumPy array
            seq_id_array[i, j] = pid
            seq_id_array[j, i] = pid

This will give you a numpy array that can then be used to do any kind of operation, as e.g. clustering.

Hope it helps.

Best,

Jonathan

ADD COMMENT • link 18 months ago by Jonathan Lefebre ▴ 70

0

Entering edit mode

Thanks for sharing ! Just these days I am trying to wrap my mind on protein alignments. I wonder: 1) "PID" in your post is "percentage of identical matches" ? Is it the same as here: What is "pident" (percentage of identical matches) in the "Diamond" protein alignment ? 2) If you would you be so kind to update your example to show how the data in file 'alignment_refined_1.fasta' can be obtained by BioPython alignments, that would be very helpful ?

ADD REPLY • link 18 months ago by Alexander ▴ 220

1

Entering edit mode

Dear Alexander,

to answer 2) first: This is just a fasta file that you can obtain essentially from every multiple sequence alignment software. I usually use MAFFT (https://mafft.cbrc.jp/alignment/software/), for which also a command line wrapper has been build within biopython (https://biopython.org/docs/1.76/api/Bio.Align.Applications.html). However, Clustal omega or muscle will also give you fasta files, if specified.

Regarding 1) In my example, i am looking at the total number of identical amino acids, apart from gapped positions, in the same position between two aligned sequences. This total number is then divided by the total number of sequence positions in the alignment. So with PID I mean pairwise identity. Sorry for the confusion.

ADD REPLY • link 18 months ago by Jonathan Lefebre ▴ 70