Getting Pairwise Sequence Alignment Score With Biopython
2
0
Entering edit mode
13.0 years ago
Lakshmi • 0

Hello,

I used the following code to run clustalw.I got an alignment file using this code. But I need to get pairwise sequence alignment score and also has to get distance matrix based on sequence identity.My aim is to do hierarchical clustering. How can I do these things with biopython.

import os
from Bio.Align.Applications import ClustalwCommandline
clustalw_exe = r"C:\Program Files\clustalw2\clustalw2.exe"
clustalw_cline = ClustalwCommandline(clustalw_exe, infile="C:\Users\pp\Desktop\myseqs.fasta") 
assert os.path.isfile(clustalw_exe), "Clustal W executable missing"
 stdout, stderr = clustalw_cline()
from Bio import AlignIO
align = AlignIO.read("C:\Users\pp\Desktop\myseqs.aln", "clustal")
print align
biopython alignment clustalw • 5.9k views
ADD COMMENT
0
Entering edit mode

So you basically want to build a phylogenetic tree?

ADD REPLY
0
Entering edit mode

Thanks for your comment. Yes I would like to create a dendrogram.

ADD REPLY
1
Entering edit mode
13.0 years ago

There are already plenty of software you can use to make a tree. You can try:

  • [?]EBI's ClustalW phylogeny tool[?]
  • [?]Mr Bayes[?]
  • [?]PHYLIP[?]
  • [?]A huge list of other phylogenetic tree generation software[?]

People usually tell me to us Mr Bayes.

ADD COMMENT
0
Entering edit mode

Thanks for your answer.I know about software. But I am trying to do this with Biopython.

ADD REPLY
1
Entering edit mode
18 months ago

Maybe quite late, but in case other people are looking for this. Pairwise sequence identity can be easily calculated using basic python functionality and numpy

    #load your alignment as biopython object
    alignment = Bio.AlignIO.read('alignment_refined_1.fasta', "fasta")

    #get the number of sequences in the alignment
    Nseq = len(alignment)

    # Create a 2D NumPy array to store the PID values
    seq_id_array = numpy.zeros((Nseq, Nseq))

    # Iterate through each pair of sequences in the alignment
    for i in range(Nseq):
        for j in range(i+1, Nseq):
            seq1 = alignment_refined_1[i]
            seq2 = alignment_refined_1[j]

            # Calculate the number of identical positions in the two sequences, ignoring positions with a gap
            identical_residues = sum([1 for x, y in zip(seq1, seq2) if x == y and x != "-"])

            # Calculate the PID, ignoring positions with a gap
            pid = identical_residues / sum([1 for x in seq1 if x != "-"]) 

            # Store the PID value in the NumPy array
            seq_id_array[i, j] = pid
            seq_id_array[j, i] = pid

This will give you a numpy array that can then be used to do any kind of operation, as e.g. clustering.

Hope it helps.

Best,

Jonathan

ADD COMMENT
0
Entering edit mode

Thanks for sharing ! Just these days I am trying to wrap my mind on protein alignments. I wonder: 1) "PID" in your post is "percentage of identical matches" ? Is it the same as here: What is "pident" (percentage of identical matches) in the "Diamond" protein alignment ? 2) If you would you be so kind to update your example to show how the data in file 'alignment_refined_1.fasta' can be obtained by BioPython alignments, that would be very helpful ?

ADD REPLY
1
Entering edit mode

Dear Alexander,

to answer 2) first: This is just a fasta file that you can obtain essentially from every multiple sequence alignment software. I usually use MAFFT (https://mafft.cbrc.jp/alignment/software/), for which also a command line wrapper has been build within biopython (https://biopython.org/docs/1.76/api/Bio.Align.Applications.html). However, Clustal omega or muscle will also give you fasta files, if specified.

Regarding 1) In my example, i am looking at the total number of identical amino acids, apart from gapped positions, in the same position between two aligned sequences. This total number is then divided by the total number of sequence positions in the alignment. So with PID I mean pairwise identity. Sorry for the confusion.

ADD REPLY

Login before adding your answer.

Traffic: 2175 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6