How to compare multiple protein sequences and obtain single point mutations and their positions
1
0
Entering edit mode
2.2 years ago
Francesco ▴ 20

Hi! I need your help. I'm working on a ML model and I need to compare a wt protein sequence to a list of 3000 mutated sequences. every mt sequence contains a single point mutation. the aim is to find every mutation and create two columns, one for the mutation position and the other for the type of mutation (aa symbol). Can you help me, please?

alignment python pandas biopython • 1.2k views
ADD COMMENT
0
Entering edit mode

You can do this easily with a multiple sequence alignment tool and the AlignIO module of biopython.

Biostars is not a code writing service though. Judging by your tags you already know where to start, so please have a go and we will gladly help troubleshoot.

ADD REPLY
0
Entering edit mode
2.0 years ago
Alban Nabla ▴ 30

If you really want to stick with Python and oversimplifying a ton, you could probably go to the EBI website and run a multiple sequence alignment using Clustal Omega; generate a 'dumb' consensus in Python; then finally write a script to compare sequences. Something on the line of:

from Bio import AlignIO
alignment = AlignIO.read('alignment.clustal', 'clustal')
from Bio.Align import AlignInfo
summary_align = AlignInfo.SummaryInfo(alignment)
consensus = summary_align.dumb_consensus()
variants = {}
for pep in alignment:
    variants[pep.id] = []
        for aa in range(len(pep.seq)):
        if pep.seq[aa] == '-':
        variants[pep.id].append([aa, 'deletion'])
        elif consensus[aa] == '-':
        variants[pep.id].append([aa, 'insertion'])
       elif pep.seq[aa] != consensus[aa]:
        variants[pep.id].append([aa, 'snp', pep.seq[aa]])
ADD COMMENT

Login before adding your answer.

Traffic: 2533 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6