I have multiple sequence alignment file in fasta format.
I want to extract the result like this with respect to reference sequence A,
B I4E
B V7F
C V7F
C F10W
How can I be able to extract? Thanks in Advance.
I have multiple sequence alignment file in fasta format.
I want to extract the result like this with respect to reference sequence A,
B I4E
B V7F
C V7F
C F10W
How can I be able to extract? Thanks in Advance.
Let's save the three sequences below in file seqs.fas
:
>A
MDIIFWVSTFVLF
>B
MDIEFWFSTFVLF
>C
MDIIFWFSTWVLF
Next, save the code below as pairwise_custom.py
and run:
python pairwise_custom.py seqs.fas
The output will be:
B I4E
B V7F
C V7F
C F10W
The code is a modified version from here - please upvote that post.
#!/usr/bin/env python
# Count how many gaps and mismatches using pairwise2 alignment
# python pairwise_custom.py your_fasta_file.fasta
import itertools, sys
from Bio import SeqIO, pairwise2
from Bio.SubsMat import MatrixInfo as matlist
def my_format_alignment(w, align1, align2, score, begin, end):
z = 1
for a, b in zip(align1[begin:end], align2[begin:end]):
if a == b:
pass
elif a == "-" or b == "-":
print(' ' + IDs[w + 1] + ' ' + a + str(z) + b)
else:
print(' ' + IDs[w + 1] + ' ' + a + str(z) + b)
z += 1
FastaFile = open(sys.argv[1], 'rU')
SEQs = []
IDs = []
for rec in SeqIO.parse(FastaFile, 'fasta'):
IDs.append rec.id)
SEQs.append(rec.seq)
combos = itertools.combinations(SEQs, 2)
z = 0
for k, v in combos:
if z < (len(SEQs) - 1):
aln = pairwise2.align.globalds(k, v, matlist.blosum62, -10, -1)
my_format_alignment(z, *aln[0])
z += 1
I have identified the problem, it comes from the bleach html santizing package by Mozilla, it is an older version that we use. It looks like it happens when the code starts with a parenthesis and contains .id
in the second part of the parenthesis). As always the hardest bugs are those that we did not create ourselves. Now I could update the library, and that would fix this behavior, but then in the meantime the TLS versions have also been updated, and I cannot connect to PyPI to get the new package until a rebuild OpenSSL. This latter would affect all software on the server, hence it is a major undertaking. I have to make sure nothing else breaks when I update OpenSSL. We are also readying the next version of Biostar - we'll see which one makes it in first.
A generic suggestion, assuming your sequences are the same length and highly similar: make a multiple alignment - with any suitable program - which will most likely be completely gapless. Then make a script in your programming language of choice that will compare A with other sequences one column at a time.
If you want a concrete suggestion, use pairwise alignment functionality in Biopython: compare A to all other sequences in pairwise fashion and print columns where the content is different between the two sequences.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
A: Tools to enumerate amino acid mutations per site?
Thank you very much. I need the information about the sequence headers name in addition to sequence position.
How can I modify that code?