Hello,
I am trying to learn biopython a little better and I am trying to write a python script that will take in some unaligned DNA sequences in fasta format from the command line, translate these to protein sequences, align these sequences with MUSCLE, and then write out the resulting protein alignment.
I run the script from the command line by typing python3 script.py input.fasta
Here is the script so far:
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO
from Bio.SeqUtils.CheckSum import seguid
from sys import argv
from Bio.Align.Applications import MuscleCommandline
from Bio import AlignIO
import subprocess
import sys
def unique_seqs(sequences):
"""returns a list of SeqRecord objects with redundant sequences removed"""
unique_records = []
checksum_container = []
for seq in sequences:
checksum = seguid(seq.seq)
if checksum not in checksum_container:
checksum_container.append(checksum)
unique_records.append(seq)
return unique_records
def make_protein_record(nuc_record):
"""Returns a new SeqRecord with the translated sequence (default table)."""
return SeqRecord(seq=nuc_record.seq.translate(cds=False), id=nuc_record.id, description="translation using def table")
seqs = SeqIO.parse(argv[1], "fasta")
seqs = list(seqs)
prot = []
print("you input {} sequences".format(len(seqs)))
UNIQUE_Seqs = unique_seqs(seqs)
print("there are {} unique DNA sequences".format(len(UNIQUE_Seqs)))
for seq in UNIQUE_Seqs:
seq = make_protein_record(seq)
prot.append(seq)
protein_fasta_handle = "prot" + argv[1]
SeqIO.write(prot, protein_fasta_handle, "fasta")
muscle_cline = MuscleCommandline()
child = subprocess.Popen(str(muscle_cline),
stdin =subprocess.PIPE,
stdout =subprocess.PIPE,
stderr =subprocess.PIPE,
universal_newlines =True,
shell =(sys.platform != "win32"))
SeqIO.write(prot, child.stdin, "fasta")
child.stdin.close()
align_handle = "ALN" + protein_fasta_handle
align = AlignIO.read(child.stdout, "fasta")
AlignIO.write(align, align_handle, "fasta")
print(align)
This script works, however it only works when I give it a low number of starting sequences. I have successfully gotten it to work with a fasta containing 8 sequences, but when I tried to use 300 starting sequences the script freezes. I initially thought it was being non-responsive because the MUSCLE alignment was taking a long time. To test this I ran MUSCLE from outside of biopython and the alignment only took 20 seconds.
By inserting print statements at various places I learned that the problem arises on line 43:
align = AlignIO.read(child.stdout, "fasta")
I haven't seen any error messages or anything of that nature. Can anyone spot something I'm doing wrong here?
Thank you for your time.
Try adding parameters to the
MuscleCommandline()
function; remove the stdin parameter in thePopen()
function, after that add the line:stdout, stdout = child.communicate()
and remove the lineSeqIO.write(prot, child.stdin, "fasta")
which I don't really understand what are you doing with it because you are already saving the protein sequences in fasta format.Working directly with pipes like this (child.stdout etc) is much harder to debug. Try making temporary files instead.