Hello,
I'm trying to use BioPython's SeqIO parser to load a ~3GB Genbank file with ~20,000 whole bacterial genomes and write .fasta files containing one sequence each. My code is working fine for the first ~2,000 genomes, but then it hits an invalid sequence line and terminates. The initial error message showed up as a BiopythonParserWarning, but after using the following code
from Bio import BiopythonParserWarning
warnings.simplefilter('ignore', BiopythonParserWarning)
it now results in a generic ValueError: Sequence line mal-formed 'title>NCBI/ffsrv11 - WWW Error 500 Diagnostic</title>' I can't find any info about this in the BioPython documentation, and I was wondering if there was any way to handle this error without storing the entire file in a list (takes a lot of RAM and my data sets will soon make this completely nonviable).
Here is the entirety of the code I am using:
from Bio import SeqIO
import warnings
from Bio import BiopythonParserWarning
warnings.simplefilter('ignore', BiopythonParserWarning)
#input file (genbank), change path
records = SeqIO.parse(r'C:\Users\aaa\aaa\file.gb', 'genbank')
#output folder(change path)
output_folder = r'C:\Users\\aaa\aaa\folder\\'
organism_dict = {}
for record in records:
try:
#create a file name in case we need to create a fasta file for this sequence, uses the SeqFeature qualifiers "organism" and "strain"
doc_name = str(record.features[0].qualifiers["organism"][0]) + ' strain ' + str(record.features[0].qualifiers["strain"][0]) + '.fasta'
#remove characters that are invalid for file names and/or newick strings, replace with spaces
doc_name = doc_name.replace('(', ' ').replace(')', ' ').replace('/', ' ').replace('\\', ' ').replace(',', ' ')
#use this doc name to see if this strain exists in our dictionary
if doc_name in organism_dict:
#check to see if the current sequence for this strain is shorter than our new sequence
if len(record.seq) > organism_dict[doc_name]:
organism_dict[doc_name] = len(record.seq)
with open(output_folder + doc_name, 'w') as handle:
SeqIO.write(record, handle, 'fasta')
else:
organism_dict[doc_name] = len(record.seq)
with open(output_folder + doc_name, 'w') as handle:
SeqIO.write(record, handle, 'fasta')
except:
printrecord.id, record.features[0].qualifiers["organism"])
Thank you, Daniel
Do you know exactly which record it is failing on? Can you slice a small portion of the GenBank file and load it somewhere for testing?