Question

Mal-formed sequence line error in Bio.SeqIO

0

Entering edit mode

5.4 years ago

stueckmann.daniel • 0

Hello,

I'm trying to use BioPython's SeqIO parser to load a ~3GB Genbank file with ~20,000 whole bacterial genomes and write .fasta files containing one sequence each. My code is working fine for the first ~2,000 genomes, but then it hits an invalid sequence line and terminates. The initial error message showed up as a BiopythonParserWarning, but after using the following code

from Bio import BiopythonParserWarning
warnings.simplefilter('ignore', BiopythonParserWarning)

it now results in a generic ValueError: Sequence line mal-formed 'title>NCBI/ffsrv11 - WWW Error 500 Diagnostic</title>' I can't find any info about this in the BioPython documentation, and I was wondering if there was any way to handle this error without storing the entire file in a list (takes a lot of RAM and my data sets will soon make this completely nonviable).

Here is the entirety of the code I am using:

from Bio import SeqIO
import warnings
from Bio import BiopythonParserWarning
warnings.simplefilter('ignore', BiopythonParserWarning)

#input file (genbank), change path
records = SeqIO.parse(r'C:\Users\aaa\aaa\file.gb', 'genbank')
#output folder(change path)
output_folder = r'C:\Users\\aaa\aaa\folder\\'

organism_dict = {}
for record in records:
    try:
        #create a file name in case we need to create a fasta file for this sequence, uses the SeqFeature qualifiers "organism" and "strain"
        doc_name = str(record.features[0].qualifiers["organism"][0]) + ' strain ' + str(record.features[0].qualifiers["strain"][0]) + '.fasta'
        #remove characters that are invalid for file names and/or newick strings, replace with spaces
        doc_name = doc_name.replace('(', ' ').replace(')', ' ').replace('/', ' ').replace('\\', ' ').replace(',', ' ')
        #use this doc name to see if this strain exists in our dictionary
        if doc_name in organism_dict:
            #check to see if the current sequence for this strain is shorter than our new sequence 
            if len(record.seq) > organism_dict[doc_name]:
                organism_dict[doc_name] = len(record.seq)
                with open(output_folder + doc_name, 'w') as handle:
                    SeqIO.write(record, handle, 'fasta')
        else:
            organism_dict[doc_name] = len(record.seq)
            with open(output_folder + doc_name, 'w') as handle:
                    SeqIO.write(record, handle, 'fasta')
    except:
        printrecord.id, record.features[0].qualifiers["organism"])

Thank you, Daniel

Python BioPython Genbank • 1.4k views

ADD COMMENT • link 5.4 years ago by stueckmann.daniel • 0

0

Entering edit mode

Do you know exactly which record it is failing on? Can you slice a small portion of the GenBank file and load it somewhere for testing?

ADD REPLY • link 5.1 years ago by vkkodali_ncbi ★ 3.8k