I used to generate FASTA out of my GenBank source files using a simple conversion script:
#!/usr/bin/env python
from Bio import SeqIO
def wrap( text, width=80 ):
for i in xrange( 0, len( text ), width ):
yield text[i:i+width]
if __name__ == "__main__":
status = progress()
for record in SeqIO.parse( sys.stdin, "genbank"):
try:
gi = record.annotations["gi"]
except KeyError:
gi = None
accession = record.id
desc = record.description
seq = record.seq
locus = record.name
print ">gi|%s|emb|%s|%s| %s" % (gi, accession, locus, desc)
for block in wrap( seq ):
print block
When I changed the sequence files to newer versions some of the resulting FASTA file sequences were just filled with Ns. After closer inspection of the GenBank source files, it turns out that they have replaced the ORIGIN block
ORIGIN
sequence...
with a CONTIG block, something like
CONTIG join(BX640437.1:1..347356,BX640438.1:51..347786,...)
It looks like GenBank is now using cross-references to build sequence entries. In my understanding this would require a multi-pass scan of the file. Is there a way to resolve this using BioPython?
Data source: Since the sequences are from the RefSeq collection, I assume they should be self-contained. This problem occurs with the current RefSeq release 47 that you can download via FTP.
I was working with BioPython 1.52 and 1.57 (latest).
Thanks for your suggestions.
I checked with the NCBI and they told me that this is intended and you are supposed to download both, FASTA and Genbank files.