I'm have several .gb files from Vector NTI that I need for convert to FASTA format. I figured it would be easy, using Biopython. However, as we all know, there's always something.
Here's the first few lines of sequence from the .gb file:
ORIGIN
GTTGACATTGATTATTGACTAGTTATTAATAGTAATCAATTACGGGGTCATTAGTTCATA
GCCCATATATGGAGTTCCGCGTTACATAACTTACGGTAAATGGCCCGCCTGGCTGACCGC
CCAACGACCCCCGCCCATTGACGTCAATAATGACGTATGTTCCCATAGTAACGCCAATAG
And here's the first few lines of sequenced from a Googled example:
ORIGIN
1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg
61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct
121 ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaagaccaa
181 gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taataaaccg
241 ccacactgtc attattataa ttagaaacag aacgcaaaaa ttatccacta tataattcaa
ORIGIN in the above "Googled example" block links here: http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html#OriginB
Here's the python code to convert:
#!/usr/bin/env python
from Bio import SeqIO
import sys
inp = sys.argv[1]
out = inp + ".fasta"
input_handle = open(inp, "rU")
output_handle = open(out, "w")
sequences = SeqIO.parse(input_handle, "genbank")
count = SeqIO.write(sequences, output_handle, "fasta")
output_handle.close()
input_handle.close()
And here's the error:
Traceback (most recent call last):
File "abi2fastq.py", line 3, in <module>
from Bio import SeqIO
File "/usr/lib64/python2.6/site-packages/Bio/SeqIO/__init__.py", line 362, in <module>
from . import InsdcIO # EMBL and GenBank
File "/usr/lib64/python2.6/site-packages/Bio/SeqIO/InsdcIO.py", line 37, in <module>
from Bio.GenBank.Scanner import GenBankScanner, EmblScanner, _ImgtScanner
File "/usr/lib64/python2.6/site-packages/Bio/GenBank/__init__.py", line 52, in <module>
from .Scanner import GenBankScanner
File "/usr/lib64/python2.6/site-packages/Bio/GenBank/Scanner.py", line 38
different in layout to those produced by GenBank/DDBJ."""
^
IndentationError: expected an indented block
Does anyone know where the indentation error is here? My .gb file doesn't have lengths in the beginning of the sequence, and also isn't spaced in the body of the sequence. Could both be the problem? I've found other problems to be with Scanner.py source code from Biopython, but I updated to the newest release, and am now getting this error. I could just copy and paste the sequence into a new file with a header, but I have sever files in several directories to perform the conversion on, so this first one is just a test. Btw, all of the Vector NTI .gb files look the same.
All help is appreciated.
There is clearly a formatting problem. Did you use Vector NTI on Windows platform to produce your .gb files ? In this case you have just to use the command dos2unix (
dos2unix winFile.gb unixFile.gb
) to format properly the file.I did not produce the file, however it most likely was produced on Windows. I tried dos2unix, and still got the same error as above. I've found forums from as far back as 2005 discussing this issue, but can't find a fix. I've come across some snippets of code to edit the Scanner.py source code, but it seems, those edits have been updated in new releases. So, I'm not quite sure the problem here.
The formats should be followed. It is not surprising that a method crashes because there is something missing in the file format. So, it seems that stuff like lengths in the beginning of the sequence must be present in genbank format.
If a correctly formatted genbank file work with your python method, it means that the problem comes from files created by Vector NTI. I think the only way is to perform a script that format your file correctly.
Nevertheless ... it seems there is a problem in your code. It should be:
instead of
that was a copy/paste error (i changed it for something else, and forgot that portion). in my script it reads genbank. i'll see if i can write something to format it properly.
i tested it on this file here (http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html), and still got the same error. I'm not so sure it's a file issue, or a source code issue.
Hi, I tested the file (http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html) with your code (above) and it work perfectly well for me .... I used python 2.7 and biopython 1.64. Which version of biopython do you use?
Initially I was using a machine with python 2.6 and biopython 1.64. It did not work with the test file. When I used another machine with python 2.7.6 and biopython 1.64, it worked on the test file. However, neither worked with the Vector NTI file. The solution I came up with is below, since all I was interested in was creating a FASTA format file with the sequence, and I needed to do it for a lot of directories, containing several .gb files each.
As shown, that is in no way a GenBank formatted file, and it is not surprising that asking Biopython to parse is as a GenBank file fails. It doesn't even have a LOCUS line at the start, and as you noted, the sequence is not laid out in GenBank style either. It looks more like a FASTA file missing the special ">" character.
There is a 'locus' and 'features' section which I did not show. There also seemed to be spacing errors in those sections as well. The file is named *.gb. It is not supposed to be FASTA format.
Next time at least show the first line and a ... indicating where you removed large chunks. Can you share the entire file, e.g. via gist.github.com or similar? We have in the past tweaked Biopython's GenBank parser to accept some broken GenBank files (with warnings), but further changes would depend on just how broken your Vector NTI output is.