Question

Bad Genbank format file from Vector NTI - Convert to FASTA

0

Entering edit mode

10.4 years ago

st.ph.n ★ 2.7k

I'm have several .gb files from Vector NTI that I need for convert to FASTA format. I figured it would be easy, using Biopython. However, as we all know, there's always something.

Here's the first few lines of sequence from the .gb file:

ORIGIN
GTTGACATTGATTATTGACTAGTTATTAATAGTAATCAATTACGGGGTCATTAGTTCATA
GCCCATATATGGAGTTCCGCGTTACATAACTTACGGTAAATGGCCCGCCTGGCTGACCGC
CCAACGACCCCCGCCCATTGACGTCAATAATGACGTATGTTCCCATAGTAACGCCAATAG

And here's the first few lines of sequenced from a Googled example:

ORIGIN
        1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg
       61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct
      121 ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaagaccaa
      181 gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taataaaccg
      241 ccacactgtc attattataa ttagaaacag aacgcaaaaa ttatccacta tataattcaa

ORIGIN in the above "Googled example" block links here: http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html#OriginB

Here's the python code to convert:

#!/usr/bin/env python

from Bio import SeqIO
import sys

inp = sys.argv[1]
out = inp + ".fasta"

input_handle = open(inp, "rU")
output_handle = open(out, "w")

sequences = SeqIO.parse(input_handle, "genbank")
count = SeqIO.write(sequences, output_handle, "fasta")

output_handle.close()
input_handle.close()

And here's the error:

Traceback (most recent call last):
  File "abi2fastq.py", line 3, in <module>
    from Bio import SeqIO
  File "/usr/lib64/python2.6/site-packages/Bio/SeqIO/__init__.py", line 362, in <module>
    from . import InsdcIO  # EMBL and GenBank
  File "/usr/lib64/python2.6/site-packages/Bio/SeqIO/InsdcIO.py", line 37, in <module>
    from Bio.GenBank.Scanner import GenBankScanner, EmblScanner, _ImgtScanner
  File "/usr/lib64/python2.6/site-packages/Bio/GenBank/__init__.py", line 52, in <module>
    from .Scanner import GenBankScanner
  File "/usr/lib64/python2.6/site-packages/Bio/GenBank/Scanner.py", line 38
    different in layout to those produced by GenBank/DDBJ."""
                                                            ^
IndentationError: expected an indented block

Does anyone know where the indentation error is here? My .gb file doesn't have lengths in the beginning of the sequence, and also isn't spaced in the body of the sequence. Could both be the problem? I've found other problems to be with Scanner.py source code from Biopython, but I updated to the newest release, and am now getting this error. I could just copy and paste the sequence into a new file with a header, but I have sever files in several directories to perform the conversion on, so this first one is just a test. Btw, all of the Vector NTI .gb files look the same.

All help is appreciated.

biopython vector_nti genbank fasta python • 4.9k views

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 10.4 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

There is clearly a formatting problem. Did you use Vector NTI on Windows platform to produce your .gb files ? In this case you have just to use the command dos2unix (dos2unix winFile.gb unixFile.gb) to format properly the file.

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.4 years ago by Juke34 9.0k

0

Entering edit mode

I did not produce the file, however it most likely was produced on Windows. I tried dos2unix, and still got the same error as above. I've found forums from as far back as 2005 discussing this issue, but can't find a fix. I've come across some snippets of code to edit the Scanner.py source code, but it seems, those edits have been updated in new releases. So, I'm not quite sure the problem here.

ADD REPLY • link 10.4 years ago by st.ph.n ★ 2.7k

1

Entering edit mode

The formats should be followed. It is not surprising that a method crashes because there is something missing in the file format. So, it seems that stuff like lengths in the beginning of the sequence must be present in genbank format.

If a correctly formatted genbank file work with your python method, it means that the problem comes from files created by Vector NTI. I think the only way is to perform a script that format your file correctly.

Nevertheless ... it seems there is a problem in your code. It should be:

sequences = SeqIO.parse(input_handle, "genbank")

instead of

sequences = SeqIO.parse(input_handle, "fasta")

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.4 years ago by Juke34 9.0k

0

Entering edit mode

that was a copy/paste error (i changed it for something else, and forgot that portion). in my script it reads genbank. i'll see if i can write something to format it properly.

ADD REPLY • link 10.4 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

i tested it on this file here (http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html), and still got the same error. I'm not so sure it's a file issue, or a source code issue.

ADD REPLY • link 10.4 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

Hi, I tested the file (http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html) with your code (above) and it work perfectly well for me .... I used python 2.7 and biopython 1.64. Which version of biopython do you use?

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.4 years ago by Juke34 9.0k

0

Entering edit mode

Initially I was using a machine with python 2.6 and biopython 1.64. It did not work with the test file. When I used another machine with python 2.7.6 and biopython 1.64, it worked on the test file. However, neither worked with the Vector NTI file. The solution I came up with is below, since all I was interested in was creating a FASTA format file with the sequence, and I needed to do it for a lot of directories, containing several .gb files each.

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.4 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

As shown, that is in no way a GenBank formatted file, and it is not surprising that asking Biopython to parse is as a GenBank file fails. It doesn't even have a LOCUS line at the start, and as you noted, the sequence is not laid out in GenBank style either. It looks more like a FASTA file missing the special ">" character.

ADD REPLY • link 10.3 years ago by Peter 6.0k

0

Entering edit mode

There is a 'locus' and 'features' section which I did not show. There also seemed to be spacing errors in those sections as well. The file is named *.gb. It is not supposed to be FASTA format.

ADD REPLY • link 10.3 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

Next time at least show the first line and a ... indicating where you removed large chunks. Can you share the entire file, e.g. via gist.github.com or similar? We have in the past tweaked Biopython's GenBank parser to accept some broken GenBank files (with warnings), but further changes would depend on just how broken your Vector NTI output is.

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by Peter 6.0k

Ram · Accepted Answer · 2014-08-26

As Juke-34 said, it's definitely a formatting issue. I tried to reformat the genbank file, which worked with some python code, but there still some spacing issues that Biopython didn't like. Also, my 'Locus' header was too long. So, I wrote this bit of code to simply take the sequence information out, and write it to another file. This is much simpler than what would be in a 'normal' genbank file, where there's sequence length numbers at the beginning of the line, and the bases are split every 10 bp, and only 60 per line. I was able to write that code to produce the genbank file, but again, Biopython was cranky, if anyone wants me to post that code I can. For now, here's the quick solution I came up with to simply pull the sequence and put in a suitable identifier as the header.

#!/usr/bin/env python

import sys

inp = sys.argv[1]  # python gbtofa.py <input_file>
outhandle = inp.split('.')[0] #Remove file extension, keeping prefix to rename .gb file to      .fasta on next line
out = open(outhandle + ".fasta", "w")

seqs = []
alllines = []

with open(inp, "r") as f: #open input
        copy = False
        for line in f:
                alllines.append(line.strip())
                if line.strip() == "ORIGIN": # Start new list after ORIGIN
                        copy = True
                elif line.strip() == "//": #End on line above '//'
                        copy = False
                elif copy:
                        seqs.append(line.strip())

locus = alllines[0].split() # Use portion of LOCUS line as header for .fasta
id = locus[1].split('_')[1]
print >> out, '>' + str(id) # print header to new file

for s in seqs:
        print >> out, s  # print sequences to new file