Question

Fast parsing genbank files

0

Entering edit mode

3.7 years ago

rubic ▴ 270

Hi,

Does anyone know of a fast parser for genbank files that contains hundres of entries (e.g., all vertebrate_mammlian proteins from refseq)? Ive tried R's genbankr's readGenBank function and biofile's gbRecord function and both are very slow and insufficient for genbank files of a size of 100M.

My purpose is simply to parse for each protein it's transcript accession, gene accession, taxonomy ID, and all its conserved domain IDs (CDDs).

genbankr does have a faster parsing function: parseGenBank but it simply contains all features in an array from which it does not seem possible to map them back to their respective proteins.

Genbank parsing R • 2.5k views

ADD COMMENT • link updated 3.7 years ago by Istvan Albert 102k • written 3.7 years ago by rubic ▴ 270

0

Entering edit mode

There is probably a cool EntrezDirect answer for this but for now you should look around on the RefSeq Functional Elements page to see if you may be able to download an interesting file that can get you partway to what you need.

ADD REPLY • link 3.7 years ago by GenoMax 148k

score 2 · Accepted Answer · 2021-04-07

2

Entering edit mode

3.7 years ago

Istvan Albert 102k

The biopython GenBank parser is as good as I have seen in high-level programming languages. It can parse the human chromosome 1 in GenBank format in about 15 seconds.

The file clocks in at about 350 MB and has almost 5 million lines.

$ wc -l NC_000001.gb
4937618 NC_000001.gb

and

-rw-r--r-- 1 ialbert ialbert 347M Apr  7 19:52 NC_000001.gb

What I would recommend is to parse the GenBank into a JSON file as I do below with the bio package, once you do that the resulting JSON file can be loaded on demand into a full data structure in less than 5 seconds!

time bio view NC_000001.gb --json > out.json

real    0m18.480s

for details on how that conversion takes place see this:

https://github.com/ialbert/bio/blob/master/biorun/models/jsonrec.py#L503

ADD COMMENT • link 3.7 years ago by Istvan Albert 102k

0

Entering edit mode

Thanks a lot @Istvan Albert.

I installed the bio package (following this documentation) but then running

bio view <my_file.gb> --json > <my_out>.json

I get:

bio: making bioinformatics fun again

Valid commands:

   bio data    : list or rename data
   bio fetch   : downloads data from repositories
   bio align   : performs sequence alignments
   bio taxon   : displays NCBI taxonomies
   bio define  : explains biological terms
   bio convert : converts data to different formats
   bio runinfo : prints sequencing run information

So it seems that the view option was not included in the installation.

Any idea?

ADD REPLY • link 3.7 years ago by rubic ▴ 270

0

Entering edit mode

I guess it should be convert instead of view according to the documentation:

bio convert <my_file.gb> --json > <my_out>.json

But when I run this command on one of the protein genbank files (vertebrate_mammalian.10.protein.gpff.gz - gunzipped of course), I get this error:

*** file format not recognized: vertebrate_mammalian.10.protein.gpff

And the same happens for vertebrate_mammalian.10.genomic.gpff

ADD REPLY • link 3.7 years ago by rubic ▴ 270

1

Entering edit mode

at this time the tool recognizes the format by file extension, the extension should be gb.gz, the conversion takes about 30 seconds and you can run it on the gziped file directly:

time bio convert vertebrate_mammalian.10.protein.gb.gz --json > out.json

real    0m32.184s

PS I guess I forgot to push the latest update (view/convert are the same commands). Maybe I should keep calling it convert. I keep going back and forth.

ADD REPLY • link 3.7 years ago by Istvan Albert 102k