Hi,
Does anyone know of a fast parser for genbank files that contains hundres of entries (e.g., all vertebrate_mammlian proteins from refseq)?
Ive tried R
's genbankr
's readGenBank
function and biofile
's gbRecord
function and both are very slow and insufficient for genbank files of a size of 100M.
My purpose is simply to parse for each protein it's transcript accession, gene accession, taxonomy ID, and all its conserved domain IDs (CDDs).
genbankr
does have a faster parsing function: parseGenBank
but it simply contains all features in an array from which it does not seem possible to map them back to their respective proteins.
There is probably a cool EntrezDirect answer for this but for now you should look around on the RefSeq Functional Elements page to see if you may be able to download an interesting file that can get you partway to what you need.