Question

To Obtain Accession Id From Genbank .Gbf Files

3

Entering edit mode

11.8 years ago

rosarylimyt ▴ 70

Can anyone please help me with handling Genbank .gbf files? Recently I've generated Sequin (.sqn) files and Genbank (.gbf) files which I don't know what to do with them to obtain the accession IDs of the translated nucleotide sequences such that I know the names of the proteins identified. the .gbf files look something like this when I open with Notepad:

LOCUS       Scaffold1            1325603 bp    DNA     linear       14-FEB-2013
DEFINITION  No definition line found.
ACCESSION   
VERSION
KEYWORDS    .
SOURCE      Unknown.
  ORGANISM  Unknown.
            Unclassified.
FEATURES             Location/Qualifiers
     source          1..1325603
                     /organism="unknown"
                     /mol_type="genomic DNA"
     gene            complement(<1..555)
                     /locus_tag="asmbl_1"
     CDS             complement(<1..555)
                     /locus_tag="asmbl_1"
                     /codon_start=1
                     /transl_table=11
                     /product="tat (twin-arginine translocation) pathway signal
                     sequence domain protein"
                     /translation="MKEFHSTLSRRDFMKSLGVVGAGLGTMSAAAPVFHDLDEVTSST
                     LGINKNPWWVKERDFKNPTVPIDWSKVTRQPGVFQGLPRPTVADFTKAGVVGGTSTDL
                     ETPEMALTLYDAMAKEFPGWTPGYAGMGDTRTTALCNASKFMMFGAWPGNMEMGGKRV
                     NVIGAIMAAGGSPTFTPWLGPQLDT"
...
...

Does anyone here know of a software tool which I can use to make sense out of these and generate accession IDs for them?

Thank you in advance!

id genbank annotation protein • 4.6k views

ADD COMMENT • link updated 11.8 years ago by Istvan Albert 101k • written 11.8 years ago by rosarylimyt ▴ 70

0

Entering edit mode

Not clear: this is a new .gbf, generated by you for some new sequence data? In which case there will not be any accessions or IDs; that happens after submission to GenBank and curation.

ADD REPLY • link 11.8 years ago by Neilfws 49k

score 1 · Answer 1 · 2013-02-15

1

Entering edit mode

11.8 years ago

Istvan Albert 101k

As Neilfws points out if this is your genbank file then there won't be any accession numbers, check your file for fields such as db_xref (see below):

 gene            1..626
                 /gene="Hbb-b1"
                 /gene_synonym="AA409645; beta1; HBB1; Hbbt1; Hbbt2"
                 /note="hemoglobin, beta adult major chain"
                 /db_xref="GeneID:15129"
                 /db_xref="MGI:96021"

If you do have those you can extract them in various ways, but before we get there let's make sure you have them in the first place.

ADD COMMENT • link 11.8 years ago by Istvan Albert 101k

0

Entering edit mode

No, I do not have the db_xref thing in the file. This .gbf file I got was generated from using CloVR-Search (annotation). Could you pls advise on how should I go about obtaining accession ID from it, apart from submitting to Genbank? Is there a program I can use to parse it such that I can isolate the 'translation=....' parts only? Thank you so much!

ADD REPLY • link 11.8 years ago by rosarylimyt ▴ 70

1

Entering edit mode

You will need to parse the file with a programming language, for example BioPython. See this section on parsing GenBank files: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc36

ADD REPLY • link 11.8 years ago by Istvan Albert 101k

0

Entering edit mode

Hi, I tried parsing one of the .gbf translated nucleotide scaffold with BioPython as advised via:

from Bio import SeqIO record=SeqIO.read('./rosary/dataset/clovr/Scaffold6.gbf','genbank'); record SeqRecord(seq=Seq('ATGGTGGGCCATCTTGGTCTCGAACCAAGGACCTCAGTCTTATCAGCTCCAACG...TGG', IUPACAmbiguousDNA()), id='', name='Scaffold6', description='No definition line found.', dbxrefs=[])

but it still wouldn't provide me with any sort of identification for that translated scaffold =[

ADD REPLY • link 11.8 years ago by rosarylimyt ▴ 70