Question

Parser to provide references to publications and ID numbers from GenBank

0

Entering edit mode

10.4 years ago

Alice ▴ 320

Hello, biostars!

I have a list of GenBank specimen vouchers. Does anyone know some parser or script to get accession numbers and publications information about these sequences?

To make a table like:

Vouchers: VL03F635, WE02491
IDs: EF104615, AY557140, AY556740

References: Wiemers, 2003; Wiemers et al., 2010

It's not difficult to write a python script for that aim, but the problem is in GenBank records. Years and authors appear like different fields and i do not know how to process such a data.

python sequencing • 1.9k views

ADD COMMENT • link updated 3.1 years ago by Ram 44k • written 10.4 years ago by Alice ▴ 320

Ram · Accepted Answer · 2014-06-30

Hi Alice,

I was at a meeting last week where we decided to work on exactly this! If you wait a week or two we should have a tool to do this automatically in R.

Until then you should check out Biopython, and especially the genbank parser

from Bio import SeqIO
from Bio import Entrez

gb_ids = ["EF104615", "AY557140", "AY556740"]

query = Entrez.efetch(db="nucleotide", id=gb_ids, rettype="gb", retmode="text")

for rec in SeqIO.parse(query, "genbank"):
    print(rec.id)
    for paper in rec.annotations["references"]:
        if paper.title != "Direct Submission":
            print("{0} ({1})".format(paper.authors, paper.journal))

Check out the Bio.SeqFeature.Reference docs to see what is represented by these objects