How To Extract Title And /Isolation_Source From A List Of Genbank Accession Numbers
2
0
Entering edit mode
11.4 years ago

Mighty Bioinformaticians,

I have a text file (txt format) with roughly 350 GenBank accession numbers (each representing a specific 16S sequence) and I want to extract the title of the corresponding paper ("TITLE" in genbank files) and the isolation source ("/isolation_source" under "FEATURE" in gb records) for each accession number.

Input example:

HM789874
JN225528

Desired output:

TITLE, /isolation_source of HM789874
TITLE, /isolation_source of JN225528

Now I have searched previous questions and found all sorts of useful information on how I might be able to solve this problem with BioPerl or BioPython modules, but as I am a biologist and not yet a bioinformatician, I find it a bit overwhelming and fail to compile all the potentially useful tricks and tips into a script or command. FYI, I am familiar with Perl/BioPerl (beginner), but I still have to make my first steps in Python.

Any script or help would be greatly appreciated,

Sam

genbank bioperl biopython • 7.9k views
ADD COMMENT
1
Entering edit mode

A shell one liner:

wget -q -O temporaryfile "http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=JN225528&rettype=gb&retmode=text" ; sed 's/^ *\//%/;s/ */ /;s/TITLE/%TITLE/;s/JOURNAL/%J/' temporaryfile | tr -d "\n"| tr "%" "\n" | grep -e "isolation_source" -e TITLE

OUTPUT:

TITLE     Direct Submission 
isolation_source="granite rock material in the Xinjiang Bole at an elevation of 1,015 m above sea level"
ADD REPLY
2
Entering edit mode
11.4 years ago

use the following xslt stylesheet:

example:

$ echo  -e "HM789874\nJN225528" | while read ACN ; do xsltproc stylesheet.xsl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&rettype=gb&retmode=xml&id=${ACN}" ; done

HM789874    Biogeography and habitat modelling of high-alpine bacteria    Green Lakes Valley Talus
HM789874    Direct Submission    Green Lakes Valley Talus
JN225528    Direct Submission    granite rock material in the Xinjiang Bole at an elevation of 1,015 m above sea level
ADD COMMENT
0
Entering edit mode

This is amazing, works perfectly!

ADD REPLY
0
Entering edit mode

Dear Pierre, Dear Bioinformaticians,

Do you know how to edit the above example use of the xslt stylesheet so that I can use an input text file with GenBank accession numbers instead of manually typing all the accession numbers like "HM789874\nJN225528" in the above example?

ADD REPLY
1
Entering edit mode
cat your.file | while read ...
ADD REPLY
0
Entering edit mode

Thank you. Unfortunately, it seems the xslt stylesheet doesn't work anymore 5 years later. It produces the following error:

cannot parse stylesheet.xsl stylesheet.xsl:2: namespace error : xmlns:xsl: '

ADD REPLY
1
Entering edit mode

it's because the 'new' biostars engine ate my xml code. I've updated my post with a 'gist'

ADD REPLY
0
Entering edit mode

Thank you. Unfortunately something changed at ncbi:

warning: failed to load external entity "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&rettype=gb&retmode=xml&id=AF184218"

Edit: This problem persists after I removed the windows carriage returns

ADD REPLY
1
Entering edit mode

since this post, 5 years ago, ncbi has moved from http to httpS . Try this:

$ echo  -e "HM789874\nJN225528" | awk '{printf("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&rettype=gb&retmode=xml&id=%s\n",$0);}' | while read URL ; do wget -O - -q  ${URL} | xsltproc --novalid stylesheet.xsl - ; done

HM789874    Biogeography and habitat modelling of high-alpine bacteria    Green Lakes Valley Talus
HM789874    Direct Submission    Green Lakes Valley Talus
JN225528    Direct Submission    granite rock material in the Xinjiang Bole at an elevation of 1,015 m above sea level
ADD REPLY
1
Entering edit mode

furthermore, the recent version of ncbi doesn't allow more than 'x' request per minutes, you might have to add a 'sleep' to the command line

ADD REPLY
0
Entering edit mode

The above command works great. I am trying to extract PUBMED ID (PMID) information for ~500 accession numbers that I have downloaded from the RefSeq database. My goal is to identify accession numbers which are published and are associated with Pubmed. The 500 accessions include both finished (completed), and unfinished (contigs) genomes. Here is the modified xslt stylesheet:


<xsl:stylesheet xmlns:xsl="&lt;a href=" http:="" www.w3.org="" 1999="" XSL="" Transform"="" rel="nofollow">http://www.w3.org/1999/XSL/Transform" version="1.1">
<xsl:output method="text"/>

<xsl:template match="/">
<xsl:apply-templates select="/GBSet/GBSeq"/>
</xsl:template>

<xsl:template match="GBSeq">
<xsl:variable name="gbseq" select="."/>
<xsl:for-each select="$gbseq/GBSeq_references/GBReference/GBReference_pubmed">
<xsl:variable name="pubmed" select="./text()"/>
<xsl:value-of select="$gbseq/GBSeq_locus"/>
<xsl:text>    </xsl:text>
<xsl:value-of select="$pubmed"/>
<xsl:text>    </xsl:text>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>

When I supply the list of only FINISHED GENOMES I get the PMID of all the accession numbers without any error. However, when I supply a list with both FINISHED and UNFINISHED genomes, it gives error after working fine for few samples.

Example file for only finished genomes, one accession per line

CP007676

CP007690

CP009361

CP009423

CP009554

CP009681

CP009828

CP010295

Example file with both finished and unfinished genomes, one accession per line

CP007676

CP007690

CP009361

CP009423

CP009554

CP009681

CP009828

CP010295

LALV00000000.1

LALA00000000.1

LALB00000000.1

ASZO00000000.1

AUPV00000000.1

AUPT00000000.1

Here is my command-

cat finishedUnfinished.txt | awk '{printf("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&retmode=xml&id=%s\n",$0);}' | while read URL ; do wget -O - -q  ${URL} | xsltproc --novalid stylesheet_PMID.xsl - ; done

Here is the output-

CP007676 26048971
CP007690 24970829
CP009361 25377701
CP009681 27152133
CP010295 25767217
LALV01000000 26048971
LALA01000000 26048971
LALB01000000 26048971
-:1: parser error : Document is empty

^ unable to parse - -:1: parser error : Document is empty

Any help will be highly appreciated.

Tauqeer

ADD REPLY
0
Entering edit mode

And it works!! Thanks!!!

ADD REPLY
1
Entering edit mode
11.4 years ago
Leszek 4.2k

for python lovers (just adding a few lines to Fetching genbank entries for list of accession numbers.):

#!/usr/bin/env python
"""Fetch GenBank entries for given accessions and report publication title and isolation source. 

USAGE:
  python acc2title.epost.py A22237 A22239 A32021 A32022 A33397 > out.tsv
or
  cat ids | python acc2title.epost.py > out.tsv

DEPENDENCIES:
Biopython
"""

import sys
from Bio import Entrez, SeqIO

#define email for entrez login
db           = "nuccore"
Entrez.email = "some_email@somedomain.com"
batchSize    = 100
retmax       = 10**9

#load accessions from arguments
if len(sys.argv[1:]) > 1:
  accs = sys.argv[1:]
else: #load accesions from stdin  
  accs = [ l.strip() for l in sys.stdin if l.strip() ]
#first get GI for query accesions
sys.stderr.write( "Fetching %s entries from GenBank: %s\n" % (len(accs), ", ".join(accs[:10])))
query  = " ".join(accs)
handle = Entrez.esearch( db=db,term=query,retmax=retmax )
giList = Entrez.read(handle)['IdList']
sys.stderr.write( "Found %s GI: %s\n" % (len(giList), ", ".join(giList[:10])))
#post NCBI query
search_handle     = Entrez.epost(db=db, id=",".join(giList))
search_results    = Entrez.read(search_handle)
webenv,query_key  = search_results["WebEnv"], search_results["QueryKey"] 
#print header
sys.stdout.write("#accesion\ttitle\tisolation_source\n")
#fecth all results in batch of batchSize entries at once
for start in range( 0,len(giList),batchSize ):
  sys.stderr.write( " %9i" % (start+1,))
  #fetch entries in batch
  handle = Entrez.efetch(db=db, rettype="gb", retstart=start, retmax=batchSize, webenv=webenv, query_key=query_key)
  #parse gb entries
  for r in SeqIO.parse(handle, 'gb'):
      title = isolation_source = ""
      #get title
      if 'references' in r.annotations:
          ref   = r.annotations['references'][0]
          title = ref.title
      #get source features
      source = r.features[0]
      #get isolation_source
      if 'isolation_source' in source.qualifiers:
          isolation_source = source.qualifiers['isolation_source'][0]
    #print output to stdout
    sys.stdout.write("%s\t%s\t%s\n" % r.name, title, isolation_source))
ADD COMMENT
0
Entering edit mode

Thank you. However, this does not seem to work for large numbers of Genbank accessions (77 000). I receive the following error message:

urllib2.HTTPError: HTTP Error 414: Request-URI Too Long

ADD REPLY

Login before adding your answer.

Traffic: 1809 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6