Bulk Download Of Ncbi Gene "Summary" Field
11
13
Entering edit mode
14.4 years ago

I would like to download or manufacture a mapping of entrez gene IDs to the text that appears in the "Summary" field on an Entrez Gene query for the H. sapiens record for that gene. These short paragraphs are often useful for getting a first idea about what an unfamiliar gene does. The obvious approach of scouring ftp.ncbi.nih.gov/gene/ and ftp://ftp.ncbi.nih.gov/refseq/ for the appropriate record (e.g. gene_info.gz) didn't turn anything up. Thanks for any suggestions.

gene ncbi • 17k views
ADD COMMENT
8
Entering edit mode
14.4 years ago

Thanks for the link, Pierre. This 2.6 Gb. file is very verbose and structured for human consumption rather than easy of retrieval. I wrote a quick and dirty Python parser to pull out the summaries and am posting this so someone else doesn't have to do it too. Note that the accessions are not entrez gene ids; you have to map those separately.

f = open('refseqgene.genomic.gbff')
locus2comment = {}
in_comment=False
for line in f:
    if line[0:5] == "LOCUS":
        locus = line.split()[1]
        comment = ""
    elif line[0:7] == "COMMENT":
        in_comment=True
        comment += line.split("    ")[1].replace("\n", " ")
    elif line[0:7] == "PRIMARY":
        in_comment = False
        try:
            locus2comment[locus] = comment.split("Summary:")[1]
        except:
            locus2comment[locus] = comment
    elif in_comment:
        comment += line.split("            ")[1].replace("\n", " ")
for locus in sorted(locus2comment):
    print locus + '\t' + locus2comment[locus]
ADD COMMENT
0
Entering edit mode

Thanks David. In case any one needs it , mapping NG_ ids can be done using this file ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/RefSeqGene/LRG_RefSeqGene

ADD REPLY
7
1
Entering edit mode
ADD REPLY
0
Entering edit mode

UPDATE: as of 30th Apr 2011, there are three files ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/RefSeqGene/

ADD REPLY
0
Entering edit mode

As of 2017 you would be better off using:

wget ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/RefSeqGene/refseqgene.*.genomic.gbff.gz

;)

ADD REPLY
3
Entering edit mode
10.0 years ago
alirezahkb ▴ 30

The best way is to use ncbi's eutils. There is an API that returns summary by gene_id.

Here is a python script that I wrote to get summaries for gene ids specified in an input file.

import numpy as np
from os import sys, path
import pandas as pd
import urllib2
import json
import sys

if __name__=='__main__':
    gene_info_file = sys.argv[1];
    output_file = sys.argv[2];
    open(output_file, 'w').close()
    gene_ids = pd.unique(pd.read_csv(gene_info_file)['1']);
    chunk_size = 100;
    cn = len(gene_ids)/chunk_size+1
    for i in range(cn):
        chunk_genes = gene_ids[chunk_size*i:np.min([chunk_size*(i+1), len(gene_ids)])];
        gids = ','.join([str(s) for s in chunk_genes])
        print (i+1),'/',cn
        url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gene&id=&' + gids + '&retmode=json';
        print url
        data = json.load(urllib2.urlopen(url))
        result = [];
        for g in chunk_genes:
            result.append([g, data['result'][str(g)]['summary'] if str(g) in data['result'] else '']);
        pd.DataFrame(result, columns=['gene_id', 'summary']).to_csv(output_file, index=False, mode='a', header= (i==0))

Usage:

python eutilsGetSummary.py gene_ids.csv gene_summary.csv

gene_ids.csv is a csv file with the first column holding the gene_ids you want to get the summaries for.

gene_summary.csv is the output.

ADD COMMENT
0
Entering edit mode

the code has a bug,there was a extra "&" before geneid should be:

 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gene&id=' + gids + '&retmode=json';
ADD REPLY
1
Entering edit mode
14.4 years ago

You may also take a look at GeneRIF from NCBI. You can download GeneRIFs from NCBI FTP

ADD COMMENT
1
Entering edit mode
8.7 years ago
govardhanks ▴ 20

(not a option for bulk downloads though) Would like to suggest UCSC Table browser (https://genome.ucsc.edu/cgi-bin/hgTables) --you might need to enter refseqgene under groups and input your gene's NM_IDS or standard HGNC symbol,output format-selected fields-- try with given examples you can figure out how to get desired output. hope this helps

ADD COMMENT
1
Entering edit mode
7.5 years ago

I was pointed to this answer some time ago, so inspired by govardhank's answer, I used UCSC hgTables to download gene summaries, indexed by RefSeq mRNA. There is a special table with summaries: hgFixed.refSeqSummary.

The gbff files (for Homo Sapiens) parsed with script from weslfield's comment gave 6 661 unique summaries. The UCSC table returned 26 140 unique summaries although these included mouse genes too (and possibly others)*; After mapping the summaries to subset of human mRNAs which I am currently working with I got 12 574 unique summaries, which doubles the gbff parsing coverage.

Also UCSC returns data for QPCT, REV1, NEB (not sure about NICK10, I couldn't find such gene) mentioned by Dave Curtis as missing in gbff files.

Feel free to use my gist for UCSC table retrieval: ucsc_download.sh. Here is the how to use it:

source ucsc_download.sh
get_whole_genome_table summary.tsv.gz genes refGene hgFixed.refSeqSummary gzip

Summary: As for 2017 please use UCSC tables, those are more complete, easier to fetch and parse. They did a really good job at making those tables.

(*) I am not sure why, but I know that it is not script-specific - I got the same when using the web interface; any advice how to avoid this would be appreciated.

ADD COMMENT
0
Entering edit mode

Hi, Michal, thank you for your awesome work. I've tried your script and run it as you put, however something just went run. Can you do me a favor?

below is how I run it and the output

gongjing@hekekedeiMac ..tion-corrrelation/results/biomart/ucsc % ll 

total 4.0K
-rwxr--r-- 1 gongjing staff 2.0K Aug 21 16:02 ucsc_download.sh 

gongjing@hekekedeiMac ..tion-corrrelation/results/biomart/ucsc % source ucsc_download.sh 

gongjing@hekekedeiMac ..tion-corrrelation/results/biomart/ucsc % get_whole_genome_table summary.tsv.gz genes refGene hgFixed.refSeqSummary gzip
--2017-08-21 20:54:51--  http://genome.ucsc.edu/cgi-bin/hgTables Resolving genome.ucsc.edu... 128.114.119.134, 128.114.119.135,
128.114.119.133, ... Connecting to genome.ucsc.edu|128.114.119.134|:80... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: 'STDOUT'
-                                                               [  <=>                                                                                                                                       ]  40.56K   109KB/s   in 0.4s
2017-08-21 20:54:52 (109 KB/s) - written to stdout [41537]
hgsid=604153833_Jsj2HlaA2zgpZxEPKxAxKQ0XgqWL&jsh_pageVerPos=0&posiion:chr21=33031597-33041570&clade=mammal&org=Human&db=hg19&hga_group=genes&hga_rack=refGene&hga_able=hgFixed.refSeqSummary&hga_regionType=genome&hga_oupuType=primaryTable&boolshad.sendToGalaxy=0&boolshad.sendToGrea=0&boolshad.sendToGenomeSpace=0&hga_ouFileName=oupu&hga_compressType=gzip&hga_doTopSubmi=ge+oupu
--2017-08-21 20:54:52--  http://genome.ucsc.edu/cgi-bin/hgTables Resolving genome.ucsc.edu... 128.114.119.135, 
128.114.119.133,
128.114.119.136, ... Connecting to genome.ucsc.edu|128.114.119.135|:80... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: 'summary.tsv.gz'
summary.tsv.gz                                                  [  <=> ]  41.50K  85.1KB/s   in 0.5s
2017-08-21 20:54:54 (85.1 KB/s) - 'summary.tsv.gz' saved [42497]

gongjing@hekekedeiMac ..tion-corrrelation/results/biomart/ucsc % ll 
total 48K
-rw-r--r-- 1 gongjing staff  42K Aug 21 20:54 summary.tsv.gz
-rwxr--r-- 1 gongjing staff 2.0K Aug 21 16:02 ucsc_download.sh
ADD REPLY
0
Entering edit mode

The output looks good. Where is the problem? Are you able to open summary.tsv.gz?

ADD REPLY
0
Entering edit mode

Hi, Michal,

I cannot unzip the file normally, and the content seems to be in HTML format. Besides, the file size is small? So I am not sure if I get the result correctly.

Here is the file information:

gongjing@hekekedeiMac ..tion-corrrelation/results/biomart/ucsc % ll
total 48K
-rw-r--r-- 1 gongjing staff  42K Aug 21 20:54 summary.tsv.gz
-rwxr--r-- 1 gongjing staff 2.0K Aug 21 16:02 ucsc_download.sh

gongjing@hekekedeiMac ..tion-corrrelation/results/biomart/ucsc % gunzip summary.tsv.gz
gunzip: summary.tsv.gz: not in gzip format

gongjing@hekekedeiMac ..tion-corrrelation/results/biomart/ucsc % head summary.tsv.gz                                                                                                                                                     
http://www.w3.org/TR/html4/loose.dtd">
<HTML>
<HEAD>

<meta http-equiv="Content-Security-Policy" content="default-src *; script-src 'self' 'unsafe-inline' 'nonce-BccRJqnamHSWfPUAF1g3aUeqFp0u' code.jquery.com www.google-analytics.com www.samsarin.com/project/dagre-d3/latest/dagre-d3.js cdnjs.cloudflare.com/ajax/libs/d3/3.4.4/d3.min.js cdnjs.cloudflare.com/ajax/libs/jquery/1.12.1/jquery.min.js cdnjs.cloudflare.com/ajax/libs/jstree/3.2.1/jstree.min.js cdnjs.cloudflare.com/ajax/libs/bowser/1.6.1/bowser.min.js cdnjs.cloudflare.com/ajax/libs/jstree/3.3.4/jstree.min.js login.persona.org/include.js ajax.googleapis.com/ajax maxcdn.bootstrapcdn.com/bootstrap d3js.org/d3.v3.min.js cdn.datatables.net; style-src * 'unsafe-inline'; font-src * data:; img-src * data:;">

<META HTTP-EQUIV="Content-Type" CONTENT="text/html;CHARSET=iso-8859-1">
<META http-equiv="Content-Script-Type" content="text/javascript">
<META HTTP-EQUIV="Pragma" CONTENT="no-cache">
<META HTTP-EQUIV="Expires" CONTENT="-1">
ADD REPLY
0
Entering edit mode

That is strange, indeed. For me it's working and retrieves gzipped summary with weight about 5.4MB. I tested it right now on Ubuntu 17.04 with GNU Wget 1.18 and GNU sed 4.4.

Looking at your logs I found oupu instead of output, posiion instead of position. I guess either your bash or your version of sed ignores '\t' tabulation substitution, and instead cuts out some t letters.

Wild guess: https://stackoverflow.com/questions/2610115/sed-not-recognizing-t-instead-it-is-treating-it-as-t-why There are many solutions, depending on your environment (OS, sed version). Please try some, and let me know if it helped. I would start with literal tab or double escaping '\t'. Remember to source ucsc_download.sh afterwards again!

Here are my logs:

$ get_whole_genome_table summary.tsv.gz genes refGene hgFixed.refSeqSummary gzip
--2017-08-24 20:00:10--  http://genome.ucsc.edu/cgi-bin/hgTables
Resolving genome.ucsc.edu genome.ucsc.edu)... 128.114.119.136, 128.114.119.135, 128.114.119.133, ...
Connecting to genome.ucsc.edu genome.ucsc.edu)|128.114.119.136|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘STDOUT’

-                                                        [  <=>                                                                                                                 ]  40,56K   110KB/s    in 0,4s    

2017-08-24 20:00:11 (110 KB/s) - written to stdout [41537]

hgsid=604746209_UcaSA7yJymRhoxJhcNr5WhGvCavS&jsh_pageVertPos=0&position:chr21=33031597-33041570&clade=mammal&org=Human&db=hg19&hgta_group=genes&hgta_track=refGene&hgta_table=hgFixed.refSeqSummary&hgta_regionType=genome&hgta_outputType=primaryTable&boolshad.sendToGalaxy=0&boolshad.sendToGreat=0&boolshad.sendToGenomeSpace=0&hgta_outFileName=output&hgta_compressType=gzip&hgta_doTopSubmit=get+output
--2017-08-24 20:00:11--  http://genome.ucsc.edu/cgi-bin/hgTables
Resolving genome.ucsc.edu genome.ucsc.edu)... 128.114.119.136, 128.114.119.135, 128.114.119.133, ...
Connecting to genome.ucsc.edu genome.ucsc.edu)|128.114.119.136|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/x-gzip]
Saving to: ‘summary.tsv.gz’

summary.tsv.gz                                           [                              <=>                                                                                     ]   5,37M  1,14MB/s    in 6,2s    

2017-08-24 20:00:18 (882 KB/s) - ‘summary.tsv.gz’ saved [5635263]
ADD REPLY
0
Entering edit mode
12.8 years ago

This seems to work fine but many genes seem to be missing. This applies to many of the list which can be downloaded from the Human Genome Browser. Off the top of my head, examples are QPCT, REV1, NICK1, NEB. I don't see anything particularly wrong with these genes but they seem to be absent from the refSeqGene files. Can anybody explain why this is or better still point to a more comprehensive source for this information? Thanks.

ADD COMMENT
0
Entering edit mode

We at SolveBio have also run into this problem while parsing these flat files. There are some odd exceptions and omissions in these flat files. For instance, sometimes records appear with no summary or there are duplicate records. There are even instances where the gene is just absent. Here is an expansion on the code written by David Quigley that tries to account for some of this.

import sys
import gzip

def run(filepath):
    with gzip.open(filepath, 'rb') as f:
        locus2comment = {}
        in_comment = False
        first_time_symbol = False
        first_time_entrez = False
        real_gene = False
        for line in f:
            if line[0:5] == "LOCUS":
                real_gene = False
                first_time_symbol = True
                first_time_entrez = True
                locus = line.split()[1]
                comment = ""
            # elif line[0:7] == 'VERSION':
            #     print line.split('\n')
                # locus2comment[locus] = line.strip().split()[0]
            elif line[0:7] == "COMMENT":
                in_comment = True
                comment += line.split("    ")[1].replace("\n", " ")
            elif line[0:7] == "PRIMARY":
                in_comment = False
                try:
                    locus2comment[locus] = comment.split("Summary:")[1]
                except:
                    locus2comment[locus] = "Remove Me"
            elif line[0:9] == '     gene' and 'complement' not in line:
                real_gene = True
            elif line[0:27] == '                     /gene=':
                if first_time_symbol and real_gene:
                    locus2comment[locus] += \
                        '\t' + line.strip().split('/gene=\"')[1][:-1]
                    first_time_symbol = False
            elif '/db_xref="GeneID:' in line:
                if first_time_entrez and real_gene:
                    locus2comment[locus] += \
                        '\t' + line.strip().split('GeneID:')[1][:-1]
                    first_time_entrez = False
            elif in_comment:
                comment += line.split("            ")[1].replace("\n", " ")

    for locus in sorted(locus2comment):
            if "Remove Me" in locus2comment[locus]:
                del locus2comment[locus]

    with gzip.open(filepath[:-8] + '.tsv.gz', 'wb') as outfile:
        for locus in sorted(locus2comment):
            if locus2comment[locus] == "Remove Me":
                continue
            outfile.write(locus + '\t' + locus2comment[locus] + '\n')

if __name__ == '__main__':
    run(sys.argv[1])

SolveBio parses and versions this dataset along with many others popular in bioinformatics with a full API for easy access. Check it out, you may find it saves you a lot of time. https://www.solvebio.com/library/RefSeqGene

ADD REPLY
0
Entering edit mode
12.8 years ago
J_G • 0

I agree with Dave, in the three refseqgene.genomic.gbff files there are only about 4700 genes, a handful without a entrez ID. Entrez iself and Genecards show refseq summary for all the missing genes I tried. So something is wrong.

ADD COMMENT
0
Entering edit mode
11.6 years ago
kukumayas • 0

go ftp://ftp.ncbi.nlm.nih.gov/refseq/release for the whole version. Good luck!

ADD COMMENT
0
Entering edit mode
11.4 years ago
aliz0611 • 0

I tried the above link and downloaded the entire vertebrate-mammalian refseq data set from ftp://ftp.ncbi.nlm.nih.gov/refseq/release/vertebrate_mammalian/, which represents data from 500+ species. However, it appears that the same human genes are missing as from the three refseqgene.genomic.gbff files. Has anyone found out where exactly the more complete dataset is?

ADD COMMENT
0
Entering edit mode
6.1 years ago
hsiaoyi0504 ▴ 70

I would like to provide another solution. You can process from raw data. Take a look of my repo: https://github.com/hsiaoyi0504/gene_dictionary.

ADD COMMENT

Login before adding your answer.

Traffic: 3374 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6