Automated Literature Search For List Of Genes
5
3
Entering edit mode
11.5 years ago
Davy ▴ 410

I have a smallish list of genes that I need to do some literature searching on. There are about 80 of them so individually searching for each one and all their aliases would be much too time consuming. Does anyone know of a tool I could use to search pubmed (or other database) with a list of genes, other than building a very long query by hand?

literature genetics • 6.4k views
ADD COMMENT
8
Entering edit mode
11.5 years ago
Tky ★ 1.0k

I modified a script from EUtilities years ago, without use BioPerl.

Please check the code and you can modify it to fit your need :-)

# 2010/11/29 
# pubfetch.pl
# Code modified based on Entrez Programming Utilities from PubMed
# http://eutils.ncbi.nlm.nih.gov/
# Usage: perl pubfetch.pl
use 5.010;
use LWP::Simple;
print "Please Enter The Keyword for Fetch: "; # ask for keyword to search
my $keyword = <> ;
chomp $keyword; 

my $year = 2000; # From which year to fetch
open COUNT,">$keyword.count.txt";
open RESULT,">$keyword.result.txt";

for ($year; $year<=2011; $year++){
    my $utils = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/";
    my $db     = "Pubmed";
    my $query  = $keyword;
    my $report = "abstract";
    my $esearch = "$utils/esearch.fcgi?" .
              "db=$db&retmax=1&usehistory=y&maxdate=$year&mindate=$year&term=";
#        say "$esearch$query";

    $output=get($esearch . $query);
    $hash{$year}=$output;
    my $web = $1 if ($output =~ /<WebEnv>(\S+)<\/WebEnv>/);
    my $key = $1 if ($output =~ /<QueryKey>(\d+)<\/QueryKey>/);
    my $count=$1 if ($output =~ /<eSearchResult><Count>(\d+)<\/Count>/);
#    say "$web $key $count";

    print "The total number of publication for $keyword in year $year is $count;\n";
    print COUNT "$year $count\n";

    if ( $count != 0 ){
        my $efetch = "$utils/efetch.fcgi?" .
               "rettype=$report&retmode=text&retstart=0&retmax=10000&" .
               "db=$db&query_key=$key&WebEnv=$web";


        my $efetch_result = get($efetch);

        print RESULT "$efetch_result";
    }  
}


my $idcount=0;
close RESULT;

open ID, "$keyword.result.txt";
open IDRESULT,">$keyword.IDlist.txt";

while (<ID>){
if (m/^PMID:\s(\d*)/){
    print IDRESULT "$1\n" ;
    push (my @array,"$1");
    $idcount++;

    }
}

print "the total number of the publication fetched is $idcount\n";
ADD COMMENT
0
Entering edit mode

Wow. Thanks so much! I think I should be able to modify this to do what I want.

ADD REPLY
8
Entering edit mode
11.5 years ago

a one-liner :-)

$ echo -e "NOTCH2\nPRKCB1" | while read G; do curl -s "http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=${G}" | xsltproc <(echo "<x:stylesheet xmlns:x="&lt;a href="http://www.w3.org/1999/XSL/Transform" "="" rel="nofollow">http://www.w3.org/1999/XSL/Transform' version='1.0'><x:output method="text"/><x:template match="/">http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=text&rettype=abstract&id=<x:for-each select="eSearchResult/IdList/Id"><x:value-of select="."/>,</x:for-each>
</x:template></x:stylesheet>") -  | xargs curl -s "${U}" ; done

1. J Immunol. 2013 Jun 5. [Epub ahead of print]

Intrinsic Molecular Factors Cause Aberrant Expansion of the Splenic Marginal Zone
B Cell Population in Nonobese Diabetic Mice.

Stolp J, MariƱo E, Batten M, Sierro F, Cox SL, Grey ST, Silveira PA.

Garvan Institute of Medical Research, Immunology Program, Darlinghurst, New South
Wales 2010, Australia.
ADD COMMENT
1
Entering edit mode

I've grouped all the Id into the same call for curl+efetch. That won't work for a large number of Id (big retmax) returned by esearch. But you could generate one URL per Id:

....  | xsltproc <(echo "<x:stylesheet xmlns:x="&lt;a href="http://www.w3.org/1999/XSL/Transform" "="" rel="nofollow">http://www.w3.org/1999/XSL/Transform' version='1.0'><x:output method="text"/><x:template match="/"><x:for-each select="eSearchResult/IdList/Id">http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=text&rettype=abstract&id=<x:value-of select="."/>
</x:for-each></x:template></x:stylesheet>") -  | while read U; do curl -s "${U}" ; done
ADD REPLY
0
Entering edit mode

can the code for literature search work for other identifiers as well, for eg; rs ids or protein ids ?

ADD REPLY
0
Entering edit mode

rs: use ncbi-elink, protein: yes but like the genes, beware the ambiguities

ADD REPLY
0
Entering edit mode

As usual, Pierre to the rescue!!!

ADD REPLY
0
Entering edit mode

Is it possible to display >20 items? Pubmed lets to see 200 items per page.

ADD REPLY
0
Entering edit mode

yes, see the NCBI doc for esearch / retmax. See also my first comment.

ADD REPLY
6
Entering edit mode
11.5 years ago

option 1: the Publication track in UCSC

The UCSC browser has recently included a new track, called Publications, containing literature relative to a gene. Thus, you can use the UCSC APIs to get all the references for a gene. For example, the following will get you all the references for the gene "CD97":

http://genome.ucsc.edu/cgi-bin/hgc?hgsid=338677393&c=chr19&o=14491955&t=14519537&g=pubsMarkerGene&i=CD97

I guess that you can also connect to the Mysql table, but I am not 100% sure that "articleId" field corresponds to the pubmed id:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e 'select * from hgFixed.pubsMarkerAnnot where markerId="CD97" limit 10'

# select only the Ids (less verbose output)
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e 'select distinct articleId, markerId from hgFixed.pubsMarkerAnnot where markerId="CD97" limit 10'

option 2: getting citations from Uniprot

Uniprot has some well curated citations for genes. You can get all the references for a list of genes by using the "Retrieve" tool from the Uniprot main page, and then parsing the RDF file.

option 3: use the eutils, but from another tool

If you do not want to spend time trying using the Bioperl (or Biopython) APIs to eutils, you can try this taverna workflow.

ADD COMMENT
2
Entering edit mode
11.5 years ago
cts ★ 1.7k

Bioperl has this sort of functionality. I've never used it to query pubmed but the following website contains snippets to help you on your way: http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook#Simple_database_query

I think what you're looking for is the esearch or efetch utilities.

ADD COMMENT
0
Entering edit mode

Thanks but I was hoping to avoid having to use the BioPerl EUtilities. :( I really should get around to familiarising myself with them but I abandoned Perl a long time ago.

ADD REPLY
1
Entering edit mode

are you also adverse to the other "bio" packages, I believe that biopython/bioruby have similar functionality (although I've never used them)

ADD REPLY
0
Entering edit mode
11.0 years ago

You can also give a try to BioGyan (http://www.biogyan.com/). It is a comprehensive search tool specially designed for biologists, enabling search, annotation and ranking of scientific literature from public databases. It can accept multiple Genes.

ADD COMMENT

Login before adding your answer.

Traffic: 2365 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6