Andrew, why on earth are you asking this when it's 00H01 here. It's time to sleep for me !:-) However here is how I would do:
- for each Gene, search the XML definition of the Gene using NCBI-ESearch and NCBI-EFetch
- Search for all the publication associated to that entry by searching the tags `<pubmedid>`
- Download the pubmed record as XML and extract the name of each author in the paper and their affiliation (it often contains the email).
- Get the name of the most frequent author (this is the most difficult part here because the names can be ambiguous )
Note: In 2007, I collected the names and the emails of some bioinformaticians by scanning pubmed with java. See my post.
UPDATE: OK, I've quickly written a program doing the job. It is available on GIST at :
Here is an excerpt from the output for 3 genes : ZC3H7B
, EIF4G1
and PRNP
.
<?xml version="1.0" encoding="UTF-8"?>
<experts>
<gene name="ZC3H7B" geneId="23264" count-pmids="13">
<Person>
<firstName>Sumio</firstName>
<lastName>Sugano</lastName>
<pmid>8125298</pmid>
<pmid>9373149</pmid>
<pmid>14702039</pmid>
<affilitation>International and Interdisciplinary Studies, The University of Tokyo, Japan.</affilitation>
<affilitation>Institute of Medical Science, University of Tokyo, Japan.</affilitation>
<affilitation>Helix Research Institute, 1532-3 Yana, Kisarazu, Chiba 292-0812, Japan.</affilitation>
</Person>
</gene>
<gene name="eif4G1" geneId="1981" count-pmids="106">
<Person>
<firstName>Nahum</firstName>
<lastName>Sonenberg</lastName>
<pmid>7651417</pmid>
<pmid>7935836</pmid>
<pmid>8449919</pmid>
(...)
<affilitation>Department of Biochemistry and McGill Cancer Center, McGill University, Montreal, H3G 1Y6, Quebec, Canada.</affilitation>
<affilitation>Department of Biochemistry, McGill University, Montreal, Quebec, Canada.</affilitation>
<affilitation>Laboratories of Molecular Biophysics, The Rockefeller University, New York, New York 10021, USA.</affilitation>
(...)
</Person>
</gene>
<gene name="PRNP" geneId="5621" count-pmids="429">
<Person>
<firstName>John</firstName>
<lastName>Collinge</lastName>
<pmid>1352724</pmid>
<pmid>1677164</pmid>
<pmid>2159587</pmid>
<pmid>20583301</pmid>
(...)
<mail>j.collinge@ic.ac.uk</mail>
<affilitation>Krebs Institute for Biomolecular Research, Department of Molecular Biology and Biotechnology, University of Sheffield, Sheffield S10 2TN, UK.</affilitation>
<affilitation>MRC Prion Unit and Department of Neurogenetics, Imperial College School of Medicine at St. Mary's, London, United Kingdom. J.Collinge@ic.ac.uk</affilitation>
<affilitation>Division of Neuroscience (Neurophysiology), Medical School, University of Birmingham, Edgbaston, Birmingham, UK. sratte@pitt.edu</affilitation>
(...)
</Person>
</gene>
</experts>
In the case of ZC3H7B
, the result is wrong. Dr Sugano (3 articles) just used this Gene in a set of other Genes. The expert would be D. Poncet, my former thesis advisor but his number of articles about this protein is 2 articles.
Eif4G1
: I know that Dr Sonenberg is the expert. His email wasn't found.
PRNP
: Collinge seems to be the expert. His e-mail was found too.
Update the code
https://gist.github.com/lindenb/740496
Actually, journal impact factor has nothing to do with the importance of individual articles. Common misconception :-) See http://altmetrics.org/manifesto/.
@Larry, most(all?) Genes in GeneWiki are Human Genes.
A serious issue that emerges is gene synonyms. One may need to consider species as well because the same gene in different organisms will function differently. So, Pierre's step 2 needs some refinement but nonetheless gets my vote.
Pierre -- Science doesn't sleep, so neither should you... ;)
Nice Q/A. You could also take into account the journal impact factor. This way the authors would be ranked by additionally relying on the "quality" of their work. Also, attention should be given to avoid mail receivers to mark the mail as spam.
As usual, I'm thoroughly impressed. I wonder how difficult it would be to adapt this to Google App Engine so we could call it as a web service. (On my first attempt to run locally, I get an error that is likely due to my complete java ignorance...) Any GAE experts in the audience?
I don't think that Google App engine would be the best place to run this service: there is a lot of I/O and it could be slow (for example in PRNP , 429 article were downloaded )
Bummer, was wondering if that would be an issue (as it is with my pubmed2wordle app). Anyway, thanks!
Amazing answer!!! Refining with Gene synonyms, Journal impact factor, Article views or downloads, Grants obtained by author (if any), works on similar genes, publication of invited reviews....