Dear all,
I have a list of about 1,000 human genes. I'd like to quantify their 'popularity' (i.e. to what extend these genes have been studied). I feel that counting the number of publications related to each of the genes could be a proxy for their popularity.
How would you approach this programmatically?
Thanks in advance,
Miquel
While I agree with Pierre's answer as a quick approach to seeing what genes might be important, the gene2pubmed seems to be purely co-citation based. You can go about estimating the importance or "hubness" of genes by counting their interactions, but gene2pubmed derived "interactions" while providing free information are not really very good. This paper (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2641004/#!po=56.2500) for example claims: "Only 28% of the co-occurred pairs in PubMed abstracts appeared in any of the commonly used human PPI databases (HPRD, BioGRID and BIND). On the other hand, of the known PPIs in HPRD, 69% showed co-occurrences in the literature, and 65% shared GO terms." There are other papers that have looked at publication networks and shown them to be vaguely useful sometimes, but with caveats. If you read through some of the pubmed publications that are included in such co-citation lists, you'll find that the co-mentions are not always about the two genes or proteins interacting - there's also an issue with synonyms (some genes have multiple slightly different names sometimes - though if you only look at HUGO gene symbols, this shouldn't be an issue?), among other problems. So you might want to look at a meta database with some manually curated and non-citation based protein-protein (and so gene-gene) interaction lists, such as BioGRID: http://thebiogrid.org/ As part of a course some time back, I and some others had to look at protein-protein interactions from such a meta database (from one that included the then-current BioGRID interactions + some other smaller databases - specifically this one: http://cbg.garvan.unsw.edu.au/pina/interactome.stat.do). We found UBC (Ubiquitin C) to be by far the most numerously interacting gene of all (followed by SUMO2 and then p53).