Question

Illustrating Our Growing Dependence On Gene Annotation

7

Entering edit mode

14.1 years ago

Andrew Su 4.9k

Most here probably would recognize that structured gene annotations underlie lots of modern, genome-scale science. For example, Gene Set Enrichment Analyses (GSEA) and various pathway analyses are dependent on having high-quality and comprehensive gene annotations, and this trend is only growing.

I want to have a slide that illustrates this point for scientists that don't do large-scale biology. You know, the biologists who are working on the one gene / one postdoc model. Ideas that had crossed my mind:

plotting citation counts of the original Gene Ontology paper, binned by year
plotting citation counts of the original GSEA paper, binned by year
plotting reference counts of a "gene ontology" PubMed query, binned by year

Clearly I'm in a rut. Any other ideas that would illustrate the point?

annotation pubmed genomics • 2.9k views

ADD COMMENT • link updated 14.1 years ago by Mary 11k • written 14.1 years ago by Andrew Su 4.9k

score 5 · Answer 1 · 2010-11-12

Huh. Interesting problem to illustrate. I don't know that citation counts of GO or GSEA would make the case for me. I know how, er....inconsistently... people cite the resources they have used (because we have actually looked at this). And in those cases the people who would cite them would often be other software/database type providers rather than users of the citations themselves, ya know?

I have an idea. No, two ideas. But I am entirely unwilling to try this myself, so take it with that grain of salt. :)

Take the top __ impact papers in a year for the last 10-15 years. (I don't know what that number is. Maybe it's the top 20. 50. Dunno. Could be 500 for all I know.) Find out the average number of genes/proteins studied/evaluated in each paper. See if it ramped up over the years.
Take the papers from a curation source (or several of them, like MGI, SGD, etc). See how many genes/paper have been coming out of them over the years. I know that for MGI, for example, you could take a paper ID and find the genes it provided data for.

I know that you and I disagree on how to do annotation. But I don't think that affects these data as I have thought about it.

And maybe I'm totally off base. I'm going to keep thinking about it--what data would it take me to come around to your view ;) ?

score 4 · Answer 2 · 2010-11-12

Adding another answer because this is a different line of thought than the first one entirely, not just an edit.

I woke up today and asked myself this question a different way: why is this so hard to illustrate? And I realized that it goes back to a point I have made a number of other times elsewhere--the data isn't IN the papers anymore. Counting stuff via the literature won't get to the problem directly.

The data you need to be doing stuff at the bleeding edge (and for your grant proposals) is not in the papers. It's in the databases. The data from the "big data" projects goes straight there--does not pass Go, barely even makes it into the figure legends or the supplemental data of the papers. Think ENCODE, which was the specific instance I blogged about there. But also ICGC, which I have been poking around in. The marker paper came out, but the data is pouring in sans papers at this point. 1000 Genomes--referred to one compelling story about a SNP that seemed to be in an intriguing location. But that was just one example SNP. These projects have a paper they call the "marker" paper that they want everyone to cite. But that paper pretty much does the theory/background/sexy-examples-of-how-cool-we-are. It does not really contain the data itself.

So maybe the way to show this is with the "marker" papers somehow. From the "big data" marker papers, how much data in UCSC/ICGC/1000G FTP sites + browers is associated with them. [I'm not sure how to quantify that, btw.] That data relies on existing annotations where there are some, and could form a source of more annotations, but does not come from the traditional route of the literature.

I think this dovetails with both the points that Alastair and Larry made too.

score 3 · Answer 3 · 2010-11-12

3

Entering edit mode

14.1 years ago

Alastair Kerr 5.3k

Or look at it the other way around. What is the growth of projects that use techniques that need such analysis. e.g. high throughput techniques such as proteomics, microarrays, next gen sequencing. There are tools at researcherid that would help with this.

On a related topic, I would be interested in the growth (or lack thereof) of primary research data that provides the underlying annotation to such resources: highlighting the dependence of homology matching and stressing the importance of knowing how an annotation was attributed. It is the sort of thing I stress in lectures but never really get numbers for.

ADD COMMENT • link 14.1 years ago by Alastair Kerr 5.3k

0

Entering edit mode

I'm not sure how researcherid helps (clarification welcome), but showing the growth of genome-scale techniques is a good idea. Most people will be able to see that bridging the gap between the data and testable hypotheses is dependent on structured gene annotations. I like it...

ADD REPLY • link 14.1 years ago by Andrew Su 4.9k

0

Entering edit mode

there is (was? - I'll double check next week) tools to extract number of paper on 'trends' such as proteomics

ADD REPLY • link 14.1 years ago by Alastair Kerr 5.3k

score 3 · Answer 4 · 2010-11-12

When I was annotating the Arabidopsis thaliana genome in 1997-1999, fully 95% of the genes we described from genome sequencing and modeling of ab initio-predicted exons were new. Since then, yes, there has been an explosion in building chips and doing all kinds of genome-wide analyses. That work has led to amazing results in differences between cultivars (isolates) of the species (gene expression, methylation patterns, etc.) For this, one can look at the work done by Joe Ecker's lab.

Similarly, in humans, there has been huge advances in technology (the arrays and chips and ability to assess protein and metabolite levels) with concomitant great leaps data points attached to genes. Now, I see that there is a return, to a degree, to the single or small gene set focus. The recent paper by Wasserman et al on GWAS hits far upstream of the MYC oncogene actually affect enhancer activity of that gene.

So, while there are more and more genome-wide studies undertaken, there are also increases in the number of papers looking at specifics of a small number of genes where attention is given to their biological role(s).

Specifically, one way to address the illustration you wish to make could center around a few genes and the number of papers published each year on that gene. Note when the gene was discovered and when its genomic sequence was deduced by that oranisms's genome sequencing project. You could also note on this timeline when certain tools came on-line - such as platforms for whole genome analysis of mRNAs, proteins, methylation patterns, and genome variation.

Like Mary wrote, these are thoughts that just come to mind after a quick thought. These would be better refined in a face-to-face roundtable discussion...