Locations Of Plots Of Quantities Of Publicly Available Biological Data
9
6
Entering edit mode
14.1 years ago
Gotgenes ▴ 460

There's a cliché in talks and presentations these days demonstrating the rapid (typically exponential, or super-exponential) growth of publicly available biological data of one nature or another (e.g., sequence data, yeast2hybrid, etc.). They're frequently juxtaposed against a plot of Moore's law. You know the type. You probably have even used or made such a plot if you're at this site.

It's not always obvious where to find these plots. Surprisingly (disappointingly, even), major clearing houses for biological data such as GenBank and Gene Expression Omnibus (GEO) don't provide plots of their growth in any obvious location, let alone their front pages (where it makes the most sense to display such positive trends). Let's compile a list of where to find these plots, including, but not limited to:

  • Publications (decent)
  • Open-access publications (good)
  • Sites that provide up-to-date plots (better)
  • Scripts or programs that generate plots on the fly (excellent)
visualization • 5.8k views
ADD COMMENT
3
Entering edit mode

I think it would also be interesting to post code that can generate these plots. The data are often available, although often not in the best format, for those who'd like to try a roll-your-own approach.

ADD REPLY
2
Entering edit mode

Good to see you here!

ADD REPLY
6
Entering edit mode
14.1 years ago
Mary 11k

We started this the other day. See this thread: Exponentially Increasing Genomes Slide Another one I like that hasn't come up yet is the growth of GeneTests, disease for which testing is available: http://www.ncbi.nlm.nih.gov/projects/GeneTests/static/whatsnew/labdirgrowth.shtml

ADD COMMENT
1
Entering edit mode

was about to write the same thing, you were 3 secs faster ;)

ADD REPLY
1
Entering edit mode

Thanks. I failed in picking my search terms to look for an existing question. I don't know if we should close this question as a duplicate, as I'm interested in any type of (high-throughput) biological data.

ADD REPLY
0
Entering edit mode

then you may want to refine your question in order to not be a duplicate ;)

ADD REPLY
5
Entering edit mode
14.1 years ago

Data for the growth of the number of articles in MEDLINE can be found here:

http://www.nlm.nih.gov/bsd/licensee/baselinestats.html

There is some time lag in interpreting numbers from the MEDLINE baseline files. For example, good data on the growth of MEDLINE through 2008 can be found in the 2010 baseline statistics: http://www.nlm.nih.gov/bsd/licensee/2010_stats/2010_Totals.html

EDIT 1: Data for the growth of the number of GeneRIFs in Entrez Gene can be found here:

http://www.ncbi.nlm.nih.gov/projects/GeneRIF/stats/

EDIT 2: Data for the growth of the number of GWAS studies in the Human Genome Epidemiology database:

http://hugenavigator.net/HuGENavigator/startPageWatch.do

ADD COMMENT
5
Entering edit mode
14.1 years ago

Already added sequence data growth in Uniprot in the other question, As you are interested in various data categories - here is the exponential growth of RCSB-PDB from 70's - till date. Kudos to RCSB-PDB team for providing the data and the graph in a convenient way.


EDIT by RamRS: Khader's link to his own answer is dead and does not point to a post on biostars.org because the post seems to have been lost before migration. Here is a link to an archived version of the post: https://web.archive.org/web/20111124051054/http://biostar.stackexchange.com/questions/2966/exponentially-increasing-genomes-slide/2973

Here is a picture of his answer:

alt text

ADD COMMENT
4
Entering edit mode
ADD COMMENT
4
Entering edit mode
14.1 years ago
Neilfws 49k

Just a brief note on a way to generate "growth of database" data yourself, at least for the Entrez databases.

Most of the Bio* projects include an EUtils library. The BioRuby module has a useful method, esearch_count, which counts the number of results for a query. As an example, you could retrieve total publications in PubMed for years 2000-2010 like this:

#!/usr/bin/ruby
require "rubygems"
require "bio"

Bio::NCBI.default_email = "me@me.com"
ncbi = Bio::NCBI::REST.new

2000.upto(2010) do |year|
  all   = ncbi.esearch_count("#{year}[dp]", {"db" => "pubmed"})
  puts "#{year}\t#{all}"
end

Redirect the output to create a tab-delimited file with year + count. Here, we're searching the DP (date published) field in PubMed. You could substitute any Entrez database, search term(s) and years.

ADD COMMENT
3
Entering edit mode
14.1 years ago
Bio_X2Y ★ 4.4k

The Silva website plots the growth of ribosomal RNA databases. e.g. http://www.arb-silva.de/documentation/background/release-104/

ADD COMMENT
3
Entering edit mode
14.1 years ago
Suk211 ★ 1.1k

SCOP has listed out the statistics of it's release history in tabular form from last 12 years.

Scop Classification Statistics

I agree with Khader that PDB has done excellent job to report the statistics on it's entries. They have something called histogram menu which can easily generate statistics on current entries based on various criterion.

ex: Source Organism (Gene Source) Histogram

ADD COMMENT
3
Entering edit mode
14.1 years ago
Gotgenes ▴ 460

There is a news article from October 2010 in Science that has a plot of the growth of human SNP data, particularly with regards to the 1000 Genomes project.

ADD COMMENT
0
Entering edit mode

Bump! Not an OA article.

ADD REPLY
3
Entering edit mode
14.1 years ago
Rob ▴ 30

A recent paper with an updated "Growth of GEO" plot:

Le et al. Cross-species queries of large gene expression databases. Bioinformatics (2010) vol. 26 (19) pp. 2416-23

ADD COMMENT

Login before adding your answer.

Traffic: 2505 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6