Question

What Are The Classic Papers In Bioinformatics?

85

Entering edit mode

14.0 years ago

Casey Bergman 18k

A few years back, I asked a dozen or so colleagues for classic/important papers that every bioinformatician should read as a part of their training. I thought BioStar might be a good place to resuscitate this exercise to get a broader set of candidates and let the community weigh in on what papers make up the bioinformatics "canon".

Here are some of the papers that I use for teaching to start the ball rolling:

Altschul et al. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403-10. http://www.ncbi.nlm.nih.gov/pubmed/2231712

Myers et al. A whole-genome assembly of Drosophila. Science. 2000 Mar 24;287(5461):2196-204. http://www.ncbi.nlm.nih.gov/pubmed/10731133

Burge & Karlin. Prediction of complete gene structures in human genomic DNA. J Mol Biol. 1997 Apr 25;268(1):78-94. http://www.ncbi.nlm.nih.gov/pubmed/9149143

Lowe & Eddy. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 1997 Mar 1;25(5):955-64. http://www.ncbi.nlm.nih.gov/pubmed/9023104

Depending on the level of interest in this topic, perhaps we can put together a library on citeulike of "bioinformatics classics"

literature papers history • 23k views

ADD COMMENT • link updated 3.6 years ago by Jeremy Leipzig 22k • written 14.0 years ago by Casey Bergman 18k

score 16 · Answer 1 · 2010-10-28

Oh, I did a blog post on one once. It was part of a "classic papers" blogging initiative that was really fun, actually.

Margaret Dayhoff, a founder of the field of bioinformatics

In it I think I found the first computational protein analysis:

In this paper we shall describe a completed computer program for the IBM 7090, which to our knowledge is the first successful attempt at aiding the analysis of the amino acid chain structure of protein.

The program was called COMPROTEIN (yes, it was all caps). But it was in fact a pipeline of several programs: MAXLAP, MERGE, PEPT , SEARCH, QLIST, and LOGRED.

Reference: Dayhoff, M. O. and R. S. Ledley. Comprotein: A Computer Program to Aid Primary Protein Structure Determination. In Proceedings of the Fall Joint Computer Conference, 1962, 262-274. Santa Monica, CA: American Federation of Information Processing Societies, 1962. http://doi.acm.org/10.1145/1461518.1461546

The link is now broken though, I'll have to find out where it is now.

This link seems to work: http://portal.acm.org/citation.cfm?id=1461546

score 13 · Answer 2 · 2010-10-28

Nobody cited the Smith & Waterman algorithm ?

JMB 1981: Identification of common molecular subsequences T. F. Smith and M. S. Waterman http://dx.doi.org/10.1016/0022-2836(81)90087-5

and Needleman–Wunsch:

Needleman, Saul B.; and Wunsch, Christian D. (1970). "A general method applicable to the search for similarities in the amino acid sequence of two proteins". Journal of Molecular Biology 48 (3): 443–53. doi:10.1016/0022-2836(70)90057-4. PMID 5420325.

score 11 · Answer 3 · 2010-10-28

I wouldn't normally answer a question twice, but these are unrelated to my first answer.

Important papers to me personally:

Chothia C, Lesk AM. (1986) The relation between the divergence of sequence and structure in proteins. EMBO J. 1986 Apr;5(4):823-6. http://www.ncbi.nlm.nih.gov/pubmed/3709526

Paving the way for homology modelling.

Perkins DN, Pappin DJ, Creasy DM, Cottrell JS. (1999) Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis. 1999 Dec;20(18):3551-67. http://www.ncbi.nlm.nih.gov/pubmed/10612281

The paper that outlined MASCOT, as important as BLAST for proteomics (though SEQUEST came earlier - Eng et al. (1994) J Am Soc Mass Spectrom 5: 976–989. doi:10.1016/1044-0305(94)80016-2).

score 10 · Answer 4 · 2011-12-14

Usually, these papers were classified into the bionformatics' fields of research during the 1990s, i.e., gene prediction (genscan, glimmer, etc), alignment (blast, Smith-Waterman, Needleman-Wunsch, etc), protein structure prediction (Chou-Fasman, etc), and phylogenetics (phylip, etc).

Here's a short list of alignment- related articles, in addition to the already listed Smith-Waterman and Needleman-Wunsch papers:

Wilson, A.C., Carlson, S.S., White, T.J. (1977) "Biochemical evolution." Ann. Rev. Biochem. 46:573-639.
Doolittle, R.F. (1981) "Similar amino acid sequences: chance or common ancestry?" Science 214:149-159.
Henikoff, S., Henikoff, J.G. (1992) "Amino acid substitution matrices from protein blocks." Proc. Natl. Acad. Sci. USA 89:10915-10919.
Gotoh, O. (1982) "An improved algorithm for matching biological sequences." J. Mol. Biol. 162:705-708.
Fitch, W.M., Smith, T.F. (1983) "Optimal sequence alignments." Proc. Natl. Acad. Sci. USA 80:1382-1386.
Pearson, W.R., Lipman, D.J. (1988) "Improved tools for biological sequence comparison." Proc. Natl. Acad. Sci. USA 85:2444-2448.
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410.
Gish, W., States, D.J. (1993) "Identification of protein coding regions by database similarity search." Nature Genet. 3:266-272.
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Res. 25:3389-3402.
Henikoff, S., Henikoff, J.G. (1994) "Position-based sequence weights." J. Mol. Biol. 243:574-578.
Lipman, D.J., Altschul, S.F., Kececioglu, J.D. (1989) "A tool for multiple sequence alignment." Proc. Natl. Acad. Sci. USA 86:4412-4415.
Thompson, J.D., Higgins, D.G., Gibson, T.J. (1994) "CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice." Nucleic Acids Res. 22:4673-4680.
Staden, R. (1989) "Methods for discovering novel motifs in nucleic acid sequences." Comput. Appl. Biosci. 5:293-298.
Stormo, G.D., Hartzell, G.W. III (1989) "Identifying protein-binding sites from unaligned DNA fragments." Proc. Natl. Acad. Sci. USA 86:1183-1187.
Schuler, G.D., Altschul, S.F., Lipman, D.J. (1991) "A workbench for multiple alignment construction and analysis." Proteins 9:180-190.
Karlin, S., Altschul, S.F. (1990) "Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes." Proc. Natl. Acad. Sci. USA 87:2264-2268.

Besides, the famous articles from Margaret Dayhoff about substitution matrices:

Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C. (1978) "A model of evolutionary change in proteins." In "Atlas of Protein Sequence and Structure, vol. 5, suppl. 3," M.O. Dayhoff (ed.), pp. 345-352, Natl. Biomed. Res. Found., Washington, DC.

Schwartz, R.M., Dayhoff, M.O. (1978) "Matrices for detecting distant relationships." In "Atlas of Protein Sequence and Structure, vol. 5, suppl. 3," M.O. Dayhoff (ed.), pp. 353-358, Natl. Biomed. Res. Found., Washington, DC.

score 8 · Answer 5 · 2010-10-28

Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JG, Korf I, Lapp H, Lehväslaiho H, Matsalla C, Mungall CJ, Osborne BI, Pocock MR, Schattner P, Senger M, Stein LD, Stupka E, Wilkinson MD, Birney E. The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 2002 Oct;12(10):1611-8. http://www.ncbi.nlm.nih.gov/pubmed/12368254

Other Bio* library papers are available, but I think most would agree, BioPerl is the most "important".

score 7 · Answer 6 · 2010-10-28

Maybe the paper on the 1000 genomes published yesterday will open a new era in bioinformatics.

This morning I attended a talk from one of the authors, and he explained some of the challenges that have been faced by the 1000 genomes consortium. For the first time in history, the biggest datasets in biology are reaching the levels of the datasets in physics and astronomy. From now on, we will have to think more carefully about the tools we use: for example, physicists have developed an alternative to Internet to share data, while we biologists are still using the http or ftp protocol to download data, competing with people downloading mp3s. We need to look for alternatives to download Gigabytes of new data produced daily, like shared cloud computing images for example. Moreover, the 1000 genomes project has also presented many new formats like BAM and SAM, and new tools to handle huge datasets.

score 6 · Answer 7 · 2010-10-28

6

Entering edit mode

14.0 years ago

Konrad ▴ 710

I would also add the first COG paper to the list:

A genomic perspective on protein families. Tatusov RL, Koonin EV, Lipman DJ; Science. 1997 Oct 24; 278(5338): 631-7

It offers interesting evolutionary insights and the concept of COG is a quite helpful tool - personally speaking.

ADD COMMENT • link 14.0 years ago by Konrad ▴ 710

score 6 · Answer 8 · 2011-01-31

PLoS COmputational Biology has recently launched a series of Perspectives called 'The roots of bioinformatics', to illustrate the seminal papers in each of the sub-fields in bioinformatics.

To date, only two articles of the series have been published:

Searls DB. The roots of bioinformatics. PLoS Comput Biol. 2010 Jun . Doolittle RF. The roots of bioinformatics in protein evolution. PLoS Comput Biol. 2010 Jul 29;6(7):e1000875. Review. PubMed PMID: 20686682;
Doolittle RF. The roots of bioinformatics in protein evolution. PLoS Comput Biol. 2010 Jul 29;6(7):e1000875. Review. PubMed PMID: 20686682; PubMed Central PMCID: PMC2912333.

If you are interested, you can create a citation alert for '"roots of bioinformatics" Plos Computational Biology in Entrez.

score 4 · Answer 9 · 2011-02-01

The review by David Searles in June, 2010 in PLoS Computational Biology on the roots of bioinformatics will certainly point you to some classic papers, including some you likely never thought of as belonging to this field. This review was very well written and was a joy to read. The paper is here.

I would also add the early papers of JW Fickett on gene modeling based on base composition and comparative approaches.

score 3 · Answer 10 · 2010-10-28

3

Entering edit mode

14.0 years ago

Pavid ▴ 160

Hey!

Interesting question! I'm beginning to work on this field, I actually started a few months ago.

I've read some papers but I quite enjoy that one

A Quick Guide for Developing Effective Bioinformatics Programming Skills

ADD COMMENT • link 14.0 years ago by Pavid ▴ 160

score 3 · Answer 11 · 2011-02-01

The Clustal paper(s) - one of the most cited paper(s) in the world (all scientific areas)

.. Thompson, JD; Gibson, TJ; Plewniak, F; Jeanmougin, F; Higgins, DG The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. NUCLEIC ACIDS RESEARCH, 25 (24): 4876-4882 DEC 15 1997

Chenna, R; Sugawara, H; Koike, T; Lopez, R; Gibson, TJ; Higgins, DG; Thompson, JD Multiple sequence alignment with the Clustal series of programs. NUCLEIC ACIDS RESEARCH, 31 (13): 3497-3500 JUL 1 200 ..

I think that is classic

score 3 · Answer 12 · 2011-02-01

3

Entering edit mode

13.8 years ago

Peter ▴ 90

Ruth Nussinov and George Pieczenik and Jerrold R. Griggs and Daniel J. Kleitman: Algorithms for Loop Matchings. In: SIAM Journal on Applied Mathematics. 35, Nr. 1, Juli 1978, S. 68-82.

30 years ago, she came up with a beautiful dynamic programming algorithm for secondary structure prediction.

ADD COMMENT • link 13.8 years ago by Peter ▴ 90

score 2 · Answer 13 · 2010-10-28

A method to identify protein sequences that fold into a known three-dimensional structure.

Prediction of protein interactions: metabolic enzymes are frequently involved in gene fusion.

A new approach to protein fold recognition

Comparative protein modelling by satisfaction of spatial restraints.