The following statistics come from ohloh or from cloc.pl count.
[?]
#Project Language Code Comment Blank Date/Ver Source FTEs
Bioclipse Java 578,095 349,515 154,338 04/02/2011 Ohloh ?
Bioconductor R/C/C++ 1,248,634 276,358 218,222 03/30/2011 cloc+awk ?
BioJava Java 272,864 129,237 59,074 03/30/2011 Ohloh ?
BioMart Java/Perl 98,637 43,231 24,346 03/30/2011 Ohloh ?
BioPerl Perl 323,007 258,987 167,907 03/30/2011 Ohloh ?
BioPython Python 120,824 39,085 22,183 03/30/2011 Ohloh ?
BioRuby Ruby 68,390 27,032 15,636 03/30/2011 Ohloh ?
EMBOSS C 633,014 258,265 215,110 04/02/2011 Ohloh ?
flystockdb JS/Ruby 7,845 ? ? ? ? 1
JKsrc C 827,908 111,490 105,524 03/31/2011 Ohloh ?
Jmol Java 213,645 58,930 28,784 03/30/2011 Ohloh ?
ncbi_cxx C++/C 1,112,817 318,441 250,134 Jun_15_2010 cloc.pl ?
OpenMS C++ 219,835 77,201 51,512 04/02/2011 Ohloh ?
SeqAn C++/C 250,390 89,885 55,212 03/30/2011 Ohloh ?
SHOGUN C++/C 128,232 53,367 33,488 04/02/2011 Ohloh ?
[?]
There are a few caveats to get the table. As the others have argued, these numbers are not a good indication of how large the project is. Just give you a very rough idea.
EDIT 03/31/2011: JKsrc from ohloh, LOCs very similar to cloc.pl results.
EDIT 04/02/2011: Updated EMBOSS with LOCs from ohloh (I modified its Enlistment list because the old one points its documentation only); added OpenMS (I modified its Enlistment list because the old one includes SVN tags and branches but we should count trunk only); added SHOGUN; added Ensembl to Ohloh, but Ohloh has problems with analyzing its repository; updated Bioclipse as Egon has updated its enlistment. Sorry to push this answer up. I just want to keep it updated.
[?]
Further to demonstrate cloc.pl. I downloaded Jim Kent's source codes jksrc.zip, unzipped it and counted lines of codes with the following command line:
find -type f|egrep "\.(c|h|cpp|cc|hpp|hh|java|py|pl|pm|rb|lua|html|htm|js|php|sql)$" > file.list; cloc.pl --list-file=file.list
The output is:
[?]
This jksrc.zip is one of the largest collections of C source codes (if not the largest). It is the base of the UCSC genome browser and a lot of other utilities such as the famous BLAT.
Please include FTE estimates when available.
Comparing the number of lines in projects is no indication of the projects size. For example, the number of lines in languages that require braces will inflate the line count significantly. Furthermore, whitespace and comments can also inflate the count.
An interesting though controversial question. For people who want to post answers, I would recommend to use the same program to count lines-of-code. `wc -l' seems too primitive. I would recommend cloc.pl (a single perl script): http://sourceforge.net/projects/cloc/files/cloc/v1.53/
I don't think that the number of lines of code is a good measure of the amount of effort went into a project. Writing compact, efficient, reusable code takes orders of magnitude more time that writing bloated, inefficient code with lots of code duplication due to bad design.
I disagree. Whilst both of you are right that loc are an ambiguous metric, it is still a fantastic estimate on a project's size when looking at the order-of-magnitute. Obviously, a project with 1k loc is significantly smaller than a 1m loc project -- no matter how you factor in braces, comments, and generous use of blank lines.
I entirely agree LOC is not a perfect measurement (actually all my projects will be underestimated by LOC), but it is at least a measurement and frequently not so misleading. How can we prove a LIMS is the largest project without measuring it?
Using LOC is just fine for T-Shirt sizing (S, M, L) software projects. But implies that you either have access to a published LOC stat or the actual source code. Could you infer the size of projects based on published results or some other published metric?
What if you did this: 1. Do a Google search to get a list of bioinformatics software. 2. Create a Google Mashup to auto search each of the titles, record the hit count. 3. Use the number of Google hit metric to at least infer the popularity of the software.
Can some admin close & purge this question + answers? Apparently it has become a discussion about locs and no one really addresses the Tomer's points about language usage and full-time employees.
Please do not close this question. The comments here are all about LOC, but the answers not.
Thanks to everyone for your energetic replies. I think you have all won me over to biostar!
My coworker recommended the following LoC tool: http://www.dwheeler.com/sloccount/
This is the one of Linux kernel fame. I'm curious as to how it fares against cloc.pl & other tools.
I'm with everyone regarding LoC not being the end-all of software complexity/size/feature metrics, but it's a useful if imperfect one. @Ben, thanks for the recommended approach. While there might be some noise in that approach, I'll definitely add a column for measures of the user community.
cloc.pl uses source code from SLOCCount. I believe cloc.pl learns from SLOCCount.