Forum:Programming language use distribution from recent programs / articles
2
5
Entering edit mode
7.6 years ago

I would be interested to see a breakdown distribution of the different programming languages used for recently published bioinformatic programs.

I suspect such a breakdown isn't available, but would there be a good enough source or quick and dirty way of assessing this? If possible, without having to go through a year of Bioinformatics Journal, download each paper and/or program and find what language each used.

Alternatively, what interesting online sources compile language use by year and, ideally, by sector?

Basically, I'm interested in something like the TIOBE index, but for bioinformatics: https://www.tiobe.com/tiobe-index/

programming-language programs • 12k views
ADD COMMENT
4
Entering edit mode

Not strictly related, as it's not Bioinfx, but I saw this on twitter a few weeks ago. Kind of interesting. Some of the conclusions are perhaps a bit sketchy, like python > java > C might reflect an increase in programming skill rather than a trend toward a 'better' language.

You might be able to adapt their work flow idea though?

https://erikbern.com/2017/03/15/the-eigenvector-of-why-we-moved-from-language-x-to-language-y.html?utm_content=buffer10d66&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer

ADD REPLY
0
Entering edit mode

I saw it pass too but didn't really look at it much last time. It is quite interesting :)

ADD REPLY
0
Entering edit mode

There may only be two a few Java programs: BBMap suite and FastQC.

Edit: Danger of responding to a poll like this. Things I don't generally use fade away in mind.

ADD REPLY
1
Entering edit mode

.gatk , picard ...

ADD REPLY
1
Entering edit mode

trimmomatic is also in java.

ADD REPLY
1
Entering edit mode

Mauve, Artemis, Qualimap(?)...

ADD REPLY
6
Entering edit mode
7.6 years ago

I've quickly written something: https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/pubmed/PubmedCodingLanguages.java

With a simple algorithm, It' s difficult to detect languages like 'C' ( e.g: 'in C.' vs 'C. Elegans')

output for 'bioinformatics 2017'

* update *

histogram for 'bioinformaticsenter image description here

(php is overrated because many urls end with '.php' )

ADD COMMENT
0
Entering edit mode

That's more like it ;) Nicely done!

What about newer languages, like golang, rust... Do you need to add code to detect these?

ADD REPLY
0
Entering edit mode

yes it's hard-coded https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/pubmed/PubmedCodingLanguages.java#L201 (feel free to suggest some more). I can send you the table (pmid/lang/title/year/context) if you want

ADD REPLY
0
Entering edit mode

I cloned and compiled (with make) jvarkit. How do I run the code for PubmedCodingLanguages? Java newbie here :)

EDIT: I also tried javac PubmedCodingLanguages.java but got errors. Not sure it is meant to be compiled by itself.

ADD REPLY
0
Entering edit mode

I'm refactoring my code these days, that's why I Haven't compiled the documentation.

make pubmedcodinglang pubmeddump

(requires java oracle 8)

and then something like:

java -jar dist/pubmeddump.jar 'Bioinformatics' | java -jar dist/pubmedcodinglang.jar 
ADD REPLY
1
Entering edit mode

Working :)

I'll see if I can tweak the code to add languages or add things as needed.

ADD REPLY
0
Entering edit mode

cool! I've commented out 'R', 'PHP' needs to be separated from the URLS, I only look at the abstract (not the title) etc...

ADD REPLY
1
Entering edit mode
7.6 years ago
John 13k

Code posted on github has a breakdown of the languages used in the project at the top. It should be possible to automate the process of going from a Github url to a CSV of langauge usage. Probably with both relative percentages, and absolute lines of code.

Then you could parse pubmed for github urls.

ADD COMMENT
3
Entering edit mode

ADD REPLY
5
Entering edit mode

Your ability to get shit done (in under 5 minutes) will never cease to amaze me Pierre :D

ADD REPLY
1
Entering edit mode

I up vote your answer and Pierre's code snippet because both are awesome, but this is not really what I am looking for. These are GitHub projects mentioning the word "bioinformatics" in the description (EDIT or somewhere in the file or directory names). It seems the gap between this and published programs is too big for the count to be informative. GitHub has its own bias for Python and scripts or random repositories will also be different from published programs.

Still, I really like this! I'll give a look at GitHub's API.

ADD REPLY
1
Entering edit mode

Could you scan a repository's README for a DOI? That might be a way to quickly filter for published work.

ADD REPLY
0
Entering edit mode

Here's a little python function that will scrape github for the code usage statistics:

def get_stats(github_url,pretty=False):
    import requests
    import lxml
    files = {}
    tree = lxml.etree.HTML(requests.get(github_url + '/search?l=markdown').content)

    for language in tree.xpath("//span[@class='count']"):
        info = language.getparent().itertext()
        next(info)
        count = int(next(info))
        lang = next(info).strip()
        files[lang] = count

    if not pretty: return files

    total_files = sum(files.values())/100.
    print 'Language    Files    Percentage'
    for language,counts in files.items():
        print language.ljust(11),
        print str(counts).rjust(5),'  ',
        print counts/total_files

It can either return either a dict of langauge_names:raw_counts, or it can just print it (with percentages):

>>> stats = get_stats('https://github.com/broadinstitute/picard',True)
Language    Files    Percentage
XML            12    1.99004975124
Shell           3    0.497512437811
Java          513    85.0746268657
Text           60    9.95024875622
JavaScript      2    0.331674958541
Gradle          2    0.331674958541
R               9    1.49253731343
Dockerfile      1    0.16583747927
CSS             1    0.16583747927

will return an empty dict if the github repo doesnt exist. If you can make a list of bioinformatic repos to scan, pop this function in a loop and aggregate the data :) I couldn't get the code usage stats via the github api unfortunately, so scraping html was all i could do.

ADD REPLY

Login before adding your answer.

Traffic: 2447 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6