Blog:Evolution of Biostars
5
25
Entering edit mode
5.8 years ago
venu 7.1k

Hi,

I have been planning to do this small data science project & finally got sometime to actually do it.

I scraped Biostars data (Question title, day asked & associated tags) since the site is live until today (2009 - 23/02/2019) and asked some questions. Here is the brief overview

1. Which programming language is most frequently used in Bioinformatics

Depending on the tags used with the questions. I looked at R, Python & perl. The result is obvious and the major contribution goes to Bioconductor project.

enter image description here

2. Number of question per year

This should, in a way, reflect number of new researchers getting into bioinformatics / bioinformatics becoming an essential component of life science research.

enter link description here

3. Frequency of tags

This is already discussed here, a few days back (I'm just posting it as the data is very new but the results are mostly similar from that thread)

enter image description here

I will write a blog post with some additional analysis & share the code :)

P.S: Mods, if this thread doesn't fit into category Blog, feel free to change / suggest appropriate category.

meta Biostars • 4.7k views
ADD COMMENT
1
Entering edit mode

Nice! I feel like the first graph might be also interesting to see relatively, by taking a look if the fraction of programming languages changes over the years.

ADD REPLY
0
Entering edit mode

Thanks Wouter. Do you mean, fraction of change compared to previous year or something else? This is a type of analysis I thought of but didn't do yet.

ADD REPLY
2
Entering edit mode

I'm not sure what I mean, but somehow you should take into account the popularity of biostars. You could state that perl always has about the same absolute number of questions, although relatively it lost massively.

ADD REPLY
1
Entering edit mode

Nice venu.

It will be good to see the distribution of Tools flag with years or with applications

ADD REPLY
0
Entering edit mode

Thanks Vijay. It may makes sense to get an overview of Tools per year, to check how frequently new tools are being developed. But what do you mean by with application? Because, each tool is flagged with more than one tag, so it's extremely complicated to get one-word application to each tool (by programmatically).

ADD REPLY
0
Entering edit mode

by "application" I would mean, RNA-seq, WGS , whole transcriptome etc. But , I agree that that will be chaotic as you mentioned.

ADD REPLY
1
Entering edit mode

Strange how the number of questions became 'saturated' from 2016. What could be the result of that? The field and everything that it encompasses had already matured?

ADD REPLY
3
Entering edit mode

I'd guess, many basic questions are being asked over the previous years and answered very well, so with a simple google search, first hits are landing on biostars threads. Also many tool developers are documenting their tools with clean examples and responding to the user queries. Might be saturated in that sense but the applications of the field are wide spread and growing?

This progress might be one of the reasons but it's just my opinion.

ADD REPLY
2
Entering edit mode

It would likely require more in-depth analysis but the number of truly unique topics/questions will show an opposite trend to plot in #2. As @venu said, a large majority of questions likely have some pointers/answer(s) that already exist on Biostars or elsewhere.

ADD REPLY
0
Entering edit mode

Nice! This is "Evolution of Biostars" rather than history :-)

If you have the data parsed out, can you perhaps make animated/interactive gifs for the word cloud that walk through the top 100/50 terms for each year?

ADD REPLY
0
Entering edit mode

Aw, your title makes more sense.

Yes, I will try to make per-year frequency of tags (a good idea to see troubled topics per year :p).

ADD REPLY
0
Entering edit mode

Python has certainly gained popularity over PERL but R dominates the tool ecosystem pyramid!

ADD REPLY
2
Entering edit mode

Not necessarily: we can conclude that people are most puzzled about R ;-)

ADD REPLY
0
Entering edit mode

Yeah. HaHa HaHa HaHa ;)

ADD REPLY
1
Entering edit mode

Since the most frequent tag is RNA-Seq and and the programming language is R, my guess is that a lot of people are confused with how to run DESeq/EdgeR :)

ADD REPLY
3
Entering edit mode
5.8 years ago

Here are some traffic data over the last five years.

  • 11 million users
  • 57 million pagevies

PS. total number of posts (including commens/answers) per year would also be an interesting plot to make.

enter image description here

ADD COMMENT
0
Entering edit mode

11 million...? That is a lot! Population of Republic of Ireland is ~4 million.

ADD REPLY
0
Entering edit mode

Turns out if the traffic were a "country" we'd be the 83rd highest populated country right between Greece and Bolivia.

ADD REPLY
0
Entering edit mode

Definitely the most popular general bioinformatics website on Earth!

ADD REPLY
2
Entering edit mode
5.8 years ago

The data is interesting but does it answer the question "Which programming language is most frequently used in Bioinformatics"? Other possible factors:

  • R might be popular on this forum, other languages on other forums- for example, there are 628 co-tagged R and bioinformatics questions on Stack Overflow compared to 822 co-tagged python and bioinformatics questions
  • Users may not tag the language the tool is written in- for example, bowtie2 is written mostly in C++ but it seems people don't use the language tag when asking a question
  • Method developers may not ask in the context of bioinformatics- Developers for bioinformatics might not use Biostars or even tag bioinformatics in their questions, they may phrase their questions to be about the algorithmic/ programming problem and ask on Stack Overflow or elsewhere
  • People using R might need more help- R is a language that is perhaps accessible to people coming from a non-programming background, so perhaps people ask more questions. To quote Mick Watson from twitter: "There are no [Stack Overflow] questions on Perl because every Perl programmer is 50+ and knows what they're doing"
ADD COMMENT
0
Entering edit mode

Devil is in the details, it seems. If only the peer review process and university ranking systems teased out the respective biases as you have done here.

ADD REPLY
2
Entering edit mode
5.8 years ago
JC 13k

Perl enthusiast here.

I know R and Python, but I'm always more productive with Perl, in general for Python and R I need to Google (stack overflow, biostar, reddit, ...) how to do some things, but on Perl I rarely look for help.

Perl is more natural for me, also because text processing is the main task which I generally I need a script, Perl is the best.

ADD COMMENT
0
Entering edit mode

Enjoy Perl too. Already follow you in github and hope to share some perl script for bioinformatics pipeline @JC

ADD REPLY
1
Entering edit mode
5.8 years ago

Made a chart with total posts (question+answer+comment) for each year

enter image description here

ADD COMMENT
0
Entering edit mode

actually, the title should be New Posts per year

ADD REPLY
1
Entering edit mode
16 months ago

Does anyone know the stats for 2022/2023? Very interested in the evolution of bash/shell QA topics

ADD COMMENT

Login before adding your answer.

Traffic: 1978 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6