Hi,
I have been planning to do this small data science project & finally got sometime to actually do it.
I scraped Biostars
data (Question title, day asked & associated tags) since the site is live until today (2009 - 23/02/2019) and asked some questions. Here is the brief overview
1. Which programming language is most frequently used in Bioinformatics
Depending on the tags used with the questions. I looked at R
, Python
& perl
. The result is obvious and the major contribution goes to Bioconductor
project.
2. Number of question per year
This should, in a way, reflect number of new researchers getting into bioinformatics / bioinformatics becoming an essential component of life science research.
3. Frequency of tags
This is already discussed here, a few days back (I'm just posting it as the data is very new but the results are mostly similar from that thread)
I will write a blog post with some additional analysis & share the code :)
P.S: Mods, if this thread doesn't fit into category Blog
, feel free to change / suggest appropriate category.
Nice! I feel like the first graph might be also interesting to see relatively, by taking a look if the fraction of programming languages changes over the years.
Thanks Wouter. Do you mean, fraction of change compared to previous year or something else? This is a type of analysis I thought of but didn't do yet.
I'm not sure what I mean, but somehow you should take into account the popularity of biostars. You could state that perl always has about the same absolute number of questions, although relatively it lost massively.
Nice venu.
It will be good to see the distribution of Tools flag with years or with applications
Thanks Vijay. It may makes sense to get an overview of Tools per year, to check how frequently new tools are being developed. But what do you mean by
with application
? Because, each tool is flagged with more than one tag, so it's extremely complicated to get one-word application to each tool (by programmatically).by "application" I would mean, RNA-seq, WGS , whole transcriptome etc. But , I agree that that will be chaotic as you mentioned.
Strange how the number of questions became 'saturated' from 2016. What could be the result of that? The field and everything that it encompasses had already matured?
I'd guess, many basic questions are being asked over the previous years and answered very well, so with a simple google search, first hits are landing on biostars threads. Also many tool developers are documenting their tools with clean examples and responding to the user queries. Might be saturated in that sense but the applications of the field are wide spread and growing?
This progress might be one of the reasons but it's just my opinion.
It would likely require more in-depth analysis but the number of truly unique topics/questions will show an opposite trend to plot in #2. As @venu said, a large majority of questions likely have some pointers/answer(s) that already exist on Biostars or elsewhere.
Nice! This is "Evolution of Biostars" rather than history :-)
If you have the data parsed out, can you perhaps make animated/interactive gifs for the word cloud that walk through the top 100/50 terms for each year?
Aw, your title makes more sense.
Yes, I will try to make per-year frequency of tags (a good idea to see troubled topics per year :p).
Python has certainly gained popularity over PERL but R dominates the tool ecosystem pyramid!
Not necessarily: we can conclude that people are most puzzled about R ;-)
Yeah. HaHa HaHa HaHa ;)
Since the most frequent tag is RNA-Seq and and the programming language is R, my guess is that a lot of people are confused with how to run DESeq/EdgeR :)