I was asked to give an introductory lecture about bioinformatics in cancer research and I wanted to spend one slide or two to talk/compare the pros and cons of coding (i.e. perl, python, R, Java, UNIX...) vs. using GUI tools (i.e. galaxy, cBioportal etc..)
the reason is that most of the students are enrolled in a genetics program with little or no bfx knowledge and they are "scared" of learning how to code, learning statistics etc etc..so I wanted to explain that 1) bfx is not all coding but there are a number of analysis that also can be done without writing lines and lines of code and 2) although there is a steeper learning curve in coding, it is extremely powerful.
there are a few posts online that touch the subject but I wanted to hear what the thoughts were here and, beside the obvious reasons, what do you think should be the most important messages to be conveyed to grad students.
Just to note that some applications do need GUI, such as phylogenetic tree viewing/editing, alignment viewer (IGV), genome browser, assembly viewer (consed and Bandage), network visualization, etc. For these, a well-thought and well-implemented GUI is essential.
yes, here GUI is essential, however, these tools still require upstream work that could suffer because of the limitations in other GUI/pre-canned analysis tools
Well, CLI is essential, however, CLI tools still require upstream wetlab work to generate data. It is not necessary for everyone to know everything. Occasionally, when you work on specific areas, even GUI alone can be ok. That is how CLC etc have survived for years.
I like Istvan's analogy!
Another limitation of GUIs (I'm thinking Galaxy) is that you're stuck with the older versions of software tools that are integrated into the interface. However, there are a number of fairly routine workflows (differential gene expression, ChIP-Seq peak calling) where the limited GUI vocabulary may suffice.
The problem with routine workflows is that they can only solve "routine" problems - and it is almost impossible to tell that beforehand when is a problem of a new class or the same old.
But some problems ARE routine, and it's not impossible to anticipate the outcome. For example, if the goal is to identify a list of mouse genes whose expression levels change the most in response to drug treatment, then a Galaxy workflow of Bowtie/Tophat/Cufflinks/CuffDiff would be adequate. Sure, there are more sensitive/sophisticated/powerful/flexible tools for the job, and this pipeline is likely to miss some candidates, but that may be okay for the user.
well years ago we had the paper that proved that RPKM is inconsistent across samples see
http://blog.nextgenetics.net/?e=51
Three years later people still use RPKM because that's what Cuffdiff implements. Now obviously every routine analysis using cuffdiff will be wrong because the units themselves are badly defined. That is before considering the actual biology or the many confounding factors. The units themselves are incorrect, how absurd is that? The question is how wrong are they? It all depends on the diversity of transcripts, if there are many new transcripts the values are fatally wrong. If there are no new transcripts RPKM will work. So now the validity of the routine analysis depends solely on the number of transcripts that express only in one condition.
Istvan, I'm aware of the limitations of RPKM (which is one of the reasons I don't use the Tuxedo package). I also agree wholeheartedly that CLI is preferable to GUIs for the many reasons cited. But the example I gave still holds. The mouse transcriptome is well-studied, so it's highly unlikely that drug treatment will produce a host of novel transcripts. Despite its flaws, RPKM would identify some subset of the most differentially expressed genes. If that's the user's only objective, I don't see the problem.
Just to clarify - it is not about novel transcripts - the problems arise when there are transcripts or isoforms that can be found in one sample but not the other.
I don't disagree that pipelines "work" - it just never clear how well they do and when they cross from "kind of right" to "no that's obviously not right". The more automated and "routine" a process the less likely one investigates it (but this true regardless of the approach command line or GUI).
I meant 'novel' in the sense of 'unique to one sample', which is the condition that you describe.
And I strongly agree that the user needs to understand the tool, be it CLI or GUI. Caveat emptor.
You are never stuck in Galaxy. It's OpenSource and it's pretty easy to update tools or point the big community to update tools. Actually, for a few tools we have wrappers before the paper comes out, because more and more people talking to the Galaxy community and contributing to it during the publishing process.
I guess this is just a matter of time and priorities. If someone spends the time in compiling a new version of tool X and integrating it in their own make file rather than integrating it in Galaxy it will take longer for all of us :)
Sorry, I should have specified the public Galaxy site at PSU. Given that the OP was addressing a class with little/no CLI expertise, I assumed that updating the tools would be beyond their skill level. Plus, I'm fairly certain that you can't update Galaxy using only the GUI...
Note that this post is in no way a criticism of Galaxy. I think it's a very useful suite of tools, it lowers the activation barrier for learning bioinformatics, and the automatic tracking of workflows is a strong selling point.
You can update Galaxy tool versions using only the GUI (and Björn provides a Docker container with Galaxy that makes the set up vastly simpler) :) Granted, you only get what's in the toolshed, but that's sufficient 99% of the time.
Thanks for the clarification, Devon. I should have been more precise. By updating, I was referring to Björn's comment about writing wrappers for the latest versions of tools. I consider the Tool Shed part of Galaxy proper, so of course it's possible to use the GUI to access versions contained there.