Hi everyone,
How a bioinformatician could be ensured about the reliability, quality and reproducibility of his/her work/analysis?
Thank you
Hi everyone,
How a bioinformatician could be ensured about the reliability, quality and reproducibility of his/her work/analysis?
Thank you
Write a GNU makefile and publish it with a manifest of all the software used and their versions, along with all the inputs used to do the analysis. Use open-source software where the source code is freely available and there is a history kept of all changes to source in an open repository.
While fancier options have been suggested below they may only be usable by fellow bioinformaticians. If you want to make your work generally accessible (to most anyone) then meticulous documentation may be sufficient. No detail should be considered too small/insignificant to include. Have a couple of people go through the worksheet to see if they can reasonably understand/follow your documentation/reasoning.
This seems timely: https://www.biorxiv.org/content/early/2017/10/10/200683
BTW, you'll need to open the PDF in a proper PDF viewer, the one in firefox won't render things nicely.
And certainly, don't forget to
Something that comes to my mind (in a bioinformatics context) is that you could be flexible in your programming language. If the rest of the team uses Java - you learn to use java as well.
Except if it's Perl, then you make everyone change to Python.
Teamwork could also be that you understand how collaborative programming in git works - using branches, pull request, code review,...
If you asked me that, I'd tell people I know how to pick the right tool for the right job and combine them to create a solution. If Python fits somewhere, I'd use it. If R does something really well, I'd use that. If Excel does the task better than other tools, I will not hesitate to use that either.
At any point in time, I am working with multiple senior scientists in my lab on various projects, as well as heading my own projects. <add specific="" examples="" here="">. I balance priorities and get everyone's projects moving forward at a good pace.
When people ask you that, you always give specific examples. Start off with generic statements, but drill down to specifics and give details based on how people respond.
Would like to mention about git-lfs for large file (bams, vcfs ...) which are very typical of bioinformatics pipelines and can not be handled by usual git (https://git-lfs.github.com/) . Applies more for analysis rather than software development.
Publishing inputs is probably important if you're publishing performance test results. When reading claims about tools performing faster or more accurately than existing tools, that could just as well be a narrow consequence of the inputs used for testing that are optimal to that toolkit, as much as how the tests were performed. A makefile documents the latter, but cannot address the former concern.
My Scottish friend / colleague Mick Watson gives good advice here: http://www.nature.com/nbt/journal/v31/n11/full/nbt.2740.html?foxtrotcallback=true
(I'm Irish ... the 'same' as being Scottish).
By the way, in one interview I was asked whether you want to be a biologist or computational biologist, then keeping in mind that I don’t know programming, I replied a biologist who knows many things in bioinformatics, however I was rejected. In this article I read a computational biologist no need to be a programmer necessarily. Really encouraging article Thank you
I think that your response was good, but maybe the employer wanted a more definitive answer.
There is still a lot of mis-understanding about what a bioinformatician (or computational biologist) does. Whilst this mis-understanding exists, you will always find employers with varying opinions on what you should be doing. As you get more senior, they eventually entrust you with all sorts of things covering statistics and simple data analyses - you'll be expected to understand the biology too (or pick it up quickly).
I gave a few presentations in the past on bioinformatics and I have always made the argument that everyone is a bioinformatician on some level: If they analyse any type of biological data, then that's bioinformatics on a fundamental level.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Solid advice, but I guess any other workflow (such as snakemake) would do? Or are there important differences?
In fact, since Snakemake has built in conda integration I wonder if it's generally preferable over standard makefiles when it comes to reproducibility.
I'm sure snakemake is a great tool, but GNU make has been around for literally decades, and Python has, in my experience, fragility issues that GNU tools very rarely introduce. Python is good for smaller, one-off analyses, like Perl, but from a reproducibility standpoint, if I write a generic script and a major version release breaks backwards-compatibility that I have to troubleshoot, or if I have to wait days or weeks for a sysadmin to devote time to figure out how to reconstruct the exact combination of scipy, numpy, Python, OS kernel, etc. on our cluster, so that they work together without API issues and other errors, then I'd ask if I would be able to easily reproduce the exact environment of an analysis down the road, without significant debugging and testing effort on my part or on the part of others. I'm sure people are making inroads to this so that these won't be issues in 10-20 years, but, respectfully, I'm honestly not sure we're quite there, yet.
Sounds reasonable indeed - thanks for your insights!
I guess most, but not all of these issues can be solved by using virtual (conda) environments, no?
For single-workstation environments controlled by an end user who is technically proficient, I'm sure that is easier to manage. But it is still effort to reproduce that environment and make sure it works. If you have a clustered environment, you need a sysadmin to manage the specific versions of dependencies required to make that all work, and you need to be able to have a way of deploying analyses to these virtual environments that is easy for others to reproduce.
If the question is about reproducibility, then I think simplicity is attractive and fragility and complexity are things to avoid, as a general philosophy.
When you start using clusters, it becomes easier but more specific. For example, our cluster uses
modulecmd
asmodule load
,module list
etc. For each module, there exist multiple versions with one of them being the default to be loaded when no version is specified with themodule load
command. When I document a script, I explicitly use a version number even when I am using the default version. I also document my script and specify the dependencies in the text so anyone who has to drill down to that level gets that information without needing to figure out log files to read through and tricks to use to deduce from said log files.Modules are great and we use them. Adding a specific version number to the
module load
command is a great tip.Using
module purge
is also a good way to "clean the slate" before running a pipeline. Our developers been bitten by committing code that works locally, because they loaded a version of a module into their development environment, and their code fails in production when the rest of us use it because we're using other versions of that module. Purging modules can help keep this from happening.Another complication is that modules can have dependencies, such that loading a module fails, if another module has not first been added. This can be specific to the lab's setup of these software packages.
But to some extent modules are self-documenting and can be a great way to run and compare multiple versions of tools.
I agree,
module purge
is a (scary-sounding) great way to start with a clean slate. True, modules have dependencies, but it's either trust the HPC folks to not remove modules or record all the ENV variables, and the PATH changes manually.In fact, one of the ways I did this was by creating my own modulefiles (which loaded from the master modulefile) so I did not have to do a bunch of
module load
s each time.Sorry,
how a bioinformatician could demonstrate that he/she is felixble, with a high team work contribution and documentation ability?
Also, this is not your principal question - edit your top level post if you have multiple questions. You're starting off with reproducibility (a technical skill) and then moving on to flexibility and team-work (behavioral skills). What exactly do you want to know about?
For example, while this recent question is not about snakemake, I think it suggests why keeping pipelines simple and as free of version dependencies as possible helps deal with the issue of fragility: Working on an old pipeline, need to gain access to a specific version of plinkseq