I want to ask about your opinions on NextFlow or Snakemake. I'm not in a bioinformatics lab, so I've learned to analyse data using every software needed for each step. I construct my pipeline step by step; for instance, I use Fastqc for QC, trim galore for trimming, and so on.
However, when looking for job opportunities in bioinformatics, they ask for experience in Nextflow or Snakemake to analyse data. I did an ATACseq analysis as an experiment by building my pipeline and also using Nextflow. Still, I would like to know more about what your opinions and points of view are about the pros and cons of using NextFlow or Snakemake or creating self-made pipelines to analyse data.
The main advantages to a pipeline tool like nextflow are well documented and discussed on nextflow's website.
Personally, I enjoy using nextflow for a variety of reasons:
The scalability is automated, so no more fiddling around with variable names in different shell scripts to parallelise.
It helps with pipeline construction as you have to plan it out in the beginning, so now I am better at explaining what my pipeline do to colleagues.
It takes pressure off me when deadlines are tights as there is less babysitting.
The nextflow community is great! I've only had one problem I haven't been able to solve through the Nextflow slack channel and on any of the forums.
Setting up GCP/AWS executors was easy as the nextflow team has done most of the heavy lifting for you.
I'm not in a bioinformatics lab either. In fact, I've been the only or one of two bioinformaticians in the last 3 labs I've worked in. Using nextflow has made life a lot easier in all of them.
The only cons for me are to do with setting up GPU jobs as CUDA gets fiddly in my experience, but this isn't a problem exclusive to nextflow.
I can't comment on Snakemake, but I suspect it's a similar community.
It is definitely a very good idea to pick up either Nextflow or snakemake. Each of these has its pro's and con's, learn the one you are most comfortable with (or the one your colleagues are using).
I'd add to look at what the jobs you are interested in use. Nextflow is more widely used in my field, so learning something like snakemate might not be the best idea.
To the question - You can write whole pipelines in bash/python etc, been there, done that - but - you still need to take care of the dependencies. Nextflow lets you flip between conda and singularity with a couple of lines of code, this is not as easy in bash etc. You don't need to change the commands depending on where you run it, but you do if you run everything in bash. Not to mention wanting to run on the cloud.
Nextflow plus singularity containers is a winning combination on hpc. Conda can also be easily used.
Agree with most comments here. However GPU via nextflow has worked superbly for me - no issues at all. But yes, on a system level GPU is harder to set up and maintain than CPU, even with containers. Non-nextflow GPU job submission has not been as problem-free in my environment.
If you don't adopt a pipeline framework you will end up re-inventing the wheel, but it will be a much worse implementation
If you given them an honest go and you still don't like Snakemake or Nextflow, well there are at least 120 other pipeline frameworks to choose from, many of which are better for machine learning, real-time events, non-file outputs, or offer stronger typing.
I agree with everything here. Learning a pipeline manager saves a lot of headaches and time when analysing large data sets using multiple tools. I have never used Snakemake, so I am biased towards Nextflow. But those I know that use Snakemake love it, so it's really down to your personal preference as to which one you use. For me, building Nextflow pipelines from scratch was a bit of a learning curve as it is based in Groovy, which I don't have a background in. However it didn't take long to learn the basics to facilitate writing Nextflow pipelines. Another great thing about Nextflow is the documentation and support. There is a huge online community, including a Slack channel and nf-core, an online database of curated pipelines ready for you to use. So there is a high chance that if you need a Nextflow analysis pipeline, it has either already been written for you, or at the very least, there are premade modules that you can use to build a custom workflow.
If you care about scalability, reproducibility, and data provenance (if you're doing anything minimally serious, you should care), you should use Nextflow or Snakemake. They're not the only tools available for that but are the top ones focusing on being the best in the bioinformatics setting (though they can also be used outside bioinformatics).
I'm a Nextflow user, and the only regret I have is not having heard of it earlier in my career :)
Thanks everyone for your quick replies! I'm definitely giving it another try, as it seems to be the best approach to improve my analyses!