Hello the community,
I am working in pathogens detection, and we need to find the best solution for saving and sharing data and running workflows. We on an Azure cloud but I don't know if it is the best solution.
What is the better cloud solution (gcp, terra, azure, aws..)?
I would like to know if there are possibilities to install our bioinformatics tools in a directiory, and run our own pipelines?
For example, I need to run wgMLST analysis so I need to tools like fastqc, trimmomatic, spades, chewbbaca and grapetree. Or install them by miniconda3 in different environnments. I would like them in a specific directory so my script could reach them. Is there a possibilty like this in using a cloud or I need to run them by Docker?
Thank you
I cannot comment on clouds but for software and pipeline sharing Docker is great. You build software in a fixed environment (say Ubuntu) once and be done with it. No dealing with specific environments when you change machine or clouds. Any Docker image can easily be converted to Singularity if Docker was not available on the system. Pair it up with Nextflow and you have a powerful combination
Thank you for reaching out
And do you I need to create a Docker image for each tool or I merely can set an ubuntu image, which contain all the tools I need?
You can build all tools you need into a single container, even with conda if you like. Building Docker images
Thank you it is very helpful!
You can also just pull biocontainers, which each typically contain a single bioinformatics tool. I pull as singularity images, store, and reference and use (via a trivial one line path to the container image setting) from a nextflow pipeline. That works very well, but (as with docker) you need root access to build a singularity or docker container.
I would always build on a local machine, then you can pull or convert to Singularity on any machine without root.
Yes it might be a good solution I think it will easier to deal with several little container instead of a big one, since I am a beginner in the field
I'll try that
Thank you for the tip!
Hi, I try to build an image which contains all the tools I need, like you told me. But i have an issue: when I install a tool, for instance, fastqc, I can reach it in my container (I used the find command and it didn't found it) And when I create a specific environment for the tool, the tool works when I run it in my container, but when I try to run it with my nextflow script, the script can activate the environment
Could you help me please?
My Dockerfile
FROM continuumio/miniconda3:4.12.0 RUN conda install -c bioconda fastqc
or with environment creation FROM continuumio/miniconda3:4.12.0 RUN conda create -n fastqc -c bioconda fastqc
nextflow script:
Judged by the nextflow script line in your Dockerfile, it seems your custom pipelines are written in Nextflow?
In that case, I wonder why you are reinventing the wheel instead of using the nf-core modules? While I agree, that rewriting existing pipelines to use modules and DSL2 syntax is a daunting task, I think, it will pay off soon - figuratively and also in terms of a more efficient cloud deployment. The nf-core modules exclusively use containers, so no need to manually ensure that all required software is installed.
I can't speak of the Azure cloud, but for Nextflow on AWS, I can recommend this manual by Kelsey Florek.
Because I am a beginner in Nextflow and Docker and I think I could learn better if I create my script, you know, for practice. To understand how it is work. But I agree with you, after all the learning. it will spare my time to use pipelines that more talented guys than me created.
Thank you Matthias for your answer and for the manual, it will be very helpful
I agree, that just applying ready-made pipelines without understanding what they do is bad practice. This also applies to the wet lab when using kits: It's nice to save some hassle, but one should still understand what the buffers A, B and C actually are and do.
This is why I pointed out the modules and DSL2 syntax to you. You can still write your own pipeline and retain full control over the analysis steps and software used, while a lot of the hassle around managing and containerizing the tools is abstracted away from you. Even if you never plan to contribute your pipeline to nf-core, you can still use their tooling to create a new pipeline skeleton, to easily use some of almost 600 tools inside your pipeline and it can even help you to create multi-tool containers. On top of that, there is a nice community and excellent support on Slack.
Good to know Since I learn the field on my own it will be helpful to have community to ask for help And good modules to use
Thank you!
script: """ source /opt/conda/etc/profile.d/conda.sh conda activate fastqc /opt/conda/envs/fastqc/bin/fastqc -q ${reads} -t ${task.cpus} """
conda is activated in these containers by default, just leave out the environment creation part, just install right into the env that is activated.
It's working !
Thank you very much !