Hello,
Snakemake suggests the following structure for the project directory:
I have a few questions when using snakemake together with conda.
For a project called project A
, I create a directory with the suggested structure in the image above. My questions are:
If I have a conda environment called
conda_project_A
, where should I create this environment? Basically, where should I put theenvs
andpkgs
directories of the environment if I don't want them to be in my.conda
in home directory? Should I create them in theproject A
directory as a hidden directory called.conda_project_A
containing theenvs
andpkgs
directories?In the example, the
envs
directory only containsyaml
files. So, for example, I createconda_project_A
environment with ayaml
file located inenvs
calledconda_project_A.yaml
. The yaml file containsPython
andsnakemake
as dependencies. Then for each tool that I want to use, I add a newyaml
file, such asfastqc.yaml
orbwa.yaml
under theenvs
and install them all inconda_project_A
environment. Is that the right way?
I am just looking for best practices in terms of reproducibility so I'd appreciate any advice. Thanks!
+1 but I remain unconvinced that you want a conda env for each tool or even for each rule. Doesn't that creates a mess of tiny environments? And if you want to experiment and run steps outside snakemake you have to keep switching envs. So far I have been happy with this strategy:
Create a fresh conda env when you start a new project
In a
requirements.txt
file add dependencies with their full version number as you go along. E.g.When you need a new dependency update that file and run
mamba install --file requirements.txt
. Occasionally you may need to delete the env and recreate it from requirements.txtOnly when you hit a real incompatibility create a dedicated env.
Other things: use mamba, I kind of forgot about conda and better leave the base env alone.
How large or restricted the environment(s) need to be is very situational and a matter of taste. It may be possible to use a single environment for the whole workflow if the tools are compatible in their dependencies, e.g. samtools, and hisat. Suppose you have many different tools with diverging dependencies (e.g. python versions) or have to install non-conda packages into those. In that case, starting a new environment for each is better. The downside is this wastes a lot of space and time, a general downside of conda environments, therefore it is important to strike a balance. In relation to the amount of time and space needed for the data an analysis, it is still a neglectable factor for me.
Also, the op asked if software could be shared across environments in a single rule, which I think is not advisable even if you might find out more about your conda environments inside Snakemake using the
$CONDA_...
variables. Remember also, that conda environments are placed inside the workflow working directory and are not named following the same naming pattern as the ones in your own home directory but have variable names. Trying to access these names in the workflow code or trying to use software from your conda install in your home directory will render the workflow non-portable.I see, this is very helpful. Where do you place the environments, for example, here, where do you create the
samtools
environment? And thesnakemake
itself, does it have its own environment? And finally, say I am running some python code in a Snakemake rule, do you then use the Python from another environment? an environment of its own?Thank you!
Another question, I see here you've used
micromamba
, do you recommend this overconda
andmamba
?You should almost always make a completely fresh environment and never try to cross-use programs from different environments (they have random folder names, and therefore you won't succeed with that anyway). You can use the same environment for multiple rules or even the whole workflow if that is possible. You need to have snakemake either in your base environment, in a separate environment you execute your workflow from or installed otherwise. The environments you create for rules in snakemake do not need to contain snakemake. I am using micromamba because it is the fastest to solve the environment. In snakemake, mamba is now the default and I would leave it at that, micromamba is not supported by it yet. The environments should be identical between them.