Motivation
I start this tutorial because I spend too many hours trying to install the dependencies of gatk4 on remote server without root privilege.
GATK is the most common tools for calling variants. Link
However, documentation of the tools and support for installation on other platform is limited. It's hard to install R and python dependencies
- If you have access to docker on your remote server, GREAT, you don't need this. Just follow the instruction on using the container from gatk. Link
- If you have root privilege on your remote server, GREAT, you don't need this. Some of the steps here will be overly complicated. Just use
conda
on your remote server. - If you (like me) do not have access to docker nor root privilege of the remote server. I hope this tutorial will solve some problems that you encounter and save you many hours of googling.
Principles
You will need to know how to use docker and its container before using this tutorial. You will also need to know how to use conda to manage virtual environment.
I will create a docker container with gatk dependencies using conda
as package manager. Then export that virtual environment and upload it to the remote server (where you do not have root privilege to install dependencies)
Note: The container that broadinsitute released, which can be accessed with: docker pull broadinstitute/gatk:latest
sadly cannot be used to install conda pack and export the environment outside of the container.
Step 1
Download latest version of GATK from github to your working directory. At this moment of writing: gatk-4.2.5.0
cd /path/to/dev/dir/
wget https://github.com/broadinstitute/gatk/releases/download/4.2.5.0/gatk-4.2.5.0.zip
mkdir gatk-4.2.5.0
unzip gatk-4.2.5.0.zip -d gatk-4.2.5.0
cd gatk-4.2.5.0
# run some command to create requirement text files for step 2
# this is some work-around that I try because I cannot run `conda env create -f gatkcondaenv.yml` directly
printf "name: gatk\nchannels:\n- conda-forge\n- defaults\ndependencies:\n- python=3.6.10\n- ipython\n" > environment1.yml
cat gatkcondaenv.yml | grep "^-" | sed 's/- //; s/ .*//; $d; 1,3d' > requirement.txt
tail -n 2 gatkcondaenv.yml | sed '1s,^,dependencies:\n,' > environment2.yml
Step 2
I use docker to create an image of ubuntu-18.04.4 (the same OS as the container provided by gatk) >> Install miniconda3 >> Install other R and Python dependencies of gatk4.
Sadly, the instruction from broadinstitute to install dependencies: conda env create -n gatk -f gatkcondaenv.yml
does not work for me. The conda process cancel while "solving environment" and no new environment are created
So I tried to find a work around. First, create a new environment with python=3.6.10 as the main python. Then, activate that environment and using conda install --file requirement.txt
to install other dependencies of gatk (the requirement.txt
is converted from the gatkcondaenv.yml
file provided by gatk)
The following code is the Dockerfile that I used to create my container using docker build -t your_account/gatk:4.2 -f Dockerfile .
the .
at the end is your build context (working directory that you are building the images from). After that, you should have a docker image named your_account/gatk:4.2
# docker pull ubuntu:18.04
# docker build -t your_account/gatk:4.2 -f Dockerfile .
FROM ubuntu:18.04
WORKDIR /opt/
COPY gatkcondaenv.yml ./
COPY gatkPythonPackageArchive.zip ./
COPY environment1.yml ./
COPY environment2.yml ./
COPY requirement.txt ./
# system packages
RUN apt-get update && apt-get install -yq curl wget jq vim less nano && \
curl -LO https://repo.anaconda.com/miniconda/Miniconda3-py39_4.10.3-Linux-x86_64.sh && \
bash Miniconda3-py39_4.10.3-Linux-x86_64.sh -p /miniconda -b && \
rm Miniconda3-py39_4.10.3-Linux-x86_64.sh
# create conda env for gatk first
ENV PATH=/miniconda/bin:${PATH}
RUN conda update -y conda && conda init && \
conda env create -f environment1.yml
# install gatk dependencies
SHELL ["conda", "run", "-n", "gatk", "/bin/bash", "-c"]
RUN conda install -y -n gatk --file requirement.txt && \
conda env update -n gatk --file environment2.yml && \
conda install -y -n base -c conda-forge conda-pack
Or you should run the code inside your container interactively yourself to create a container with working dependencies and commit it. Below is the stdout of my docker build with the Dockerfile that I write here.
$ docker build -t your_account/gatk:4.2 -f Dockerfile .
[+] Building 730.5s (15/15) FINISHED
=> [internal] load build definition from Dockerfile 0.0s
...
...
=> => exporting layers 68.2s
=> => writing image sha256:fc50be4ed681277ea2b7927b622b97c6b4e9c6eb9a4d73224183317a4efe3ef9 0.0s
=> => naming to docker.io/your_account/gatk:4.2 0.0s
Step 3
Hopefully, the required dependencies is installed and you can test that it work with python -c "import vqsr_cnn"
. The stdout will start with Using TensorFlow backend.
After that, run the container and mount it to a volume so that you can write the environment outside of your container. Then upload it to your remote server and unpack it
dir="/path/to/output/"
docker run --rm -v ${dir}:/mnt/ -it your_account/gatk:4.2
# inside the container, run conda pack
conda pack -n gatk -o /mnt/gatk.tar.gz
# the gatk.tar.gz file will be output to your output directory
# upload it to your remote server, for example with rsync
source="gatk.tar.gz"
dest="username@remote_host_ip:/path/for/env/dir/"
rsync -aP ${source} ${dest}
# on your remote server
cd /path/for/env/dir/
mkdir -p gatk
tar -xzf gatk.tar.gz -C gatk
# Activate the environment. This adds `my_env/bin` to your path
source /path/for/env/dir/gatk/bin/activate
# clean up prefix so that you can run R and python without problem
conda-unpack
you don't need to be root to run gatk.
sorry, I am updating the post. This is meant to install dependencies to run some gatk tools like vqsr:
python -c "import vqsr_cnn"
one can use conda. There is a yaml file provided by gatk:
conda env create -f gatkcondaenv.yml
I tried that command too. But for my case, I start with new container running ubuntu-18.04 >> install miniconda >> download gatk4 >> install dependencies with
conda env create -f gatkcondaenv.yml
The command failed at solving environment:
You could simply use any of the Docker base images that have conda (or mamba) out of the box, such as:
As @dariober says below, once you have that running you could simply install gatk it via conda itself. Also, what is wrong with the official gatk container from the Broad? https://hub.docker.com/r/broadinstitute/gatk/
Anyway, thanks for the effort!
Oh, nothing wrong with it. If you have access to docker on your server. You can just use it and this tutorial will be very redundant to you
I write this for the specific case that you do not have access to docker on your remote server.
Admittedly, this is for a very specific situation. But I just want to share it somewhere because I spend a lot of time trying to resolve this. Haha
Hello, I'm facing similar situation, solving environment fails when I try the conda create environment with yml file (
docker pull
gives meGot permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock
). However, when I tried to follow your protocol on creating a new docker viadocker build -t your_account/gatk:4.2 -f Dockerfile .
it gives me the same permission denied error. I wonder if you have encountered similar things...?Thank you!