Fastq Screen is a wonderful FASTQC tool that one can use to identify the source of contamination in their data. But lately, the configuration of the tools has turned out to be a nightmare with the addition of database failing recently. I will try to pen down a few steps I took to successfully configure the fastq_screen.conf file
- Download the fastq_screen using conda/mamba
conda create -n fastq_screen
conda activate fastq_screen
conda install -c bioconda fastq-screen
which fastq_screen
My fastq_screen lives in miniconda3/envs/fastqscreen/bin/fastq_screen
however my fastq_screen
is a symlink here when I visit the bin
folder.
The exemplary configuration file is present in miniconda3/envs/fastqscreen/share/fastq-screen-0.14.0-1/
with name fastq_screen.conf.example
. I will make a copy of this file and name it fastq_screen.conf
and start editing it.
- When I download
fastq_screen
, bowtie and bowtie2 gets automatically downloaded. You can set the path of these tools by uncommenting them as follows:
BOWTIE /miniconda3/envs/fastqscree/bin/bowtie
BOWTIE2 /miniconda3/envs/fastqscree/bin/bowtie2
BWA /sw/csi/bwa/0.7.17/el7_gnu6.4.0/bin/bwa
Since I am working on a cluster that already has bwa
installed I didn't download it separately. I will load this module module load bwa
each time I run fastq_screen to use it.
- Now the part that involves database configuration is laborious. I had to download each organism separately and index them. I make a separate directory and keep my
bwa
indexes therein.
## Human - sequences available from
## ftp://ftp.ensembl.org/pub/current/fasta/homo_sapiens/dna/
DATABASE Human /path_to_indexes/GRCh38.primary_assembly.genome.fa
##
## Mouse - sequence available from
## ftp://ftp.ensembl.org/pub/current/fasta/mus_musculus/dna/
DATABASE Mouse /path_to_indexes_diectory/GRCm39/GRCm39.primary_assembly.genome.fa
##
## Ecoli- sequence available from EMBL accession U00096.2
DATABASE Ecoli /path_to_indexes_diectory/Ecoli/Ecoli.ASM160652v1.fasta
##
## PhiX - sequence available from Refseq accession NC_001422.1
DATABASE PhiX /path_to_indexes_diectory/PhiX/PhiX.fasta
##
## Adapters - sequence derived from the FastQC contaminats file found at: www.bioinformatics.babraham.ac.uk/projects/fastqc
DATABASE Adapters /path_to_indexes_diectory/Adapters/adapters.fasta
##
## Vector - Sequence taken from the UniVec database
## http://www.ncbi.nlm.nih.gov/VecScreen/UniVec.html
DATABASE Vectors /path_to_indexes_diectory/Vectors/UniVec.fasta
## Pvivax - Sequence taken from PlasmoDB
## https://plasmodb.org/common/downloads/release-56/PvivaxP01/fasta/data/PlasmoDB-56_PvivaxP01_Genome.fasta
DATABASE Pvivax /path_to_indexes_diectory/Pvivax/PlasmoDB-56_PvivaxP01_Genome.fasta
Best place to download the GRCh38 and GRCm39 is Gencode. Some of the links from where I got the fasta files are
wget https://plasmodb.org/common/downloads/release-56/PvivaxP01/fasta/data/PlasmoDB-56_PvivaxP01_Genome.fasta
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M28/GRCm39.primary_assembly.genome.fa.gz
wget http://ftp.ensemblgenomes.org/pub/release-52/bacteria//fasta/bacteria_12_collection/escherichia_coli_gca_001606525/dna/Escherichia_coli_gca_001606525.ASM160652v1.dna.toplevel.fa.gz
wget https://ftp.ncbi.nlm.nih.gov/pub/UniVec/UniVec
- Runing the analysis parallel on cluster using
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --mem=500GB
#SBATCH --partition=batch
#SBATCH --cpus-per-task 28
#SBATCH -J fastq_screen
#SBATCH -o fastq_screen.out
#SBATCH -e fastq_screen.err
#SBATCH --time=20:00:00
#SBATCH --mail-user=rohit.XXXXXX@gmai.com
#SBATCH --mail-type=ALL
module load bwa/0.7.17/gnu-6.4.0
## file contains filenames
cat file | parallel -j 8 "fastq_screen --aligner bwa {}"
A general advise for these sorts of posts is to remove absolute paths and everything specific to your local infrastructure such as the slurm submission. New users will easily be confused with that and one anyway has to adapt things to local setups.
Sorry!! Improved as suggested.