Question

Tutorial:Running fastq_screen on your data

3

Entering edit mode

2.9 years ago

rohitsatyam102 ▴ 920

Fastq Screen is a wonderful FASTQC tool that one can use to identify the source of contamination in their data. But lately, the configuration of the tools has turned out to be a nightmare with the addition of database failing recently. I will try to pen down a few steps I took to successfully configure the fastq_screen.conf file

Download the fastq_screen using conda/mamba

conda create -n fastq_screen
conda activate fastq_screen
conda install -c bioconda fastq-screen
which fastq_screen

My fastq_screen lives in miniconda3/envs/fastqscreen/bin/fastq_screen however my fastq_screen is a symlink here when I visit the bin folder.

The exemplary configuration file is present in miniconda3/envs/fastqscreen/share/fastq-screen-0.14.0-1/ with name fastq_screen.conf.example. I will make a copy of this file and name it fastq_screen.conf and start editing it.

When I download fastq_screen, bowtie and bowtie2 gets automatically downloaded. You can set the path of these tools by uncommenting them as follows:

BOWTIE  /miniconda3/envs/fastqscree/bin/bowtie
BOWTIE2 /miniconda3/envs/fastqscree/bin/bowtie2
BWA /sw/csi/bwa/0.7.17/el7_gnu6.4.0/bin/bwa

Since I am working on a cluster that already has bwa installed I didn't download it separately. I will load this module module load bwa each time I run fastq_screen to use it.

Now the part that involves database configuration is laborious. I had to download each organism separately and index them. I make a separate directory and keep my bwa indexes therein.

## Human - sequences available from
## ftp://ftp.ensembl.org/pub/current/fasta/homo_sapiens/dna/
DATABASE        Human   /path_to_indexes/GRCh38.primary_assembly.genome.fa
##
## Mouse - sequence available from
## ftp://ftp.ensembl.org/pub/current/fasta/mus_musculus/dna/
DATABASE        Mouse   /path_to_indexes_diectory/GRCm39/GRCm39.primary_assembly.genome.fa
##
## Ecoli- sequence available from EMBL accession U00096.2
DATABASE        Ecoli   /path_to_indexes_diectory/Ecoli/Ecoli.ASM160652v1.fasta
##
## PhiX - sequence available from Refseq accession NC_001422.1
DATABASE        PhiX    /path_to_indexes_diectory/PhiX/PhiX.fasta
##
## Adapters - sequence derived from the FastQC contaminats file found at: www.bioinformatics.babraham.ac.uk/projects/fastqc
DATABASE        Adapters        /path_to_indexes_diectory/Adapters/adapters.fasta
##
## Vector - Sequence taken from the UniVec database
## http://www.ncbi.nlm.nih.gov/VecScreen/UniVec.html
DATABASE        Vectors         /path_to_indexes_diectory/Vectors/UniVec.fasta
## Pvivax - Sequence taken from PlasmoDB
##  https://plasmodb.org/common/downloads/release-56/PvivaxP01/fasta/data/PlasmoDB-56_PvivaxP01_Genome.fasta
DATABASE        Pvivax  /path_to_indexes_diectory/Pvivax/PlasmoDB-56_PvivaxP01_Genome.fasta

Best place to download the GRCh38 and GRCm39 is Gencode. Some of the links from where I got the fasta files are

wget https://plasmodb.org/common/downloads/release-56/PvivaxP01/fasta/data/PlasmoDB-56_PvivaxP01_Genome.fasta
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M28/GRCm39.primary_assembly.genome.fa.gz
wget http://ftp.ensemblgenomes.org/pub/release-52/bacteria//fasta/bacteria_12_collection/escherichia_coli_gca_001606525/dna/Escherichia_coli_gca_001606525.ASM160652v1.dna.toplevel.fa.gz
wget https://ftp.ncbi.nlm.nih.gov/pub/UniVec/UniVec

Runing the analysis parallel on cluster using

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --mem=500GB
#SBATCH --partition=batch
#SBATCH --cpus-per-task 28
#SBATCH -J fastq_screen
#SBATCH -o fastq_screen.out
#SBATCH -e fastq_screen.err
#SBATCH --time=20:00:00
#SBATCH --mail-user=rohit.XXXXXX@gmai.com
#SBATCH --mail-type=ALL

module load bwa/0.7.17/gnu-6.4.0

## file contains filenames
cat file | parallel -j 8 "fastq_screen --aligner bwa {}"

fastqscreen • 3.6k views

ADD COMMENT • link 2.9 years ago by rohitsatyam102 ▴ 920

1

Entering edit mode

A general advise for these sorts of posts is to remove absolute paths and everything specific to your local infrastructure such as the slurm submission. New users will easily be confused with that and one anyway has to adapt things to local setups.