Missing reads in fastp
0
0
Entering edit mode
2.3 years ago
amy__ ▴ 190

Hello,

Having run fastp like this:

#!/bin/bash --login
#SBATCH -J AmyHouseman_fastp
#SBATCH -o %x.stdout.%J.%N
#SBATCH -e %x.stderr.%J.%N
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --mail-type=ALL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=xxxxx    # Where to send mail
#SBATCH --array=1-33
#SBATCH --partition=compute
#SBATCH --account=scw1581
#SBATCH --mem-per-cpu=128GB

module purge
module load singularity
module load parallel

# Set bash error trapping to exit on first error.
set -eu

WDPATH=/scratch/$USER/$SLURM_ARRAY_JOB_ID
CONTAINER_FILE=fastp-v0.23.1.sif
MY_CONTAINER_PATH=/scratch/$USER/containers
CONTAINER=$MY_CONTAINER_PATH/$CONTAINER_FILE

EXOME_IDs_FILE=/scratch/c.c21087028/Polyposis_Exome_Analysis_NG0921/fastp/All_fastp_input/NG0921_sampleIDs
FORWARD_FASTQ=/scratch/c.c21087028/Polyposis_Exome_Analysis_NG0921/fastp/All_fastp_input/RawPEreads/{}1.fq.gz
REVERSE_FASTQ=/scratch/c.c21087028/Polyposis_Exome_Analysis_NG0921/fastp/All_fastp_input/RawPEreads/{}2.fq.gz

TRIMMED_FORWARD_OUTPUT=/scratch/c.c21087028/Polyposis_Exome_Analysis_NG0921/fastp/All_fastp_output1/TrimmedPEreads/{}1.trimmed.fq.gz
TRIMMED_REVERSE_OUTPUT=/scratch/c.c21087028/Polyposis_Exome_Analysis_NG0921/fastp/All_fastp_output1/TrimmedPEreads/{}2.trimmed.fq.gz
HTML_OUTPUT=/scratch/c.c21087028/Polyposis_Exome_Analysis_NG0921/fastp/All_fastp_output1/html_output/{}pair.html
JSON_OUTPUT=/scratch/c.c21087028/Polyposis_Exome_Analysis_NG0921/fastp/All_fastp_output1/json_output/{}pair.json
FAILED_FASTQ_OUTPUT=/scratch/c.c21087028/Polyposis_Exome_Analysis_NG0921/fastp/All_fastp_output1/failed_output/{}failed.fq.gz
UNPAIRED_FORWARD_OUTPUT=/scratch/c.c21087028/Polyposis_Exome_Analysis_NG0921/fastp/All_fastp_output1/unpairedreads_output/Unpaired_read1/{}1.unpaired1.fq.gz
UNPAIRED_REVERSE_OUTPUT=/scratch/c.c21087028/Polyposis_Exome_Analysis_NG0921/fastp/All_fastp_output1/unpairedreads_output/Unpaired_read2/{}2.unpaired2.fq.gz
ADAPTERS=/scratch/c.c21087028/Polyposis_Exome_Analysis_NG0921/fastp/All_fastp_input/NG0921_adapter.fa

if [ "$SLURM_ARRAY_TASK_ID" == "1" ]
then
  mkdir -p $MY_CONTAINER_PATH
  [ -f "$CONTAINER" ] || ssh cl1 wget -O $CONTAINER https://wotan.cardiff.ac.uk/containers/$CONTAINER_FILE
  mkdir $WDPATH
fi

while [ ! -d $WDPATH ]
do
  sleep 10
done

sed -n "${SLURM_ARRAY_TASK_ID}p" $EXOME_IDs_FILE | parallel -j 1 "singularity run $CONTAINER -i $FORWARD_FASTQ -I $REVERSE_FASTQ -o $TRIMMED_FORWARD_OUTPUT -O $TRIMMED_REVERSE_OUTPUT --html $HTML_OUTPUT --json $JSON_OUTPUT --failed_out $FAILED_FASTQ_OUTPUT --unpaired1 $UNPAIRED_FORWARD_OUTPUT --unpaired2 $UNPAIRED_REVERSE_OUTPUT --adapter_fasta $ADAPTERS"

Sometimes I get output trimmed files that are of not correct, such as:

enter image description here

How can so many reads pass the filtering yet so little be in the filtered ones?

It is bizarre and really annoying!

fastp filtering trimming • 1.5k views
ADD COMMENT
0
Entering edit mode

Could you elaborate a little more on what you find problematic? Upon my cursory reading it seems everything is fine.

With good data, more reads pass the filters and fewer reads will have errors.

ADD REPLY
0
Entering edit mode

I think the confusion stems from following:

Before filtering - 114 M reads (I guess that is total in data set pre-filtering)

After Filtering - 35 M reads (that I would think is either reads that remain after filtering or are removed during filtering (ambiguous)

Above numbers do not add us since the

Filtering result (which I guess is final) shows

reads passed filters - 114M reads (once again)

ADD REPLY
0
Entering edit mode

Hi,

Its more that when I then do

echo $(zcat NG0921001_EKDN210018941-1A_HN2MGDSX2_L2_1.trimmed.fq.gz|wc -l)/4|bc

The result is only 21785824

and for the reverse

echo $(zcat NG0921001_EKDN210018941-1A_HN2MGDSX2_L2_2.trimmed.fq.gz|wc -l)/4|bc

the result is just 21785824


After running trimgalore, I get

echo $(zcat NG0921001_EKDN210018941-1A_HN2MGDSX2_L2_1_val_1.fq.gz|wc -l)/4|bc

57393620

and echo $(zcat NG0921001_EKDN210018941-1A_HN2MGDSX2_L2_2_val_2.fq.gz|wc -l)/4|bc

57393620

Which is more in keeping with the 114M

Strange! I will get the fastqc from trimgalore and see if the stats

ADD REPLY
0
Entering edit mode

If you are starting with 21,785,824 reads how are you getting 57,393,620 after trimming?

ADD REPLY
0
Entering edit mode

Sorry I was not clear.

The original input fastq file (no trimming done) for NG0921001_EKDN210018941-1A_HN2MGDSX2_L2_1.fq.gz is 57402212 reads.

When using the trimming tool fastp it ends up outputting only 21,785,824 reads as the trimmed reads fastq.

Then when I used the original fastq and trim galore, to compare its output to fastp I get 57,393,620 as the output after running trim galore.

So fastp is doing something strange, and outputting a very small number compared to when trim galore does its own trimming

ADD REPLY
0
Entering edit mode

The explanation is simply that the two tools are being instructed to do something else altogether.

ADD REPLY

Login before adding your answer.

Traffic: 1558 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6