Hi all,
I'm posting here to ask your help in understanding the implications of using KeepBothReads in Trimmomatic (v 0.39).
I've an RNAseq experiment (150PE) with a quite good coverage (around 60M fragments). I ran Trimmomatic setting KeepBothReads as false (default mode) or true (see below the code).
mode 1
java -jar $TRIMMOMATIC PE \
-threads 40 \
./${s}_R1_001.fastq.gz ./${s}_R2_001.fastq.gz \
./trimmed/T.1.${s}.P1.gz \
./trimmed/T.1.${s}.U1.gz \
./trimmed/T.1.${s}.P2.gz \
./trimmed/T.1.${s}.U2.gz \
ILLUMINACLIP:$ILLUMINACLIP:2:30:10:2:false \
HEADCROP:1 \
LEADING:3 \
TRAILING:3 \
SLIDINGWINDOW:4:15 \
MINLEN:36
mode 2
java -jar $TRIMMOMATIC PE \
-threads 40 \
./${s}_R1_001.fastq.gz \
./${s}_R2_001.fastq.gz \
./trimmed/T.1.${s}.P1.gz \
./trimmed/T.1.${s}.U1.gz \
./trimmed/T.1.${s}.P2.gz \
./trimmed/T.1.${s}.U2.gz \
ILLUMINACLIP:$ILLUMINACLIP:2:30:10:2:true \
HEADCROP:1\
LEADING:3\
TRAILING:3 \
SLIDINGWINDOW:4:15\
MINLEN:36
Basically, by setting this option as true, I retain both reads in a pair, even if one read fails quality filtering or adapter removal steps. and this is evident by looking at the results (see an example below).
mode 1
Input Read Pairs: 70.957.762
Both Surviving: 40.859.380 (57.58%)
Forward Only Surviving: 29.615.197 (41.74%)
Reverse Only Surviving: 122.332 (0.17%)
Dropped: 360.853 (0.51%)
mode 2
Input Read Pairs: 70.957.762
Both Surviving: 70.138.149 (98.84%)
Forward Only Surviving: 336.428 (0.47%)
Reverse Only Surviving: 192.967 (0.27%)
Dropped: 290.218 (0.41%)
Basically, the quality of reverse reads is lower compared to the forward ones, and in around 40% of the sequenced fragments the reverse mate is lost. The common pseudoaligners (e.g. kallisto) cannot handle a combination of paired and unpaired reads, thus if I want to retain the good forward reads surviving after the trimming, I have to use the second mode (KeepBothReads:true
). Do you think this will represent a problem in the next steps of the analysis, or simply, these bad quality reverse reads will not align to the transcriptome?
Thank you to who will take time to help me!
Best,
Marianna
Can you post the FastQC plots for quality for original R1/R2 data? Have you looked to see what is causing the reads to drop out? You should not remove reads based on Q-score alone if you are aligning to an established reference.
If your reads are failing because there were some cycles with poor quality bases in middle that may represent an issue with the run. Generally such data should not be released by the sequencing entity.
Looking at the FastqC output (see below), the problem is not related to the quality of the reads but to the presence of adapters. But Fw and Rv reads seem similar in terms of quality. Both have a lot of adapters.
forward reads
reverse reads
Please post the plots as inline figures if you can.
If you are sure the issue is with adapters then the reads are likely failing the
minlen
filter after trimming. So that means one read become shorter than 36 bp. While you will get alignments at 36 bp (even with human genome) the chances of reads starting to multi-map go up significantly as reads become shorter. If you have really short inserts (< 35 bp) then you are better off dropping those read pairs.Probably not what you would want to hear but this does not appear to be a good quality library.
Below you can find the plots of the reverse reads, those not surviving.
Looking at the results it seems related to adapters. And probably yes, they do not survive because shorter than 36bp.
The good part of the story is that I have 40M reads surviving in pairs, which is not bad for gene expression.
Great. That is the way to go then.
Yes, it seems the way to go.
Sorry to bother you again. Based on the plots, it seems that the reason of such a high amount of lost reads is the presence of adapters.Do you agree on that? Or do you suspect something different?
Thank you for your time and advise.
Marianna
It is not entirely obvious from the plot. Looks like you have transposases sequence showing up but nothing else. Do you know what exact adapter you were looking for? I assume it was in the file you used. It that was the case then that adapter must be present in your files.
You could also try
fastp
orbbduk.sh
from BBMap suite (https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbduk-guide/ ) to get an independent confirmation oftrimmomatic
result.Thank you GenoMax for your hints.
I've run fastp in one sample using the following command (as much as similar to Trimmomatic)
Reads seem ok. About 96% of them passed the filters. I'm confused!
See below the plots.
Assuming
fastp
was able to find the adapter (and it matches what you expect to be there) this result should be ok.