Question

KeepBothReads in Trimmomatic

0

Entering edit mode

3 months ago

mariannapauletto ▴ 90

Hi all,

I'm posting here to ask your help in understanding the implications of using KeepBothReads in Trimmomatic (v 0.39).

I've an RNAseq experiment (150PE) with a quite good coverage (around 60M fragments). I ran Trimmomatic setting KeepBothReads as false (default mode) or true (see below the code).

mode 1

java -jar $TRIMMOMATIC PE \
  -threads 40 \
  ./${s}_R1_001.fastq.gz ./${s}_R2_001.fastq.gz \
  ./trimmed/T.1.${s}.P1.gz \
  ./trimmed/T.1.${s}.U1.gz \
  ./trimmed/T.1.${s}.P2.gz \
  ./trimmed/T.1.${s}.U2.gz \
  ILLUMINACLIP:$ILLUMINACLIP:2:30:10:2:false \
  HEADCROP:1 \
  LEADING:3 \
  TRAILING:3 \
  SLIDINGWINDOW:4:15 \
  MINLEN:36

mode 2

java -jar $TRIMMOMATIC PE \
  -threads 40 \
  ./${s}_R1_001.fastq.gz \
  ./${s}_R2_001.fastq.gz \
  ./trimmed/T.1.${s}.P1.gz \
  ./trimmed/T.1.${s}.U1.gz \
  ./trimmed/T.1.${s}.P2.gz \
  ./trimmed/T.1.${s}.U2.gz \
  ILLUMINACLIP:$ILLUMINACLIP:2:30:10:2:true \
  HEADCROP:1\
  LEADING:3\
  TRAILING:3 \
  SLIDINGWINDOW:4:15\
  MINLEN:36

Basically, by setting this option as true, I retain both reads in a pair, even if one read fails quality filtering or adapter removal steps. and this is evident by looking at the results (see an example below).

mode 1

Input Read Pairs: 70.957.762
Both Surviving: 40.859.380 (57.58%)
Forward Only Surviving: 29.615.197 (41.74%)
Reverse Only Surviving: 122.332 (0.17%) 
Dropped: 360.853 (0.51%)

mode 2

Input Read Pairs: 70.957.762 
Both Surviving: 70.138.149 (98.84%)
Forward Only Surviving: 336.428 (0.47%)
Reverse Only Surviving: 192.967 (0.27%)
Dropped: 290.218 (0.41%)

Basically, the quality of reverse reads is lower compared to the forward ones, and in around 40% of the sequenced fragments the reverse mate is lost. The common pseudoaligners (e.g. kallisto) cannot handle a combination of paired and unpaired reads, thus if I want to retain the good forward reads surviving after the trimming, I have to use the second mode (KeepBothReads:true). Do you think this will represent a problem in the next steps of the analysis, or simply, these bad quality reverse reads will not align to the transcriptome?

Thank you to who will take time to help me!

Best,
Marianna

Trimmomatic PE-reads • 1.3k views

ADD COMMENT • link 11 weeks ago by mariannapauletto ▴ 90

0

Entering edit mode

Can you post the FastQC plots for quality for original R1/R2 data? Have you looked to see what is causing the reads to drop out? You should not remove reads based on Q-score alone if you are aligning to an established reference.

If your reads are failing because there were some cycles with poor quality bases in middle that may represent an issue with the run. Generally such data should not be released by the sequencing entity.

ADD REPLY • link 3 months ago by GenoMax 148k

0

Entering edit mode

Looking at the FastqC output (see below), the problem is not related to the quality of the reads but to the presence of adapters. But Fw and Rv reads seem similar in terms of quality. Both have a lot of adapters.

forward reads

reverse reads

ADD REPLY • link 3 months ago by mariannapauletto ▴ 90

0

Entering edit mode

Please post the plots as inline figures if you can.

If you are sure the issue is with adapters then the reads are likely failing the minlen filter after trimming. So that means one read become shorter than 36 bp. While you will get alignments at 36 bp (even with human genome) the chances of reads starting to multi-map go up significantly as reads become shorter. If you have really short inserts (< 35 bp) then you are better off dropping those read pairs.

Probably not what you would want to hear but this does not appear to be a good quality library.

ADD REPLY • link 3 months ago by GenoMax 148k

0

Entering edit mode

Below you can find the plots of the reverse reads, those not surviving.

Looking at the results it seems related to adapters. And probably yes, they do not survive because shorter than 36bp.

The good part of the story is that I have 40M reads surviving in pairs, which is not bad for gene expression.

enter image description here

ADD REPLY • link 3 months ago by mariannapauletto ▴ 90

0

Entering edit mode

The good part of the story is that I have 40M reads surviving in pairs, which is not bad for gene expression.

Great. That is the way to go then.

ADD REPLY • link 3 months ago by GenoMax 148k

0

Entering edit mode

Yes, it seems the way to go.

Sorry to bother you again. Based on the plots, it seems that the reason of such a high amount of lost reads is the presence of adapters.Do you agree on that? Or do you suspect something different?

Thank you for your time and advise.

Marianna

ADD REPLY • link 3 months ago by mariannapauletto ▴ 90

0

Entering edit mode

Based on the plots, it seems that the reason of such a high amount of lost reads is the presence of adapters

It is not entirely obvious from the plot. Looks like you have transposases sequence showing up but nothing else. Do you know what exact adapter you were looking for? I assume it was in the file you used. It that was the case then that adapter must be present in your files.

You could also try fastp or bbduk.sh from BBMap suite (https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbduk-guide/ ) to get an independent confirmation of trimmomatic result.

ADD REPLY • link 3 months ago by GenoMax 148k

0

Entering edit mode

Thank you GenoMax for your hints.

I've run fastp in one sample using the following command (as much as similar to Trimmomatic)

fastp -i Y96_S1_R1_001.fastq.gz -I Y96_S1_R2_001.fastq.gz -o T.Y96.R1.fq.gz -O T.Y96.R2.fq.gz -5 -3 -r --failed_out failed_out.fq.gz --detect_adapter_for_pe \ -f 1 -l 36 --html T.Y96.res

Reads seem ok. About 96% of them passed the filters. I'm confused!

See below the plots.

enter image description here

ADD REPLY • link 3 months ago by mariannapauletto ▴ 90

1

Entering edit mode

Assuming fastp was able to find the adapter (and it matches what you expect to be there) this result should be ok.

ADD REPLY • link 3 months ago by GenoMax 148k

score 1 · Answer 1 · 2024-09-19

Hi Marianna,

I think part of your problem is that you are using a trimmer that does not understand that if adapter sequence occurs at position X on read 1, it will also occur at position X on read 2. What you need to do is use an adapter trimmer that is actually aware of how paired reads work. There is no reason to quality-trim modern Illumina reads because their quality scores are false, unless you recalibrate them. So don't do that. Unless you recalibrate them, which I do recommend, but it's complicated unless you spike in PhiX.

So, here's what you can do:

1) Download BBMap from Sourceforge.
2) bbduk.sh in=reads.fq out=trimmed.fq ref=adapters ktrim=r k=23 mink=11 hdist=2 hdist2=0 tpe tbo

Judging by your base frequency histograms, unless you are using viral input (or some other situation where you expect very high divergence between A and T, and G and C), the data is garbage (otherwise the G and C, and A and T, lines would be matching). The qscore lines don't matter because those are all false since Illumina no longer reports them correctly. Although, you can get accurate ones if you want by recalibration after mapping. For example, "bbmap.sh in=reads.fq out=mapped.sam vslow; calctruequality.sh in=mapped.sam; bbduk.sh in=reads.fq out=recal.fq recalibrate". It's not as good as real qscores from the sequencer like Illumina used to produce 15 years ago, but it's still pretty good. Actually they are empirically more accurate than the qscores Illumina used to produce when they actually tried to produce real qscores. But that does not mean they are better; it's nuanced.

The command I posted above has the flags "tbo tpe". What do those mean? Well, Trimmomatic can't do them. "tbo" means "trimbyoverlap", which means that the reads are merged using BBMerge, and if the insert size is shorter than read length, the residual is trimmed. "tpe" means "trimpairsevenly", because if adapter sequence is detected in R1 at position 17, that means R2 also has adapter at the same position, it just may not have been detected due to mismatches from sequencing error. You need both of them if you want to accurately adapter-trim your reads and not end up with weird situations like you are experiencing where they get trimmed to different lengths and still have adapter sequences.