Question

Why did my data get worse after trimming with trimmomatic?

0

Entering edit mode

2.0 years ago

san96 ▴ 170

Hi, I am new to RNA-seq data analysis and am currently trimming my fastq files with trimmomatic, however after trimming my results it seems that some features are getting worse, specifically the sequence length distribution, am I doing something wrong? Could this give me problems in my subsequent analyses?

I am attaching the command line that I am running. I will appreciate any help given.

java -jar trimmomatic-0.39.jar PE -phred33 input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10  SLIDINGWINDOW:4:15 MINLEN:36

Original1 Trimming1

original_dist1 trimming_dist1

trimmomatic • 1.9k views

ADD COMMENT • link updated 2.0 years ago by GenoMax 148k • written 2.0 years ago by san96 ▴ 170

3

Entering edit mode

I'm confused it clearly got better... You trimmed adapters... from the ends of the reads right?

ADD REPLY • link 2.0 years ago by benformatics 4.1k

0

Entering edit mode

I remember the times when it was expected the quality of the base to be dropped up to 22 at the end of Illumina reads before preprocessing. What an advance in technologies!

ADD REPLY • link 2.0 years ago by rbioinfo ▴ 40

0

Entering edit mode

ibq.enriquepola : Please do not delete threads once they have received at least one comment or answer. They provide value to future visitors. You can accept an answer (green check mark) to provide closure to this thread.

ADD REPLY • link 2.0 years ago by GenoMax 148k

score 2 · Answer 1 · 2023-01-11

Sequence length distribution can change after trimming (especially if you had extraneous sequence in your data). That extraneous data will be gone after trimming. For example, if you originally had an untrimmed read length of 150 bp (where all reads were same length) now it will show a distribution of (150 - longest length of extraneous sequence) bp all the way to 150 bp (reads that did not have any extraneous sequence).

Trimming programs will drop reads once they get below a certain length threshold (both reads in the pair will be dropped to keep files in sync for paired-end data) so that lower boundary number may be default "length of reads to keep" for the program (unless you change it).