Question

Getting amount of removed reads after Trimmomatic

0

Entering edit mode

8.1 years ago

Corps_en_Thym • 0

Hi,

I launched Trimmomatic on 300 fastq.gz files to filter low quality reads and now I would like to know the number of sequences that have been removed in each file.

I have fastqc results for each file so I could parse every fastqc html to get this information. I just wondered if there was an easier way to do it.

Thank you,

trimmomatic fastq • 3.3k views

ADD COMMENT • link updated 8.1 years ago by Chris Fields ★ 2.2k • written 8.1 years ago by Corps_en_Thym • 0

score 2 · Answer 1 · 2016-10-21

2

Entering edit mode

8.1 years ago

Chris Fields ★ 2.2k

This is normally reported in the standard output, which I normally redirect to a log file. A particularly thorny example (see second to last line):

TrimmomaticPE: Started with arguments: -threads 8 -phred33 N5_AGTTCCGT_L008_R1_001.fastq.gz N5_AGTTCCGT_L008_R2_001.fastq.gz N5_AGTTCCGT_L008_R1.paired.fastq.gz N5_AGTTCCGT_L008_R1.unpaired.fastq.gz N5_AGTTCCGT_L008_R2.paired.fastq.gz N5_AGTTCCGT_L008_R2.unpaired.fastq.gz ILLUMINACLIP:TruSeq3-PE-2.fa:2:15:10 CROP:98 HEADCROP:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:15 MINLEN:30
Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
Using Long Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
Using Long Clipping Sequence: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT'
Using Long Clipping Sequence: 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA'
ILLUMINACLIP: Using 1 prefix pairs, 4 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Input Read Pairs: 25510326 Both Surviving: 23679651 (92.82%) Forward Only Surviving: 1373354 (5.38%) Reverse Only Surviving: 332039 (1.30%) Dropped: 125282 (0.49%)
TrimmomaticPE: Completed successfully

ADD COMMENT • link 8.1 years ago by Chris Fields ★ 2.2k

2

Entering edit mode

I should also add, MultiQC is a great tool to capture the log reports for multiple samples and collate them into a single report with nice summary graphics.

ADD REPLY • link 8.1 years ago by Chris Fields ★ 2.2k

0

Entering edit mode

Thak you for the answer,

I forgot to record the standard output, that is why I wondered if there was a way to get the information afterward.

I used MultiQC to have a summary of the fastqc reports. But now I am more interested in a simple text file with the name of the sample and the number/percentage of removed sequences.

ADD REPLY • link 8.1 years ago by Corps_en_Thym • 0

1

Entering edit mode

Run FASTQC on both the raw and trimmed data (I typically do this to make sure the trimming addressed any identified issues, unless I skip trimming altogether), then run MultiQC retaining the parent directory information on all the data (there is a swicth for this) or run it on each. In either case, MultiQC generates a tab-delimited text file, so you could then pull the raw and trimmed FASTQC results into R and derive the number of removed reads from that.