I launched Trimmomatic on 300 fastq.gz files to filter low quality reads and now I would like to know the number of sequences that have been removed in each file.
I have fastqc results for each file so I could parse every fastqc html to get this information. I just wondered if there was an easier way to do it.
This is normally reported in the standard output, which I normally redirect to a log file. A particularly thorny example (see second to last line):
TrimmomaticPE: Started with arguments: -threads 8 -phred33 N5_AGTTCCGT_L008_R1_001.fastq.gz N5_AGTTCCGT_L008_R2_001.fastq.gz N5_AGTTCCGT_L008_R1.paired.fastq.gz N5_AGTTCCGT_L008_R1.unpaired.fastq.gz N5_AGTTCCGT_L008_R2.paired.fastq.gz N5_AGTTCCGT_L008_R2.unpaired.fastq.gz ILLUMINACLIP:TruSeq3-PE-2.fa:2:15:10 CROP:98 HEADCROP:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:15 MINLEN:30
Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
Using Long Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
Using Long Clipping Sequence: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT'
Using Long Clipping Sequence: 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA'
ILLUMINACLIP: Using 1 prefix pairs, 4 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Input Read Pairs: 25510326 Both Surviving: 23679651 (92.82%) Forward Only Surviving: 1373354 (5.38%) Reverse Only Surviving: 332039 (1.30%) Dropped: 125282 (0.49%)
TrimmomaticPE: Completed successfully
I should also add, MultiQC is a great tool to capture the log reports for multiple samples and collate them into a single report with nice summary graphics.
I forgot to record the standard output, that is why I wondered if there was a way to get the information afterward.
I used MultiQC to have a summary of the fastqc reports. But now I am more interested in a simple text file with the name of the sample and the number/percentage of removed sequences.
Run FASTQC on both the raw and trimmed data (I typically do this to make sure the trimming addressed any identified issues, unless I skip trimming altogether), then run MultiQC retaining the parent directory information on all the data (there is a swicth for this) or run it on each. In either case, MultiQC generates a tab-delimited text file, so you could then pull the raw and trimmed FASTQC results into R and derive the number of removed reads from that.
I should also add, MultiQC is a great tool to capture the log reports for multiple samples and collate them into a single report with nice summary graphics.
Thak you for the answer,
I forgot to record the standard output, that is why I wondered if there was a way to get the information afterward.
I used MultiQC to have a summary of the fastqc reports. But now I am more interested in a simple text file with the name of the sample and the number/percentage of removed sequences.
Run FASTQC on both the raw and trimmed data (I typically do this to make sure the trimming addressed any identified issues, unless I skip trimming altogether), then run MultiQC retaining the parent directory information on all the data (there is a swicth for this) or run it on each. In either case, MultiQC generates a tab-delimited text file, so you could then pull the raw and trimmed FASTQC results into R and derive the number of removed reads from that.
It worked , Thank you.