I am using cutadapt (v1.14) to trim adapter from a published Ribosome profiling dataset (short single-end reads of 51 nt). When I run FastQC on the raw data, I see that the read quality is pretty good at the 3' end with the entire box plot of quality > 30. However when I trim the adapter and run FastQC on the processed data, I find that the Quality drops at the 3' end. I am unable to understand why there will be a drop in quality after adapter trimming when the original reads were of high quality. Would appreciate if someone could throw some light on this.
The adapter trimming command is as follows
cutadapt -a CTGTAGGCACCATCAATATCTCGTATGC -q 20 -m 20 -M 45 -O 6 -o SRR1562913_trimmed.fastq SRR1562913_1.fastq
FastQC on the raw data Raw dataset
FastQC on the processed data Processed data
But shouldn't the lower quality seen for position 33-43 in the processed data be visible for these positions in the raw data? Of course their numbers would be small so that mean quality is higher, but even the lower bounds of the box plot is > 30 in the raw data
The lower bound of the box plot (lower whisker) is not the minimal observable value. The exact definition varies. It may represent the tenth percentile for example. If the raw plot of adapter contamination shows high values (say above 50%), the processed box plot may possibly be showing the raw outliers.
Thanks. I think this explains it. The lower bound is the 10th percentile according to FastQC documentation. The bad quality reads must be in the lowest 10th percentile and hence do not show up in the raw data plot
The adapter do not necessarily have to occur at the very end of your reads, so I think some of them might occur in the 33-43 range boosting the quality score in the region (prior to trimming) as well
So this implies that after adapter trimming you always need to do another round of quality trimming then? and that the order is important: first adapter then quality?
Every dataset is different. Even in this case most of data is still above Q20 so as long as there is a reference genome available to align against, no quality trimming should be needed.
true. especially if you assume that the aligner will do soft clipping/trimming of the data (which most do I think)
I was however thinking in the case of assembly (which is obviously not the case in the question asked here).
For any de novo work it would be appropriate (perhaps required) to quality trim the data at Q20 (or stricter).
my thought exactly. However I'm a little nervous about the order of trimming which apparently has (severe) impact on the result. and OK, normally you would probably first get rid of the adapters and then do Q-trimming.
I'm a bit rusty on the cutadapt syntax :/ but is the cmdline given in this post not also doing Q-trimming as well (
-q 20
)? If so, I'm concerned that other tools that do both adapter removal and Q-trim combined might also not apply the "correct" order