Hello,
I'm starting to use cutadapt instead of sickle, and I'm not sure I understand how it works exactly. I tried to use minimum length of 20 bp and minimum quality score of 20, and the results are super different between the three programs.
This is cutadapt output:
cutadapt -q 20 -m 20 -o output_q20_m20.fastq input.fastq
This is cutadapt 1.13 with Python 2.7.9
Command line parameters: -q 20 -m 20 -o output_q20_m20.fastq input.fastq
Trimming 0 adapters with at most 10.0% errors in single-end mode ...
Finished in 12.03 s (3 us/read; 17.89 M reads/minute).
=== Summary ===
Total reads processed: 3,587,045
Reads with adapters: 0 (0.0%)
Reads that were too short: 571,747 (15.9%)
Reads written (passing filters): 3,015,298 (84.1%)
Total basepairs processed: 125,546,575 bp
Quality-trimmed: 21,189,744 bp (16.9%)
Total written (filtered): 98,543,830 bp (78.5%)
This is sickle output:
sickle se --fastq-file input.fastq --qual-type sanger --qual-threshold 20 --length-threshold 20 --output-file output_sickle_q20_m20.fastq
SE input file: input.fastq
Total FastQ records: 3587045
FastQ records kept: 2449731
FastQ records discarded: 1137314
This is BBDuk output:
./bbduk.sh -Xmx1g in=input.fastq out=output_bbduk.fq qtrim=r trimq=20 ml=20 overwrite=true
java -Djava.library.path=/home/rioualen/Desktop/bbmap/jni/ -ea -Xmx1g -Xms1g -cp /home/rioualen/Desktop/bbmap/current/ jgi.BBDukF -Xmx1g in=input.fastq out=output_bbduk.fq qtrim=r trimq=20 ml=20 overwrite=true
Executing jgi.BBDukF [-Xmx1g, in=input.fastq, out=output_bbduk.fq, qtrim=r, trimq=20, ml=20, overwrite=true]
BBDuk version 37.10
Initial:
Memory: max=1029m, free=993m, used=36m
Input is being processed as unpaired
Started output streams: 0.020 seconds.
Processing time: 2.012 seconds.
Input: 3587045 reads 125546575 bases.
QTrimmed: 3156968 reads (88.01%) 68828955 bases (54.82%)
Total Removed: 1592853 reads (44.41%) 68828955 bases (54.82%)
Result: 1994192 reads (55.59%) 56717620 bases (45.18%)
Time: 2.059 seconds.
Reads Processed: 3587k 1742.23k reads/sec
Bases Processed: 125m 60.98m bases/sec
How can there be such a big difference between them, eg 15.9%, 31.7% and 44.4% of reads filtered?
Is cutadapt saying " Trimming 0 adapters with at most 10.0% errors in single-end mode " a bug? Cause I checked the fastq files and it is properly removing bp with a score <20.
Below are links to fastQC results
As an aside, trimming at Q20 is over the top for most applications (ChIPseq, RNAseq, BSseq, variant calling). Trimming around Q5 tends to be more reasonable.
I think you're comparing apples with oranges: cutadapt is supposed to remove adapter sequences from reads, while sickle does quality trimming (scythe does adapter clipping).
Cutadapt does both 5'/3' trimming and adapters removal. Here I just used the options for trimming and specified no adapter. Plus, my sample has very few adapter sequences so it wouldn't make such a difference anyway...
Have you looked at each read to confirm :) On a serious note, I recommend that you try a third option.
bbduk.sh
from BBMap suite. Easy to use. A file with all commercial adapters included in software source. Capable of even removing single basepair contaminants at ends of reads.I tried cutadapt with the adapters removal + same params and it gives a rate of 16.6% of discarded reads. My point is not to analyze this specific sample, but rather understand what's going on here, and if cutadapt is doing what I think it should be doing...
Every program has its own logic and trying to compare the results from two may only be possible if you "looked under the hood" to absolutely make sure that parameters that you think are identical actually are. e.g. perhaps the two programs use different length "seeds" to look for initial adapter matches. Small differences like that can change the final result.
I am familiar with bbduk which uses k-mer matching (and you can specify the length to use). I suggest that you try
bbduk.sh
(no install is needed once you download, only Java) and see how the results compare to two you have already tried.I added BBDuk output cause it's yet again very different, using again minimuum length of 20bp and Q20 for the quality threshold.
I am going to tag Brian Bushnell (author of BBMap suite) so he can provide some additional insight: Brian Bushnell
But from bbduk.sh help:
In first case as soon as average quality in a region drops below that value (you seem to indicate you have Q scores that go down, up and down across the reads, which is odd and seems to indicate that there may have been some issue with this run e.g. air bubble going through the lane) the rest of the read would be trimmed.
With second option the entire read would be considered to have a certain average quality. Can you try the second option?
Also
qtrim=w
will give you windowed quality trimming so that would be more similar to sickle.The quality doesn't seem to be going up and down in general, but I thought it might be the case for a fraction of reads, cause I can't think of another explanation so far... (see screencaps attached)
As for the minavgquality parameter, when I set it to 20 I have even more reads discarded (the quality of the samples is not terrific!)
I edited your original post so the links for the screen caps are now visible. Did you set
minavgquality=20
instead oftrimq=20
since combining those would lead to a different result.Do you have a screencap of "before" any trimming? Can you add that for reference?
Are you using the
adapters.fa
file in theresources
directory with yourbbduk.sh
command?Thanks, I added a screencap before any trimming (like I said, the data is not great but it doesn't matter).
I tried minavgquality=20 instead of trimq=20 but that gets me ~70% of reads trimmed!
I'm doing more tests tomorrow.
"minavgquality=20" does not trim, it simply discards reads with an average quality below 20. I recommend trimming instead. And as Devon points out, 20 is too high for most purposes.
I don't remember what trimming method is used by sickly or cutadapt, but BBDuk uses the phred algorithm, which leaves the longest area with average quality above the limit you specify such that it cannot be extended without adding an area of average quality below the limit you specify.
dear Brian, thanks for your outstanding program. Please allow me to note, that the description of the trimq parameter is a bit obscure on your BBDuk guide page. If you copied the sentence above (or post 17 in your introductory seqanswers thread) to your documentation, you could maybe save a few forum questions or google efforts
Blargh, I stand corrected.
For choosing proper tools to remove adapters from your reads check this article (Table 1).