I'm using Cutadapt in my paired-end FASTQ files for removing primers (forward primer and rev. com. of the reverse primer in forward reads, and reverse primer and rev. com. of the forward primer in the reverse reads). This is my command (wildcards {} are due to Snakemake being my workflow management system):
cutadapt \
{input.fq_f} \
{input.fq_r} \
--front Primer_F=^{params.primer_f} \
-G Primer_R=^{params.primer_r} \
--adapter Rev_com_primer_R={params.revcom_r} \
-A Rev_com_primer_F={params.revcom_f} \
--discard-untrimmed \
--cores {params.nthreads} \
--output {output.fq_trim_f} \
--paired-output {output.fq_trim_r}
I run FastQC on the raw FASTQ as well as on the trimmed FASTQ files, then I MultiQC results. What I found is that The quality per tile section goes from green to red when I trim in all my FASTQ files.
Taking one sample as an example (all samples have the same problem), you can see here the FastQC per tile quality of the raw reads:
And then the red in the same FASTQ after trimming primers:
Trying to find an explanation, I looked into the Per base quality plots. Here you can see the one of the raw reads:
And for trimmed reads:
(I know the quality at 3' sucks, I plan to use Sickle as my next step but before I need to understand why is this happening)
What I saw is that Cutadapt is in fact removing primers at 5' end (since this data comes fron fungal ITS I don't expect the rev. com. of primers being in the 3' end, but I told Cutadapt just in case since ITS length is variable), as you can see the first based showed in raw are gone in trimmed.
But if you look at the X-axis you can see that it ends at the same number as the raw. The boxes of the plot look that they were slightly pushed to the left and now in the end I dont see any boxes but I'm still seeing the blue line representing the mean. If you look in that region, you can see the read line at the bottom indicating that there are bases there. It looks like they were "added" to my reads?
The blue mean line being outside the boxes (I know there are no boxes there but still) is strange, but I can buy the explanation that it is a quirk of plotting or something. But the zero quality bases added at the end of my reads after trimming? I don't know why they are there.
Many thanks in advance
Edit: Following first comment suggestion, I double-checked my primer sequences (with the sequencing company and with the paper in which they were described) and the reverse complementaries of them (with reverse-complement.com and with a custom function of mine supporting IUPAC codes). All four sequences are right.
Also, just in case it is useful, I add the Cutadapt log for my example (removing my computers paths and adding newlines to the command itself for better readability). 3' adapters were removed aprox. 90 times each one, so there should not be read-through at all here:
This is cutadapt 4.6 with Python 3.8.15
Command line parameters:
CON_14_3_ITS_R1.fastq.gz \
CON_14_3_ITS_R2.fastq.gz \
--front Primer_F=^GATGAAGAACGYAGYRAA \
-G Primer_R=^TCCTCCGCTTWTTGWTWTGC \
--adapter Rev_com_primer_R=GCAWAWCAAWAAGCGGAGGA \
-A Rev_com_primer_F=TTYRCTRCGTTCTTCATC \
--discard-untrimmed \
--cores 2 \
--output CON_14_3_ITS_R1_cutadapt.fastq.gz \
--paired-output CON_14_3_ITS_R2_cutadapt.fastq.gz
Processing paired-end reads on 2 cores ...
Finished in 13.370 s (81.387 µs/read; 0.74 M reads/minute).
=== Summary ===
Total read pairs processed: 164,280
Read 1 with adapter: 163,548 (99.6%)
Read 2 with adapter: 163,093 (99.3%)
== Read fate breakdown ==
Pairs discarded as untrimmed: 1,770 (1.1%)
Pairs written (passing filters): 162,510 (98.9%)
Total basepairs processed: 98,403,472 bp
Read 1: 49,191,219 bp
Read 2: 49,212,253 bp
Total written (filtered): 91,174,044 bp (92.7%)
Read 1: 45,737,885 bp
Read 2: 45,436,159 bp
=== First read: Adapter Primer_F ===
Sequence: GATGAAGAACGYAGYRAA; Type: anchored 5'; Length: 18; Trimmed: 163457 times
No. of allowed errors: 1
Overview of removed sequences
length count expect max.err error counts
17 1291 0.0 1 0 1291
18 162074 0.0 1 160312 1762
19 92 0.0 1 0 92
=== First read: Adapter Rev_com_primer_R ===
Sequence: GCAWAWCAAWAAGCGGAGGA; Type: regular 3'; Length: 20; Trimmed: 91 times
Minimum overlap: 3
No. of allowed errors:
1-9 bp: 0; 10-19 bp: 1; 20 bp: 2
Bases preceding removed adapters:
A: 51.6%
C: 11.0%
G: 20.9%
T: 16.5%
none/other: 0.0%
Overview of removed sequences
length count expect max.err error counts
3 6 2566.9 0 6
4 4 641.7 0 4
5 1 160.4 0 1
16 1 0.0 1 1
19 4 0.0 1 1 1 2
20 65 0.0 2 56 6 3
22 3 0.0 2 3
23 1 0.0 2 1
25 1 0.0 2 1
27 1 0.0 2 0 0 1
48 1 0.0 2 0 1
55 1 0.0 2 0 0 1
72 1 0.0 2 1
120 1 0.0 2 0 0 1
=== Second read: Adapter Primer_R ===
Sequence: TCCTCCGCTTWTTGWTWTGC; Type: anchored 5'; Length: 20; Trimmed: 163000 times
No. of allowed errors: 2
Overview of removed sequences
length count expect max.err error counts
18 129 0.0 1 0 0 129
19 2560 0.0 1 0 2323 237
20 160099 0.0 2 157267 2492 340
21 212 0.0 2 0 192 20
=== Second read: Adapter Rev_com_primer_F ===
Sequence: TTYRCTRCGTTCTTCATC; Type: regular 3'; Length: 18; Trimmed: 93 times
Minimum overlap: 3
No. of allowed errors:
1-9 bp: 0; 10-18 bp: 1
Bases preceding removed adapters:
A: 24.7%
C: 24.7%
G: 28.0%
T: 22.6%
none/other: 0.0%
Overview of removed sequences
length count expect max.err error counts
3 42 2566.9 0 42
4 27 641.7 0 27
5 7 160.4 0 7
6 2 40.1 0 2
8 1 2.5 0 1
10 1 0.2 1 0 1
17 2 0.0 1 0 2
18 8 0.0 1 3 5
25 1 0.0 1 0 1
55 1 0.0 1 1
79 1 0.0 1 1
More than likely you have library fragments that are shorter than the length of sequencing. With 300 cycles you are getting adapter read-through on 3'end which is leading to poor Q scores.
Normally the bad Q-scores should get cleaned up if they are because of above on proper trimming. So check and make sure that you are using the proper adapter sequences for trimming.
Many thanks for your comment. If read-through is happening, shouldn't Cutadapt being taking care of that there? I provided to the Cutadapt command the reverse complementary of the primers as 3' adapters (
--adapter
,-A
) for that reason.Just read the comment update, my problem with that is that even if I were providing the wrong 3' adapter sequence, Cutadapt would not cut at the 3' so I should see the same 3' in raw and trimmed. Am I wrong?
Check a random sampling of reads to make sure the adapters are being removed. You may also want to blast a few reads to ensure that there is no unexpected contamination.
What kind of data is this?
Something is getting trimmed since the plots are changing. Checking on what remains is the next step.
Fungal ITS2 region (length is really variable so read-through can be an issue)
Following your advice, I checked the primer and adapters sequences and edited the question accordingly
Then you have data that has poor Q scores on 3-end. I don't know which workflow you are planning to use but as indicated in Qiime2 workflow ( https://forum.qiime2.org/t/fungal-its-analysis-tutorial/7351 ) subsequent steps (e.g. denoising) may deal with this.
Yes I plan to use QIIME2, but I'm Cutadapting outside QIIME so its easier to MultiQC without importing/unzipping QZAs (and QIIME2 importing via manifest file changes my filenames and Snakemake gets overwhelmed with that). I'll denoise them and see what happens. Many thanks for your time!