Question

FastQC Quality per tile and per sequence behaving strange after using Cutadapt

0

Entering edit mode

12 months ago

Sergio A.S. ▴ 10

I'm using Cutadapt in my paired-end FASTQ files for removing primers (forward primer and rev. com. of the reverse primer in forward reads, and reverse primer and rev. com. of the forward primer in the reverse reads). This is my command (wildcards {} are due to Snakemake being my workflow management system):

cutadapt \
  {input.fq_f} \
  {input.fq_r} \
  --front Primer_F=^{params.primer_f} \
  -G Primer_R=^{params.primer_r} \
  --adapter Rev_com_primer_R={params.revcom_r} \
  -A Rev_com_primer_F={params.revcom_f} \
  --discard-untrimmed \
  --cores {params.nthreads} \
  --output {output.fq_trim_f} \
  --paired-output {output.fq_trim_r}

I run FastQC on the raw FASTQ as well as on the trimmed FASTQ files, then I MultiQC results. What I found is that The quality per tile section goes from green to red when I trim in all my FASTQ files.

Taking one sample as an example (all samples have the same problem), you can see here the FastQC per tile quality of the raw reads:

Per tile raw

And then the red in the same FASTQ after trimming primers:

Per tile trimmed

Trying to find an explanation, I looked into the Per base quality plots. Here you can see the one of the raw reads:

Per base raw

And for trimmed reads:

Per base trimmed

(I know the quality at 3' sucks, I plan to use Sickle as my next step but before I need to understand why is this happening)

What I saw is that Cutadapt is in fact removing primers at 5' end (since this data comes fron fungal ITS I don't expect the rev. com. of primers being in the 3' end, but I told Cutadapt just in case since ITS length is variable), as you can see the first based showed in raw are gone in trimmed.

But if you look at the X-axis you can see that it ends at the same number as the raw. The boxes of the plot look that they were slightly pushed to the left and now in the end I dont see any boxes but I'm still seeing the blue line representing the mean. If you look in that region, you can see the read line at the bottom indicating that there are bases there. It looks like they were "added" to my reads?

The blue mean line being outside the boxes (I know there are no boxes there but still) is strange, but I can buy the explanation that it is a quirk of plotting or something. But the zero quality bases added at the end of my reads after trimming? I don't know why they are there.

Many thanks in advance

Edit: Following first comment suggestion, I double-checked my primer sequences (with the sequencing company and with the paper in which they were described) and the reverse complementaries of them (with reverse-complement.com and with a custom function of mine supporting IUPAC codes). All four sequences are right.

Also, just in case it is useful, I add the Cutadapt log for my example (removing my computers paths and adding newlines to the command itself for better readability). 3' adapters were removed aprox. 90 times each one, so there should not be read-through at all here:

This is cutadapt 4.6 with Python 3.8.15
Command line parameters:

CON_14_3_ITS_R1.fastq.gz \
CON_14_3_ITS_R2.fastq.gz \
--front Primer_F=^GATGAAGAACGYAGYRAA \
-G Primer_R=^TCCTCCGCTTWTTGWTWTGC \
--adapter Rev_com_primer_R=GCAWAWCAAWAAGCGGAGGA \
-A Rev_com_primer_F=TTYRCTRCGTTCTTCATC \
--discard-untrimmed \
--cores 2 \
--output CON_14_3_ITS_R1_cutadapt.fastq.gz \
--paired-output CON_14_3_ITS_R2_cutadapt.fastq.gz

Processing paired-end reads on 2 cores ...
Finished in 13.370 s (81.387 µs/read; 0.74 M reads/minute).

=== Summary ===

Total read pairs processed:            164,280
  Read 1 with adapter:                 163,548 (99.6%)
  Read 2 with adapter:                 163,093 (99.3%)

== Read fate breakdown ==
Pairs discarded as untrimmed:            1,770 (1.1%)
Pairs written (passing filters):       162,510 (98.9%)

Total basepairs processed:    98,403,472 bp
  Read 1:    49,191,219 bp
  Read 2:    49,212,253 bp
Total written (filtered):     91,174,044 bp (92.7%)
  Read 1:    45,737,885 bp
  Read 2:    45,436,159 bp

=== First read: Adapter Primer_F ===

Sequence: GATGAAGAACGYAGYRAA; Type: anchored 5'; Length: 18; Trimmed: 163457 times

No. of allowed errors: 1

Overview of removed sequences
length  count   expect  max.err error counts
17  1291    0.0 1   0 1291
18  162074  0.0 1   160312 1762
19  92  0.0 1   0 92


=== First read: Adapter Rev_com_primer_R ===

Sequence: GCAWAWCAAWAAGCGGAGGA; Type: regular 3'; Length: 20; Trimmed: 91 times

Minimum overlap: 3
No. of allowed errors:
1-9 bp: 0; 10-19 bp: 1; 20 bp: 2

Bases preceding removed adapters:
  A: 51.6%
  C: 11.0%
  G: 20.9%
  T: 16.5%
  none/other: 0.0%

Overview of removed sequences
length  count   expect  max.err error counts
3   6   2566.9  0   6
4   4   641.7   0   4
5   1   160.4   0   1
16  1   0.0 1   1
19  4   0.0 1   1 1 2
20  65  0.0 2   56 6 3
22  3   0.0 2   3
23  1   0.0 2   1
25  1   0.0 2   1
27  1   0.0 2   0 0 1
48  1   0.0 2   0 1
55  1   0.0 2   0 0 1
72  1   0.0 2   1
120 1   0.0 2   0 0 1


=== Second read: Adapter Primer_R ===

Sequence: TCCTCCGCTTWTTGWTWTGC; Type: anchored 5'; Length: 20; Trimmed: 163000 times

No. of allowed errors: 2

Overview of removed sequences
length  count   expect  max.err error counts
18  129 0.0 1   0 0 129
19  2560    0.0 1   0 2323 237
20  160099  0.0 2   157267 2492 340
21  212 0.0 2   0 192 20


=== Second read: Adapter Rev_com_primer_F ===

Sequence: TTYRCTRCGTTCTTCATC; Type: regular 3'; Length: 18; Trimmed: 93 times

Minimum overlap: 3
No. of allowed errors:
1-9 bp: 0; 10-18 bp: 1

Bases preceding removed adapters:
  A: 24.7%
  C: 24.7%
  G: 28.0%
  T: 22.6%
  none/other: 0.0%

Overview of removed sequences
length  count   expect  max.err error counts
3   42  2566.9  0   42
4   27  641.7   0   27
5   7   160.4   0   7
6   2   40.1    0   2
8   1   2.5 0   1
10  1   0.2 1   0 1
17  2   0.0 1   0 2
18  8   0.0 1   3 5
25  1   0.0 1   0 1
55  1   0.0 1   1
79  1   0.0 1   1

fastqc cutadapt illumina tile paired-end • 1.2k views

ADD COMMENT • link 12 months ago by Sergio A.S. ▴ 10

1

Entering edit mode

I need to understand why is this happening

More than likely you have library fragments that are shorter than the length of sequencing. With 300 cycles you are getting adapter read-through on 3'end which is leading to poor Q scores.

Normally the bad Q-scores should get cleaned up if they are because of above on proper trimming. So check and make sure that you are using the proper adapter sequences for trimming.

ADD REPLY • link 12 months ago by GenoMax 151k

0

Entering edit mode

Many thanks for your comment. If read-through is happening, shouldn't Cutadapt being taking care of that there? I provided to the Cutadapt command the reverse complementary of the primers as 3' adapters (--adapter, -A) for that reason.

Just read the comment update, my problem with that is that even if I were providing the wrong 3' adapter sequence, Cutadapt would not cut at the 3' so I should see the same 3' in raw and trimmed. Am I wrong?

ADD REPLY • link 12 months ago by Sergio A.S. ▴ 10

1

Entering edit mode

Check a random sampling of reads to make sure the adapters are being removed. You may also want to blast a few reads to ensure that there is no unexpected contamination.

What kind of data is this?

Cutadapt would not cut at the 3' so I should see the same 3' in raw and trimmed. Am I wrong?

Something is getting trimmed since the plots are changing. Checking on what remains is the next step.

ADD REPLY • link 12 months ago by GenoMax 151k

0

Entering edit mode

Fungal ITS2 region (length is really variable so read-through can be an issue)

ADD REPLY • link 12 months ago by Sergio A.S. ▴ 10

0

Entering edit mode

Following your advice, I checked the primer and adapters sequences and edited the question accordingly

ADD REPLY • link 12 months ago by Sergio A.S. ▴ 10

1

Entering edit mode

3' adapters were removed aprox. 90 times each one, so there sohuld not be read-through at all here

Then you have data that has poor Q scores on 3-end. I don't know which workflow you are planning to use but as indicated in Qiime2 workflow ( https://forum.qiime2.org/t/fungal-its-analysis-tutorial/7351 ) subsequent steps (e.g. denoising) may deal with this.

ADD REPLY • link 12 months ago by GenoMax 151k

0

Entering edit mode

Yes I plan to use QIIME2, but I'm Cutadapting outside QIIME so its easier to MultiQC without importing/unzipping QZAs (and QIIME2 importing via manifest file changes my filenames and Snakemake gets overwhelmed with that). I'll denoise them and see what happens. Many thanks for your time!

ADD REPLY • link 12 months ago by Sergio A.S. ▴ 10