Question

Cutadapt not trimming adapters in PE mode

0

Entering edit mode

8.0 years ago

tlorin ▴ 370

Hello all,

Please apologies if this is a trivial question regarding cutadapt but it is the first time I am running this program.

I have 2 PE Illumina Hiseq4000 files (100nt) with many many reads. I take a subset of 100,000 sequences for left and right reads. The sequencing platform told me the forward adapter is AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC and the reverse is AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT.

If I do a grep AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC subset.R1.fastq I get 7 results (subset of 100,000 sequences):

AGGGTTGCATTGGCCCTCTGGCTCTTGAAGTTCAAAAAGCCAGTCTGAATGATCCTCTTCCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTCCG TCGGTATGGTGATGCATTTCGTGTTGACACTCTGGGTGGTGATGGCTTTCTCCAGTTCGTCCAGCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACG CTGGGCTCAACCTCCAGGGTGATGGTCTTACCAGTGAGTGTCTTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTCCGCACATCTCGTATGCCGT GGTGTCACTTGGCTCCACCTCCAGGGTGATGGTCTTACCAGTGAGGGTCTTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTCCGCACATCTCGT AGGGCCTGGAGCCTGGGGCCTGGGGTCCGAGGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTCCGCACACCTCGTATGCCGGCCTCTGCTTGAA GTTACAGTGTGTGTGTGTTACAGTGTGTGTGTGTTACTGTGTGTGTGTGTGTTACAGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTCCGCACA GGTGGTTGCTGTTGTCCCAGATCTGAGTGTTGTGCCACAGCACCATCTTCAGCTTGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTCCGCACAC

If I do run grep AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT subset.R2.fastq I get 4 results:

CCTCGGACCCCAGGCCCCAGGCTCCAGGCCCTAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAA AGGTGATGTGGACGTGTCAGTTCCAACAGTTGAGGATGACAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAA AAATGCCATCCTTTGGGGTTAAAGGTCCAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAAAAAA CAAGGAGGGCATTCCCCCTGACCAGCAGAGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAAAC

I want to trim these reads so I run cutadapt (according to the documentation and to this post).

cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT -o subset.R1.trim.fastq -p subset.R2.trim.fastq subset.R1.fastq subset.R2.fastq

I would expect cutadapt to trim at least 7 sequences in the left reads file and 4 in the right reads file (and even more including that the adapter sequence does not need to be full in the read).

Here is cutadapt output:

Trimming 2 adapters with at most 10.0% errors in paired-end legacy mode ...
Finished in 4.89 s (49 us/read; 1.23 M reads/minute).

=== Summary ===

Total read pairs processed:            100,000
  Read 1 with adapter:                   3,363 (3.4%)
  Read 2 with adapter:                       0 (0.0%)
Pairs written (passing filters):       100,000 (100.0%)

Total basepairs processed:    20,000,000 bp
  Read 1:    10,000,000 bp
  Read 2:    10,000,000 bp
Total written (filtered):     19,987,225 bp (99.9%)
  Read 1:     9,987,225 bp
  Read 2:    10,000,000 bp

=== First read: Adapter 1 ===

Sequence: AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC; Type: regular 3'; Length: 34; Trimmed: 2425 times.

No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-34 bp: 3

Bases preceding removed adapters:
  A: 21.4%
  C: 38.6%
  G: 23.8%
  T: 16.2%
  none/other: 0.0%

Overview of removed sequences
length  count   expect  max.err error counts
3   1739    1562.5  0   1739
4   415 390.6   0   415
5   142 97.7    0   142
6   34  24.4    0   34
7   4   6.1 0   4
8   9   1.5 0   9
9   10  0.4 0   7 3
10  6   0.1 1   3 3
11  4   0.0 1   4
12  5   0.0 1   5
13  7   0.0 1   7
14  5   0.0 1   5
15  1   0.0 1   1
16  2   0.0 1   2
17  6   0.0 1   6
18  1   0.0 1   1
19  1   0.0 1   1
20  3   0.0 2   2 1
21  2   0.0 2   1 0 1
22  2   0.0 2   2
23  2   0.0 2   2
24  2   0.0 2   2
25  1   0.0 2   1
27  1   0.0 2   1
28  1   0.0 2   1
29  3   0.0 2   3
30  3   0.0 3   1 2
32  4   0.0 3   3 0 0 1
35  1   0.0 3   1
39  1   0.0 3   1
43  1   0.0 3   1
44  2   0.0 3   1 1
45  1   0.0 3   0 1
49  1   0.0 3   1
53  1   0.0 3   0 0 1
56  1   0.0 3   1
68  1   0.0 3   1

=== First read: Adapter 2 ===

Sequence: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT; Type: regular 5'; Length: 58; Trimmed: 938 times.

No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-39 bp: 3; 40-49 bp: 4; 50-58 bp: 5

Overview of removed sequences
length  count   expect  max.err error counts
3   686 1562.5  0   686
4   210 390.6   0   210
5   15  97.7    0   15
6   13  24.4    0   13
7   3   6.1 0   3
8   1   1.5 0   1
9   1   0.4 0   0 1
10  6   0.1 1   0 6
11  3   0.0 1   0 3

So I understand that at least 2425 reads should be trimmed in the left read file and 938 in the right read file. This seems plausible. However, if I count reads in the output files I get 100,000 reads for both output files, so it seems that no read has been trimmed!

Any idea of what is occurring? Is the command line correct?

Many thanks for your answers!

cutadapt RNA-Seq • 5.3k views

ADD COMMENT • link updated 8.0 years ago by h.mon 35k • written 8.0 years ago by tlorin ▴ 370

1

Entering edit mode

Trimming does not mean removing. A trimmed read will still be present in the output, it will just be shorter. That said, it's weird that read 2 never got trimmed. For normal paired libraries, read 1 and read 2 will have adapter sequence at the same position, so in this case the adapter trimming is not working correctly. I suggest you try this approach instead.

ADD REPLY • link 8.0 years ago by Brian Bushnell 20k

0

Entering edit mode

Thanks @Brian! It's much clearer. So if I understand correctly, I should run bbduk.sh in1=subset.R1.fastq in2=subset.R2.fastq out1=subset.R1.trim.fastq out2=subset.R2.trim.fastq literal=AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC,AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT ktrim=r k=23 mink=11 hdist=1 tpe tbo.

This gives the following output.

BBDuk version 37.31
maskMiddle was disabled because useShortKmers=true
Initial:
Memory: max=12875m, free=12674m, used=201m

Added 5667 kmers; time:     0.071 seconds.
Memory: max=12875m, free=12203m, used=672m

Input is being processed as paired
Started output streams: 0.019 seconds.
Processing time:        0.857 seconds.

Input:                      200000 reads        20000000 bases.
KTrimmed:                   212 reads (0.11%)   6266 bases (0.03%)
Trimmed by overlap:         242 reads (0.12%)   2058 bases (0.01%)
Total Removed:              8 reads (0.00%)     8324 bases (0.04%)
Result:                     199992 reads (100.00%)  19991676 bases (99.96%)

Time:               0.959 seconds.
Reads Processed:        200k    208.64k reads/sec
Bases Processed:      20000k    20.86m bases/sec

Read distribution is as follows:

awk 'NR%4 == 2 {lengths[length($0)]++} END {for (l in lengths) {print l, lengths[l]}}'  subset.R1.trim.fastq | sort -k1 -n -r | head
100 99773
99 20
98 17
97 12
96 13
94 11
89 11
95 7
93 6
92 9

ADD REPLY • link 8.0 years ago by tlorin ▴ 370

1

Entering edit mode

It's an old post, but I think the correct options for R1 and R2 are -a and -A ... not -a and -g ... both would be applied to R1... capitalised options -A or -G are applied to R2.

cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT -o subset.R1.trim.fastq -p subset.R2.trim.fastq subset.R1.fastq subset.R2.fastq

ADD REPLY • link 5.8 years ago by Oliver Slay ▴ 60

score 1 · Answer 1 · 2017-06-02

Check the range of read length before and after cutadapt: trimming means the adapter sequences will be removed from within the reads, leaving the remaining bit of the read shorter.

I would be concerned about the asymmetrical trimming between R1 and R2, you should try Trimmomatic or BBDuk, as they will trim both reads symmetrically even if the adapter is recognized only in one read.

edit: rephrased trimming explanation, after reading a clearer one just above.