Hello all,
Please apologies if this is a trivial question regarding cutadapt but it is the first time I am running this program.
I have 2 PE Illumina Hiseq4000 files (100nt) with many many reads. I take a subset of 100,000 sequences for left and right reads. The sequencing platform told me the forward adapter is AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
and the reverse is AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
.
If I do a grep AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC subset.R1.fastq
I get 7 results (subset of 100,000 sequences):
AGGGTTGCATTGGCCCTCTGGCTCTTGAAGTTCAAAAAGCCAGTCTGAATGATCCTCTTCCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTCCG TCGGTATGGTGATGCATTTCGTGTTGACACTCTGGGTGGTGATGGCTTTCTCCAGTTCGTCCAGCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACG CTGGGCTCAACCTCCAGGGTGATGGTCTTACCAGTGAGTGTCTTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTCCGCACATCTCGTATGCCGT GGTGTCACTTGGCTCCACCTCCAGGGTGATGGTCTTACCAGTGAGGGTCTTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTCCGCACATCTCGT AGGGCCTGGAGCCTGGGGCCTGGGGTCCGAGGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTCCGCACACCTCGTATGCCGGCCTCTGCTTGAA GTTACAGTGTGTGTGTGTTACAGTGTGTGTGTGTTACTGTGTGTGTGTGTGTTACAGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTCCGCACA GGTGGTTGCTGTTGTCCCAGATCTGAGTGTTGTGCCACAGCACCATCTTCAGCTTGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTCCGCACAC
If I do run grep AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT subset.R2.fastq
I get 4 results:
CCTCGGACCCCAGGCCCCAGGCTCCAGGCCCTAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAA AGGTGATGTGGACGTGTCAGTTCCAACAGTTGAGGATGACAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAA AAATGCCATCCTTTGGGGTTAAAGGTCCAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAAAAAA CAAGGAGGGCATTCCCCCTGACCAGCAGAGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAAAC
I want to trim these reads so I run cutadapt (according to the documentation and to this post).
cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT -o subset.R1.trim.fastq -p subset.R2.trim.fastq subset.R1.fastq subset.R2.fastq
I would expect cutadapt to trim at least 7 sequences in the left reads file and 4 in the right reads file (and even more including that the adapter sequence does not need to be full in the read).
Here is cutadapt output:
Trimming 2 adapters with at most 10.0% errors in paired-end legacy mode ...
Finished in 4.89 s (49 us/read; 1.23 M reads/minute).
=== Summary ===
Total read pairs processed: 100,000
Read 1 with adapter: 3,363 (3.4%)
Read 2 with adapter: 0 (0.0%)
Pairs written (passing filters): 100,000 (100.0%)
Total basepairs processed: 20,000,000 bp
Read 1: 10,000,000 bp
Read 2: 10,000,000 bp
Total written (filtered): 19,987,225 bp (99.9%)
Read 1: 9,987,225 bp
Read 2: 10,000,000 bp
=== First read: Adapter 1 ===
Sequence: AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC; Type: regular 3'; Length: 34; Trimmed: 2425 times.
No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-34 bp: 3
Bases preceding removed adapters:
A: 21.4%
C: 38.6%
G: 23.8%
T: 16.2%
none/other: 0.0%
Overview of removed sequences
length count expect max.err error counts
3 1739 1562.5 0 1739
4 415 390.6 0 415
5 142 97.7 0 142
6 34 24.4 0 34
7 4 6.1 0 4
8 9 1.5 0 9
9 10 0.4 0 7 3
10 6 0.1 1 3 3
11 4 0.0 1 4
12 5 0.0 1 5
13 7 0.0 1 7
14 5 0.0 1 5
15 1 0.0 1 1
16 2 0.0 1 2
17 6 0.0 1 6
18 1 0.0 1 1
19 1 0.0 1 1
20 3 0.0 2 2 1
21 2 0.0 2 1 0 1
22 2 0.0 2 2
23 2 0.0 2 2
24 2 0.0 2 2
25 1 0.0 2 1
27 1 0.0 2 1
28 1 0.0 2 1
29 3 0.0 2 3
30 3 0.0 3 1 2
32 4 0.0 3 3 0 0 1
35 1 0.0 3 1
39 1 0.0 3 1
43 1 0.0 3 1
44 2 0.0 3 1 1
45 1 0.0 3 0 1
49 1 0.0 3 1
53 1 0.0 3 0 0 1
56 1 0.0 3 1
68 1 0.0 3 1
=== First read: Adapter 2 ===
Sequence: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT; Type: regular 5'; Length: 58; Trimmed: 938 times.
No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-39 bp: 3; 40-49 bp: 4; 50-58 bp: 5
Overview of removed sequences
length count expect max.err error counts
3 686 1562.5 0 686
4 210 390.6 0 210
5 15 97.7 0 15
6 13 24.4 0 13
7 3 6.1 0 3
8 1 1.5 0 1
9 1 0.4 0 0 1
10 6 0.1 1 0 6
11 3 0.0 1 0 3
So I understand that at least 2425 reads should be trimmed in the left read file and 938 in the right read file. This seems plausible. However, if I count reads in the output files I get 100,000 reads for both output files, so it seems that no read has been trimmed!
Any idea of what is occurring? Is the command line correct?
Many thanks for your answers!
Trimming does not mean removing. A trimmed read will still be present in the output, it will just be shorter. That said, it's weird that read 2 never got trimmed. For normal paired libraries, read 1 and read 2 will have adapter sequence at the same position, so in this case the adapter trimming is not working correctly. I suggest you try this approach instead.
Thanks @Brian! It's much clearer. So if I understand correctly, I should run
bbduk.sh in1=subset.R1.fastq in2=subset.R2.fastq out1=subset.R1.trim.fastq out2=subset.R2.trim.fastq literal=AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC,AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT ktrim=r k=23 mink=11 hdist=1 tpe tbo
.This gives the following output.
Read distribution is as follows:
It's an old post, but I think the correct options for R1 and R2 are -a and -A ... not -a and -g ... both would be applied to R1... capitalised options -A or -G are applied to R2.