I have the pipeline with the following trimming and mapping parameter:
trim_galore --stringency 5 --dont_gzip --trim1 --length 30 -q 0 -a CTGTCTCTTATACACATCT -o $TRIMGALORE_DIR $READ1 $READ2
bowtie $Bowtie_Index_File -1 $TRIMGALORE_R1 -2 $TRIMGALORE_R2 --threads 5 -m 1 -v 2 -S -I 0 -X 2000
Raw paired-end mapping (without adapter trimming) gives this result
**********************************************
Stats for BAM file(s):
**********************************************
Total reads: 169555726
Mapped reads: 97870984 (57.722%)
Forward strand: 120620234 (71.139%)
Reverse strand: 48935492 (28.861%)
Failed QC: 0 (0%)
Duplicates: 0 (0%)
Paired-end reads: 169555726 (100%)
'Proper-pairs': 97870984 (57.722%)
Both pairs mapped: 97870984 (57.722%)
Read 1: 84777863
Read 2: 84777863
Singletons: 0 (0%)
Average insert size (absolute value): 125.862
Median insert size (absolute value): 99
But after running it with Trim_Galore I get this:
**********************************************
Stats for BAM file(s):
**********************************************
Total reads: 168972000
Mapped reads: 314228 (0.185965%)
Forward strand: 168814886 (99.907%)
Reverse strand: 157114 (0.0929823%)
Failed QC: 0 (0%)
Duplicates: 0 (0%)
Paired-end reads: 168972000 (100%)
'Proper-pairs': 314228 (0.185965%)
Both pairs mapped: 314228 (0.185965%)
Read 1: 84486000
Read 2: 84486000
Singletons: 0 (0%)
Average insert size (absolute value): 571.141
Median insert size (absolute value): 297
So the mapping rates drop radically from 57% to 0.18%.
I find it strange because the trim galore only use trim away very little sequence. Why was it and what's the right parameter should I use for both Bowtie and Cutadapt/trim_galore?
SUMMARISING RUN PARAMETERS
==========================
Input filename: /home/ubuntu/R1_001.fastq
Trimming mode: single-end
Trim Galore version: 0.4.1
Cutadapt version: 1.12
Quality Phred score cutoff: 0
Quality encoding type selected: ASCII+33
Adapter sequence: 'CTGTCTCTTATACACATCT' ()
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 5 bp
Minimum required sequence length before a sequence gets removed: 30 bp
All sequences will be trimmed by 1 bp on their 3' end to avoid problems with invalid paired-end alignments with Bowtie 1
Writing final adapter and quality trimmed output to BB_89_S1_R1_001_trimmed.fq
>>> Now performing quality (cutoff 0) and adapter trimming in a single pass for the adapter sequence: 'CTGTCTCTTATACACATCT' from file /home/ubuntu/R1_001.fastq <<<
10000000 sequences processed
20000000 sequences processed
30000000 sequences processed
40000000 sequences processed
50000000 sequences processed
60000000 sequences processed
70000000 sequences processed
80000000 sequences processed
This is cutadapt 1.12 with Python 2.7.13
Command line parameters: -f fastq -e 0.1 -q 0 -O 5 -a CTGTCTCTTATACACATCT /home/ubuntu/R1_001.fastq
Trimming 1 adapter with at most 10.0% errors in single-end mode ...
Finished in 804.75 s (9 us/read; 6.32 M reads/minute).
=== Summary ===
Total reads processed: 84,777,863
Reads with adapters: 424,374 (0.5%)
Reads written (passing filters): 84,777,863 (100.0%)
Total basepairs processed: 3,052,003,068 bp
Quality-trimmed: 0 bp (0.0%)
Total written (filtered): 3,048,935,811 bp (99.9%)
=== Adapter 1 ===
Sequence: CTGTCTCTTATACACATCT; Type: regular 3'; Length: 19; Trimmed: 424374 times.
No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1
Bases preceding removed adapters:
A: 13.7%
C: 36.7%
G: 25.3%
T: 24.2%
none/other: 0.0%
Overview of removed sequences
length count expect max.err error counts
5 132511 82790.9 0 132511
6 72231 20697.7 0 72231
7 50748 5174.4 0 50748
8 43106 1293.6 0 43106
9 53661 323.4 0 50630 3031
10 50660 80.9 1 44674 5986
11 8616 20.2 1 6320 2296
12 2894 5.1 1 1601 1293
13 2928 1.3 1 2706 222
14 995 0.3 1 906 89
15 2681 0.1 1 2570 111
16 620 0.0 1 598 22
17 1622 0.0 1 1578 44
18 263 0.0 1 255 8
19 460 0.0 1 446 14
20 86 0.0 1 84 2
21 107 0.0 1 103 4
22 49 0.0 1 42 7
23 28 0.0 1 23 5
24 11 0.0 1 10 1
25 8 0.0 1 7 1
26 7 0.0 1 6 1
27 23 0.0 1 22 1
28 8 0.0 1 8
29 1 0.0 1 1
33 2 0.0 1 1 1
35 1 0.0 1 0 1
36 47 0.0 1 45 2
RUN STATISTICS FOR INPUT FILE: /home/ubuntu/R1_001.fastq
=============================================
84777863 sequences processed in total
Sequences removed because they became shorter than the length cutoff of 30 bp: 291863 (0.3%)
Writing report to 'trim_galore_dir/R2_001.fastq_trimming_report.txt'
SUMMARISING RUN PARAMETERS
==========================
Input filename: /home/ubuntu/R2_001.fastq
Trimming mode: single-end
Trim Galore version: 0.4.1
Cutadapt version: 1.12
Quality Phred score cutoff: 0
Quality encoding type selected: ASCII+33
Adapter sequence: 'CTGTCTCTTATACACATCT' ()
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 5 bp
Minimum required sequence length before a sequence gets removed: 30 bp
All sequences will be trimmed by 1 bp on their 3' end to avoid problems with invalid paired-end alignments with Bowtie 1
Writing final adapter and quality trimmed output to BB_89_S1_R2_001_trimmed.fq
>>> Now performing quality (cutoff 0) and adapter trimming in a single pass for the adapter sequence: 'CTGTCTCTTATACACATCT' from file /home/ubuntu/R2_001.fastq <<<
10000000 sequences processed
20000000 sequences processed
30000000 sequences processed
40000000 sequences processed
50000000 sequences processed
60000000 sequences processed
70000000 sequences processed
80000000 sequences processed
This is cutadapt 1.12 with Python 2.7.13
Command line parameters: -f fastq -e 0.1 -q 0 -O 5 -a CTGTCTCTTATACACATCT /home/ubuntu/R2_001.fastq
Trimming 1 adapter with at most 10.0% errors in single-end mode ...
Finished in 814.43 s (10 us/read; 6.25 M reads/minute).
=== Summary ===
Total reads processed: 84,777,863
Reads with adapters: 423,690 (0.5%)
Reads written (passing filters): 84,777,863 (100.0%)
Total basepairs processed: 3,052,003,068 bp
Quality-trimmed: 0 bp (0.0%)
Total written (filtered): 3,048,945,359 bp (99.9%)
=== Adapter 1 ===
Sequence: CTGTCTCTTATACACATCT; Type: regular 3'; Length: 19; Trimmed: 423690 times.
No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1
Bases preceding removed adapters:
A: 13.7%
C: 36.1%
G: 25.4%
T: 24.8%
none/other: 0.0%
Overview of removed sequences
length count expect max.err error counts
5 132783 82790.9 0 132783
6 72585 20697.7 0 72585
7 50354 5174.4 0 50354
8 43191 1293.6 0 43191
9 53096 323.4 0 49979 3117
10 50491 80.9 1 44467 6024
11 8598 20.2 1 6158 2440
12 2915 5.1 1 1566 1349
13 2880 1.3 1 2625 255
14 959 0.3 1 865 94
15 2597 0.1 1 2493 104
16 601 0.0 1 570 31
17 1573 0.0 1 1520 53
18 256 0.0 1 240 16
19 447 0.0 1 430 17
20 83 0.0 1 77 6
21 101 0.0 1 87 14
22 46 0.0 1 40 6
23 28 0.0 1 26 2
24 12 0.0 1 10 2
25 8 0.0 1 7 1
26 7 0.0 1 6 1
27 23 0.0 1 20 3
28 8 0.0 1 7 1
29 1 0.0 1 1
33 2 0.0 1 1 1
36 45 0.0 1 44 1
Have you checked a random selection of reads by blasting them at NCBI to make sure they align to the right genome?
Initially 100% of your reads that mapped, mapped as proper pairs. That's very odd. How are you doing the mapping? It appears that you are excluding unpaired reads from being considered mapped, and that the order got messed up during trimming so that the reads are no longer paired.
@BrianBushnell do you mean mapping before trimming?
bowtie $Bowtie_Index_File -1 $READ1 -2 $READ2 --threads 5 -m 1 -v 2 -S -I 0 -X 2000
that's READ before trimming.