Question

Bowtie2 Alignment with low percentage of properly paired reads (ATAC-seq)

1

Entering edit mode

2.2 years ago

zau saa ▴ 150

Hello! Recently I've downloaded ATAC-seq fastq files from ENCODE for practice. However, after I aligned reads to hg38, samtools flagstat reported that the percentage of properly paired reads is only about 48%. However, the QC information of the corresponding bam file in ENCODE shows that the percentage of properly paired reads is about 90%. I wonder where I made a mistake. Here is my process:

# Firstly, I trim reads by trimmomatic (ver. 0.39)
java -jar $trim/trimmomatic-0.39.jar PE -threads 10 R1.fastq.gz R2.fastq.gz \ 
R1_trim.fastq.gz R1_trim_unpaired.fastq.gz R2_trim.fastq.gz R2_trim_unpaired.fastq.gz \
ILLUMINACLIP:$trim/adapters/NexteraPE-PE.fa:2:30:10:5:True LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:30 HEADCROP:13 

# Secondly, I use bowtie2 to align reads to hg38
bowtie2 --very-sensitive -p 8 -X 2000 -x $index -1 R1_trim.fastq.gz -2 R2_trim.fastq.gz | \
samtools view -Sb | samtools sort -l 4 -o $align/sort.bam

# Thirdly, I use samtools flagstat for statistics
samtools flagstat sort.bam

Results:

82500300 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
77243188 + 0 mapped (93.63% : N/A) *
82500300 + 0 paired in sequencing
41250150 + 0 read1
41250150 + 0 read2
39107164 + 0 properly paired (47.40% : N/A) *
77047062 + 0 with itself and mate mapped
196126 + 0 singletons (0.24% : N/A)
3938892 + 0 with mate mapped to a different chr
766863 + 0 with mate mapped to a different chr (mapQ>=5)

trimmomatic Bowtie2 ENCODE ATAC-seq • 2.3k views

ADD COMMENT • link 2.2 years ago by zau saa ▴ 150

0

Entering edit mode

Using --very-sensitive in bowtie2 could be the reason? Did they also use that in their pipeline? Because otherwise, the default is --sensitive.

ADD REPLY • link 2.2 years ago by Marco Pannone ▴ 810

0

Entering edit mode

I think it's not the reason. The pipeline I refer to use this parameter.

ADD REPLY • link 2.2 years ago by zau saa ▴ 150

0

Entering edit mode

I would remove mitochondrial reads first and then check how stats are. In older protocols mt contamination in ATAC-seq was quite a thing, sometimes accounting for > 80% of all reads, but mapping stats of the mt genome are largely irrelevant.

ADD REPLY • link 2.2 years ago by ATpoint 86k

0

Entering edit mode

I remove mitochondrial reads by removeChrom and PCR duplicated reads by picard . however, properly paired rate remains low.

# remove PCR duplicated reads by picard
picard MarkDuplicates -I sort.bam -O sort_rmdup.bam -M dups.txt --REMOVE_DUPLICATES

# remove mitochondrial reads by removeChrom.py 
samtools view -h sort_rmdup.bam | python3 removeChrom.py - - chrM | \
samtools view -Sb --threads 5 > sort_rmdup_chrM.bam

result:

63913762 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
52104631 + 0 mapped (81.52% : N/A)*
63913762 + 0 paired in sequencing
31944520 + 0 read1
31969242 + 0 read2
29772366 + 0 properly paired (46.58% : N/A)*
51816674 + 0 with itself and mate mapped
287957 + 0 singletons (0.45% : N/A)
1713866 + 0 with mate mapped to a different chr
293512 + 0 with mate mapped to a different chr (mapQ>=5)

ADD REPLY • link 2.2 years ago by zau saa ▴ 150

0

Entering edit mode

Then I don't not know, sorry. I quickly checked some of our data that I could immediately access (mouse) and they usually ranged in the ~95% mapped and 80-90% properly-paired range with the --very-sensitive flag. Mostly hhematopoietic and bone marrow cells.

Likely you did not do any mistake. It is always tricky to use public data. What cell type is that?

ADD REPLY • link 2.2 years ago by ATpoint 86k

1

Entering edit mode

I finally solved the problem! Following a tutorial with a 95% properly paired rate(mouse), I used trim_galore to trim adapters. However, differing from the process I aforementioned, I don't cut bases in front of the reads. Then, the properly paired rate of the alignment increases to 89%, which is similar to the bam given by ENCODE. Since the performance of trim_galore and trimmomatic is comparable, the key to the low properly paired rate should be that I cut 13 bases in front of the reads. I can't understand why the hardtrim process leads to a low properly paired rate, though.

ADD REPLY • link 2.2 years ago by zau saa ▴ 150

0

Entering edit mode

It's naive B primary cell. https://www.encodeproject.org/experiments/ENCSR685OFR/ It's strange that the report of its unflitered bam shows that the percentage of properly paired reads passing QC is 89.3%