Bowtie2 Alignment with low percentage of properly paired reads (ATAC-seq)
0
1
Entering edit mode
2.2 years ago
zau saa ▴ 150

Hello! Recently I've downloaded ATAC-seq fastq files from ENCODE for practice. However, after I aligned reads to hg38, samtools flagstat reported that the percentage of properly paired reads is only about 48%. However, the QC information of the corresponding bam file in ENCODE shows that the percentage of properly paired reads is about 90%. I wonder where I made a mistake. Here is my process:

# Firstly, I trim reads by trimmomatic (ver. 0.39)
java -jar $trim/trimmomatic-0.39.jar PE -threads 10 R1.fastq.gz R2.fastq.gz \ 
R1_trim.fastq.gz R1_trim_unpaired.fastq.gz R2_trim.fastq.gz R2_trim_unpaired.fastq.gz \
ILLUMINACLIP:$trim/adapters/NexteraPE-PE.fa:2:30:10:5:True LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:30 HEADCROP:13 

# Secondly, I use bowtie2 to align reads to hg38
bowtie2 --very-sensitive -p 8 -X 2000 -x $index -1 R1_trim.fastq.gz -2 R2_trim.fastq.gz | \
samtools view -Sb | samtools sort -l 4 -o $align/sort.bam

# Thirdly, I use samtools flagstat for statistics
samtools flagstat sort.bam

Results:

82500300 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
77243188 + 0 mapped (93.63% : N/A) *
82500300 + 0 paired in sequencing
41250150 + 0 read1
41250150 + 0 read2
39107164 + 0 properly paired (47.40% : N/A) *
77047062 + 0 with itself and mate mapped
196126 + 0 singletons (0.24% : N/A)
3938892 + 0 with mate mapped to a different chr
766863 + 0 with mate mapped to a different chr (mapQ>=5)
trimmomatic Bowtie2 ENCODE ATAC-seq • 2.3k views
ADD COMMENT
0
Entering edit mode

Using --very-sensitive in bowtie2 could be the reason? Did they also use that in their pipeline? Because otherwise, the default is --sensitive.

ADD REPLY
0
Entering edit mode

I think it's not the reason. The pipeline I refer to use this parameter.

ADD REPLY
0
Entering edit mode

I would remove mitochondrial reads first and then check how stats are. In older protocols mt contamination in ATAC-seq was quite a thing, sometimes accounting for > 80% of all reads, but mapping stats of the mt genome are largely irrelevant.

ADD REPLY
0
Entering edit mode

I remove mitochondrial reads by removeChrom and PCR duplicated reads by picard . however, properly paired rate remains low.

# remove PCR duplicated reads by picard
picard MarkDuplicates -I sort.bam -O sort_rmdup.bam -M dups.txt --REMOVE_DUPLICATES

# remove mitochondrial reads by removeChrom.py 
samtools view -h sort_rmdup.bam | python3 removeChrom.py - - chrM | \
samtools view -Sb --threads 5 > sort_rmdup_chrM.bam

result:

63913762 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
52104631 + 0 mapped (81.52% : N/A)*
63913762 + 0 paired in sequencing
31944520 + 0 read1
31969242 + 0 read2
29772366 + 0 properly paired (46.58% : N/A)*
51816674 + 0 with itself and mate mapped
287957 + 0 singletons (0.45% : N/A)
1713866 + 0 with mate mapped to a different chr
293512 + 0 with mate mapped to a different chr (mapQ>=5)
ADD REPLY
0
Entering edit mode

Then I don't not know, sorry. I quickly checked some of our data that I could immediately access (mouse) and they usually ranged in the ~95% mapped and 80-90% properly-paired range with the --very-sensitive flag. Mostly hhematopoietic and bone marrow cells.

Likely you did not do any mistake. It is always tricky to use public data. What cell type is that?

ADD REPLY
1
Entering edit mode

I finally solved the problem! Following a tutorial with a 95% properly paired rate(mouse), I used trim_galore to trim adapters. However, differing from the process I aforementioned, I don't cut bases in front of the reads. Then, the properly paired rate of the alignment increases to 89%, which is similar to the bam given by ENCODE. Since the performance of trim_galore and trimmomatic is comparable, the key to the low properly paired rate should be that I cut 13 bases in front of the reads. I can't understand why the hardtrim process leads to a low properly paired rate, though.

ADD REPLY
0
Entering edit mode

It's naive B primary cell. https://www.encodeproject.org/experiments/ENCSR685OFR/ It's strange that the report of its unflitered bam shows that the percentage of properly paired reads passing QC is 89.3%

ADD REPLY
0
Entering edit mode

Anyway, thanks for your reply!

ADD REPLY

Login before adding your answer.

Traffic: 1822 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6