Entering edit mode
2.2 years ago
zau saa
▴
150
Hello! Recently I've downloaded ATAC-seq fastq files from ENCODE for practice. However, after I aligned reads to hg38, samtools flagstat reported that the percentage of properly paired reads is only about 48%. However, the QC information of the corresponding bam file in ENCODE shows that the percentage of properly paired reads is about 90%. I wonder where I made a mistake. Here is my process:
# Firstly, I trim reads by trimmomatic (ver. 0.39)
java -jar $trim/trimmomatic-0.39.jar PE -threads 10 R1.fastq.gz R2.fastq.gz \
R1_trim.fastq.gz R1_trim_unpaired.fastq.gz R2_trim.fastq.gz R2_trim_unpaired.fastq.gz \
ILLUMINACLIP:$trim/adapters/NexteraPE-PE.fa:2:30:10:5:True LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:30 HEADCROP:13
# Secondly, I use bowtie2 to align reads to hg38
bowtie2 --very-sensitive -p 8 -X 2000 -x $index -1 R1_trim.fastq.gz -2 R2_trim.fastq.gz | \
samtools view -Sb | samtools sort -l 4 -o $align/sort.bam
# Thirdly, I use samtools flagstat for statistics
samtools flagstat sort.bam
Results:
82500300 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
77243188 + 0 mapped (93.63% : N/A) *
82500300 + 0 paired in sequencing
41250150 + 0 read1
41250150 + 0 read2
39107164 + 0 properly paired (47.40% : N/A) *
77047062 + 0 with itself and mate mapped
196126 + 0 singletons (0.24% : N/A)
3938892 + 0 with mate mapped to a different chr
766863 + 0 with mate mapped to a different chr (mapQ>=5)
Using
--very-sensitive
in bowtie2 could be the reason? Did they also use that in their pipeline? Because otherwise, the default is--sensitive
.I think it's not the reason. The pipeline I refer to use this parameter.
I would remove mitochondrial reads first and then check how stats are. In older protocols mt contamination in ATAC-seq was quite a thing, sometimes accounting for > 80% of all reads, but mapping stats of the mt genome are largely irrelevant.
I remove mitochondrial reads by removeChrom and PCR duplicated reads by picard . however, properly paired rate remains low.
result:
Then I don't not know, sorry. I quickly checked some of our data that I could immediately access (mouse) and they usually ranged in the ~95% mapped and 80-90% properly-paired range with the
--very-sensitive
flag. Mostly hhematopoietic and bone marrow cells.Likely you did not do any mistake. It is always tricky to use public data. What cell type is that?
I finally solved the problem! Following a tutorial with a 95% properly paired rate(mouse), I used trim_galore to trim adapters. However, differing from the process I aforementioned, I don't cut bases in front of the reads. Then, the properly paired rate of the alignment increases to 89%, which is similar to the bam given by ENCODE. Since the performance of trim_galore and trimmomatic is comparable, the key to the low properly paired rate should be that I cut 13 bases in front of the reads. I can't understand why the hardtrim process leads to a low properly paired rate, though.
It's naive B primary cell. https://www.encodeproject.org/experiments/ENCSR685OFR/ It's strange that the report of its unflitered bam shows that the percentage of properly paired reads passing QC is 89.3%
Anyway, thanks for your reply!