HI, I am working on some ATAC-seq data. We have performed paired-end sequencing with read length of 150 bp on a total of 24 samples (8 conditions in triplicates) To begin with, I will just briefly describe some of the main analysis steps.
I have trimmed the read to remove adapter contaminants and also filter out reads of low quality, using BBmap (version 38.58):
bbduk.sh in1=sample_R1.fastq.gz in2=sample_R2.fastq.gz out1=sample_R1_clean.fastq.gz out2=sample_R2_clean.fastq.gz ref=/home/bbmap/resources/adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo
bbmap/bbduk.sh in1=sample_R1_clean.fastq.gz in2=sample_R2_clean.fastq.gz out1=sample_R1_clean2.fastq.gz out2=sample_R2_clean2.fastq.gz qtrim=l trimq=30
Next, I mapped the reads to the human reference genome GRCh38 using Bowtie2 (version 2.3.4.1). For this, pre-built bowtie index files excluding ALT contigs were downloaded from the bowtie2 page at SourceForge. Code used:
bowtie2 -q -p 8 --very-sensitive-local -x GRCh38_noalt_as -1 sample_R1_clean2.fastq.gz -2 sample_R2_clean2.fastq.gz -S sample.sam
The sam files was further converted to bam files, sorted and indexed using samtools (version 1.9)
Performing a number of post-alignment quality control steps I could conclude that the samples contained between 35%-78% duplicates and between 27%-70% mitochondrial reads. Filtering these out together with blacklisted regions at peak calling with Genrich gave about 70k-100k peaks which on further inspection made sense.
However, now to one of my main concerns and the reason I am writing this post. For the bam files before any further filtering, when plotting fragment size distribution (using the R packages ATACseqQC and ggplot2) the graph shows a huge spike at exactly 150 bp (see image below)
First I was wondering whether this spike was due to duplicates, the mitochondrial reads or blcklisted regions. However, even after all of these have been removed the huge spike is still present, now with an even larger relative size compared to the rest of the profile (see image below)
I am hoping that someone could help me understand what is going on here and what may generate this enormous spike in the fragment size distribution. The fact that it occur at 150 bp is interesting, because that is the exact length of the sequencing reads, but I do not know whether that connection is important here.
Also, on a side note, I have tested several tools for calculating and plotting the fragment size distribution, and they gives a little different profiles, though the big spike at 150 bp is always there. However, the "fragSizeDist" function in ATACseqQC also gives a large frequency of fragments with size 0 bp (nearly as many as for the huge spike at 150 bp), which I remove before plotting.
Great catch!