Hi,
I'm analyzing human cancer cell line ATAC-seq data (PE 75bp Illumina) and have encountered a weird insert size distribution. For ATAC-seq PE data, insert size distribution is a good quality control of how well the experiment worked (This is how it would ideally look https://dbrg77.wordpress.com/2017/02/10/atac-seq-insert-size-plotting/). I have used several mapping strategies and always come to a similar distribution plot.
Sequencing is good quality with 50-60 million reads per sample with min. 3 replicates and mapping >80% for all samples. > 95% of forward and reverse reads are paired.
Analysis
java -jar $tool_path/trimmomatic-0.36.jar PE -phred33 -threads 20 $forward $reverse $forward_1_paired.fastq $forward_unpaired.fastq $reverse_paired.fastq $reverse_unpaired.fastq ILLUMINACLIP:$tool_path/adapters/TruSeq3-PE.fa:2:30:10 LEADING:10 TRAILING:10 MINLEN:50
bowtie2 --threads 20 --very-sensitive -x $bowtie2index -1 $forward_paired -2 $reverse_paired -S $filename_aligned.sam
Plotting
samtools view ATAC_f2q30_sorted.bam | awk '$9>0' | cut -f 9 | sort | uniq -c | sort -b -k2,2n | sed -e 's/^[ \t]*//' > fragment_length_count.txt
Alternative way of plotting
java -Xmx20g -jar picard.jar CollectInsertSizeMetrics I=$file O=insert_size_metrics.txt H=insert_size_histogram.pdf
Insert size distribution *In the plot it should be written insert length, not fragment length.
Does anyone have an idea what went wrong here? Could this data be still useful to analysis? I was thinking to extract PE reads with insert size between 0-300nt and proceed only with these reads for downstream analysis?
Please share your code for adapter trimming, mapping and generating this figure. Then we'll see. Still, this looks suspicious and I've never seen it in our ATAC-seq data, be it mouse or human.
I've added the information in the description.
The only odd thing I see is that you trimmed for the TruSeq instead of the Nextera adapter. Sequence to trim on both ends is
CTGTCTCTTATACACATCT
. This makes a difference when calculating insert sizes, but typically the nucleosomal pattern should still be visible. How did the gel or Bioanalyzer picture look prior to sequencing (hope you did that quality control)?Hi , There is few question to ask ; -is your sequencing is good ? ( high coverage of you insertion ? ) -is your model organism got a "complete" genome ref ? ( correct assembly? high concentration of TE/ high genome dynamic ? )
I've added the information in the description.
So you are with data with hi heterogeneity/ high rearrangement because of cancer cells , did you check with for example IGV the coverage of one insertion ?
I do not think that the cell-of-origin is the reason here. We've done ATAC-seq in different cancer cell lines, and the pattern was always clear and distinct. ATAC-seq still looks at the genome/regulome, and single rearrangements IMHO do not influence the overall picture.
Ok i taught it could explain the differences ...
Was the plot from your custom code or the data from Picard? What does the output of
bamPEFragmentSize
from deepTools look like? What does the output ofsamtools idxstats
look like? Are you sure this is actually your sample and not someone else's (it happens, though rarely)?