ATAC-seq data and deepTools: Small detail with impact on output
0
1
Entering edit mode
4.8 years ago
gable_works ▴ 50

I use deepTools regularly in my pipelines. For ATAC-seq data I noticed a potential downside to the bamCoverage function. My sequence read length is 2x42 and alignment paired end. With the assumption that deepTools bamCoverage would align read fragments according to mated pairs, I found a small detail that may conflict with ATAC-seq datasets. Read length distribution follows a sub-, mono-, di-nucleosomal pattern (below). Insert Size Distribution

The problem occurs in that deepTools default setting for paired-end is to align read mates < 4X read length which would cut my library in half and these reads will be considered single end. I think this results in a "choppy" feature on the genome browser and ultimately is incorrect (below). Moving forward I will use the -e option to explicitly inform the command to extend all reads. Below is a genome browser track showing the differences in this option.

extendReads

Here is some information about the commands:

  • Insert Size Distribution: samtools view [].bam | awk '$9>0' | cut -f 9 | sort | uniq -c | sort -b -k2,2n | sed -e 's/^[ \t]*//' > out.txt
  • default-setting: bamCoverage -b [].bam -o [].bw --normalizeUsing RPGC --effectiveGenomeSize 2913022398 -bs 10 -p max
  • extendReads: bamCoverage -b [].bam -o [].bw --normalizeUsing RPGC --effectiveGenomeSize 2913022398 -bs 10 -e -p max
bigwig read-length deeptools ATAC-seq • 5.5k views
ADD COMMENT
1
Entering edit mode
ADD REPLY
1
Entering edit mode

Please confirm that images are posted correctly.

ADD REPLY
1
Entering edit mode

Yep, fine now, thank you.

ADD REPLY
1
Entering edit mode

With paired end data using the -e option is pretty much always recommended. An exception to this might actually be ATAC-seq, where only the ends of fragments are guaranteed to represent accessible regions.

I also have no idea what "align read mates > 4X read length which would cut my library in half and these reads will be considered single end" means.

ADD REPLY
0
Entering edit mode

I do understand that -e is recommended.

"Paired-end: Reads with mates are always extended to match the fragment size defined by the two read mates. Unmated reads, mate reads that map too far apart (>4x fragment length) or even map to different chromosomes are treated like single-end reads. "

I'm saying that most of my ATAC data falls into the "mate reads that map too far apart (>4x fragment length)". I'm assuming this means read length and not fragment length because why would a fragment in an alignment file map to something 4 times its own length. If my understanding is incorrect please help me explain the difference in the genome browser plots I have posted. Why is it that -e changes the output?

ADD REPLY
1
Entering edit mode

-e creates coverage over the entire fragment length, not just the read length and that is why you get a "less detailed" view of the region. Without -e you see in higher resolution where the actual cutting sites are located while with -e the picture is more smooth but less detailed. In fact for ATAC-seq I never use -e for exactly that reason since I am interested in the cutting site itself. If you ever want to plot footprints you will probably need to go without -e and use -bs 1 plus limit the read length to 1 (shifted by 4/5 depending on strand, please see original ATAC-seq paper for details) to get maximum resolution.

ADD REPLY
1
Entering edit mode

Realistically the shifting isn't needed and the 6-7 bases on each end of a fragment represent accessible bases. The original literature on Tn5 suggests that this can probably be expanded to the first 15 or so bases, since the transposase needs about that much free space to be active.

ADD REPLY
1
Entering edit mode

True, for anything beyond footprinting analysis is probably does not matter at all, I just mentioned for matters of completeness. Not sure if you even need it for FP, not really my expertise.

ADD REPLY
0
Entering edit mode

Thank you for bringing this up. Certainly Tn5 occupies space on the genome which we want to somehow incorporate into our analysis. But how much space? My reference highlights a 25/30 mer sequence insertional bias for the Tn5 on the bacterial genome Mobile DNA. The original publication that came out a year later shows insertional bias around that same length Nat Methods (Figure S4). Below I can show you this insertional bias present in my own data which is nearly identical to the original publication and another publication highlighting sequence bias and alternative Tn5 mutants that result in less bias BMC Biotechnology.FastQC

So my interpretation is that if you want to incorporate cut sites into analysis the optimal setting is a 30bp fragment. From the same lab as the original article, analysis of TCGA ATAC data was done with 150bp cut sites Science 2018 ( Supplemental Materials , Page 6, bottom paragraph). I have incorporated this analysis into a pipeline I developed in the lab, started a collaboration with 2 labs on results generated and we are wrapping up a publication now. I do not wish to portray this analysis strategy as an ineffective tool to generate hypotheses as I have done this dozens of times with ATAC data.

ADD REPLY
0
Entering edit mode

I would recommend that if you are only interested in cut sites from ATAC data, excluding the -e option is an incomplete solution. I'm assuming that the interpretation of a cut site is actually the end of the fragments that are aligned in the paired end file. I can reason that to what degree (how many bps) we need to consider Tn5 occupancy may have different opinions. Nonetheless, a very "relaxed" occupancy published by the original lab (see Science 2018 above) uses MACS2 with --shift -75 --extsize 150 --nomodel to analyze. I'm sure you know but visualization tracks can be easily created with -B. But why is -e in with bamCoverage incomplete? Please help me figure this out because I've tested it. Below shows how signal is vastly different even when analyzing the modal distributions separately. I have included 40-140bp and 140-250 bp as these are the first two modes, clearly depicted in the original post (above). And thank you for the recommendation on binSize, I have updated that in the command. Fragment Size Separation

Commands: --minFragmentLength 40 --maxFragmentLength 110 -bs 1 -e --minFragmentLength 140 --maxFragmentLength 250 -bs 1 -e

Does anyone have an explanation as to why -e changes the output? If paired-end reads are always aligned by fragment, unless the fragment is 4X the average fragment length of the library, why is this output different? To me it looks like the range of 40-110 bp shows no differences in bigwig profile but there are remarkable differences in the 140-250 bp fragment bigwig. And if one is to analyze cut sites alone, you can see my point that excluding -e is incomplete because it makes no differences in the 40-110 bp fragments. So doesn't that mean that you are excluding cut sites of the subnucloesomal fraction of reads?

ADD REPLY
1
Entering edit mode

With -e specified the entire fragment is used rather than just the ends. This is NEVER the case unless you explicitly specify -e (resulting in the more smoothed signals you're seeing). If you want the cut sites then don't use -e and use the --Offset option.

ADD REPLY
1
Entering edit mode

Only fragments more than 4x the median length of your file will get treated as singletons, so almost none of your data falls into this category.

ADD REPLY
0
Entering edit mode

I see now that in the later text this is stated. But this still begs the question as to why the outputs are different. If you say that nearly all of my library will be analyzed as paired end fragments then why does -e make a difference? And do you have any alternatives to analyze cut sites? If you say an exception may be ATAC data then bamCoverage doesn't have options to create visualizations tracks of all the cut sites of ATAC-seq reads.

ADD REPLY
0
Entering edit mode

Also I should have directed my > to <

I have updated the post*

ADD REPLY

Login before adding your answer.

Traffic: 1953 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6