I use deepTools regularly in my pipelines. For ATAC-seq data I noticed a potential downside to the bamCoverage
function. My sequence read length is 2x42 and alignment paired end. With the assumption that deepTools bamCoverage would align read fragments according to mated pairs, I found a small detail that may conflict with ATAC-seq datasets. Read length distribution follows a sub-, mono-, di-nucleosomal pattern (below).
The problem occurs in that deepTools default setting for paired-end is to align read mates < 4X read length which would cut my library in half and these reads will be considered single end. I think this results in a "choppy" feature on the genome browser and ultimately is incorrect (below). Moving forward I will use the -e
option to explicitly inform the command to extend all reads. Below is a genome browser track showing the differences in this option.
Here is some information about the commands:
- Insert Size Distribution:
samtools view [].bam | awk '$9>0' | cut -f 9 | sort | uniq -c | sort -b -k2,2n | sed -e 's/^[ \t]*//' > out.txt
- default-setting:
bamCoverage -b [].bam -o [].bw --normalizeUsing RPGC --effectiveGenomeSize 2913022398 -bs 10 -p max
- extendReads:
bamCoverage -b [].bam -o [].bw --normalizeUsing RPGC --effectiveGenomeSize 2913022398 -bs 10 -e -p max
How to add images to a Biostars post
Please confirm that images are posted correctly.
Yep, fine now, thank you.
With paired end data using the
-e
option is pretty much always recommended. An exception to this might actually be ATAC-seq, where only the ends of fragments are guaranteed to represent accessible regions.I also have no idea what "align read mates > 4X read length which would cut my library in half and these reads will be considered single end" means.
I do understand that
-e
is recommended."Paired-end: Reads with mates are always extended to match the fragment size defined by the two read mates. Unmated reads, mate reads that map too far apart (>4x fragment length) or even map to different chromosomes are treated like single-end reads. "
I'm saying that most of my ATAC data falls into the "mate reads that map too far apart (>4x fragment length)". I'm assuming this means read length and not fragment length because why would a fragment in an alignment file map to something 4 times its own length. If my understanding is incorrect please help me explain the difference in the genome browser plots I have posted. Why is it that
-e
changes the output?-e
creates coverage over the entire fragment length, not just the read length and that is why you get a "less detailed" view of the region. Without-e
you see in higher resolution where the actual cutting sites are located while with-e
the picture is more smooth but less detailed. In fact for ATAC-seq I never use-e
for exactly that reason since I am interested in the cutting site itself. If you ever want to plot footprints you will probably need to go without-e
and use-bs 1
plus limit the read length to 1 (shifted by 4/5 depending on strand, please see original ATAC-seq paper for details) to get maximum resolution.Realistically the shifting isn't needed and the 6-7 bases on each end of a fragment represent accessible bases. The original literature on Tn5 suggests that this can probably be expanded to the first 15 or so bases, since the transposase needs about that much free space to be active.
True, for anything beyond footprinting analysis is probably does not matter at all, I just mentioned for matters of completeness. Not sure if you even need it for FP, not really my expertise.
Thank you for bringing this up. Certainly Tn5 occupies space on the genome which we want to somehow incorporate into our analysis. But how much space? My reference highlights a 25/30 mer sequence insertional bias for the Tn5 on the bacterial genome Mobile DNA. The original publication that came out a year later shows insertional bias around that same length Nat Methods (Figure S4). Below I can show you this insertional bias present in my own data which is nearly identical to the original publication and another publication highlighting sequence bias and alternative Tn5 mutants that result in less bias BMC Biotechnology.
So my interpretation is that if you want to incorporate cut sites into analysis the optimal setting is a 30bp fragment. From the same lab as the original article, analysis of TCGA ATAC data was done with 150bp cut sites Science 2018 ( Supplemental Materials , Page 6, bottom paragraph). I have incorporated this analysis into a pipeline I developed in the lab, started a collaboration with 2 labs on results generated and we are wrapping up a publication now. I do not wish to portray this analysis strategy as an ineffective tool to generate hypotheses as I have done this dozens of times with ATAC data.
I would recommend that if you are only interested in cut sites from ATAC data, excluding the
-e
option is an incomplete solution. I'm assuming that the interpretation of a cut site is actually the end of the fragments that are aligned in the paired end file. I can reason that to what degree (how many bps) we need to consider Tn5 occupancy may have different opinions. Nonetheless, a very "relaxed" occupancy published by the original lab (see Science 2018 above) uses MACS2 with--shift -75 --extsize 150 --nomodel
to analyze. I'm sure you know but visualization tracks can be easily created with-B
. But why is-e
in with bamCoverage incomplete? Please help me figure this out because I've tested it. Below shows how signal is vastly different even when analyzing the modal distributions separately. I have included 40-140bp and 140-250 bp as these are the first two modes, clearly depicted in the original post (above). And thank you for the recommendation on binSize, I have updated that in the command.Commands:
--minFragmentLength 40 --maxFragmentLength 110 -bs 1 -e
--minFragmentLength 140 --maxFragmentLength 250 -bs 1 -e
Does anyone have an explanation as to why
-e
changes the output? If paired-end reads are always aligned by fragment, unless the fragment is 4X the average fragment length of the library, why is this output different? To me it looks like the range of 40-110 bp shows no differences in bigwig profile but there are remarkable differences in the 140-250 bp fragment bigwig. And if one is to analyze cut sites alone, you can see my point that excluding-e
is incomplete because it makes no differences in the 40-110 bp fragments. So doesn't that mean that you are excluding cut sites of the subnucloesomal fraction of reads?With
-e
specified the entire fragment is used rather than just the ends. This is NEVER the case unless you explicitly specify-e
(resulting in the more smoothed signals you're seeing). If you want the cut sites then don't use-e
and use the--Offset
option.Only fragments more than 4x the median length of your file will get treated as singletons, so almost none of your data falls into this category.
I see now that in the later text this is stated. But this still begs the question as to why the outputs are different. If you say that nearly all of my library will be analyzed as paired end fragments then why does
-e
make a difference? And do you have any alternatives to analyze cut sites? If you say an exception may be ATAC data thenbamCoverage
doesn't have options to create visualizations tracks of all the cut sites of ATAC-seq reads.Also I should have directed my > to <
I have updated the post*