Hello!
I have recently started to use Deeptools for my ATAC-seq data analysis and would like to get some feedback on the conversion of the BAM files to bigwig coverage tracks. I am interested in looking at TSS metaprofiles using either all fragments or fragments of only certain sizes.
For an all-inclusive coverage:
bamCoverage --bam in.bam --outFileName out.bw --binSize 1 --normalizeUsing RPGC --effectiveGenomeSize x
Am I correct in normalizing to 1x coverage? From what I have seen in the literature, this is what most people are doing with ATAC-seq data. Also, I am uncertain about the effective genome size, which depends on read length. I have trimmed the Tn5 adapter sequence prior to alignment and have now varying read lengths.
For nucleosome-free regions:
bamCoverage --bam in.bam --outFileName out.bw --binSize 1 --normalizeUsing RPGC --effectiveGenomeSize x --skipNonCoveredRegions --extendReads --maxFragmentLength 120
I have seen that some people also use --minFragmentLength 40, but it doesn't seem to have a great effect when I look at the metaprofiles or in the genome browser. What is your experience? Also, what are your thoughts on skipping non-covered regions and extending the reads to full fragments? Especially the latter confuses me quite a bit. Does the extension result in the entire fragment being counted? If I don't specify this option, is only the 5' read end counted? I am not quite understanding which part of the read is counted by the tool. Shouldn't each read be only counted once instead of throughout its length?
For mononucleosome fragments:
bamCoverage --bam in.bam --outFileName out.bw --binSize 1 --normalizeUsing RPGC --effectiveGenomeSize x --MNase
In this specific case, people seem to not skip non-covered regions - why is that? In my understanding the above code creates files where only the center (3 bp) of fragments between 130 and 200 bp is considered. This reveals the position of the nucleosomes, correct? What about focusing on Tn5 cut-sites? Would I need to use --Offset 1? How exactly?
These are quite a lot of questions. It's quite hard for me to judge what is commonly used for those specific purposes mentioned above. Thanks to anyone who has taken the time to read this and I appreciate all input and help!
Thanks a lot for your help! I will try this way of normalization then. I've also seen people downsample their BAM files to the read number of the lowest sample. What is your take on that?
Speaking of normalization and varying library sizes, I have a rather general question: When pooling different sequencing libraries with different adapters, it frequently happens that one sample gets many more reads compared to the other samples. Why is that? Is this simply due to biological variability among replicates or does some technical effect from the sequencing platform contribute to this?
That is effectively the same as per-million (1x) normalization and suffers from the same problems as mentioned above.
That happens if the libraries were not loaded at equal concentrations to the sequencer. A proper quantification prior to pooling samples to make the final sequencing pool is necessary to avoid that. This has nothing to do with biological effects, pure technical shortcoming.
Ah, good to know. Thanks a lot!