Hi all,
I recently ran some ATAC-Seq and was expecting to have at least 50 million paired reads for them. Most of them have enough reads. In fact, they are reaching 100 million (21 samples in 1 lane in Novaseq plus X using 10B flow cell, 20% PhiX). However, there are some with significantly lower reads (50 to 70 millions). I am not sure if I should resequence them to top up the reads to around 100 million.
My question is, when deciding which one I need to resequence to top up for more data, should I: 1) just rerun anything with less than 100 million paired reads? 2) rerun anything significantly lower than 100 million paired reads (e.g. >90 million reads)? 3) do the alignment and look at the quality first (TSS enrichment value, % duplication, % mitochondria) before deciding anything?
And if I don't have to resequence it, would normalisation to the library size suffice for a reliable differential analysis?
Sorry if it's a very routine practice. I am very new to ATAC-Seq.
In my opinion and experience there is no point sequencing ATAC-seq that deep. We typically go for 30M reads. At 50-100mio you mainly start picking up duplicate reads. I would even consider subsampling the large datasets to the one with smallest depth and do that analysis with that. Where did you get that recommendation from to go that deep? I would definitely not resequence anything here.
Hi ATpoint Thanks for the reply. It's mainly a miscommunication with the sequencing company to go up to 100 million. I was under the impression of getting 50 million reads per sample. Plus it's a flaw in the experimental design we had. We only had 24 indices so I have to split them into four lanes. We were aiming for 50 million paired reads. ENCODE recommended 50 million non-duplicant, non-mitochondrial reads for paired-end. This paper here suggests 50 million reads for open chromatin differential analysis and 200 million reads for TF footprinting.
I will check the duplication rate and downsample the reads if necessary. Thanks for the suggestion.
I see. Yeah, ENCODE (as usual) is not good at communication. They recommend 25mio reads, but 50mio paired-end reads, but the accepted lingo (to me at least) is that one does not count R1 and R2 spearately, so what they actually mean is 25mio read pairs.
These recommendations of 200mio reads are completely out of the world to me, nobody does that. Plus, de novo footprinting is something you rarely see in practice, plus in most cases does not reveal anything what a normal motif analysis would not.
Anyway, my recommendation to you is to keep the data as-is, align, remove duplicates and then see how much is left per sample. If it's grossly different and maybe even PCA indicates that depth is a driver of variation, then subsample to the sample with lowest depth. Definitely don't do more sequencing, it's a waste of money, at least I have yet to see an ATAC-seq analysis that needs north of 100mio reads.
Thanks for the clarification. ENCODE was fine. It just the sequencing company that we have communication problem with. We were aiming for 50 million read pairs. And we ended up getting 100 million read pairs. I might have to do dual indexing in the future. The protocol was in place when I first joined the group so I just assume it would work. Probably should have read through it more thoroughly.
Thanks for your help!
ATpoint Sorry to tag you in after almost a month but I have a question about downsampling. I am using nf-core and can put in a module to downsample the data. But I am not sure at which step or if I should pick what to remove. Should I complete the FASTQC first and downsample it, or the other way round? And I assume it might be better to sample the reads randomly instead of only duplicated reads because it might cause biases and skew the cohort? Thank you very much for all the help you have provide and I appreciate any help in the future.
If you must have 100M reads then you will need to resequence. You can do that by repooling only those libraries that need more data.
This "issue" came up earlier today in context of RNAseq and I will link my answer as a guide --> Paired-end sequencing with inconsistent quality/number of reads
I don't specifically know about ATAC-seq and it may be fine to analyze the data as is. But if you want to, you can down-sample reads from the other samples (that have a lot more, do it randomly using a program like
reformat.sh
from BBMap suite) so that you end up with equivalent read amounts. Don't have to be identical.Thanks GenoMax I think that would be better than resequence them to match the library size. I will have a look into BBMap.