The reason for that is to get rid of the Nextera and transposase adapter sequences. In the experimental workflow of ATAC-seq, you use a transposase that integrates Illumina adapters into regions of open chromatin while simultaneously fragmentating these regions. The result are fragments from these open regions, being flanked with adapters that can be targeted by suitable PCR primers to make these regions ready for sequencing. The workflow creates fragment of different lengths, see the original paper Figures 1 and 2.
If you then sequence your samples, with lets say Nextseq 75bp cycles, but your actual sequence (sequence between the adapters) is only 60bp long, then you also pick up 15bp of the adapters, which will lead to false alignment results.
=> Therefore, you must get rid of any adapter content prior to aligning for fastq files. The actual Nextera adapter sequence (if using standard Illumina Nextera Sample Prep Kit) is CTGTCTCTTATA on both strands. You can use tools,such as Cutadapt (TrimGalore) or Skewer.
I personally use Skewer with these options to trim the Nextera transposase sequences:
skewer -x CTGTCTCTTATA -y CTGTCTCTTATA -m pe -q 20
I think they used 30bp because 30 is probably the minimal fragment length that the assay created in their hands, so it is kind of the length that does for sure not contain adapter contaminations. Still, I would use adapter trimming software instead of fixed trimming sizes (which pretty much all publications did so far).
I don't think I've seen any other ATAC-seq papers where they do that, so it doesn't seem to be a common practice. Would be curious to know why ENCODE does it.
Yeah it seems a little weird to me as well but ENCODE is supposed to be gold standard so I'm a little confused. Our project in ATAC has resulted in a low alignment rate (56%), so we were gonna try their pipeline but I can't rationalize changing something that drastic without a legitimate reason.
ENCODE is more like a tin standard...
Have you tried to see what is in the 44% reads that are not aligning (and why)?
Yes we have tried it a little bit, but to little results. To give some background on where we've gotten so far:
Any other suggestions? Thanks!
I've had good luck using STAR in cases like this. It has the nice property of saying why it doesn't align a good chunk of the reads, so you can get an idea of whether you need to allow more soft-clipping or not. I think this is preferable since you end up tossing less of the sequencing.
Thanks for your reply, we tried that but didn't notice anything too important. We checked the quality scores of all samples and they are good. Here is the star output, do you see anything else that would resemble a problem?https://postimg.org/image/plsm95551/. I've only recently started using STAR so I'm still trying to fully understand the output.
Right off the bat, I noticed reads don't allign because of being 'too short' and after consulting Alex, the developer, it seemed it would be a quality score issue but that wasn't the case. Here is my questions with Alex
The "too short" thing can be remedied by setting
--outFilterMatchNminOverLread
to a smaller fraction.So, my default was with bowtie2, i sensitive setting, I then changed it to sensitive-local mapping, which relaxes the reads ability to map. From the looks of it, --outFilterMatchNminOverLread is similar to this. While I got an increased read percentage, when I reran MACS2, i ended up getting significantly less peaks across all samples, leading me to believe that changing the parameter led to more noise..
I think you responded to me inquiry @DevonRyan when I originally asked about the bowtie parameter. Do you think it would be worth rerunning star to get try it?
Further, my worry going forward with star besides looking into alignment stats is that star is designed for RNA-seq, so some I'm assuming some are going to make this suboptimal in theory for ATAC-seq. Example being the varying sizes of fragment length ATAC libraries create.
If you are running out of options I suggest trying BBMap. You may be surprised at what it can do. It is easy to use and fast to boot. Brian Bushnell (author) actively participates here and would be a good resource.
Interesting, thank you I will look into it. Our worry at the moment isn't necessarily what the problem is but more about why the low alignment rate. The thing is that with the 50% alignment rate, we are still getting good peaks that the majority map within or near a TSS, as you'd expect, and the data actually looks good.
Our main problem right now is deciding what is worth doing, and spending time researching/looking into, when we could be doing more analysis with the data we have.
Once again, have to thank you guys for actively responding, appreciate all the questions you guys have answered for me
In order to diagnose that 50% alignment rate properly you do need to look at the pile of reads that is not aligning.
You still have not said anything about what the 50% of the reads contain that are not aligning (unless I missed that in this already long thread). Have you collected those reads and blasted a few to see what comes up? You want to be sure that they are not weird chimeras etc.
Oh yeah, sorry forgot to add that. I did try that. Of the blast search to the canine genome of the first thirty sequences that didn't align. 12/30 hits mapped to one single area of the genome
I see so perhaps these reads are being excluded because they are multi-mapping? That would be a logical guess.
Yeah that's our guess. When I reduced the stringency of mapping parameters in bowtie by doing a local alignment I got an increased read percentage, but when I reran MACS2, i ended up getting significantly less peaks across all samples, leading me to believe that changing the parameter led to more noise
STAR should work better, the varying fragment size is a non-issue.
Are you sure that's the case? The developer at STAR seemed to say otherwise.
That's not him saying otherwise, just that 3 years ago he didn't personally have much experience with it.
We've seen a few ATAC-Seq libraries with small inserts (~40bp), so (just a guess) maybe ENCODE used fixed-length trimming to accommodate such libraries and avoid adapter readthrough.
Usually when people do this it's because their underlying read quality is absolute crap. If your library prep didn't wreck you sequencing quality then just ignore everything ENCODE did and do something that makes more sense.
Looks like @datascientist28 already tried that so ENCODE appears to be plan B.
I would try to also ask on the official ENCODE help mailing list: https://mailman.stanford.edu/mailman/listinfo/encode-help
They were very helpful the one time I posted there.
I know of another study that used reads of 25bp, but not ATAC-seq.
By using shorter reads coupled with the stringency for 'unique mapping', you're merely increasing the likelihood that your read mappings are indeed 'perfectly aligned'. As it's more difficult to uniquely align shorter reads, any reads that do make it through should genuinely uniquely map.
Using shorter reads coupled with the subsequent QC requirements for 'uniqueness' thus helps to reduce alignment ambiguity and increase precision of peak identification, but at the expensive of read depth, which, for modern day sequencers, is no issue.