comparing ATAC-seq/ChIP-seq peaks
2
0
Entering edit mode
5.8 years ago
sckinta ▴ 730

We want to do a meta-analysis on previously analyzed ATAC-seq data from different cell types. It has been noticed that those ATAC-seq libraries are sequenced at different batches (some libraries are 100bp PE while some are 50 bp PE). The library sizes of dedup read are varied a lot too.

cell type   biological replicates   dedup read number   read length
cell1   rep1     3,031,071  50bp
cell1   rep2     4,964,107  50bp
cell2   rep1     596,001    100bp
cell2   rep2     1,031,791  100bp
cell3   rep1     1,303,817  50bp
cell3   rep2     1,030,124  50bp
cell4   rep1     979,684    100bp
cell4   rep2     720,355    100bp
cell5   rep1     563,108    100bp
cell5   rep2     537,466    100bp

Based on the common sense, more dedup reads generate more peaks (assuming ATAC-seq has very consistent noise background across libraries). I want to down-sample dedup reads for peak calling (macs2) to make library size consistent from sample to sample. However, since we have read length variations among samples, making dedup read number consistent is not enough, since longer reads will have higher genome coverage than shorter reads when read number are the same. Thus, I am wondering when I do donwsampling, should I take read length into consideration?

For example, if I want to finally sub-sample 100bp library dedup read count as 550K reads, should I sub-sample 50bp library dedup read count to 1100K reads instead?

I am also open to any better solutions for this problem.

ATAC-seq ChIP-Seq • 3.0k views
ADD COMMENT
2
Entering edit mode
5.8 years ago
ATpoint 85k

I do not think that the peak calling is dramatically affected by read length. MACS2 has a BAMPE option to pileup fragment length as defined by the two read mates and for this read length does not matter. The only bias one might introduce is mapping bias, as longer reads map more unique than shorter reads. I cannot give a quantification on how much this affects the alignment, but I would guess the effect is rather minor. 50bp is a common read length. Keep in mind that the nucleosome-free regions in ATAC-seq (the first of the peaks when plotting fragment length frequencies) typically produce fragments of about 80bp length, so longer reads should get trimmed anyway, further reducing the difference between the 50bp reads and the 100/80bp reads. Downsampling probably makes sense as it roughly normalizes the libraries in terms of depth.

Still, if this is human data, note that the read count are very low. One typically aims for something around 25mio paired-end reads after all filtering. You'll probably get the most significant peaks with these low read counts but it looks like those libraries are quiet undersequenced.

ADD COMMENT
0
Entering edit mode

I agree with you that libraries are undersequenced. Thus I am thinking to merge reads from all biological replicates together then call peaks. For some cells I have large replicate sizes (which are not shown here). I am worried about read length, since I have experienced read-length-bias on quantitative analysis on ATAC-seq data before --- among five biological replicates on two cell types, PCA analysis reveals samples clusters based on library read length but not cell types. I am not sure whether this will be true for peak calling. If peak calling number is correlated with sequence coverage which is estimated as (read_length * read_number)/genome_size, why do you think peak calling will not be dramatically affected by read length?

ADD REPLY
0
Entering edit mode

In terms of BAMPE/BEDPE option (which is very smart!), I currently have tagAlign file (basically a bed file). Every two lines present a pair of read mapping. I can convert that to BEDPE by defining the chromosome name, most left and right position of pair reads as fragment coordinate. Is that valid?

ADD REPLY
0
Entering edit mode
5.8 years ago

What if you define the peaks with multiple samples, and then quantify each sample separately?

There may be other possibilities, but these are possible options that I can think of:

1) Pool all reads between samples

2) Pool all reads for control samples

3) Take union and/or intersection of separate sample peaks

If you have Count-Per-Million (or Fragment-Per-Kilobase-per-Million), there may still be some difference (for ~1 M versus ~0.5 M reads), but you could check sample clustering. For example, cell 2 has different de-duplicated read counts, so maybe you can check how well those replicates cluster. If you still have issues, perhaps filter peaks based upon something like maximal CPM/FPKM?

I am not entirely sure what is the affect of PE versus SE samples on peak calling, although (if the starting total reads were similar) I would have expected more reads remaining with PE (at least if you are using Picard, and both reads are being used to identify duplicates). However, that isn't wasn't what I saw in your example.

Maybe you already did this, but I think (at least some ATAC-Seq libraries) can benefit from using --local with the Bowtie2 alignment (if the alignment rates are starting how, this may help in terms of having more reads for analysis).

ADD COMMENT
0
Entering edit mode

Thank you for reply. Yes, to do the comparison, I have to merge peaks anyway. For the option 1), since the cells are very different from each other, pooling reads from all samples may cancel out the noise and true signal between samples. So I choose to call peaks at different cell types by pooling reads from biological replicates together. However, the problem here is I have very big difference on read number -- more read more peaks in general, thus cell-specific peaks will be biased by sequencing depth. I choose to down-sample the libraries to make read number consistent, but read length difference between cell types are something I am not sure whether I should take into consideration. The estimate sequence coverage is (read_length * read_number)/genome_size. If the peak number is correlated with sequence coverage, then different read length across samples will be the problem.

ADD REPLY
0
Entering edit mode

The read length and pairing (single-end versus paired-end) probably have some effect on the alignment rates (and duplicate read rates, for pairing).

I guess looking for peaks / regions with at least 10 CPM in one sample (or some higher threshold) is like down-sampling (although you wouldn't have to re-normalize/quantify each time you added a new sample with lower counts). However, if your quantification uses counts (like htseq-count) rather than coverage, the quantification wouldn't be directly affected as much (although you do need to take all the steps of analysis into consideration when interpreting the results, and there could be an indirect effect).

In other words, I think there may be ways to at least partially improve the coverage issue that you are describing.

ADD REPLY

Login before adding your answer.

Traffic: 1757 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6