We want to do a meta-analysis on previously analyzed ATAC-seq data from different cell types. It has been noticed that those ATAC-seq libraries are sequenced at different batches (some libraries are 100bp PE while some are 50 bp PE). The library sizes of dedup read are varied a lot too.
cell type biological replicates dedup read number read length
cell1 rep1 3,031,071 50bp
cell1 rep2 4,964,107 50bp
cell2 rep1 596,001 100bp
cell2 rep2 1,031,791 100bp
cell3 rep1 1,303,817 50bp
cell3 rep2 1,030,124 50bp
cell4 rep1 979,684 100bp
cell4 rep2 720,355 100bp
cell5 rep1 563,108 100bp
cell5 rep2 537,466 100bp
Based on the common sense, more dedup reads generate more peaks (assuming ATAC-seq has very consistent noise background across libraries). I want to down-sample dedup reads for peak calling (macs2) to make library size consistent from sample to sample. However, since we have read length variations among samples, making dedup read number consistent is not enough, since longer reads will have higher genome coverage than shorter reads when read number are the same. Thus, I am wondering when I do donwsampling, should I take read length into consideration?
For example, if I want to finally sub-sample 100bp library dedup read count as 550K reads, should I sub-sample 50bp library dedup read count to 1100K reads instead?
I am also open to any better solutions for this problem.
I agree with you that libraries are undersequenced. Thus I am thinking to merge reads from all biological replicates together then call peaks. For some cells I have large replicate sizes (which are not shown here). I am worried about read length, since I have experienced read-length-bias on quantitative analysis on ATAC-seq data before --- among five biological replicates on two cell types, PCA analysis reveals samples clusters based on library read length but not cell types. I am not sure whether this will be true for peak calling. If peak calling number is correlated with sequence coverage which is estimated as (read_length * read_number)/genome_size, why do you think peak calling will not be dramatically affected by read length?
In terms of BAMPE/BEDPE option (which is very smart!), I currently have tagAlign file (basically a bed file). Every two lines present a pair of read mapping. I can convert that to BEDPE by defining the chromosome name, most left and right position of pair reads as fragment coordinate. Is that valid?