Hi, I have RNA-seq and ATAC-seq for the same samples. I wish to cross-validate two datasets. I got DEGs from RNA-seq and DARs from ATAC-seq. I annotated DARs (differentially accessible regions) using Homer, so I got their nearest gene information. Would it be reasonable to correlate gene expression with DAR nearby genes?
x-axis: log2FC of DARs (ATAC-seq)
y-axis: log2FC of DAR nearly gene (RNA-seq)
My concern is if DAR is far from nearby genes, would it be appropriate to correlate the two fold changes?
Or any other suggestions for ways of cross-validating RNA-seq and ATAC-seq? Thanks!
Annotating peaks (any peak, open chromatin, ChIP/protein) to a gene is arbitrary, there is to my knowledge no reliable method for it that notably outperforms others. Lack of solid ground truth is a big factor in it. Assays like capture HiC tell us that one peak can contact many promoters, sometimes over hundreds of kilobases while on the other hand TAD boundaries can deny near-gene contact even if a gene-peak pair is just a few kilobases apart.
What people do in my experience (unfortunately) is to try something, see whether it supports their hypothesis, and if not keep trying until it does. Confirmation bias at its finest.
The in silico peak-gene association problem is imo one of the big unsolved challenges in biology.
And it will only be solved when we properly understand the rules that determine the 3D strucutre of the genome, and the activity of enhancers.
In the meantime, I guess the way to validate is to mutate the DAR and see if it affects the expression of the DEG.