Hello there!
I have been working with single-cell ATAC-seq data and wanted to identify the closest genes related to peaks found to be differentially accessed across cell types. I have been analyzing the data with Signac package which is a great package for pre-processing and data analysis of scATAC (https://github.com/timoast/signac). To associate peaks to genes there is the ClosestFeature function which identifies the closest gene to the peak using GenomicRanges functions (I think is FindOverlap) based on the genomic annotation to provide. I have provided both only protein-coding genes and also the full gene biotypes from EnsemblDb.HSapiens.v86 package which obviously gave me different associations. The first option results in peaks being associated to genes also quite far away (even >1Mbp), whereas using the full annotation I get a lot of peaks next to gene bodies of genes which however don't tell me much..My question is what is the best practice in this case, if there is one: should i map peaks to all gene biotypes or protein-coding genes are the most likely to be associated with this signal? Are there a specific combination of gene biotypes you would go for? Perhaps should i map these peaks also to known genomic regulatory regions? Where should i look for that?
Thanks a lot!
In my work I typically limit these assignments on protein-coding genes. The thing is that if you include all genes you get a lot of biotypes such as pseudogenes and different kinds of non-coding RNA genes that are difficult or impossible to interpret since proper annotations (towards function) are lacking. Therefore I typically don't bother with it. It might be a good idea to limit the distance between distal regions and genes, maybe 50 or 100kb. You can also limit assignments to those within the same TAD or subTAD if these kind of HiC data do exist for your cell type. Another filter would be to use only differential peaks and also only differentially-expressed genes to have at least some very basic evidence that two elements might be functionally connected in the genome. Still, this is very crude and will always produce (probably a lot) of false assignments, but might be worth doing to get a first glance towards which genes might be regulated by these peaks.
All good suggestions, really appreciate!