I found that the genomic bases of cis-regulatory elements (cCRE) that overlap with CDS (coding regions) show lower conservation than CDS bases that have no cCRE overlap (2.839 vs. 2.978, based on phyloP100way scores). These scores are statistically significant. I'm confident in my methodology, and I’ve thoroughly checked my code for errors. However, this result seems counterintuitive—intuitively, regions with overlapping functions (acting as both enhancers and CDS) might be expected to show higher conservation than CDS-only regions.
I have also done this on a per-gene basis, and I found 60% of the genes had higher CDS-only than overlap scores.
For reference, I'm using ENCODE cCREs and GENCODE CDS regions (filtered for MANE Select transcripts).
Additionally, I analyzed ClinVar synonymous variants and found that 50.1% overlap with cCREs. I anticipated that cCRE-CDS regions would show depletion in synonymous variants.
Could there be a logical explanation for these findings, or might there be confounding variables affecting the results? Is there another analysis anyone would recommend to explore this further?
Hello, thanks for your response. RE: my methodology-- all I did was take the cCRE regions from ENCODE and overlap these regions with CDS (coding regions) from GENCODE. I end up with "overlap" regions and non-overlap regions. The regions where there's CDS, but no cCRE is more conserved, which I think is strange.