Entering edit mode
2.0 years ago
ning
▴
120
CpGs are symmetrical, in that a CG sequence on the forward strand is hybridized to a GC --- and both dinucleotides on each opposing strand are CpGs dinucleotides which can be methylated. Conversely, CpGs can be GC on the forward strand but CG on the reverse strand.
FORWARD -> 5'--CG--3' [OR] 5'--GC--3' <- FORWARD
REVERSE -> 3'--GC--5' 3'--CG--5' <- REVERSE
The assignment of "forward" and "reverse" strandedness is more or less arbitrary.
Given the above, why does it seem like the Illumina 850k (aka EPIC) array only profiles methylation from CpGs which are CG in the forward strand, while ignoring CpGs which are GC in the forward strand? I would also love to hear if my premises are wrong.
suppressPackageStartupMessages({
library(IlluminaHumanMethylationEPICanno.ilm10b4.hg19)
library(tidyverse)})
data(IlluminaHumanMethylationEPICanno.ilm10b4.hg19)
IlluminaHumanMethylationEPICanno.ilm10b4.hg19 %>%
getAnnotation() %>%
as_tibble() %>%
count(forward_seq=str_extract(Forward_Sequence, "\\[[ATCG]{2}\\]"))
# Results:
# # A tibble: 3 × 2
# forward_seq n
# <chr> <int>
# 1 [CA] 2922
# 2 [CG] 862927
# 3 [CT] 10
I thought the information on the strand was contained in the
strand
variable:Can it just be that the
Forward sequence
variable contains the genomic 5'->3' seq for that CG location (so the seq is always the '+' strand) ?As far as I know CpG sites are those in which you have C->G when reading in the 5'->3' direction, so a 5'->3' GC (+) 3'->5' CG (-) site, such as the option you draw in your schema is not a CpG site.
Papyrus I agree with your interpretation of the
Forward sequence
andstrand
variables. But why should CpG sites exclude 5' -> 3' GC (+) sites when biologically they are expected to behave just like 5' -> 3' CG (+) sites, since the designation of the forward strand is arbitrary?Hmm, I'm no expert but I don't think 5'CpG3' and 3'CpG5' are sterically equivalent, these molecules are different, so, as any other genomic sequence motif to be recognized by enzymes etc. they will maybe behave different. (there's a nice figure on Wikipedia comparing CpG and GpC sites).
The designation of which strand is the "forward" strand is indeed arbitrary, but what is not arbitrary is that one end of the strand is 5', and the other is 3', and the CpG site is defined by reading in that direction, not by reading on the forward or reverse strand. 5'->3' directionality has biological/functional meaning such as in DNA replication or transcription.
Papyrus You're right, thank you! I had completely forgotten about the biochemistry of DNA and was only thinking in strings. As you have written, the forward strand is 5' -> 3', so a GC on the forward strand is a GpC site, not a CpG site.
Glad to help! I also had to recall those concepts :)
I cross-posted this to Bioinformatics StackExchange: https://bioinformatics.stackexchange.com/q/20043/6520