Hi everyone, I come from a computer science background, so my knowledge of genomics is limited. I’m planning to develop an algorithm that can be applied to methylation data, but my questions might be quite basic. I would appreciate your help in understanding this correctly.
Recently, I looked into methylation datasets and noticed that sites are typically annotated with the prefix 'cg' followed by a string of digits, like 'cg04303809'. I understand that 'cg' refers to methylation at CpG sites in DNA sequences, but I’m unsure about the meaning of the digit sequences that follow. Do they represent positions in the reference genome? Additionally, I’m curious if there is any sequential relationship between these methylation annotations. For example, does the methylation status at one CpG site affect the methylation status at another CpG site elsewhere? If such relationships exist, could you please provide any references or resources that discuss this? Thank you very much
Thank you for your reply. If the number is an index for the CpG sites, within a sequence, do you think they are continuous values and has an order? For example, we have cg02494853, cg01707559 and cg04016144, does the cg01707559 really locate in between cg02494853 and cg04016144? In addition, I used dataset from Kaggle, according to the description, this data is a part of a challenge. Below is the link to it. https://www.kaggle.com/datasets/marquis03/age-assessment-and-disease-risk-prediction?select=ai4bio_trainset