I am using R package FDb.InfiniumMethylation.hg19 try to get the related gene information to some cpg sites. But I found a problem. For example, I have 3 cpg sites: cg10631684, cg13611636, cg15448829. I try to find the nearest gene to them (or which gene they are in). So I used following R code:
library(FDb.InfiniumMethylation.hg19)
hm450=get450k()
cpg = c("cg10631684", "cg13611636","cg15448829")
probes=hm450[cpg]
getNearestGene(probes)
I can get position informtion from object probes
cg10631684 starts at 113795897, ends at 113795898 (chr12).
cg13611636 starts at 40753851, ends at 40753852 (chr8).
cg15448829 starts at 119634634, ends at 119634635 (chr8).
For getNearestGene(probes), I can have
queryHits subjectHits distance nearestGeneSymbol
cg10631684 1 6463 472 PLBD2
cg13611636 11 21341 0 ZMAT4
cg15448829 10 16004 0 SAMD12-AS1
So cg10631684 is not in any gene, the nearest gene to it is PLBD2. cg13611636 is in the gene ZMAT4. ,cg15448829 is in the gene SAMD12-AS1.
Then, I did something different. I downloaded all the gene information from biomart(http://grch37.ensembl.org/biomart/martview/) create a bed file. I also created a bed file for these 3 cpg sites. I used command intersectBed try to see if these two cpg sites overlapping with any genes. I found:
cg10631684 is in gene SLC8B1. cg13611636 is in the gene ZMAT4. cg15448829 is not in any gene.
So only 1 cpg site has the same result. Other 2 are not consistent. I checked SLC8B1 starts at 113797298 ends at 113736564 (chr12). It seems cg10631684 should be in gene SLC8B1. For gene SAMD12-AS1, I cannot find it in biomart.
So my questions here are:
- Is my understanding of function getNearestGene wrong? How come it considers cg10631684 is not in gene SLC8B1?
- The biomart doesn't contain all the gene? (I am using hg19, so I used GRCH37)
Thank you.
A couple things:
bedtools closest
function a try and see what results you obtain with this. Bedtools intersect doesn't seem to be the appropriate method here.