I have some gene expression data from a publicly available array data using the SurePrint G3 Human GE 8x60K Microarray platform. I am trying to annotate the Agilent probe ids with entrezIDs using biomaRt in R. However it appears that several Agilent IDs are missing in biomaRt.
As I am not familiar with the Agilent's technology I am not sure whether this issue may be due to the design of the array (i.e. custom probes etc) or whether there is missing data in the biomaRt package, or of course I have an error in my code.
The link provides the table of Agilent probe IDs along with other gene identifiers used by the expression data set
## This is just loading in the table from the link above.
probe = read.delim("GPL15931-probe_annotation.txt", comment.char = '#')
library(biomaRt)
ensembl = useMart("ensembl", dataset = "hsapiens_gene_ensembl")
## Should return all the agilent probe ids in biomaRt
agilent = getBM(attributes=c( 'efg_agilent_sureprint_g3_ge_8x60k' ),values="*", mart= ensembl)
First thing that strikes me is that for a 60K array there is only 31K probe IDs returned. Am I missing something here with either the technology or the code?
If I look for which probes are matched between the two datasets the difference is 11K probes. All the probes in biomaRt match but there is missing 11K that are in the GEO dataset.
table(probe$ID %in% agilent$efg_agilent_sureprint_g3_ge_8x60k)
Is there any agilent bioconductor packages that might have more complete IDs and gene identifiers? Any other thoughts on how to work around this problem?
Although the question above still stands I did find a work around in case some is having a similar experience. The work around involved using the genomic coordinates of the probes and look for overlap of all genes (with entrez IDs).
After splitting the genomic coordinates from the GEO data table into chr, start and stop I can use
genomicRanges
package. There was also an issue with reverse strand gene coordinates which required me to reverse them with a simple loop.