When we download and use microarray data, we undergo various pre-processing steps, among which is matching the probes of the microarray with the genes we need. Microarrays come in various "platforms", each with its "unique probe IDs". We proceed with the following code to convert these to ensembl IDs or gene IDs.
For example, suppose we have data named GSE46239. This data's platform is Affymetrix Human Genome U133 Plus 2.0 Array [HG-U133_Plus_v2] and we can convert probe id to ensembl id by this code:
m_46239 <- getGEO(filename = 'GSE46239/GSE46239_series_matrix.txt.gz')
m_46239_anno <- pData(m_46239)
m_46239_anno <- m_46239_anno[,c('geo_accession', 'dx:ch1')]
rownames(m_46239_anno) <- NULL
colnames(m_46239_anno) <- c('Sample', 'State')
m_46239_anno$State <- ifelse(m_46239_anno$State == 'healthy', 'Normal', 'Disease')
m_46239_anno$Sample <- paste0(m_46239_anno$Sample, '.CEL.gz')
# GSE46239 expression # Affymetrix GeneChip Human Genome U133 Plus 2.0 Array [HG-U133_Plus_2]
m_46239_cel <- read.celfiles(paste0('GSE46239/', m_46239_anno$Sample), pkgname = 'pd.hg.u133.plus.2')
m_46239_rma <- oligo::rma(m_46239_cel)
m_46239_exp <- exprs(m_46239_rma)
ensembl <- useMart('ENSEMBL_MART_ENSEMBL')
ensembl_hs <- useDataset(dataset = 'hsapiens_gene_ensembl', mart = ensembl)
u133v2 <- getBM(attributes = c('affy_hg_u133_plus_2', 'ensembl_gene_id'),
filters = 'affy_hg_u133_plus_2',
values = rownames(m_46239_exp),
mart = ensembl_hs)
Through the result of u133v2, it can be confirmed that probe ids (HG-U133_Plus_v2) and ensembl ids have been matched. However, when examining the result, it is observed that there are many duplicates between two ids. Therefore, it raises concerns about how to merge them.
head(u133v2)
affy_hg_u133_plus_2 ensembl_gene_id
1 1553551_s_at ENSG00000210082
2 1553551_s_at ENSG00000209082
3 1553551_s_at ENSG00000198888
4 1553551_s_at ENSG00000210100
5 1553551_s_at ENSG00000210112
6 1553538_s_at ENSG00000198763
I would appreciate it to provide suggestion how to merge them or link. Thank you.