Issue with converting Affy IDs to gene IDs/symbols
1
0
Entering edit mode
4 months ago
rbronste ▴ 420

Hi I am trying to convert Affy IDs in the following way:

library(GEOquery)
gset <- getGEO("GSE180395", GSEMatrix =TRUE, AnnotGPL=FALSE)
if (length(gset) > 1) idx <- grep("GPL19983", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]

head(rownames(gset))

require("biomaRt")
mart <- useMart("ENSEMBL_MART_ENSEMBL")
mart <- useDataset("hsapiens_gene_ensembl", mart)
annotLookup <- getBM(
mart=mart,
attributes=c(
"affy_hugene_2_1_st_v1",
"ensembl_gene_id",
"gene_biotype",
"external_gene_name"),
filters = "affy_hugene_2_1_st_v1",
values = rownames(exprs(gset))[1:50], uniqueRows=TRUE)

head(annotLookup, 20)

I get the following empty table:

head(annotLookup, 20)
[1] affy_hugene_2_1_st_v1 ensembl_gene_id       gene_biotype          external_gene_name   
<0 rows> (or 0-length row.names)

Any ideas? Thanks!

biomart affymetrix GEOquery microarray • 543 views
ADD COMMENT
1
Entering edit mode

As noted below, the IDs here are not actually Affymetrix IDs, but instead are IDs associated with Entrez Genes based on the BrainArray annotation. To get the Entrez Gene ID without playing regex games, you can use fData(gset)$ENTREZ_GENE_ID. See https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL19983 for some details.

ADD REPLY
2
Entering edit mode
4 months ago

When you run head(rownames(gset)) what do you actually get?

EDIT:

I was able to replicate the same issue and having played around with it a little bit I can't really say why it's doing this but I can't get it to work. A workaround that does work is just to strip the Entrez gene ID from the array probe name since that's how they're named.

library(biomaRt)
library(GEOquery)

gset <- getGEO("GSE180395", GSEMatrix =TRUE, AnnotGPL=FALSE)
if (length(gset) > 1) idx <- grep("GPL19983", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]

mart <- useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")

example.vals <- gsub("_.+", "", rownames(exprs(gset))[1:50])

annotLookup <- getBM(mart = mart,
                     attributes = c("entrezgene_id",
                                    "ensembl_gene_id",
                                    "gene_biotype",
                                    "external_gene_name"),
                     filters = "entrezgene_id",
                     values = example.vals)
ADD COMMENT
0
Entering edit mode
> head(rownames(gset)) 
  [1] "100009613_at" "100009676_at" "10000_at"     "10001_at"     "10002_at"     "100033413_at"
ADD REPLY
1
Entering edit mode

Weird, anyway I've edited my reply to provide a fix. If you do figure the issue out more elegantly, let me know.

ADD REPLY
0
Entering edit mode

Great thanks for the help!

If I can ask one other quick question on this, what would be simplest way to modify that initial ExpressionSet to replace the probe IDs with the external_gene_name column?

Thanks again.

ADD REPLY
1
Entering edit mode

This is hacky. I'd welcome suggestions from other people, but here's something you could try.

## as Sean Davis said above you can use the fData()
entrez.ids <- fData(gset)$ENTREZ_GENE_ID

annotLookup <- getBM(mart = mart,
                     attributes = c("entrezgene_id",
                                    "ensembl_gene_id",
                                    "gene_biotype",
                                    "external_gene_name"),
                     filters = "entrezgene_id",
                     values = entrez.ids, uniqueRows = TRUE)

annotLookup <- annotLookup[which(!annotLookup$external_gene_name == ""), ]
annotLookup <- annotLookup[annotLookup$gene_biotype=="protein_coding", ]

annotLookup <- group_by(annotLookup, entrezgene_id)
annotLookup <- summarise(annotLookup, 
                         ensembl_gene_id = paste(unique(ensembl_gene_id), collapse = "; "),
                         external_gene_name = paste(unique(external_gene_name), collapse = "; "))

annotLookup <- as.data.frame(annotLookup)

## you have 31 entries where an entrez ID maps multiple genes
grep(";", annotLookup$external_gene_name, value = TRUE)
## you could deal with them some other way, but I'd be tempted to just drop them
annotLookup <- annotLookup[grep(";", annotLookup$external_gene_name, invert = TRUE), ]

## you have 22 entries where gene maps multiple entrez ID 
dup.genes <- annotLookup$external_gene_name[which(duplicated(annotLookup$external_gene_name))]
annotLookup[annotLookup$external_gene_name %in% dup.genes, ]

## let's line the objects up first
rownames(gset) <- entrez.ids
common.set <- intersect(rownames(gset), annotLookup$entrezgene_id)
gset <- gset[common.set, ]

rownames(annotLookup) <- annotLookup$entrezgene_id
annotLookup <- annotLookup[common.set, ]

all(rownames(annotLookup) == rownames(gset))

## you still have duplicates
which(duplicated(annotLookup$external_gene_name))

exprs.mat <- exprs(gset)

## use an aggregate function - I use median here, but you can collapse the duplicates some other way, or just ignoring them
med.mat <- apply(exprs.mat, 2, \(x) tapply(x, annotLookup$external_gene_name, median))

## make a new expression set that is labelled as genes
symbol.gset <- ExpressionSet(med.mat)

colnames(symbol.gset)==colnames(gset)
pData(symbol.gset) <- pData(gset)
ADD REPLY

Login before adding your answer.

Traffic: 2518 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6