Question

Issue with converting Affy IDs to gene IDs/symbols

0

Entering edit mode

8 weeks ago

rbronste ▴ 420

Hi I am trying to convert Affy IDs in the following way:

library(GEOquery)
gset <- getGEO("GSE180395", GSEMatrix =TRUE, AnnotGPL=FALSE)
if (length(gset) > 1) idx <- grep("GPL19983", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]

head(rownames(gset))

require("biomaRt")
mart <- useMart("ENSEMBL_MART_ENSEMBL")
mart <- useDataset("hsapiens_gene_ensembl", mart)
annotLookup <- getBM(
mart=mart,
attributes=c(
"affy_hugene_2_1_st_v1",
"ensembl_gene_id",
"gene_biotype",
"external_gene_name"),
filters = "affy_hugene_2_1_st_v1",
values = rownames(exprs(gset))[1:50], uniqueRows=TRUE)

head(annotLookup, 20)

I get the following empty table:

head(annotLookup, 20)
[1] affy_hugene_2_1_st_v1 ensembl_gene_id       gene_biotype          external_gene_name   
<0 rows> (or 0-length row.names)

Any ideas? Thanks!

biomart affymetrix GEOquery microarray • 471 views

ADD COMMENT • link updated 8 weeks ago by yura.grabovska ▴ 360 • written 8 weeks ago by rbronste ▴ 420

1

Entering edit mode

As noted below, the IDs here are not actually Affymetrix IDs, but instead are IDs associated with Entrez Genes based on the BrainArray annotation. To get the Entrez Gene ID without playing regex games, you can use fData(gset)$ENTREZ_GENE_ID. See https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL19983 for some details.

ADD REPLY • link 8 weeks ago by Sean Davis 27k

score 2 · Answer 1 · 2024-07-08

2

Entering edit mode

8 weeks ago

yura.grabovska ▴ 360

When you run head(rownames(gset)) what do you actually get?

EDIT:

I was able to replicate the same issue and having played around with it a little bit I can't really say why it's doing this but I can't get it to work. A workaround that does work is just to strip the Entrez gene ID from the array probe name since that's how they're named.

library(biomaRt)
library(GEOquery)

gset <- getGEO("GSE180395", GSEMatrix =TRUE, AnnotGPL=FALSE)
if (length(gset) > 1) idx <- grep("GPL19983", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]

mart <- useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")

example.vals <- gsub("_.+", "", rownames(exprs(gset))[1:50])

annotLookup <- getBM(mart = mart,
                     attributes = c("entrezgene_id",
                                    "ensembl_gene_id",
                                    "gene_biotype",
                                    "external_gene_name"),
                     filters = "entrezgene_id",
                     values = example.vals)

ADD COMMENT • link 8 weeks ago by yura.grabovska ▴ 360

0

Entering edit mode

> head(rownames(gset)) 
  [1] "100009613_at" "100009676_at" "10000_at"     "10001_at"     "10002_at"     "100033413_at"

ADD REPLY • link 8 weeks ago by rbronste ▴ 420

1

Entering edit mode

Weird, anyway I've edited my reply to provide a fix. If you do figure the issue out more elegantly, let me know.

ADD REPLY • link 8 weeks ago by yura.grabovska ▴ 360

0

Entering edit mode

Great thanks for the help!

If I can ask one other quick question on this, what would be simplest way to modify that initial ExpressionSet to replace the probe IDs with the external_gene_name column?

Thanks again.

ADD REPLY • link 8 weeks ago by rbronste ▴ 420

1

Entering edit mode

This is hacky. I'd welcome suggestions from other people, but here's something you could try.

## as Sean Davis said above you can use the fData()
entrez.ids <- fData(gset)$ENTREZ_GENE_ID

annotLookup <- getBM(mart = mart,
                     attributes = c("entrezgene_id",
                                    "ensembl_gene_id",
                                    "gene_biotype",
                                    "external_gene_name"),
                     filters = "entrezgene_id",
                     values = entrez.ids, uniqueRows = TRUE)

annotLookup <- annotLookup[which(!annotLookup$external_gene_name == ""), ]
annotLookup <- annotLookup[annotLookup$gene_biotype=="protein_coding", ]

annotLookup <- group_by(annotLookup, entrezgene_id)
annotLookup <- summarise(annotLookup, 
                         ensembl_gene_id = paste(unique(ensembl_gene_id), collapse = "; "),
                         external_gene_name = paste(unique(external_gene_name), collapse = "; "))

annotLookup <- as.data.frame(annotLookup)

## you have 31 entries where an entrez ID maps multiple genes
grep(";", annotLookup$external_gene_name, value = TRUE)
## you could deal with them some other way, but I'd be tempted to just drop them
annotLookup <- annotLookup[grep(";", annotLookup$external_gene_name, invert = TRUE), ]

## you have 22 entries where gene maps multiple entrez ID 
dup.genes <- annotLookup$external_gene_name[which(duplicated(annotLookup$external_gene_name))]
annotLookup[annotLookup$external_gene_name %in% dup.genes, ]

## let's line the objects up first
rownames(gset) <- entrez.ids
common.set <- intersect(rownames(gset), annotLookup$entrezgene_id)
gset <- gset[common.set, ]

rownames(annotLookup) <- annotLookup$entrezgene_id
annotLookup <- annotLookup[common.set, ]

all(rownames(annotLookup) == rownames(gset))

## you still have duplicates
which(duplicated(annotLookup$external_gene_name))

exprs.mat <- exprs(gset)

## use an aggregate function - I use median here, but you can collapse the duplicates some other way, or just ignoring them
med.mat <- apply(exprs.mat, 2, \(x) tapply(x, annotLookup$external_gene_name, median))

## make a new expression set that is labelled as genes
symbol.gset <- ExpressionSet(med.mat)

colnames(symbol.gset)==colnames(gset)
pData(symbol.gset) <- pData(gset)

ADD REPLY • link 8 weeks ago by yura.grabovska ▴ 360