I am quite unsure on how to annotate the probesets from a GEO dataset(in this case GSE14333) to gene symbols, in R. The Affymetrix Human Genome U133Plus 2.0 arrays, was used for this dataset.
Would appreciate any help in this regard.
Many thanks!
I am quite unsure on how to annotate the probesets from a GEO dataset(in this case GSE14333) to gene symbols, in R. The Affymetrix Human Genome U133Plus 2.0 arrays, was used for this dataset.
Would appreciate any help in this regard.
Many thanks!
You can use biomaRt. Here is a previous reproducible example using another dataset of the same array type, GSE12056:
First download the dataset from GEO:
require(GEOquery)
require(Biobase)
gset <- getGEO("GSE12056", GSEMatrix =TRUE, getGPL=FALSE)
if (length(gset) > 1) idx <- grep("GPL570", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]
We now have the ExpressionSet object:
dim(exprs(gset))
[1] 54674 20
rownames(exprs(gset))[1:50]
[1] "1007_s_at" "1053_at" "117_at" "121_at" "1255_g_at"
[6] "1294_at" "1316_at" "1320_at" "1405_i_at" "1431_at"
[11] "1438_at" "1487_at" "1494_f_at" "1552256_a_at" "1552257_a_at"
[16] "1552258_at" "1552261_at" "1552263_at" "1552264_a_at" "1552266_at"
[21] "1552269_at" "1552271_at" "1552272_a_at" "1552274_at" "1552275_s_at"
[26] "1552276_a_at" "1552277_a_at" "1552278_a_at" "1552279_a_at" "1552280_at"
[31] "1552281_at" "1552283_s_at" "1552286_at" "1552287_s_at" "1552288_at"
[36] "1552289_a_at" "1552291_at" "1552293_at" "1552295_a_at" "1552296_at"
[41] "1552299_at" "1552301_a_at" "1552302_at" "1552303_a_at" "1552304_at"
[46] "1552306_at" "1552307_a_at" "1552309_a_at" "1552310_at" "1552311_a_at"
Now annotate these first 50 and create a 'lookup' table of annotation that can be used to rename your Affy IDs to gene names (takes a long time to look up all IDs):
require("biomaRt")
mart <- useMart("ENSEMBL_MART_ENSEMBL")
mart <- useDataset("hsapiens_gene_ensembl", mart)
annotLookup <- getBM(
mart=mart,
attributes=c(
"affy_hg_u133_plus_2",
"ensembl_gene_id",
"gene_biotype",
"external_gene_name"),
filter = "affy_hg_u133_plus_2",
values = rownames(exprs(gset))[1:50], uniqueRows=TRUE)
head(annotLookup, 20)
affy_hg_u133_plus_2 ensembl_gene_id gene_biotype external_gene_name
1294_at ENSG00000283726 miRNA MIR5193
1316_at ENSG00000126351 protein_coding THRA
1552310_at ENSG00000169609 protein_coding C15orf40
1552286_at ENSG00000250565 protein_coding ATP6V1E2
1552291_at ENSG00000163964 protein_coding PIGX
1294_at ENSG00000182179 protein_coding UBA7
1552296_at ENSG00000142959 protein_coding BEST4
1438_at ENSG00000182580 protein_coding EPHB3
1552287_s_at ENSG00000223959 transcribed_pseudogene AFG3L1P
1007_s_at ENSG00000234078 protein_coding DDR1
1552280_at ENSG00000145850 protein_coding TIMD4
1552304_at ENSG00000139133 protein_coding ALG10
1552306_at ENSG00000139133 protein_coding ALG10
1320_at ENSG00000070778 protein_coding PTPN21
1552256_a_at ENSG00000073060 protein_coding SCARB1
1007_s_at ENSG00000215522 protein_coding DDR1
1552303_a_at ENSG00000184988 protein_coding TMEM106A
1552302_at ENSG00000184988 protein_coding TMEM106A
1552309_a_at ENSG00000162614 protein_coding NEXN
1552274_at ENSG00000168297 protein_coding PXK
If you want the RefSeq 'NM' and 'NR' identifiers, then add "refseq_mrna" and "refseq_ncrna" to attributes
Kevin
Hi Kevin, how do I map the probe id's to the gene symbols so that they are in the same order. I just noticed that the annotLookup table probe id's are in a different order when compared to the original dataset.
> rownames(exprs(gset))[1:10]
[1] "1007_s_at" "1053_at" "117_at" "121_at" "1255_g_at" "1294_at" "1316_at" "1320_at" "1405_i_at"
[10] "1431_at"
> annotLookup <- getBM(mart=mart, attributes=c("affy_hg_u133_plus_2", "external_gene_name"), filter="affy_hg_u133_plus_2", values <- rownames(exprs(gset))[1:10] , uniqueRows=TRUE)
> head(annotLookup)
affy_hg_u133_plus_2 external_gene_name
1 1294_at MIR5193
2 1316_at THRA
3 1294_at UBA7
4 1007_s_at DDR1
5 1320_at PTPN21
6 117_at HSPA6
Thanks again!
Yes, thanks for bringing this up. The table that is generated is just a 'lookup' table, which you can then use to modify your data's annotation, as follows:
We have:
head(rownames(gset))
[1] "1007_s_at" "1053_at" "117_at" "121_at" "1255_g_at" "1294_at"
head(annotLookup)
affy_hg_u133_plus_2 ensembl_gene_id gene_biotype external_gene_name
1 1294_at ENSG00000283726 miRNA MIR5193
2 1552347_at ENSG00000205758 protein_coding CRYZL1
3 1552620_at ENSG00000184148 protein_coding SPRR4
4 1552340_at ENSG00000170374 protein_coding SP7
5 1316_at ENSG00000126351 protein_coding THRA
6 1552516_a_at ENSG00000163349 protein_coding HIPK1
There are many ways to match across fields in objects, but I always go to match()
:
indicesLookup <- match(rownames(gset), annotLookup$affy_hg_u133_plus_2)
With the matching indices, we use these to extract the matching gene names:
head(annotLookup[indicesLookup, "external_gene_name"])
[1] "DDR1" "RFC2" "HSPA6" "PAX8" "GUCA1A" "MIR5193"
Always double check:
dftmp <- data.frame(rownames(gset), annotLookup[indicesLookup, c("affy_hg_u133_plus_2", "external_gene_name")])
head(dftmp, 20)
rownames.exprs.gset.. affy_hg_u133_plus_2 external_gene_name
117 1007_s_at 1007_s_at DDR1
363 1053_at 1053_at RFC2
262 117_at 117_at HSPA6
529 121_at 121_at PAX8
304 1255_g_at 1255_g_at GUCA1A
1 1294_at 1294_at MIR5193
5 1316_at 1316_at THRA
177 1320_at 1320_at PTPN21
300 1405_i_at 1405_i_at CCL5
269 1431_at 1431_at CYP2E1
99 1438_at 1438_at EPHB3
359 1487_at 1487_at ESRRA
473 1494_f_at 1494_f_at CYP2A6
180 1552256_a_at 1552256_a_at SCARB1
372 1552257_a_at 1552257_a_at TTLL12
515 1552258_at 1552258_at MIR4435-2HG
416 1552261_at 1552261_at WFDC2
383 1552263_at 1552263_at MAPK1
382 1552264_a_at 1552264_a_at MAPK1
255 1552266_at 1552266_at ADAM32
table(dftmp[,1] == dftmp[,2])
TRUE
41644
...seems to match (watch for the probe names matching on each row).
Now replace the rownames. Note that there will be some probes that don't match to any genes - these will be NA. For example, all Affy probes with a 'AFFX' prefix are control probes, and should be removed after normalisation.
Also, some probes will map to the same gene; so, gene names will not be unique. This will normally create an error if you try to set non-unique names as rownames to a data-frame. So, you can 'cheat' and make rownames unique for now by adding an extension:
rownames(gset) <- paste(annotLookup[indicesLookup, "external_gene_name"], c(1:length(indicesLookup)), sep="_")
head(rownames(gset),20)
[1] "DDR1_1" "RFC2_2" "HSPA6_3" "PAX8_4"
[5] "GUCA1A_5" "MIR5193_6" "THRA_7" "PTPN21_8"
[9] "CCL5_9" "CYP2E1_10" "EPHB3_11" "ESRRA_12"
[13] "CYP2A6_13" "SCARB1_14" "TTLL12_15" "MIR4435-2HG_16"
[17] "WFDC2_17" "MAPK1_18" "MAPK1_19" "ADAM32_20"
exprs(gset)[1:5,1:5]
GSM304303 GSM304304 GSM304479 GSM304480 GSM304481
DDR1_1 8222.3 6354.8 8361.6 10947.8 8385.2
RFC2_2 10033.3 8773.8 11313.5 7423.7 10522.6
HSPA6_3 221.4 323.5 459.3 67.4 458.5
PAX8_4 2832.2 3333.2 2952.5 3450.2 3194.3
GUCA1A_5 297.1 222.6 20.1 30.5 140.5
You can easily remove this extension for plots, etc, with:
gsub("_[0-9]*$", "", rownames(gset))
Note - rownames(exprs(gset))
and rownames(gset)
are interchangeable, but, for modifying these, you have to use rownames(gset)
Yes, definitely U133, however, I have multiple of them and they were not done at the same time, I'm under the impression it is an annotation version issue! I tracked down one of the probes that were not annotated, after some digging the Ensembl id given to it was available in GRCh37 (see ), how comes that the probset remains hanging without an updated id in the GRCh38? what does it really mean that a probset can found only in a former version?
My code is very much like yours:
mart <- useMart("ENSEMBL_MART_ENSEMBL")
mart <- useDataset("hsapiens_gene_ensembl", mart)
annotLookup <- getBM(mart=mart, attributes=c( "affy_hg_u133_plus_2", "ensembl_gene_id", "hgnc_symbol", "gene_biotype", "external_gene_name"), filter = "affy_hg_u133_plus_2", values = gxp.affy$Probes, uniqueRows=TRUE)
thanks again
ENSG00000261911 was a patch gene on GRCh37. That means there was a repair added to cover over an error in the reference genome GRCh37, and genes were annotated on the repaired sequence as well as on the reference. These had independent stable identifiers.
When GRCh38 was introduced, the patch was merged into the reference, replacing the original reference assembly. Where there was a gene on both the patch and the underlying assembly, the genes were merged.
Can you trace back a few steps in my code to infer a few other probes that are not matching. For example, I know the U133 design fairly well and there are many thousands of probes that relate to control probes and that target 'hypothetical' transcripts that are not even listed in Ensembl. Many of the transcripts are even genuinely un-named, apart from a seemingly randomly-assign unique identifier.
The summary of the array design on THeremoFisher's website says:
Array Profile
All probe sets represented on the GeneChip Human Genome U133 Set are identically replicated on the GeneChip Human Genome U133 Plus 2.0 Array. The sequences from which these probe sets were derived were selected from GenBank™, dbEST, and RefSeq. The sequence clusters were created from the UniGene database (Build 133, April 20, 2001) and then refined by analysis and comparison with a number of other publicly available databases, including the Washington University EST trace repository and the University of California, Santa Cruz Golden-Path human genome database (April 2001 release).
In addition, there are 9,921 probe sets representing approximately 6,500 genes. These gene sequences were selected from GenBank, dbEST, and RefSeq. Sequence clusters were created from the UniGene database (Build 159, January 25, 2003) and refined by analysis and comparison with a number of other publicly available databases, including the Washington University EST trace repository and the NCBI human genome assembly (Build 31).
[source: https://www.thermofisher.com/order/catalog/product/900466#/900466]
So, it's a grand mix of things..
If you want the 'ultimate' annotation source, then I would download the annotation file from Affymetrix:
(for example, the file HG-U133_Plus_2 Annotations, CSV format, Release 36 (36 MB, 7/12/16))
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Dear all, I have followed everything you explained here, I appreciate for such a very well explanation. However, I have a problem with Affymetrix probes. The Affymetrix Human Genome U133Plus 2.0 arrays, was used for GSE17536 dataset and I realized that some of protein coding genes are lost due to the lack of Affymetrix id. Those genes which will be used in the downstream analysis are of our interest and I don't know how to extract those genes without Affymetrix ids. Below there is an example
I would appreciate if you could advise me any strategy to reach out those genes I am interested in.
Many thanks in advance.
Please start a new thread, with your question.