NB - it is critical
to realise that biomaRt will not
return your data in the same order as that in which it was submitted. You will have to manually match the order of the returned data to your input data.
-----------------------------
Hello,
Here is a reproducible example using the dataset that you're using:
First download the dataset from GEO:
require(GEOquery)
require(Biobase)
gset <- getGEO("GSE12056", GSEMatrix =TRUE, getGPL=FALSE)
if (length(gset) > 1) idx <- grep("GPL570", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]
We now have the ExpressionSet object:
dim(exprs(gset))
[1] 54674 20
rownames(exprs(gset))[1:50]
[1] "1007_s_at" "1053_at" "117_at" "121_at" "1255_g_at"
[6] "1294_at" "1316_at" "1320_at" "1405_i_at" "1431_at"
[11] "1438_at" "1487_at" "1494_f_at" "1552256_a_at" "1552257_a_at"
[16] "1552258_at" "1552261_at" "1552263_at" "1552264_a_at" "1552266_at"
[21] "1552269_at" "1552271_at" "1552272_a_at" "1552274_at" "1552275_s_at"
[26] "1552276_a_at" "1552277_a_at" "1552278_a_at" "1552279_a_at" "1552280_at"
[31] "1552281_at" "1552283_s_at" "1552286_at" "1552287_s_at" "1552288_at"
[36] "1552289_a_at" "1552291_at" "1552293_at" "1552295_a_at" "1552296_at"
[41] "1552299_at" "1552301_a_at" "1552302_at" "1552303_a_at" "1552304_at"
[46] "1552306_at" "1552307_a_at" "1552309_a_at" "1552310_at" "1552311_a_at"
Now annotate these first 50 and create a 'lookup' table of annotation that can be used to rename your Affy IDs to gene names (takes a long time to look up all IDs):
require("biomaRt")
mart <- useMart("ENSEMBL_MART_ENSEMBL")
mart <- useDataset("hsapiens_gene_ensembl", mart)
annotLookup <- getBM(
mart = mart,
attributes = c(
"affy_hg_u133_plus_2",
"ensembl_gene_id",
"gene_biotype",
"external_gene_name"),
filter = "affy_hg_u133_plus_2",
values = rownames(exprs(gset))[1:50],
uniqueRows=TRUE)
head(annotLookup, 20)
affy_hg_u133_plus_2 ensembl_gene_id gene_biotype external_gene_name
1294_at ENSG00000283726 miRNA MIR5193
1316_at ENSG00000126351 protein_coding THRA
1552310_at ENSG00000169609 protein_coding C15orf40
1552286_at ENSG00000250565 protein_coding ATP6V1E2
1552291_at ENSG00000163964 protein_coding PIGX
1294_at ENSG00000182179 protein_coding UBA7
1552296_at ENSG00000142959 protein_coding BEST4
1438_at ENSG00000182580 protein_coding EPHB3
1552287_s_at ENSG00000223959 transcribed_pseudogene AFG3L1P
1007_s_at ENSG00000234078 protein_coding DDR1
1552280_at ENSG00000145850 protein_coding TIMD4
1552304_at ENSG00000139133 protein_coding ALG10
1552306_at ENSG00000139133 protein_coding ALG10
1320_at ENSG00000070778 protein_coding PTPN21
1552256_a_at ENSG00000073060 protein_coding SCARB1
1007_s_at ENSG00000215522 protein_coding DDR1
1552303_a_at ENSG00000184988 protein_coding TMEM106A
1552302_at ENSG00000184988 protein_coding TMEM106A
1552309_a_at ENSG00000162614 protein_coding NEXN
1552274_at ENSG00000168297 protein_coding PXK
If you want the RefSeq 'NM' and 'NR' identifiers, then add "refseq_mrna" and "refseq_ncrna" to attributes
Kevin
Actually the problem is with difference in databases, bioMart is based on Ensemble and refSeq is based on NCBI so it definitely shows some differences in co-ordinates, thank you Kevin and others for your valuable suggestions
You might be using a different version of the genome on Biomart. Give the full code and format it properly, please.
One alternate suggestion is to use hgu133plus2.db. It has location information for each probe @OP
Hello, I am using Kevin's code for printing the gene, chromosome, etc. but I am having troubles making the script work. I am getting this error:
It appears after running this command
Does anyone knows how to fix it?
Thanks in advance :))
Actually, I solved my own problem ^_^
If it was solved, please share what was the problem.