Affymetrix Human Genome U133 Plus 2.0 Array - probe annotation with biomaRt
1
2
Entering edit mode
6.4 years ago

Hello ,

I am using biomaRt package for getting geneid, chromosome_name ..etc

getBM(attributes = c('affy_hg_u133_plus_2','hgnc_symbol','chromosome_name','start_position','end_position','band'),
      filters = 'affy_hg_u133_plus_2',
      values = (affyids),
      mart=ensembl)

but the start_position and end_position is different when i am comparing with refGene.txt.gz from NCBI dataset, i need full information about gene, which chromosome it is residing, start end position etc, from where i get correct information of hg19 version, please help me to find the exact information.

R biomaRt • 8.9k views
ADD COMMENT
1
Entering edit mode

Actually the problem is with difference in databases, bioMart is based on Ensemble and refSeq is based on NCBI so it definitely shows some differences in co-ordinates, thank you Kevin and others for your valuable suggestions

ADD REPLY
0
Entering edit mode

You might be using a different version of the genome on Biomart. Give the full code and format it properly, please.

ADD REPLY
0
Entering edit mode
<h6>#######this is my code dataset is GSE12056##########3</h6>
source("http://www.bioconductor.org/biocLite.R")
biocLite()
biocLite('affy')
library('limma')
library(affy)
require(GEOquery)
library(Biobase)
library("biomaRt")
library("readxl")
library("WriteXLS")
ensembl=useMart("ensembl")
ensembl=useDataset("hsapiens_gene_ensembl",mart=ensembl) 
MyBioinfData<-ReadAffy()
eset.rma<-rma(MyBioinfData)
hist(MyBioinfData)
hist(exprs(eset.rma))
boxplot(exprs(eset.rma))
plotMA(exprs(eset.rma))
rma_data=exprs(eset.rma)
affyids=rma_data[1]
affyids=c(affyids)
ensembl=useMart("ensembl")
ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl", GRCh=37)
x=getBM(attributes = c('affy_hg_u133_plus_2','hgnc_symbol','chromosome_name','start_position','end_position','band'),filters = 'affy_hg_u133_plus_2',values = (affyids),mart=ensembl)
ADD REPLY
0
Entering edit mode

One alternate suggestion is to use hgu133plus2.db. It has location information for each probe @OP

ADD REPLY
0
Entering edit mode

Hello, I am using Kevin's code for printing the gene, chromosome, etc. but I am having troubles making the script work. I am getting this error:

Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘exprs’ for signature ‘"factor"’
  

It appears after running this command

annotLookup <- biomaRt::getBM(mart=mart, attributes=c("affy_hg_u133_plus_2", "ensembl_gene_id", "gene_biotype", "external_gene_name","chromosome_name","start_position", "end_position","strand"), filter="affy_hg_u133_plus_2", values=rownames(exprs(gset))[1:50], uniqueRows=TRUE)

Does anyone knows how to fix it?

Thanks in advance :))

ADD REPLY
0
Entering edit mode

Actually, I solved my own problem ^_^

ADD REPLY
0
Entering edit mode

If it was solved, please share what was the problem.

ADD REPLY
9
Entering edit mode
6.4 years ago

NB - it is critical to realise that biomaRt will not return your data in the same order as that in which it was submitted. You will have to manually match the order of the returned data to your input data.

-----------------------------

Hello,

Here is a reproducible example using the dataset that you're using:

First download the dataset from GEO:

require(GEOquery)
require(Biobase)
gset <- getGEO("GSE12056", GSEMatrix =TRUE, getGPL=FALSE)
if (length(gset) > 1) idx <- grep("GPL570", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]

We now have the ExpressionSet object:

dim(exprs(gset))
[1] 54674    20

rownames(exprs(gset))[1:50]
 [1] "1007_s_at"    "1053_at"      "117_at"       "121_at"       "1255_g_at"   
 [6] "1294_at"      "1316_at"      "1320_at"      "1405_i_at"    "1431_at"     
[11] "1438_at"      "1487_at"      "1494_f_at"    "1552256_a_at" "1552257_a_at"
[16] "1552258_at"   "1552261_at"   "1552263_at"   "1552264_a_at" "1552266_at"  
[21] "1552269_at"   "1552271_at"   "1552272_a_at" "1552274_at"   "1552275_s_at"
[26] "1552276_a_at" "1552277_a_at" "1552278_a_at" "1552279_a_at" "1552280_at"  
[31] "1552281_at"   "1552283_s_at" "1552286_at"   "1552287_s_at" "1552288_at"  
[36] "1552289_a_at" "1552291_at"   "1552293_at"   "1552295_a_at" "1552296_at"  
[41] "1552299_at"   "1552301_a_at" "1552302_at"   "1552303_a_at" "1552304_at"  
[46] "1552306_at"   "1552307_a_at" "1552309_a_at" "1552310_at"   "1552311_a_at"

Now annotate these first 50 and create a 'lookup' table of annotation that can be used to rename your Affy IDs to gene names (takes a long time to look up all IDs):

require("biomaRt")
mart <- useMart("ENSEMBL_MART_ENSEMBL")
mart <- useDataset("hsapiens_gene_ensembl", mart)
annotLookup <- getBM(
  mart = mart,
  attributes = c(
    "affy_hg_u133_plus_2",
    "ensembl_gene_id",
    "gene_biotype",
    "external_gene_name"),
  filter = "affy_hg_u133_plus_2",
  values = rownames(exprs(gset))[1:50],
  uniqueRows=TRUE)

head(annotLookup, 20)
affy_hg_u133_plus_2 ensembl_gene_id gene_biotype              external_gene_name
1294_at             ENSG00000283726 miRNA                     MIR5193
1316_at             ENSG00000126351 protein_coding            THRA
1552310_at          ENSG00000169609 protein_coding            C15orf40
1552286_at          ENSG00000250565 protein_coding            ATP6V1E2
1552291_at          ENSG00000163964 protein_coding            PIGX
1294_at             ENSG00000182179 protein_coding            UBA7
1552296_at          ENSG00000142959 protein_coding            BEST4
1438_at             ENSG00000182580 protein_coding            EPHB3
1552287_s_at        ENSG00000223959 transcribed_pseudogene    AFG3L1P
1007_s_at           ENSG00000234078 protein_coding            DDR1
1552280_at          ENSG00000145850 protein_coding            TIMD4
1552304_at          ENSG00000139133 protein_coding            ALG10
1552306_at          ENSG00000139133 protein_coding            ALG10
1320_at             ENSG00000070778 protein_coding            PTPN21
1552256_a_at        ENSG00000073060 protein_coding            SCARB1
1007_s_at           ENSG00000215522 protein_coding            DDR1
1552303_a_at        ENSG00000184988 protein_coding            TMEM106A
1552302_at          ENSG00000184988 protein_coding            TMEM106A
1552309_a_at        ENSG00000162614 protein_coding            NEXN
1552274_at          ENSG00000168297 protein_coding            PXK

If you want the RefSeq 'NM' and 'NR' identifiers, then add "refseq_mrna" and "refseq_ncrna" to attributes

Kevin

ADD COMMENT
1
Entering edit mode

Thanks for this, saved me a lot of time :)

ADD REPLY
0
Entering edit mode

i am stuck with the coordinates actually, i.e. start end position of a gene , which is not match with the hg19 refseq data set details. most of them is differed by 10 to 20 nucleotide difference, each accession gives this difference in no, from that how can i choose one, even if i am taking bioMart output as my input still the bioMart output gives some duplicate infn, and some line are blank, and chromosome no section gives some PATCH ID

1552538_a_at    KIF6             6     39297766 39693181    p21.2
1494_f_at   CYP2A7  19  41381344    41388657    q13.2
1552876_at  LINC00334   21  46654255    46678645    q22.3
1552877_s_at    LINC00334   21  46654255    46678645    q22.3
1552846_s_at    RAB42P1 14  90358132    90358788    q32.11
1552703_s_at    CASP1   11  104896170   104972158   q22.3
1007_s_at   DDR1    HSCHR6_MHC_DBB  30839458    30858654    p21.33
117_at  HSPA7   1   161576081   161578007   q23.3
1552813_at  KLF14   HG1308_PATCH    130394889   130397715   q32.3
1552814_a_at    KLF14   HG1308_PATCH    130394889   130397715   q32.3
1552478_a_at        1   209960324   209961038   q32.2
1553567_s_at    MT-ATP8 MT  8366    8572
1553567_s_at    MT-ATP6 MT  8527    9207
ADD REPLY
1
Entering edit mode

Thanks. THis works for GRCh37 / hg19 co-ordinates:

require("biomaRt")
mart <- useMart("ENSEMBL_MART_ENSEMBL", host="http://grch37.ensembl.org")
mart <- useDataset("hsapiens_gene_ensembl", mart)
annotLookup <- getBM(mart=mart, attributes=c("affy_hg_u133_plus_2", "ensembl_gene_id", "gene_biotype", "external_gene_name","chromosome_name","start_position", "end_position","strand"), filter="affy_hg_u133_plus_2", values=rownames(exprs(gset))[1:50], uniqueRows=TRUE)
ADD REPLY
0
Entering edit mode

This was very useful to me too, but I have a question about the codes:

gset <- getGEO("GSE12056", GSEMatrix =TRUE, getGPL=FALSE)
if (length(gset) > 1) idx <- grep("GPL570", attr(gset, "names")) else idx <- 1

I didn't understand why is added a second dataset "GPL570" ?

Thank you

ADD REPLY
1
Entering edit mode

Hey Morris, GPL570 is actually a reference to the array type - the GEO has both codes for array types and also individual studies.

The two lines that you have pasted are purely to extract the expression values properly from the downloaded dataset.

ADD REPLY

Login before adding your answer.

Traffic: 2629 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6