remove multiple gene symbols in the affymetrix platform
1
0
Entering edit mode
4.5 years ago
j_jamal96 ▴ 20

Dear all,

What function should be used in the affymetrix platform to remove duplicate genes (values related to the probsets of that gene)? Mean, medium, IQR, .... ???

Annotation must be performed before analysis? Or can the annotation be done after the analysis ?( in this situation are the duplicates removed?)

Thank you

microarray affymetrix gene symbols annotation • 2.4k views
ADD COMMENT
1
Entering edit mode
4.5 years ago

Hey j_jamal96,

Annotation must be performed before analysis? Or can the annotation be done after the analysis ?( in this situation are the duplicates removed?)

Usually after, but there is no fixed rule.

What function should be used in the affymetrix platform to remove duplicate genes (values related to the probsets of that gene)? Mean, medium, IQR, .... ???

It will help if you confirm the array version that you are using. Some level of summarisation can be performed during normalisation - see my answer, here: C: Human Exon array probeset to gene-level expression

Another option is to use limma::avereps() to summarise expression over probes / genes. If you use this, use it on the normalised and log [base 2] transformed data

Kevin

ADD COMMENT
0
Entering edit mode

I cannot thank you enough for helping me.

The data platform is HG-U133_Plus_2 And I used the following code:

data.raw <- ReadAffy( filenames = list_cel, compress = F)     # Load raw data
ndata.raw <- rma(data.raw)

library(annotate)
library(hgu133plus2.db)
ID <- featureNames(ndata.raw))


gene.symbols <- getSYMBOL(ID, "hgu133plus2")
fData(ndata.raw) <- data.frame(Symbol=gene.symbols)
ndata.raw <- ndata.raw[HasSymbol, ]

eset <- Biobase::exprs(ndata.raw) ; dim(eset)

eset <- as.data.frame(eset)

eset$ID <- row.names(eset)

gname <- as.data.frame(gene.symbols)

gname$ID <- rownames(gname)

mydata <- merge(eset, gname, by = "ID")
A <- limma::avereps(mydata[  ,2:ncol(mydata)], mydata$gene.symbols)
ADD REPLY
0
Entering edit mode

Hi, that seems fine. You probably don't need to specify Biobase::, but no harm either. Also, there are lots of different ways to map between probe IDs and gene symbols, as I elaborate at the end of my answer here: https://support.bioconductor.org/p/130727/#130733

I don't see anything of concern with the way that you have done it, though. Next will be to proceed to differential expression.

ADD REPLY
0
Entering edit mode

thank you very much for your help Kevin

ADD REPLY
0
Entering edit mode

Hi Kevin,

How can I use limma:avereps for following codes?

#   Differential expression analysis with limma
library(Biobase)
library(GEOquery)
library(limma)

# load series and platform data from GEO

gset <- getGEO("GSE60371", GSEMatrix =TRUE, AnnotGPL=FALSE)
if (length(gset) > 1) idx <- grep("GPL13264", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]

# make proper column names to match toptable 
fvarLabels(gset) <- make.names(fvarLabels(gset))

# group names for all samples
gsms <- paste0("XXXX1111111111111111111110000000000000000000000000",
    "0000000000000000000000000000000")
sml <- c()
for (i in 1:nchar(gsms)) { sml[i] <- substr(gsms,i,i) }

# eliminate samples marked as "X"
sel <- which(sml != "X")
sml <- sml[sel]
gset <- gset[ ,sel]

# log2 transform
ex <- exprs(gset)
qx <- as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))
LogC <- (qx[5] > 100) ||
      (qx[6]-qx[1] > 50 && qx[2] > 0) ||
      (qx[2] > 0 && qx[2] < 1 && qx[4] > 1 && qx[4] < 2)
if (LogC) { ex[which(ex <= 0)] <- NaN
exprs(gset) <- log2(ex) }

# set up the data and proceed with analysis
sml <- paste("G", sml, sep="")    # set group names
fl <- as.factor(sml)
gset$description <- fl
design <- model.matrix(~ description + 0, gset)
colnames(design) <- levels(fl)
fit <- lmFit(gset, design)
cont.matrix <- makeContrasts(G1-G0, levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2, 0.01)
tT <- topTable(fit2, adjust="fdr", sort.by="B", number=250)

tT <- subset(tT, select=c("ID","adj.P.Val","P.Value","t","B","logFC","miRNA_ID","SPOT_ID"))
write.table(tT, file=stdout(), row.names=F, sep="\t")


################################################################
#   Boxplot for selected GEO samples
library(Biobase)
library(GEOquery)

# load series and platform data from GEO

gset <- getGEO("GSE60371", GSEMatrix =TRUE, getGPL=FALSE)
if (length(gset) > 1) idx <- grep("GPL13264", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]

# group names for all samples in a series
gsms <- paste0("XXXX1111111111111111111110000000000000000000000000",
    "0000000000000000000000000000000")
sml <- c()
for (i in 1:nchar(gsms)) { sml[i] <- substr(gsms,i,i) }
sml <- paste("G", sml, sep="")  # set group names

# eliminate samples marked as "X"
sel <- which(sml != "X")
sml <- sml[sel]
gset <- gset[ ,sel]

# order samples by group
ex <- exprs(gset)[ , order(sml)]
sml <- sml[order(sml)]
fl <- as.factor(sml)
labels <- c("PROSTATE_CANCER","CONTROL")

# set parameters and draw the plot
palette(c("#dfeaf4","#f4dfdf", "#AABBCC"))
dev.new(width=4+dim(gset)[[2]]/5, height=6)
par(mar=c(2+round(max(nchar(sampleNames(gset)))/2),4,2,1))
title <- paste ("GSE60371", '/', annotation(gset), " selected samples", sep ='')
boxplot(ex, boxwex=0.6, notch=T, main=title, outline=FALSE, las=2, col=fl)
legend("topleft", labels, fill=palette(), bty="n")
ADD REPLY

Login before adding your answer.

Traffic: 2604 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6