Hello,
I'm working on a mice microarray dataset (GPL8321). I've annotated the dataset using the affytools annotateEset function and proceed with the limma pipeline for differential expression. However, looking at the genes names of the DEGs, I noticed that some genes were duplicated, with different expression values obtained. Looking further, I also noticed that this GPL have a great number of probe ids that map to the same ensembl id multiple times.
eset <- rma(celdata)
eset <- annotateEset(eset, mouse430a2.db, columns = c("PROBEID", "ENTREZID", "SYMBOL", "GENENAME", "ENSEMBL"))
table(duplicated(fData(eset)$ENSEMBL))
FALSE TRUE
13113 9577
My question is, the best practice should be to remove the duplicated ensembl IDs before the differential expression anaylsis? This high number of duplicates wouldnt interfere with the statistical analysis and p-value computation?
Should this be handled by computing the mean value of the probes that map to the same ensembl? And how can I achieve it on a Large ExpressionSet object (eset)?