Dear all,
Not been a bioinformatician I have minimal knowledge of using R to analyse microarray datasets. I was able to annotate the series matrix from GEO (GSE59045) using GEOquery, but I noted that there are multiple expression values for the same gene. Can somebody help me with a script for R to collapse the information, obtaining only one expression value to each gene? From other posts I learn that the best option is to select the highest value (maximum) for each gene, I will loss information regarding isoforms but it is not necessary for me at the moment.
Example
GSM1424930 GSM1424931 GSM1424932 GSM1424933 Gene Symbol Ensembl
11715100_at 3.681680528 3.615040247 3.681680528 3.725140832 HIST1H3G ENSG00000273983
11715101_s_at 5.804431414 5.982370634 5.982370634 6.219341531 HIST1H3G ENSG00000273983
11715102_x_at 3.779383579 3.760277943 3.608772565 3.816631661 HIST1H3G ENSG00000273983
11715103_x_at 7.194430009 8.058842933 7.606162382 7.5365415 TNFAIP8L1 ENSG00000185361
1715104_s_at 5.286305369 5.338718503 5.499863475 5.616707797 OTOP2 ENSG00000183034
Thanks.
Also, you might want to ensure there are no empty-string or NA values in the Ensembl column
many thanks for your answer, no problem I will filter the matrix for NA values in Ensembl column. Can help me with an example of the procedure or suggesting me a tutorial to follow in R?
The featureData entry of an ExpressionSet is a (type of) data.frame, so you can find NAs and empty strings as you would in a normal data.frame. I've added some code to do it into the above function.