Hello,
I am working with the GSE140829 dataset, which provides microarray expression data in two .txt formats:
Raw data
Normalized data
Number of samples: 587
I am trying to learn how to do my own QC and pre-processing to get to that normalized data, and see if I get similar.
Once I have the metadata in one variable (info.mod) and the expression data in another (data.mod).
I have seen that the expression data comes out in total 2348 samples (2384/4 = 587) that if I understood correctly are 4 channels for each sample of the microarray, how do I go from here?
I have also noticed that the SYMBOL genes are duplicated and triplicated (maybe replicates). What do I have to do to have the expression data for each sample with the SYMBOL variable?
Here is what I have written in R:
#Metadata
GSE140829 <- getGEO('GSE140829', GSEMatrix = TRUE)
varLabels(GSE140829)
info <- pData(GSE140829)
info <- info[,c(1,2,41:42)]
#We modify the name to keep only the sample ID and match the ID of the expression data.
info.mod <-
info %>%
mutate(title = gsub("Whole blood,", " ", title)) %>%
mutate(title = gsub(".*,", " ", title)) %>%
mutate(title = gsub("\\[|\\]", " ", title)) %>%
mutate(title = gsub("ad_mci", " ", title)) %>%
as.data.frame()
#Expression data
data <- read.delim('~/GSE140829_raw_data.txt', header = T)
data[1:10,1:50]
data.mod <- data[,c(2,14:2361)]
And here I am stuck.
My goal would be to change the sample names to the same ID as info.mod but first I need to know what to do with duplicate probes and x4 samples.
The final objective is to make a WGCNA analysis.
Thank you for your help,