As someone without any biology background, through self-studying , I am trying to understand and implement the pipeline of data analysis of raw Illumina bead array data.
Currently, I am trying to recreate the analysis and results of the following paper: Patel, H., Hodges, A. K., Curtis, C., Lee, S. H., Troakes, C., Dobson, R. J. B., & Newhouse, S. J. (2019). Transcriptomic analysis of probable asymptomatic and symptomatic alzheimer brains. Brain, behavior, and immunity, 80, 644–656. The data stored at GEO: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE118553.
The paper describes initial data processing: Raw gene expression data was exported from Illumina’s GenomeStudio (version 2011.1) into RStudio (version 0.99.467) for data processing. Using R (version 3.2.2), raw data was Maximum Likelihood Estimation (MLE) background corrected using R package “MBCB” (version 1.18.0) (Allen et al., 2010), log2 transformed, and underwent Robust Spline Normalisation (RSN) using R package “lumi” (version 2.16.0) .
My initial implementation is a follows:
library(lumi)
library(MBCB)
raw_data<-read.table('GSE118553_non-normalized_data.txt', sep='', header=TRUE) #import non-normalized set
expression_columns <- grep("(?<!Detection_Pval)$", colnames(raw_data), perl = TRUE) #remove columns with detection p-val
expression_matrix <- raw_data[, expression_columns]
df <- read.table('GPL10558_HumanHT-12_V4_0_R2_15002873_B.txt', skip = 8,nrow=92, header = TRUE,fill=TRUE, sep = "\t", stringsAsFactors = FALSE, quote = "\"", check.names = FALSE) #import data where the control samples are indicated
transposed_control_vec<-as.vector(df$Probe_Id)
control_group<-data %>% dplyr::filter(ID_REF %in% transposed_control_vec)%>% dplyr::select(-ID_REF) #select only control Illumina samples and remove ID_REF column
expression_matrix <- expression_matrix %>% dplyr::filter(!ID_REF %in% transposed_control_vec) %>% dplyr::select(-ID_REF) #select non-control samples and remove ID_REF
corrected_np<-mbcb.correct (expression_matrix,
control_group,
mleBool = TRUE)
df_np<-as.data.frame(corrected_np$NP)
df_np=log2(df_np)
rsn_tr<-as.data.frame(rsn(as.matrix(df_np)))
However, the resulting values are not identical to the set with already processed ones available in GEO.
One hesitation is that rsn()
function requires rows as sample and genes as columns, but in this case it is not clear for me how the genes are coded.
The column names of the control_group
start with GSM,
Column names of the expression_matrix
are (XXXXXXXXXXX_LETTER
, e.g. 9345344009_A
) - these are sample IDs, as given in the description, for example:
GSM3332689 Temporal_Cortex AD (9371242065_I).
ID_REF
are Illumina probe IDs in the format ILMN_XXXXXXX
.
Would be very grateful for suggestions and highlights of mistakes!