Hey, no, those programs / packages will not recognise this data because the authors decided to add all feature data and expression data in the same file. You would have to read it into R manually, and then do your own final filterng.
I have done it for you this time so that you can learn the general process, and also keeping in mind that this array (Illumina) can prove very frustrating. Keep in mind that the data in the 'processed' file is already normalised and that, here (below), we are merely preparing the data for our own downstream analyses.
# read in the data
mat <- read.csv('mRNA_microarray_processed.txt', sep = '\t',
header = TRUE, row.names = 1, stringsAsFactors = FALSE)
# remove QC probes
table(mat$QC)
mat <- mat[mat$QC == '',]
mat <- mat[,-1]
# set rownames:
# format will be: 'IlluminaProbeID_GeneSymbol'
rownames(mat) <- paste0(mat$Probe_Id, '_', mat$Symbol)
# extract out detection p-value data
pvalues <- mat[,grep('p$', colnames(mat))]
# remove detection p-values, standard deviation, and nbeads from main data
mat <- mat[,-grep('p$|sd$|nbeads$', colnames(mat))]
# only keep probes where the detection p-value is p < 0.05 in greater than or equal to 3 out of the 6 samples
keep <- (rowSums(pvalues < 0.05) >= 3)
table(keep)
keep
FALSE TRUE
25653 21569
mat <- mat[keep,]
pvalues <- pvalues[keep,]
# divide up the feature data and the expression data
# log [base2] transform the expression data
featuredata <- mat[,7:ncol(mat)]
mat <- data.matrix(log2(mat[,1:6]))
# verify that data is normalised
par(mfrow = c(1,2))
hist(mat)
boxplot(mat)
Kevin
Yes, background correction is required; however, I need to point out that processing the Illumina HT expression array data from the public domain can be fraught with problems. I checked the data available for your study and it contains:
raw
signal intensities for all samples combined into the same file.processed
signal intensities for all samples combined into the same file.Information about processing Illumina arrays is contained in the limma manual. However, the public domain data is never typically in a format such that following the manual is an easy task. For example, look at this use-case:
If you do not feel comfortable processing the data from the raw stage, then you could just use the processed data. You will have other questions, though.
Thanks a lot, Kevin ...
What is the difference between raw and processed data? In the processed data, has the background been normalized and corrected?
According to the information in the ArrayExpress records, the 'processed' data is already normalised by quantile normalisation, and the authors appear to have eliminated probes with high detection p-values (i.e., low quality probes), which serves as a pseudo background correction, I suppose. I see no information about background correction, specifically, which would normally be performed as part of the neqc method. The authors also removed outlier samples via the Median Absolute Deviation (MAD) method.
So, you could make a start with the processed data... the detection p-value columns will likely still be in that data; so, please remove them and then check the data distribution via boxplot() and hist()
Kevin, thank you for your suggestions I tried to import the processed data to R, tried a few functions, but unfortunately found an error.
Error in gregexpr("\t", dataLine1)[[1]] : subscript out of bounds
Error in strsplit(info[nMetaDataLines + 2], sep)[[1]] : subscript out of bounds
Reading file mRNA_microarray_processed.txt ... ... Error in readGenericHeader(fname, columns = expr, sep = sep) : Specified column headings not found in file
Unpacking data files ArrayExpress: Reading pheno data from SDRF Error in which(sapply(seq_len(nrow(pData(ph))), function(i) all(pData(ph)[i, : argument to 'which' is not logical In addition: Warning message: In readPhenoData(sdrf, path) : ArrayExpress: Cannot find 'Array Data File' column in SDRF. Object might not be created correctly.
Unpacking data files But no data is stored in it
would you please help me to find How I can import data? And What part of this data should be used as the lmFit function object? Because I didn't see any examples of these commands
I have already analyzed this data with raw data if you need additional explanation and used code I can send it to you via email …
Thank you in advance