Analysis of Illumina HumanHT-12_V4
1
0
Entering edit mode
4.6 years ago
j_jamal96 ▴ 20

Hi, Dears… I want to analyze the data related to the Illumina HumanHT-12_V4 platform. https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-4901/

I am wondering if it is necessary to correct the background in this data? and Is it an essential step in microarray analysis? In some articles, this has not been done…

Thanks in advance

Illumina HumanHT-12_V4 Background correction • 2.2k views
ADD COMMENT
0
Entering edit mode

Yes, background correction is required; however, I need to point out that processing the Illumina HT expression array data from the public domain can be fraught with problems. I checked the data available for your study and it contains:

  1. a single table of raw signal intensities for all samples combined into the same file.
  2. a single table of processed signal intensities for all samples combined into the same file.

Information about processing Illumina arrays is contained in the limma manual. However, the public domain data is never typically in a format such that following the manual is an easy task. For example, look at this use-case:

If you do not feel comfortable processing the data from the raw stage, then you could just use the processed data. You will have other questions, though.

ADD REPLY
0
Entering edit mode

Thanks a lot, Kevin ...

What is the difference between raw and processed data? In the processed data, has the background been normalized and corrected?

ADD REPLY
0
Entering edit mode

According to the information in the ArrayExpress records, the 'processed' data is already normalised by quantile normalisation, and the authors appear to have eliminated probes with high detection p-values (i.e., low quality probes), which serves as a pseudo background correction, I suppose. I see no information about background correction, specifically, which would normally be performed as part of the neqc method. The authors also removed outlier samples via the Median Absolute Deviation (MAD) method.

So, you could make a start with the processed data... the detection p-value columns will likely still be in that data; so, please remove them and then check the data distribution via boxplot() and hist()

ADD REPLY
0
Entering edit mode

Kevin, thank you for your suggestions I tried to import the processed data to R, tried a few functions, but unfortunately found an error.

> file <- 'mRNA_microarray_processed.txt'
> file.lumi <- lumiR(file)

Error in gregexpr("\t", dataLine1)[[1]] : subscript out of bounds

> file.lumi <- lumiR("mRNA_microarray_processed.txt", sep="\t")

Error in strsplit(info[nMetaDataLines + 2], sep)[[1]] : subscript out of bounds

> x <- read.ilmn("mRNA_microarray_processed.txt")

Reading file mRNA_microarray_processed.txt ... ... Error in readGenericHeader(fname, columns = expr, sep = sep) : Specified column headings not found in file

rawset <- ArrayExpress("E-MTAB-4901")

Unpacking data files ArrayExpress: Reading pheno data from SDRF Error in which(sapply(seq_len(nrow(pData(ph))), function(i) all(pData(ph)[i, : argument to 'which' is not logical In addition: Warning message: In readPhenoData(sdrf, path) : ArrayExpress: Cannot find 'Array Data File' column in SDRF. Object might not be created correctly.

Data <- getAE("E-MTAB-4901", type = "full")
Data1 <- getAE("E-MTAB-4901", type = "processed")

Unpacking data files But no data is stored in it

would you please help me to find How I can import data? And What part of this data should be used as the lmFit function object? Because I didn't see any examples of these commands

fit <- lmFit(Object, design)

I have already analyzed this data with raw data if you need additional explanation and used code I can send it to you via email …

Thank you in advance

ADD REPLY
0
Entering edit mode
4.6 years ago

Hey, no, those programs / packages will not recognise this data because the authors decided to add all feature data and expression data in the same file. You would have to read it into R manually, and then do your own final filterng.

I have done it for you this time so that you can learn the general process, and also keeping in mind that this array (Illumina) can prove very frustrating. Keep in mind that the data in the 'processed' file is already normalised and that, here (below), we are merely preparing the data for our own downstream analyses.

# read in the data
  mat <- read.csv('mRNA_microarray_processed.txt', sep = '\t',
    header = TRUE, row.names = 1, stringsAsFactors = FALSE)

# remove QC probes
  table(mat$QC)
  mat <- mat[mat$QC == '',]
  mat <- mat[,-1]

# set rownames:
# format will be: 'IlluminaProbeID_GeneSymbol'
  rownames(mat) <- paste0(mat$Probe_Id, '_', mat$Symbol)

# extract out detection p-value data
  pvalues <- mat[,grep('p$', colnames(mat))]

# remove detection p-values, standard deviation, and nbeads from main data
  mat <- mat[,-grep('p$|sd$|nbeads$', colnames(mat))]

# only keep probes where the detection p-value is p < 0.05 in greater than or equal to 3 out of the 6 samples
  keep <- (rowSums(pvalues < 0.05) >= 3)
  table(keep)
    keep
    FALSE  TRUE 
    25653 21569

  mat <- mat[keep,]
  pvalues <- pvalues[keep,]

# divide up the feature data and the expression data
# log [base2] transform the expression data
  featuredata <- mat[,7:ncol(mat)]
  mat <- data.matrix(log2(mat[,1:6]))

# verify that data is normalised
  par(mfrow = c(1,2))
  hist(mat)
  boxplot(mat)

k

Kevin

ADD COMMENT
0
Entering edit mode

Thank you very much for your kind answer.

First, I wish you all the best and do apologize for the inconvenience. I would like to ask another question and your answer will be very valuable for my analyses. ‘’Does the platform type affect the number of significant genes?’’ Or let me put in another way: do we have to consider different thresholds when we analyze various platforms individually or the same threshold should be considered?

ADD REPLY
0
Entering edit mode

The processing for other array platforms is very different. For example, Agilent and Affymetrix arrays do not use detection p-values. To give you an idea, I have 10 years of experience working with genotyping and expression arrays, and I only now understand all (I think?) of the nuances associated with each array.

Another problem is that the authors of the study whose data you are using have processed the data in a non-standard way. Generally, though, for all Illumina HT arrays, you can regard the processing as similar. See my other linked thread again for a more general processing pipeline for Illumina HT arrays.

ADD REPLY

Login before adding your answer.

Traffic: 1862 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6