Entering edit mode
5.5 years ago
liux.bio
▴
360
Hi, biostars.
I am analyzing an affymetrix microarray dataset. After searching, I found following blogs about filtering probes:
- Question: filtering probes in affymetrix data
- Question: Filtering before differential expression analysis of microarrays - New paper out
- Question: Control probe sets in Affymetrix ST microarrays
Following the blogs, I determined the filtering standard
- For each sample, compute 95th percentile of the negative control genes (normgene -> intro)
- For a gene in a sample, if the gene's value is greater than the 95th percentile of the negative control genes in this sampe, we assume this gene is detected
- if a gene is detected in more than half of all the samples, select this gene, otherwise, filter it.
After filtering, I obtain 6543 probes (the number of all probes is 53617). I want to know if the filtering standard is reasonable. Please give me some suggestions.
Here is my code:
library(oligo)
# load data
celFiles <- list.celfiles("./rawData/GSE83452/celFiles/", full.names = TRUE)
rawData <- read.celfiles(celFiles)
# normalize
processedData <- rma(rawData)
# filter genes with low expression
librarypd.hugene.2.0.st)
con <- dbpd.hugene.2.0.st)
# negative-control probes (normgene -> intro)
NCprobes <- dbGetQuery(con, "select meta_fsetid from core_mps inner join featureSet on
core_mps.fsetid=featureSet.fsetid where featureSet.type='10';")
# the 95th percentile of the negative control genes
NCquantileValue <- apply(exprs(processedData)[as.character(NCprobes[,1]),], 2, quantile,
probs=0.95)
# function to filter genes
filterGeneDetected <- function(sampleValues, quantileValues, detectedRatio) {
# the length of sampleValues and that of quantileValues should be the same
# In a sample, if a gene's is up quantileValue, we say this gene is detected
# If a gene is detected in more than detectedRatio * allSampels, the gene is selected
samplesLength <- length(sampleValues)
detectedNum <- 0
for (index in c(1:samplesLength)) {
if (sampleValues[index] > quantileValues[index]) {
detectedNum = detectedNum + 1
}
}
ifSelected <- FALSE
if (detectedNum/samplesLength >= detectedRatio) {
ifSelected = TRUE
}
return (ifSelected)
}
detectedProbes <- apply(exprs(processedData), 1,
filterGeneDetected,
quantileValues = NCquantileValue,
detectedRatio = 0.5)
processedDataFilter <- processedData[detectedProbes, ]
Many thanks for your kindness.
That strategy for defining 'detection' seems reasonable to me, in this dataset that does not have a Affymetrix MAS Absent/Present call. And the number of surviving genes seems reasonable given your criteria. One comment about the criteria of insisting on detection in >= 50% of the samples. You may want to consider applying that filter separately within each experimental group of interest. For example, in GSE83542, say you wanted to compare NASH vs. non-NASH at baseline (n=104 vs. 44). If you had a gene that was detected in all non-NASH, but none of the NASH samples, that might be a very interesting gene, but would be filtered out by your final criterion since it was detected in only 44 of the samples (less than 50% of all samples in the dataset). If you applied your 50% detection threshold within each of the two comparison groups, and retained genes detected at 50% within any experimental group, that gene would survive your filter.