Ask for suggestions of filtering probes in affymetrix data
0
0
Entering edit mode
5.5 years ago
liux.bio ▴ 360

Hi, biostars.

I am analyzing an affymetrix microarray dataset. After searching, I found following blogs about filtering probes:

Following the blogs, I determined the filtering standard

  • For each sample, compute 95th percentile of the negative control genes (normgene -> intro)
  • For a gene in a sample, if the gene's value is greater than the 95th percentile of the negative control genes in this sampe, we assume this gene is detected
  • if a gene is detected in more than half of all the samples, select this gene, otherwise, filter it.

After filtering, I obtain 6543 probes (the number of all probes is 53617). I want to know if the filtering standard is reasonable. Please give me some suggestions.

Here is my code:

library(oligo)
# load data 
celFiles <- list.celfiles("./rawData/GSE83452/celFiles/", full.names = TRUE)
rawData <- read.celfiles(celFiles)
# normalize 
processedData <- rma(rawData)

# filter genes with low expression
librarypd.hugene.2.0.st)
con <- dbpd.hugene.2.0.st)
# negative-control probes (normgene -> intro)
NCprobes <- dbGetQuery(con,  "select meta_fsetid from core_mps inner join featureSet on 
 core_mps.fsetid=featureSet.fsetid where featureSet.type='10';")
#  the 95th percentile of the negative control genes 
NCquantileValue <- apply(exprs(processedData)[as.character(NCprobes[,1]),], 2, quantile, 
 probs=0.95)

# function to filter genes
filterGeneDetected <- function(sampleValues, quantileValues, detectedRatio) {
    # the length of sampleValues and that of quantileValues should be the same
    # In a sample, if a gene's is up quantileValue, we say this gene is detected
    # If a gene is detected in more than detectedRatio * allSampels, the gene is selected
    samplesLength <- length(sampleValues)
    detectedNum <- 0
    for (index in c(1:samplesLength)) {
       if (sampleValues[index] > quantileValues[index]) {
           detectedNum =  detectedNum + 1
        }  
        }
    ifSelected <- FALSE
    if (detectedNum/samplesLength >= detectedRatio) {
        ifSelected = TRUE
     }
     return (ifSelected)
   } 

detectedProbes <- apply(exprs(processedData), 1, 
                    filterGeneDetected, 
                    quantileValues = NCquantileValue,
                    detectedRatio = 0.5)
processedDataFilter <- processedData[detectedProbes, ]

Many thanks for your kindness.

microarray Affymetrix • 2.0k views
ADD COMMENT
0
Entering edit mode

That strategy for defining 'detection' seems reasonable to me, in this dataset that does not have a Affymetrix MAS Absent/Present call. And the number of surviving genes seems reasonable given your criteria. One comment about the criteria of insisting on detection in >= 50% of the samples. You may want to consider applying that filter separately within each experimental group of interest. For example, in GSE83542, say you wanted to compare NASH vs. non-NASH at baseline (n=104 vs. 44). If you had a gene that was detected in all non-NASH, but none of the NASH samples, that might be a very interesting gene, but would be filtered out by your final criterion since it was detected in only 44 of the samples (less than 50% of all samples in the dataset). If you applied your 50% detection threshold within each of the two comparison groups, and retained genes detected at 50% within any experimental group, that gene would survive your filter.

ADD REPLY

Login before adding your answer.

Traffic: 4332 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6