Hi,
I have a filtering function: First Filter out probes that are showing little expression above noise - log2(100) is arbitrary number we decide equivilant to 6.7/7 expression in 5% of a the sample set
filter1<- pOverA(0.05, log2(100))
Filter by interquartile range- gets rid of non variable/uninteresting probes- variable over 1.5, x=data
filter2<- function(x)(IQR(x)>1.5)
combine into a single function
ff<- filterfun(filter1,filter2)
I call it like so :
gse14814_probes <- genefilter(gse14814.gcrma,ff)
where gse14814.gcrma is an expression set created using ReadAffy()
with CEL files and normalised with gcrma()
.
The function works for the above expression set, however when I use an expression set that was downloaded as a GEO series matrix file using :
gse4573<-getGEO(filename="GSE4573_series_matrix.txt", GSEMatrix=TRUE)
The same function does not work, as nothing is filterd. No warnings or anything come up, just nothing is filtered.
Any ideas on what my problem is ? Thanks in advance
R
The above doesnt work unfortunately. I had considered applying a new function to loop through each gene in the matrix and test the conditions, however not entirely sure how to do this, Im quite new to this.
however something like this ? j = 0 for(i in numberofgenes){ overthreshold <- expessionmatrix[,currentgene] > log2(100) if IQR(expessionmatrix[,currentgene]) > 1.5 { if(sum(overthreshold) > 5) { retaingenes[i ] <- currentgene j = j +1 } } }
I dont no how to get 1: number of genes from the matrix, 2: current gene form the matrix,
I'm guessing you are applying your ff to the untransformed gse4573 ExpressionSet. Note that the expression values are NOT log-transformed, so the constants in your filter function will not filter anything. I edited the code above to include the log transformation step and you'll note that things work just fine.
Thank you so much, thank you for explaining why my function was not working, it really helps me learn . Thank you !
You are quite welcome!
Sorry Sean just abnother Q if you have time. I applied the filter function to 5 data sets, 3 are plus1 arrays- 22,000 probes, 2 are plus 2 arrays- 55,000 probes, On one of the datasets for the plus 2 arrays, 16,000 probes were returned. where as the other plus 2 only returned 4,000.The plus 1 arrays returned around 2000 probes. So should the plus two arrays not return around 4,000, as the are just over double the plus 1 arrays? I think 16,000 prbes returned under both of my conditions is a lot, in comparison to the other datasets? what do you think ?
There is no "right" answer to the question of how much to filter, at least that I know of. As to your question, using the same filter functions on two different datasets does not necessarily lead to proportional filtering due to differences between labs, normalization conditions, and experimental setup. You'll need to determine for yourself what the best filtering strategy is for each dataset.