Question

Problems Filtering Probes From Series Matrix File With Bioconductor

0

Entering edit mode

11.8 years ago

moranr ▴ 290

Hi,

I have a filtering function: First Filter out probes that are showing little expression above noise - log2(100) is arbitrary number we decide equivilant to 6.7/7 expression in 5% of a the sample set

filter1<- pOverA(0.05, log2(100))

Filter by interquartile range- gets rid of non variable/uninteresting probes- variable over 1.5, x=data

filter2<- function(x)(IQR(x)>1.5)

combine into a single function

ff<- filterfun(filter1,filter2)

I call it like so :

gse14814_probes <- genefilter(gse14814.gcrma,ff)

where gse14814.gcrma is an expression set created using ReadAffy() with CEL files and normalised with gcrma().

The function works for the above expression set, however when I use an expression set that was downloaded as a GEO series matrix file using :

gse4573<-getGEO(filename="GSE4573_series_matrix.txt", GSEMatrix=TRUE)

The same function does not work, as nothing is filterd. No warnings or anything come up, just nothing is filtered.

Any ideas on what my problem is ? Thanks in advance

R

bioconductor r microarray filtering • 4.0k views

ADD COMMENT • link updated 11.8 years ago by Sean Davis 27k • written 11.8 years ago by moranr ▴ 290

score 2 · Answer 1 · 2013-02-20

2

Entering edit mode

11.8 years ago

Sean Davis 27k

getGEO() returns a list of ExpressionSets. So, this should probably do the trick:

gse4573<-getGEO(filename="GSE4573_series_matrix.txt", GSEMatrix=TRUE)[[1]]
gse4573_probes <- genefilter(gse4573,ff)
# no filtering occurs because the constants used in the original
# filter function do not make sense for linear-scale data 
sum(gse4573_probes)
[1] 22282

Log-transform the data and things work as expected. Note that the code has not changed, but the data must be massaged to work with the code as it was written.

exprs(gse4573)=log2(exprs(gse4573))
gse4573_probes <- genefilter(gse4573,ff)
# Only difference is that the data were log-transformed first.
sum(gse4573_probes)
[1] 2507

So, 2507 probesets survive the cutoff. To get the associated filtered ExpressionSet:

gsefiltered = gse4573[gse4573_probes,]

ADD COMMENT • link 11.8 years ago by Sean Davis 27k

0

Entering edit mode

The above doesnt work unfortunately. I had considered applying a new function to loop through each gene in the matrix and test the conditions, however not entirely sure how to do this, Im quite new to this.

however something like this ? j = 0 for(i in numberofgenes){ overthreshold <- expessionmatrix[,currentgene] > log2(100) if IQR(expessionmatrix[,currentgene]) > 1.5 { if(sum(overthreshold) > 5) { retaingenes[i ] <- currentgene j = j +1 } } }

I dont no how to get 1: number of genes from the matrix, 2: current gene form the matrix,

ADD REPLY • link 11.8 years ago by moranr ▴ 290

1

Entering edit mode

I'm guessing you are applying your ff to the untransformed gse4573 ExpressionSet. Note that the expression values are NOT log-transformed, so the constants in your filter function will not filter anything. I edited the code above to include the log transformation step and you'll note that things work just fine.

ADD REPLY • link 11.8 years ago by Sean Davis 27k

0

Entering edit mode

Thank you so much, thank you for explaining why my function was not working, it really helps me learn . Thank you !

ADD REPLY • link 11.8 years ago by moranr ▴ 290

0

Entering edit mode

You are quite welcome!

ADD REPLY • link 11.8 years ago by Sean Davis 27k

0

Entering edit mode

Sorry Sean just abnother Q if you have time. I applied the filter function to 5 data sets, 3 are plus1 arrays- 22,000 probes, 2 are plus 2 arrays- 55,000 probes, On one of the datasets for the plus 2 arrays, 16,000 probes were returned. where as the other plus 2 only returned 4,000.The plus 1 arrays returned around 2000 probes. So should the plus two arrays not return around 4,000, as the are just over double the plus 1 arrays? I think 16,000 prbes returned under both of my conditions is a lot, in comparison to the other datasets? what do you think ?

ADD REPLY • link 11.8 years ago by moranr ▴ 290

1

Entering edit mode

There is no "right" answer to the question of how much to filter, at least that I know of. As to your question, using the same filter functions on two different datasets does not necessarily lead to proportional filtering due to differences between labs, normalization conditions, and experimental setup. You'll need to determine for yourself what the best filtering strategy is for each dataset.

ADD REPLY • link 11.8 years ago by Sean Davis 27k

Istvan Albert · Answer 2 · 2013-02-19

1

Entering edit mode

11.8 years ago

Ying W ★ 4.3k

I haven't used getGEO() before but I'm guessing gse4573 and gse14814.gcrma contain different things. genefilter() wants a matrix or eSet as the first argument.

Could you post some info such as the genefilter() command you run with gse4573 and also what gse4573 contains? (ie. class(gse4573), names(gse4573), head(gse4573)) as well as sessionInfo() output

ADD COMMENT • link updated 11.8 years ago by Istvan Albert 102k • written 11.8 years ago by Ying W ★ 4.3k

0

Entering edit mode

I use the exact same genefilter() commmand as above only I put in the gse4573 rather than gse14814.gcrma. I am working from my own laptop today which uses the R64 terminal/environment, where i normally use terminal. I got a warning back this time: Error in apply(expr, 1, flist) : dim(X) must have a positive length, any ide what this means ?

ADD REPLY • link 11.8 years ago by moranr ▴ 290