Question

Is it normal for most data to be lost when filtering data?

0

Entering edit mode

3.3 years ago

ssko ▴ 20

Hi everyone!

I have metagenome dataset. I was trying to find differential expressions of my samples (I have only two samples). Therefore, I used edgeR package in my count data. In filter the data step, before filtering, my dataframe's dimension was [3005,2]. It remained [71,2] after the filtering step. Most of the data have been lost, and one of the samples has only 1715 zero values, the other sample has only 1776 zero values. I used this code: keep <- rowSums(cpm(y)>100) >= 2.

Is it normal for dimension to go down from 3005 to 71? If anybody has an idea about filtering data, I am open to all suggestions. Thanks all!

edgeR R filter • 1.2k views

ADD COMMENT • link updated 3.3 years ago by Gordon Smyth ★ 8.1k • written 3.3 years ago by ssko ▴ 20

3

Entering edit mode

For edgeR the default min.count if you use their filterByExpr function is 10. Your cutoff of 100 is likely high for you data. It would be a good idea to make a histogram of counts to check.

ADD REPLY • link 3.3 years ago by rpolicastro 13k

2

Entering edit mode

Thanks for replying but OP's cutoff isn't directly comparable to min.count. The edgeR threshold of 10 is for counts whereas OP is applying a cutoff to the counts-per-million. The edgeR threshold is required only for some samples whereas OP is requiring the cutoff to be satisfied for every sample. If the sequencing depth is 10 million reads per sample (say) then OP's cpm cutoff corresponds to a count of 1,000 for each sample and at least 2,000 for the row sum. No wonder they lose most of their data.

ADD REPLY • link 3.3 years ago by Gordon Smyth ★ 8.1k

1

Entering edit mode

Yep, I agree that a cutoff of 100 is kinda of high. With the default (removing the genes that have 10 or less counts) you already see a big decrease in the size of the data. In addition, genes near 100 counts are already being trascribed.

ADD REPLY • link 3.3 years ago by vitor ▴ 130

score 3 · Answer 1 · 2022-02-08

No it isn't at all normal to remove most data. The code you're using is a bit crazy and is bound to remove most of the data. A cpm cutoff around 1 instead of 100 would be more usual. But why not follow the edgeR User's Guide and use

keep <- filterByExpr(y)

To be honest, you hardly need to filter at all. Since you don't have any replication, and hence can't estimate the dispersion, you really only need to remove rows that are all zero.