Hi everyone!
I have metagenome dataset.
I was trying to find differential expressions of my samples (I have only two samples). Therefore, I used edgeR package in my count data. In filter the data step, before filtering, my dataframe's dimension was [3005,2]. It remained [71,2] after the filtering step. Most of the data have been lost, and one of the samples has only 1715 zero values, the other sample has only 1776 zero values.
I used this code: keep <- rowSums(cpm(y)>100) >= 2
.
Is it normal for dimension to go down from 3005 to 71? If anybody has an idea about filtering data, I am open to all suggestions. Thanks all!
For edgeR the default
min.count
if you use theirfilterByExpr
function is 10. Your cutoff of 100 is likely high for you data. It would be a good idea to make a histogram of counts to check.Thanks for replying but OP's cutoff isn't directly comparable to
min.count
. The edgeR threshold of 10 is for counts whereas OP is applying a cutoff to the counts-per-million. The edgeR threshold is required only for some samples whereas OP is requiring the cutoff to be satisfied for every sample. If the sequencing depth is 10 million reads per sample (say) then OP's cpm cutoff corresponds to a count of 1,000 for each sample and at least 2,000 for the row sum. No wonder they lose most of their data.Yep, I agree that a cutoff of 100 is kinda of high. With the default (removing the genes that have 10 or less counts) you already see a big decrease in the size of the data. In addition, genes near 100 counts are already being trascribed.