Hi everyone,
I have been dealing with expression data for about 4 years (both microarray and rna-seq). but this question still confuses me when I do data preprocessing. 1) My opinion is that at least we should do low expressed gene filtration first. Reason is that: the aim for quantile/log2 transform is to make the data distribution more proper. but if quantile/log2 goes first and then followed by low-expressed gene filtration, we may break the distribution.
2) For log2 transform and quantile normalization, I really don;t know which one goes first.
Thank you in advance for your time and valuable suggestion.
If you Remove low expressed genes first (across the samples/cohort) and then do log transform(FPKM + 1),the results should be fine.
So is this question about RNA-seq or microarray?
Hi WouterDeCoster, I want to make it as general for both RNA-seq and microarray.
RNA-seq and microarray are both transcriptomics, but that's the end of the similarities. Microarray are continuous intensities, RNA-seq discrete counts (sampled from a negative binomial distribution: overdispersed poisson distribution).
I'll leave microarray analysis for someone else, but most acceptable is for RNA-seq to use tools like DESeq2 and edgeR which model the data assuming this negative binomial distribution. So you don't want to preprocess the data here, because for the software to work optimally it expects raw, unmanipulated counts.
Thank you. exactly, for raw read count of rna-seq data, I usually use deseq2 and edger to do DEG analysis. but sometimes I have to go with only rpkm/fpkm data. that's where I get trouble.