I have RNA seq matrix in the form of FPKM (Raw counts are not available). I want to add this data set to other datasets that I have for a machine learning model building downstream. Can I follow the following approach or not: First, Filter the matrix by keeping only FPKM >= 1 in at least 10 samples
RNA_FPKM <- RNA_FPKM[apply(RNA_FPKM[,-1], 1, function(X) length(X[X >= 1]) > 10 ) , ]
Then taking the log2 of the filtered FPKM matrix and adding 0.1
new_expression <- log2(RNA_FPKM + 0.1)
to filter out the lowly expressed genes. My question is this a valid approach? I don't have the raw counts so that is the best I can do. Forgive me if this is totally wrong or idiotic but I am totally new to this field so your help will be much appreciated.
What you're describing isn't converting the FPKM values to raw reads, but filtering genes based on FPKM levels. If that's the end goal then I see no harm in your analysis. I do have a couple of questions:
Why are you filtering for expression in 10 samples? Would it be more appropriate to require expression in all of your samples? Do you have 20 samples and you're requiring expression in half of them? Can these samples be divided into meaningful groups - e.g. require expression in all treated or all untreated samples?
Is 1 FPKM a reasonable cutoff empirically, or is it arbitrary? I tend to look at
plot(density(log2(fpkm)))
as this will often yield a bimodal distribution, then determine my expression cutoff based on the input dataset. If this is degraded FFPE data then 1 FPKM might not be stringent enough.Thanks Shawn. Actually if I filtered by FPKM > 1 in all the samples, I will lose 90% of my genes, I don't mind. So if after filtering, I converted FPKM values to log2 scale For the purpose of machine learning model building would this be scientifically valid or not? 1. These RNA seq values are meant to be combined with microarray data. 2. These combined data will be divided into training data and testing data. 3. I will use normalizeBetweenArray function on each of the training data and testing data. 4. I did so and the data appear normalized and the classifier appears to be working well on the test data using this approach. However, I know that it is not valid to use FPKM RNA-seq in such analysis, So how to fix this, would taking the log2 of FPKM and adding 0.1 be helpful??
I'd be a little concerned about mixing RNA-seq and microarray expression data - yes they'll both be on a log scale but they'll still be on quite different scales (microarray signal intensity is generally on a 6-12 range, while the log2(FPKM) would be ~1-13). I'd put a little more thought into combining these different measurements, maybe you can take a Z-score of the RNA-seq and a Z-score of the microarray data? I'm not sure how to advise in that case.
But regarding using log2(FPKM+0.1) as a measurement of gene expression that sounds perfectly reasonable.
I know it would be problematic but I have to do it. Thank you, your input was very helpful.