Hi,
I am using the PanCan Atlas TCGA data for a project, which consists of about 11,000 bulk RNAseq cancer samples. This data set has some missing values (NAs), and I wanted to try to impute missing values.
Which method is the best for missing data imputation for bulk-RNAseq data? Most of the recent literature only talks about missing value imputation methods for scRNAseq data but not bulk RNAseq data.
I've came across KNN- based missing value imputation and SEQimpute/ROBimpute for bulk RNAseq data. The latter is included in the R package 'rrcovNA', and the robust version apparently performs better compared to KNN in case outliers are in the data set. These two approaches are from relatively old publications and I could not find it widely applied in the literature. Are there any more recent and more popular imputation methods for bulk-RNAseq data? It would be great if there is a method that also performs some permutations during the imputation.
Thanks in advance!
Hi,
I don't have a full answer since I just started researching this myself, but I noticed that the DESeq2 package, which is very widely cited, reports log fold changes for genes which have 0s in one condition. It is using a Bayesian approach. Maybe it's possible to integrate this into a workflow outside of DESeq? https://support.bioconductor.org/p/64014/
Also I found this article (seemingly by a high school student:)) which tested a bunch of imputation methods prior to clustering microarray data. Some of these methods are available in R packages, though I agree I'm not sure how widely cited they are for RNA-Seq data: https://arxiv.org/pdf/1809.05969.pdf
I'm very interested to hear what you decide to do.
It is uncommon that RNA-seq contains missing values in the form of "NAs". I would rather try to find out why that is before doing any attempts to impute data as, as said above, RNA-seq usually does not contain them. I've actually never seen a dataset with NAs, for what's worth. Maybe give an email to their data support help desk. Might be a technical issue related to data processing. In fact if you simply downloaded the raw fastq files, aligned and quantified them there would be no NAs, at best you get zeros if there are no counts per gene/exon so this actually must be something related to their processing, given that it is true what you report actually being "NA" in the expression matrix and this is not something that came up during your own data manipualtion, hard to tell without code.
Maybe I misinterpreted the question by the original poster, but I was quite interested to hear people's thoughts on how to deal with the case where you have 0s (not NAs) in one experimental condition when you want to do downstream analysis like clustering on the dataset. Only likely to happen with lowly expressed genes but you could be throwing out interesting data if you have a very strong response in one condition. I have a feeling that many researchers discard rows with any zeros, but there must be intelligent ways of dealing with that problem.
Bulk RNA-seq from the Broad Institute's TCGA firehose pipeline does have some missing values -- that's just because of the way they processed the data.
Normally, there shouldn't be missing values in bulk RNA-seq and there's no need for imputation. I prefer getting TCGA data from https://xenabrowser.net/
In answer to the previous commenter's inquiry, if you have 0s, that's fine -- work with it normally. Transform your data (e.g. log2(x+1) or some other variance stabilizing transform) and then do whatever downstream analysis you want. You might see some funky results with low count genes because log2(x+1) can be very different from log2(x) when x is small, but differential gene expression programs can deal with these cases (e.g. if you have all 0's in one condition but all 1's in another condition, you'll have a large p-value for differential gene expression -- but if you have all 0's in one condition but values >1000 in the other condition, that gene will still be considered differentially expressed). This is why it's important to look at not only the effect size (e.g. log fold change) but also the statistical significance.