Hi all,
I am trying to compare a few different RNAseq datasets with very different library sizes - often varying by orders of magnitude. I want to visualise the data using some sort of dimension reduction (e.g. PCA, UMAP etc.).
Usually I would use either vst or rlog from DESeq2 before running dimension reduction algorithms. However, the vst documentation states that it is sensitive to widely varying library size, and my dataset is much too large to run rlog in a realistic timeframe.
Are there any alternative transformations that can achieve variance stabilisation on a large dataset with widely variance library sizes? Is limma-voom appropriate, or are there others?
Many thanks
The problem will be that you likely have many technical dropouts if library sizes are too low. It would be important to determine if the variations in depth are more or less equally distributed across conditions or if a certain condition is specially confounded by low depth. The latter would be a systematic problem. If not you can still try and run the default normalization of e.g. DESeq2 or edgeR and then inspect data with MA-plots (e.g. using the averaged counts per condition since MA-plots take two samples only). The majority of genes should still center somewhat at y = 0 if the procedure went well. It would probably make sense to use only genes with a higher average expression for downstream analysis such as clustering to avoid lowly-expressed genes that probably suffer most from dropouts due to low depth. But as said, first of all I'd check if a certain condition is particularily confounded and if standard normalization works using e.g. the aforementioned MA-plots.
Thanks for this. It's definitely condition-specific - I am trying to see how 'similar' some low-depth data from our lab is to published data for specific tissues. I'm interested both in published higher-depth cancer studies and much lower-depth single cell studies. So there is very wide variation in depth, and it is condition specific. I'm mostly trying to do fairly broad comparative analysis, e.g. do our samples cluster with tissue A or tissue B, rather than trying to interrogate specific genes.
I think you are right that more stringent gene filtering for those with higher mean expression is key. Thanks again.
I hope you are not trying to squeeze data from different studies as well as your data into the same normalization and dimensionality reduction. This most likely will strongly cluster by study since RNA-seq is (in my experience) always confounded by study due to choice of library and sequencing kit on top of the biological variation. You will probably see that you have very distinct clustering by study and cannot make any statement if your samples cluster closer to other data points based on the underlying biology.