in the vignette for scRNAseq data integration using Seurat (https://satijalab.org/seurat/articles/integration_introduction.html), the most variable features, including 2000 or 3000 features are chosen for data integration of multiple samples, resulting in a combined data object.
The combined data object has only the 2000 or 3000 most variable features, which were chosen at the beginning of the workflow. All the other features are lost in the integrated data. Is there any option to integrate the data having all features in the integrated data object?
I think you are missing the concept here. As igor says the integration runs on the reduced dimensions, most commonly PCA, not on a per-gene basis. And for this you only want to include the variable genes as all other genes do not any information. The integrated values are only useful for clustering and visualization which, again, is commonly performed on a reduced dimension level. For something like differential gene expression one would not use the integrated values as the integration precedure creates dependencies between the data points and might change magnitude and directions. There are many threads on this already, e.g. issues at Seurat Github or at section 13.8 in OSCA:
At this point, it is also tempting to use the corrected expression values for gene-based analyses like DE-based marker gene detection. This is not generally recommended as an arbitrary correction algorithm is not obliged to preserve the magnitude (or even direction) of differences in per-gene expression when attempting to align multiple batches. For example, cosine normalization in fastMNN() shrinks the magnitude of the expression values so that the computed log-fold changes have no obvious interpretation. Of greater concern is the possibility that the correction introduces artificial agreement across batches.
(...)
In summary, for integration you should use the most variable genes which is the reason why this is the default in most workflows.
Thank you very much for clarification and for providing some links for further reading. Your explanation is very helpful. As you indicated, the Seurat authors recommend using the uncorrected data and define blocking for the batch variable for differential expression analysis in the discussion section.
I was wondering, however, how you would approach any other analyses apart from DE-testing where you can not define blocking. What if you want to compare ssGSEA /GSEA signature enrichment scores, or if you want to use scRNAseq data to model the expression of a feature? Do we need to wait for better computational methods to allow proper batch correction of scRNAseq data? I've read that methods such as combat and other methods are not providing satisfactory results even for DE-analysis (https://www.sciencedirect.com/science/article/pii/S200103701930409X).
You can use as few or as many features as you want. Keep in mind that the integration happens based on the reduced dimensions, so you are "losing" features either way.
Thanks for your response. If I use SCT transformation, it seems like I cannot specify all features, as there is an error coming up during the workflow (see below). I think might be a transformation-specific issue. Any advice?
Thank you very much for clarification and for providing some links for further reading. Your explanation is very helpful. As you indicated, the Seurat authors recommend using the uncorrected data and define blocking for the batch variable for differential expression analysis in the discussion section.
I was wondering, however, how you would approach any other analyses apart from DE-testing where you can not define blocking. What if you want to compare ssGSEA /GSEA signature enrichment scores, or if you want to use scRNAseq data to model the expression of a feature? Do we need to wait for better computational methods to allow proper batch correction of scRNAseq data? I've read that methods such as combat and other methods are not providing satisfactory results even for DE-analysis (https://www.sciencedirect.com/science/article/pii/S200103701930409X).