Hi,
I just had a question after thinking for a while and thought I would get your guy's expert advice. When doing single cell rna sequencing analysis for lets say 4 control and 4 cancer samples, would it be more useful to do the preprocessing (filtering, normalization, etc..) and then cluster each sample individually to identify the cell types. Then, once the cell types are identified in each individual sample, merge them together with the cell type ids, integrate and do differential gene analysis between cell types? This I say in contrast to, doing the preprocessing on all the samples together, then integrating them to cluster and identify cell types within all 8 samples combined. My understanding is, with analyzing each sample individually and not having to integrate, you preserve signal to identify cell types after clustering. Then you can combine the cell types from all the sample to do downstream analysis. Versus, when doing the preprocessing and integration on all 8 samples to eventually cluster the cells on all 8 samples, you lose some actual biological signal. Any advice and thoughts would be appreciated.
You should integrate and call clusters on the joint embedding. Most good integration algorithms (like scVI) should be robust to cell types that are present only in your e.g. 4 cancer cells. You also avoid differences in clustering based solely on the noise present in each sample, which is not informative.
Thank you for your guidance. So just to make sure I understand correctly, you are saying to merge all 8 samples together, preprocess them together (filter low count cells, normalize, etc...), and then integrate that merged object (with the 8 samples combined) using Harmony or scVI and it should preserve the cell type signal in each sample? The reason I ask is, if you have a control and condition sample that do not differ too much in terms of biology, then because of the way integration works in finding cells with similar expression between conditions and using those as anchors to eventually apply a factor to "integrate" other cells, would you not lose out on the biological signal that may help understand the difference between control and condition? Appreciate your initial response and any further guidance.