Question

principal component analysis (PCA) of single cell samples which come from different libraries

1

Entering edit mode

5.2 years ago

yueli7 ▴ 250

Hello,

I have 17 single cell samples which come from different libraries.

They are 8 healthy, 3 mild and 6 severe patients samples.

First, I want to do the principal component analysis (PCA) of these 17 samples to see whether the healthy samples can cluster together.

Which package I can use?

Thanks in advance for any great help!

Best,

Yue

RNA-Seq • 3.5k views

ADD COMMENT • link updated 5.2 years ago by ATpoint 89k • written 5.2 years ago by yueli7 ▴ 250

score 1 · Answer 1 · 2020-05-24

1

Entering edit mode

5.2 years ago

asmitabioinfo2203 ▴ 20

You can use Seurat for this, but you should first merge all samples together and then remove batch effect first by using of CCA correction in-build function of seurat and then you can apply step - by step pipeline or for this DESeq2 is also helpful.

ADD COMMENT • link 5.2 years ago by asmitabioinfo2203 ▴ 20

score 1 · Answer 2 · 2020-05-24

1

Entering edit mode

5.2 years ago

piyushjo ▴ 710

I will suggest using FastMNN() wrapper, also available with Seurat package, to integrate different samples. CCA, in my and others' experience performs forceful integration. It is useful only if you know all the libraries are biological/technical replicates.

ADD COMMENT • link 5.2 years ago by piyushjo ▴ 710

score 1 · Answer 3 · 2020-05-24

First of, I am myself rather new to the scRNA-seq field so feel free to question what I say. I make comments on the situations I have encountered myself so far, sorrynotsorry for the wall of text that will follow:

It all depends on what eventually you want to do. If you want to get an idea on the presence of batch effects and on how strong the separation of these samples is based on the status (healthy, mild, severe) then you definitely should not use any integration procedure and rather process every samples independently but identically. This would typically involve normalization, feature selection, merge of the selected features (=highly variable genes per sample) and then a (multibatch)PCA, maybe followed by a 2D visualization approach such as TSNE or UMAP. Code suggestions for all this can e.g. be found in the scran package and the Bioconductor single-cell workflow. Seurat for sure offers the same, but I am not a Seurat user so I cannot comment. Doing so you will get a visual idea of how your samples cluster. This should give an impression if dominant batch effects are present and/or if the clustering is rather dominated by the status (healthy, mild, severe). Based on this you probably need to decide on how to continue which is based on the question you want to answer.

I would say integration via FastMNN, CCA or any of the other available methods is not always beneficial and should be considered with care. From what I understand it is most useful to create a unified clustering landscape in which all cells from all batches are embedded. This (again from what I understand) requires though that the overall composition between the batches is rather similar in order to get robust results. It would probably (please correct if wrong) be problematic if you have unique populations in one but not the other batches as these population might end up being forced into clusters which are not uniquely formed by these unique populations. Biological cluster heterogeneity might (in part) be lost upon using these methods. Again, please correct me if this is wrong. The integration might be desirable if the batches are very similar in terms of composition but dominated by unwanted batch effects such as samples from different days of library prep, different platforms of other less obvious effects. Still, I think integration is the less desirable the more unique the clusters in each condition (e.g. in this case here healthy, mild, severe) are.

That having said, if you observe strong clustering by condition, and this is reproducible among replicates, so all healthy, mild and severe cluster rather close respectively, then maybe the an integration where each sample is considered a unique batch (which is the default of e.g. fastMNN) might not be optimal. Rather it could be desirable to only correct for obvious batches such as the day of library prep. Again, it all depends on the question you want to answer.

Still, to come back to the actual question, as said and linked above, code for PCA and TSNE/UMAP can be found in various packages such as scran and the Bioconductor single-cell workflow. Be sure to first QC and check samples individually before merging or integrating them as clustering differences might (will) be lost upon (blind) integration.