principal component analysis (PCA) of single cell samples which come from different libraries
3
1
Entering edit mode
4.5 years ago
yueli7 ▴ 250

Hello,

I have 17 single cell samples which come from different libraries.

They are 8 healthy, 3 mild and 6 severe patients samples.

First, I want to do the principal component analysis (PCA) of these 17 samples to see whether the healthy samples can cluster together.

Which package I can use?

Thanks in advance for any great help!

Best,

Yue

RNA-Seq • 2.7k views
ADD COMMENT
1
Entering edit mode
4.5 years ago

You can use Seurat for this, but you should first merge all samples together and then remove batch effect first by using of CCA correction in-build function of seurat and then you can apply step - by step pipeline or for this DESeq2 is also helpful.

ADD COMMENT
1
Entering edit mode
4.5 years ago
piyushjo ▴ 710

I will suggest using FastMNN() wrapper, also available with Seurat package, to integrate different samples. CCA, in my and others' experience performs forceful integration. It is useful only if you know all the libraries are biological/technical replicates.

ADD COMMENT
1
Entering edit mode
4.5 years ago
ATpoint 85k

First of, I am myself rather new to the scRNA-seq field so feel free to question what I say. I make comments on the situations I have encountered myself so far, sorrynotsorry for the wall of text that will follow:

It all depends on what eventually you want to do. If you want to get an idea on the presence of batch effects and on how strong the separation of these samples is based on the status (healthy, mild, severe) then you definitely should not use any integration procedure and rather process every samples independently but identically. This would typically involve normalization, feature selection, merge of the selected features (=highly variable genes per sample) and then a (multibatch)PCA, maybe followed by a 2D visualization approach such as TSNE or UMAP. Code suggestions for all this can e.g. be found in the scran package and the Bioconductor single-cell workflow. Seurat for sure offers the same, but I am not a Seurat user so I cannot comment. Doing so you will get a visual idea of how your samples cluster. This should give an impression if dominant batch effects are present and/or if the clustering is rather dominated by the status (healthy, mild, severe). Based on this you probably need to decide on how to continue which is based on the question you want to answer.

I would say integration via FastMNN, CCA or any of the other available methods is not always beneficial and should be considered with care. From what I understand it is most useful to create a unified clustering landscape in which all cells from all batches are embedded. This (again from what I understand) requires though that the overall composition between the batches is rather similar in order to get robust results. It would probably (please correct if wrong) be problematic if you have unique populations in one but not the other batches as these population might end up being forced into clusters which are not uniquely formed by these unique populations. Biological cluster heterogeneity might (in part) be lost upon using these methods. Again, please correct me if this is wrong. The integration might be desirable if the batches are very similar in terms of composition but dominated by unwanted batch effects such as samples from different days of library prep, different platforms of other less obvious effects. Still, I think integration is the less desirable the more unique the clusters in each condition (e.g. in this case here healthy, mild, severe) are.

That having said, if you observe strong clustering by condition, and this is reproducible among replicates, so all healthy, mild and severe cluster rather close respectively, then maybe the an integration where each sample is considered a unique batch (which is the default of e.g. fastMNN) might not be optimal. Rather it could be desirable to only correct for obvious batches such as the day of library prep. Again, it all depends on the question you want to answer.

Still, to come back to the actual question, as said and linked above, code for PCA and TSNE/UMAP can be found in various packages such as scran and the Bioconductor single-cell workflow. Be sure to first QC and check samples individually before merging or integrating them as clustering differences might (will) be lost upon (blind) integration.

ADD COMMENT
1
Entering edit mode

It is correct that any integration method will forcefully correct the difference across population, as the intended function is to correct the "batch" effect here the "batch" effect are coming from disease.

In the present scenario, where there are samples based on treatment (healthy vs patient) conditions, you might end up with the patient samples overlapping with healthy samples. This happens because the chosen integration points (highly variable genes) still show similar pattern that drives the overlap. However, if the cell type have changed substantially, some algorithms (in my experience fastMNN) clusters them separately. This I have seen when integrating tumor samples with and without microglia/endothelial population. While different patient tumor cells integrated (even though they are different they are still similar in some nature), the microglia cells (unique populations absent in some samples) clustered apart.

Now if you are interested in understanding what is happening to a particular population that clustered together, with different sourcen healthy vs patient, you can extract that population and do DGE by source factor. There you will be able to see the difference.

The correction algorithm is for visualization only. So it serves the purpose of identifying similar populations across samples which could still have transcriptional difference owing to changes in type of sequencing method used or some biological variable (here it is disease, but could also be differentiation). That doesn't mean that you are getting wrong "answers".

ADD REPLY
1
Entering edit mode

Now if you are interested in understanding what is happening to a particular population that clustered together, with different sourcen healthy vs patient, you can extract that population and do DGE by source factor. There you will be able to see the difference.

Yes, this is often not emphasized enough that the integrated values are only used for clustering and all other analysis, e.g. DE, will still be performed on the unintegrated values.

ADD REPLY

Login before adding your answer.

Traffic: 2172 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6