Question

How many PCs should be considered for downstream analyses?

4

Entering edit mode

4.8 years ago

bioinforesearchquestions ▴ 370

Hi All,

I have two groups WT and KO.

As per the Jackstraw plot, ‘Significant’ PCs will show a strong enrichment of features with low p-values (solid curve above the dashed line).

How to interpret the JackStraw plot. How come even the PCs with p-value =1 is above the dashed line.
PC 5 has pvalue "1". Do I need to consider the PCs which has only pvalue <0.05 (PC : 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 13, 14 and 15) for the downstream analyses?

As per the Elbow plot, looks like at PC 34 the standard deviation is touching the ground and staying constant.

So how many PCs should I consider for the downstream analyses like (find neighbors, find clusters and UMAP)?

cond_integrated <- FindNeighbors(object = cond_integrated, dims = ?)
cond_integrated <- FindClusters(object = cond_integrated)
cond_integrated <- RunUMAP(cond_integrated, reduction = "pca", dims = ?)

As I change the number of dimensions each time, I am getting different UMAP clustering.

merged_cond <- merge(x = WT_seurat_obj, y = KO_seurat_obj, add.cell.id = c("WT","KO"))

# filtered the merged_con based on mito, etc
filtered_cond_seurat

# split seurat object by condition from filtered_cond_seurat

for (i in 1:length(split_cond)) {
  split_cond[[i]] <- NormalizeData
  split_cond[[i]] <- CellCycleScoring
  split_cond[[i]] <- SCTransform
}
Obtained integ_features from SelectIntegrationFeatures using split_cond seurat object
Obtained anchor features using PrepSCTIntegration
Obtained integ_anchors using FindIntegrationAnchors and SCT normalization method
Obtained cond_integrated seurat object using IntegrateData

cond_integrated <- RunPCA(object = cond_integrated)

DimHeatmap(cond_integrated, dims = 1:15, cells = 500, balanced = TRUE)

cond_integrated <- JackStraw(cond_integrated, num.replicate = 100, dims=50)

cond_integrated <- ScoreJackStraw(cond_integrated, dims = 1:50)

JackStrawPlot(cond_integrated, dims = 1:50)

ElbowPlot(object = cond_integrated, ndims = 50)

PC-heatmaps

Jackstraw-Plot

Elbowplot

scRNAseq PCA UMAP Clustering RNAseq • 12k views

ADD COMMENT • link updated 4.2 years ago by Eugene A ▴ 190 • written 4.8 years ago by bioinforesearchquestions ▴ 370

0

Entering edit mode

Want to revive this thread. I have a situation, when a lower number of PCs seems to give me more "biologically relevant" results, does it justify using a lower number of PCs?

I have following setup: several time points of cell differentiation protocol, but all represent different libraries (I know that it is far from ideal setup, but on the one hand it was made like this due to complicated protocol on wet side and on the other it should not prevent me from analysing each individual time point first and then try to make a between-point connection based on obtained biological prior-knowledge)

I'm performing UMAP dimreduction on a subset of my data, to see the overall structure. I've noticed that a low number of PCs (5) provide better time point-to-time point clusterization than a higher number of PCs (15). That is probably due to the batch effect, getting amplified with higher numbers of PCs. Would in that case be meaningful to use a low number of PCs? And maybe additionally perform clusterization with a higher number of PCs in each individual timepoint later on?

Advices will be appreciated

Best, Eugene

Elbow plot 5 PCs 5 PCs 15 PCs 15 PCs

ADD REPLY • link 4.2 years ago by Eugene A ▴ 190

0

Entering edit mode

Please do not add new questions as an answer to existing threads. If you feel your question is unique then create a new post for it.

ADD REPLY • link 4.2 years ago by GenoMax 148k

0

Entering edit mode

I'm performing UMAP dimreduction on a subset of my data, to see the overall structure. I've noticed that a low number of PCs (5) provide better time point-to-time point clusterization than a higher number of PCs (15).

Not sure what you are looking at to come up with this assessment, but the eyeball test says that clustering is much better with 15 PCs. Not only are clusters better separated globally, but red and blue are better separated locally, as are cyan and magenta groups.

ADD REPLY • link 4.2 years ago by Mensur Dlakic ★ 28k

0

Entering edit mode

The point here is that I have some prior knowledge of what these cells are and as far as these populations are on the way of differentiation trajectories it is safe to assume that consecutive days have to be somehow closer to one another than more distinct time points. And that is exactly what I see with 5 PCs. On the other hand with 15 all timepoints just scattered across the umaps components.

ADD REPLY • link 4.2 years ago by Eugene A ▴ 190

1

Entering edit mode

And my point is that they are not supposed to be arranged in any kind of trajectory that mimics their differentiation pattern. They are supposed to be well separated, which they are with 15 PCs. You are expecting too much from dimensionality reduction if you think that it is going to recapitulate the differentiation pattern.

There aren't 6 expected clusters with 5 PCs - there are 4 at most. If you didn't know their colors ahead of time, there is no way you'd be able to come up with a correct number of clusters. On the other hand, 15 PCs is much more informative regarding real clusters, though I wouldn't necessarily guess 6 either if all dots were uniformly colored.

ADD REPLY • link 4.2 years ago by Mensur Dlakic ★ 28k

score 7 · Accepted Answer · 2020-02-20

When using SCTransform, this matters somewhat less, as it tends to be more robust and handle noise better. As such, you can provide a lot of PCs without introducing undue variation. I generally start with 30, but have gone up to 50 and noticed little difference. The authors generally recommend using more than the standard workflow for reasons outlined in the SCTransform vignette:

Why can we choose more PCs when using sctransform?

In the standard Seurat workflow we focus on 10 PCs for this dataset, though we highlight that the results are similar with higher settings for this parameter. Interestingtly, we’ve found that when using sctransform, we often benefit by pushing this parameter even higher. We believe this is because the sctransform workflow performs more effective normalization, strongly removing technical effects from the data.

Even after standard log-normalization, variation in sequencing depth is still a confounding factor (see Figure 1), and this effect can subtly influence higher PCs. In sctransform, this effect is substantially mitigated (see Figure 3). This means that higher PCs are more likely to represent subtle, but biologically relevant, sources of heterogeneity – so including them may improve downstream analysis.

In addition, sctransform returns 3,000 variable features by default, instead of 2,000. The rationale is similar, the additional variable features are less likely to be driven by technical differences across cells, and instead may represent more subtle biological fluctuations. In general, we find that results produced with sctransform are less dependent on these parameters (indeed, we achieve nearly identical results when using all genes in the transcriptome, though this does reduce computational efficiency).

In short, use more than you think you need and try not to overthink it.