What is a suitable metric to compute a cell-to-cell distance matrix
1
0
Entering edit mode
2.9 years ago
Gabriel ▴ 170

In Seurat and Scran I have noted they use the SNN or KNN algorithm to find the nearest neighbor of cells to one another for clustering, data integration etc. makeSNNGraph

I have seen mentions of Euclidean, Jaccard, and rank applied on the ranks themselves (of the nearest neighborhood) but how were the ranks themselves calculated and what are distance metrics that are better suited for sc Data ?

I have seen it calculated from the reduced dimensions, and then simply applying some distance metric on the PCA scores:

F.ex.

mat = reducedDim(sce, "PCA")
distance = dist(mat, method = "euclidean")
seurat KNN scran SNN adjacency • 1.1k views
ADD COMMENT
2
Entering edit mode
2.9 years ago

I don't think there's a right answer. A distance measure implies a certain notion of similarity between the cells and which notion of similarity is relevant can vary with the context. In addition some measures may have properties that may make them more or less suitable to certain contexts. For example many distance measures suffer from the concentration phenomenon by which, in noisy high dimensional spaces, the distance measure tends towards a constant with differences in observed values being essentially random. This can render nearest neighbours in such spaces meaningless. This is one of the reasons for applying dimensionality reduction methods. However, some dimensionality reduction methods produce an outcome in which some distances don't preserve the same notion of similarity as in the original space. Choosing a distance measure is also subject to some technical considerations such as whether the data is discrete, normalized, skewed, has outliers... To judge if the distance measure is suitable you need external information, i.e. do cells already known to be similar (for whatever notion of similarity you are interested in) have a short distance? Or how would you evaluate the outcome if you get two different clustering outcomes using two different measures? On the other hand when there is a strong structure in the data, multiple, if not most, approaches should find it.

ADD COMMENT
1
Entering edit mode

For documentation I tried several configurations and found one that seems to work well:

Pseudocode:


Raw counts ->

logNormCounts (log normalize with scran) ->

Select common variable features between datasets (similar to seurat find anchors) among top 3000-5000 genes ->

PCA using prcomp function (possibly removing batch associated PCs?) ->

spearman correlation between cells on first 50-100 PCs


Pearson correlation gives weight to the outlier PCs which could be a problem if one PC is due to technical noise. I also considered the option of pooling cells for more robust Correlations, however, pooling itself requires a distance metric, and PCA acts as a noise reduction method

ADD REPLY

Login before adding your answer.

Traffic: 2891 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6