UMAP and "equal" objects
3
0
Entering edit mode
3.5 years ago

I want to plot a very large dataset. UMAP works quite good with this type of data (not single-cell expression but similar). However I have couple of clusters of absolutely equal objects, distance between all these objects is 0 (within each cluster) and UMAP somehow draws these huge clusters as "outlier" dots - even though these objects are not so dissimilar to the other objects.

I can replace these objects with only 1 representative, but are there alternative way to vizualize clusters using UMAP so it is not plotted as a dot very far from other dots?

visualization • 1.6k views
ADD COMMENT
1
Entering edit mode
3.5 years ago

You probably need to play with the parameters. Check these papers to get an idea of where you could focus your efforts:

ADD COMMENT
1
Entering edit mode
3.5 years ago
Mensur Dlakic ★ 28k

It depends on your definition of a large dataset. I have used openTSNE with 20-30 CPUs on a 100000 x 136 dataset, and it does the embedding in ~25 minutes. Even though this implementation of t-SNE is not as fast as UMAP, it is fast enough that it should not be a problem to use t-SNE even on datasets with million data points, as long as their second dimension is not in thousands.

I am curious as to how do you define the distance between your vectors to be 0. UMAP is not supposed to separate at all data points that are (near-)identical, no matter what parameters are used.

ADD COMMENT
0
Entering edit mode

Thanks! The problem of nearly identical samples is not that they are not separated, but they are placed far far far away from the general distribution - even though they are not very far conceptually! I am fine with them being place as 1 dot, but within the general distribution of data, they are not actual outliers, but since UMAP looks for local similarities - it prefers to "push" these huge clusters of identical objects as far as possible...

ADD REPLY
0
Entering edit mode

Just for information, openTSNE is from (some of) the same people as the papers I linked to above.

ADD REPLY
0
Entering edit mode
3.5 years ago
James • 0

You might try to use the linear-correlation distance instead of the Euclidean distance: the correlation distance normalizes all vectors to unit vectors.

ADD COMMENT

Login before adding your answer.

Traffic: 2512 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6