perplexity value in t-sne algorithm
4
0
Entering edit mode
7.0 years ago
einatshu • 0

Hi! I have matrix of samples and their gene expressions, I want to apply "t-sne" algorithm (in R) on that data but I'm not sure what the correct perplexity_value I should use. I read that this value is very important for the "t-sne" algorithm and I want to get the best performance for my data.

Is there any way to determine dynamically what the right perplexity_value for a given data-set? Thanks

gene RNA-Seq R t-sne • 10k views
ADD COMMENT
2
Entering edit mode
7.0 years ago

From the algorithm's author (see FAQ section):

How should I set the perplexity in t-SNE?
The performance of t-SNE is fairly robust under different settings of the perplexity. The most appropriate value depends on the density of your data. Loosely speaking, one could say that a larger / denser dataset requires a larger perplexity. Typical values for the perplexity range between 5 and 50.

What is perplexity anyway?
Perplexity is a measure for information that is defined as 2 to the power of the Shannon entropy. The perplexity of a fair die with k sides is equal to k. In t-SNE, the perplexity may be viewed as a knob that sets the number of effective nearest neighbors. It is comparable with the number of nearest neighbors k that is employed in many manifold learners.

There's also this paper on automatic selection ot t-SNE perplexity.

ADD COMMENT
2
Entering edit mode
7.0 years ago

Though not super in-depth, this post is a good primer: How to Use t-SNE Effectively

ADD COMMENT
1
Entering edit mode
4.3 years ago
Renesh ★ 2.2k

In most cases, the default value of the perplexity parameter in t-SNE is 30. The standard range for the perplexity parameter is between 10-100. But, you can change this value based on the size of your datasets. In the context of scRNA-seq, setting a perplexity value of 1% of sample size (number of cells) could be useful to preserve the global geometry and in this article, you can see how you can use this perplexity parameter for a very large dataset (say millions of cells). Besides, other parameters such as the number of iterations, learning rate, and early exaggeration factor can also affect the visualization and should be optimized for larger datasets. I have covered more detail here with example https://reneshbedre.github.io/blog/tsne.html

ADD COMMENT
0
Entering edit mode
7.0 years ago

In practice, I think you should just try a few values and see which one(s) "makes sense". It's a hand-waving suggestion of course but after all, t-SNE is a visualization method so there isn't really a correct or optimal way of doing it, there is no statistical hypothesis or test involved. It depends on what you want to show... (the paper linked by Jean-Karim is interesting though and I haven't read it yet).

ADD COMMENT

Login before adding your answer.

Traffic: 2066 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6