I am trying to generate a Uniform Manifold Approximation and Projection (UMAP) for about 300,000 observations of 36 variables. So far I have been using the umap package in R but this seems prohibitively slow (for exploratory analyses and parameter optimisation). Can someone recommend an alternative faster method?
Thank you!
There are streaming single-GPU implementations of tSNE like https://github.com/ZJULearning/AtSNE but even those when running on recent NVIDIA chips like Titan RTX's take several days if one has an input matrix with 3 millions rows in 50 dimensions.
Any advice you can offer on how you run the reduction of a dataset with 3 million rows in a few hours would be hugely valuable. How many dimensions do your points have?
Hey Douglas, thank you for inquiring. There was nothing overtly special about it... I just installed umap in R on a HPC at King's College London and ran it overnight via a PBS script specifying a considerable amount of RAM and 2 cores. I don't know the system architecture used, but guess that it was CentOS. I think that the biggest problem facing some people may be patience...?
I need to explore the other UMAP implementations mentioned in this thread on Biostars, though. It is becoming a limiting step for 'Desktop' analyis workflows that need to use UMAP.
Any advice you can offer on how you run the reduction of a dataset with 3 million rows in a few hours would be hugely valuable.
Since you are asking for any advice, here is one: consider using UMAP under python. I used to run UMAP very frequently on a dataset that is approximately 300 x 2 million. With 8 CPUs, that takes 3-4 days. Your dataset will presumably take less since the overall matrix size is roughly 4 times smaller. I am not saying it will run in a few hours as you'd like, but that is not very realistic for a dataset with millions of data points no matter what embedding method you are using, and no matter your computational power.
Another thing to try, if speed is your main concern, is LargeVis. That will run on the same 300 x 2 million dataset in 5-6 hours (Linux executable with 8 CPUs). To my eye the UMAP plots are cleaner and more aesthetically pleasing, but LargeVis will be ahead by an order of magnitude in terms of speed.
Good question. Prohibitive only for exploratory analyses (hours). I'd like to see how the output changes with adjustments to parameters like number of neighbors.
I suspected that our definitions of prohibitively slow may be different. A common set of parameters to test is described here. If you go with 6 n_neighbors values (5, 10, 20, 50, 100, 200) and 4 min_dist values (0.25, 0.5, 0.8, 0.99), that should be doable in under a week.
Honestly, I don't see a problem unless you are on some kind of a deadline. Even if you want to do 50-100 trials that each take few hours, what is couple of weeks to find out what works. Besides, UMAP doesn't do any clustering - it only provides low-dimensional visualization from which clusters are often derived automatically and manually. That means that most embeddings will show cluster structure if there is one, so hunting for a set of parameters that gives the absolute best visualization may not be necessary.
I will repeat: t-SNE will do a decent job on this dataset as well, and it really has only perplexity that may need tuning. I'd be surprised if you need to do more than 3 different perplexity values (say, 20, 30 and 50), and that will also run in no more than 3-4 days for a dataset of this size.
You're running out of memory. Lower n_neighbors values require less memory. I personally feel that 200 is much more than necessary, I don't think I've ever seen anyone use more than 50, as you start to lose a lot of finer-grained local structure after that point. This is particularly the case if you expect some of your cell types of interest to be quite closely related.
You can also subset your data for exploratory analysis. I would argue that after 50,000 points, you won't really be able to see any differences anyway. There are only so many dots you can put on a plot before they simply start overlapping the previous dots.
This is a valid approach ONLY if the subset is representative of the whole dataset. That is, if it is drawn from the same data distribution. Drawing the subset randomly may or may not fulfill this criterion - that's the nature of random sampling. It comes down to how important it is to do this right vs. time constraints.
To clarify, the suggestion was to experiment with different parameters using a small subset. Then apply the optimal settings to the full dataset. That will answer the question of how representative the subset was.
I personally would be curious to see an example using real biological data with over 100k samples where a 10% random subset generates a substantially different UMAP representation.
This does not sound right. I use python UMAP very often on a dataset with 1 million data points and about 300 variables. It takes 1-2 hours on a modern multi-core PC.
If this is something that is dataset-dependent and UMAP is truly slow with your setup, I would suggest that you try it in python before doing anything else. Sure, reducing the number of variables by PCA or by other means will speed things up, but it may take away important signal if PCA recovers less than 100% variance. Also, t-SNE should work fine on a database of this size, though it will take half a day or so.
Hello All, UMAP is slow for millions of data points or more, limited by NN-Descent (single threaded). UWOT is faster for nearest neighbor search but the embedding step is still single-threaded. Check annembed Jean and I developed, we parallelize everything and it can run 11 million data points in just 40 minutes, 24 threads. https://github.com/jean-pierreBoth/annembed
I don't think you will get much traction posting this in a 4 year-old thread. You may want to create a separate post and highlight the strengths of your program.
I've used umap in R for datasets of >3 million cells. It takes a few hours on a cluster environment.
@KevinBlighe could you give a few notes on how your R setup distributes the compute load over several nodes / cores?
To the best of my knowledge, UMAP still only runs on a single host: https://github.com/lmcinnes/umap/issues/37
There are streaming single-GPU implementations of tSNE like https://github.com/ZJULearning/AtSNE but even those when running on recent NVIDIA chips like Titan RTX's take several days if one has an input matrix with 3 millions rows in 50 dimensions.
Any advice you can offer on how you run the reduction of a dataset with 3 million rows in a few hours would be hugely valuable. How many dimensions do your points have?
Hey Douglas, thank you for inquiring. There was nothing overtly special about it... I just installed umap in R on a HPC at King's College London and ran it overnight via a PBS script specifying a considerable amount of RAM and 2 cores. I don't know the system architecture used, but guess that it was CentOS. I think that the biggest problem facing some people may be patience...?
I need to explore the other UMAP implementations mentioned in this thread on Biostars, though. It is becoming a limiting step for 'Desktop' analyis workflows that need to use UMAP.
Since you are asking for
any advice
, here is one: consider using UMAP under python. I used to run UMAP very frequently on a dataset that is approximately300 x 2 million
. With 8 CPUs, that takes 3-4 days. Your dataset will presumably take less since the overall matrix size is roughly 4 times smaller. I am not saying it will runin a few hours
as you'd like, but that is not very realistic for a dataset with millions of data points no matter what embedding method you are using, and no matter your computational power.Another thing to try, if speed is your main concern, is LargeVis. That will run on the same
300 x 2 million
dataset in 5-6 hours (Linux executable with 8 CPUs). To my eye the UMAP plots are cleaner and more aesthetically pleasing, but LargeVis will be ahead by an order of magnitude in terms of speed.The algorithm is described here.
By the way, what do you mean exactly by prohibitively slow? Many hours? Days?
Good question. Prohibitive only for exploratory analyses (hours). I'd like to see how the output changes with adjustments to parameters like number of neighbors.
I suspected that our definitions of
prohibitively slow
may be different. A common set of parameters to test is described here. If you go with 6n_neighbors
values (5, 10, 20, 50, 100, 200) and 4min_dist
values (0.25, 0.5, 0.8, 0.99), that should be doable in under a week.Honestly, I don't see a problem unless you are on some kind of a deadline. Even if you want to do 50-100 trials that each take few hours, what is couple of weeks to find out what works. Besides, UMAP doesn't do any clustering - it only provides low-dimensional visualization from which clusters are often derived automatically and manually. That means that most embeddings will show cluster structure if there is one, so hunting for a set of parameters that gives the absolute best visualization may not be necessary.
I will repeat: t-SNE will do a decent job on this dataset as well, and it really has only perplexity that may need tuning. I'd be surprised if you need to do more than 3 different perplexity values (say, 20, 30 and 50), and that will also run in no more than 3-4 days for a dataset of this size.
Thank you for the great suggestions!
Any idea why (where df = 30000row*36col dataframe)
would produce
Lower n_neighbors values do not yield this error.
You're running out of memory. Lower n_neighbors values require less memory. I personally feel that 200 is much more than necessary, I don't think I've ever seen anyone use more than 50, as you start to lose a lot of finer-grained local structure after that point. This is particularly the case if you expect some of your cell types of interest to be quite closely related.
You can also subset your data for exploratory analysis. I would argue that after 50,000 points, you won't really be able to see any differences anyway. There are only so many dots you can put on a plot before they simply start overlapping the previous dots.
This is a valid approach ONLY if the subset is representative of the whole dataset. That is, if it is drawn from the same data distribution. Drawing the subset randomly may or may not fulfill this criterion - that's the nature of random sampling. It comes down to how important it is to do this right vs. time constraints.
To clarify, the suggestion was to experiment with different parameters using a small subset. Then apply the optimal settings to the full dataset. That will answer the question of how representative the subset was.
I personally would be curious to see an example using real biological data with over 100k samples where a 10% random subset generates a substantially different UMAP representation.
If you have a CUDA capable GPU available, you can try gpumap.