As you point out, and contrary to what many people believe, PCA and t-SNE are dimensionality reduction rather than clustering techniques. The fact that often we can "see" clusters in 2D or 3D representations by PCA and t-SNE means that there is internal structure in data, but it doesn't automatically lead to clustering. In that sense, both are primarily used for visualizations.
If your intent is to rigorously cluster data, especially based on distances, it should be done either on original data, or on data where non-informative features have been eliminated. Sometimes it helps to discretize the data before clustering, for example by using minimum description length binning. The latter technique also has the effect of removing non-informative features. We have to always keep in mind that dimensionality reduction removes both information and noise from the original dataset.
All that said, people cluster both on PCA and t-SNE embeddings of data quite effectively. Since PCA is a linear transformation, one can use distance-based clustering afterward, but as already pointed out you may need more than first 2 components (and sometimes many components) to include all meaningful data variance. t-SNE is non-linear and therefore doesn't preserve distances, but in my experience it does preserve the density in most cases when the perplexity parameter is chosen at least somewhat correctly. In the example you gave above, and with which I disagree, the argument is that a simple 2-cluster dataset can't be resolved when one chooses a perplexity of 20 or 40. Roughly speaking, perplexity is the minimal number of neighbors each data point is expected to have. For a dataset of 1000 total points as in that example you cited, the upper number of clusters for t-SNE to consider is roughly set at 50 (p=20), 25 (p=40) and 13 (p=80). The closer we are in our guess to the actual number of clusters (2), the more "clusterable" our embedding will be. Obviously, 50 potential clusters with p=20 is quite a bit off from the actual 2, and 13 potential clusters with p=80 is much closer. The visual would be even better (and easier to cluster) with p=100, but the original poster did not attempt that. To me, choosing a correct perplexity is similar to choosing K in K-means clustering: if your choice is far off from the actual number, you may end up with wrong clusters. We would get a wrong K-means clustering solution in the example you referenced above if we chose K=10.
From a practical point of view, if you ever decide to cluster based on t-SNE embeddings, it should be density-based rather than distance-based. As to whether it works, it has been shown many times by many people to work, though one needs to exercise caution. Having data with little or no noise helps as well. I will demonstrate this on clustering that was done on t-SNE embedding from tetranucleotide frequencies (# features = 256) of a metagenomic dataset - this means mixed DNA from many species where we don't know the exact origin. This is how the clusters look like.
To independently test whether this clustering is reliable, we have spiked in DNAs from 3 known species, and this is where they fall in that plot.
Hopefully it is clear that all 3 known genomes are resolved well within the existing structure. That doesn't mean that all clusters have been assigned with 100% accuracy, but we know from other experiments that many of them are.
In short: there is stronger mathematical justification to cluster from PCA embedding than from t-SNE, especially if one can find reliably the number of PCs to use (this is not automatic). Still, one can get just as good or better clustering with t-SNE embedding if we can find good approximation for perplexity (this is not automatic either).
Thank you so much for taking the time to write out such a detailed response. I'll definitely be clustering on PCs first from here on out.