Battle of the dimensionality countercurses
Evaluating performance of common and uncommon dimensionality reduction tools
 UPDATED
UPDATE 5/08/2020: Since we’re on the subject of dimensionality reduction for singlecell data, figured I’d add in the recent dbMAP (diffusionbased Manifold Approximaiton and Projection), yet another dimensionality reduction method for singlecell data, this time based on diffusion maps and UMAP.
UPDATE 6/23/2020: A new technique out of DeepMind: The Geometric Manifold Component Estimator (GeoManCEr). While it’s been optimized for 3D meshes, it seems like this could be big in future research on dimensionality reduction for reinforcement learning (or other tasks related to 3D objects).
UPDATE 12/26/2020: You know what’s surprisingly uncommon in highdimensional visualizations? Parallel Coordinate Plots. To counter this apparent deficiency, I’ve added a new section with a few comparisons done with Facebook’s HiPlot.
With the recent release of PHATE, I felt it was a good time to revisit a comparison of the available Dimensionality reduction techniques. There are plenty of techniques for dimensionality reduction, ranging from ANCOVA for getting rid of covariate variables to using entire neural networks to make latent space representations. With such diversity of techniques in mind, here’s a refresher on some of the basics:
What is Dimensionality Reduction?
For a given problem in statistics or machine learning, we might have a lot of features. This usually means that not only is fitting a model to data time and computeintensive, but we’re less certain that we have a good solution after going through all that. This is what we mean when we talk about The Curse of Dimensionality. To undo this “curse”, we look for the best ways to similify the data to contain the most important features. The tradeoff is that any “countercurse” we use will lose some information. While this may seem likea pretty significant piece of fineprint (which it is), this still makes dimensionality reduction super useful for data compression, data classification, visualization in lower dimensions, and even noise reduction.
When discussing dimensionality reduction, there are two main camps: projection & manifold learning. Projection focuses on taking all highdimensional features, and projecting them onto a lowerdimensional subspace where some aspect of the data’s shape is preserved (e.g., some distance measure between all the points). Manifold learning is similar to Projection, except that it relies on the manifold hypothesis. This is the assumption that most realworld highdimensional datasets lie close to a much lowerdimensional manifold. Picture a bunch of acrylic paint dots dried onto a sheet of glass laying flat on a table. If you wanted to describe the position of one of these dots to someone else trying to recreate the piece, you could probably get away with giving just two coordinates for each dot. Now suppose the glass was held at a 45degree angle, and was slowly rotating. The dots could be descibed with 3 spatial coordinates, at time $t$ (1 temporal coordinate), and if the dots were different colors you could even add RGB descriptors. What we have is a piece where each dot is described by 7 different numbers (7dimensional feature vectors), but each one still corresponds to a point in the original 2dimensional space (even with all the rotation going on). You could even create instructions that recreate this 7dimensional description from just the 2D coordinates. This is an overly simplistic example (our feature vectors might have hundreds of dimensions each, and we’d be trying to compress them to 1 or 2 or 3 dimensions), but it illustrates the concept. Meanwhile projection would not assume the existence of such an underlying lowerdimensional ground truth, but would try to find the strongest correlations between the coordinates (or any of a variety of linear functions) that results in the least information being lost.
One can think of projection as being linear dimensionality reduction, and manifold learning as being nonlinear dimensionality reduction.
What are the differences between the different techniques?
We have several dimensionality reduction tools at our disposal [1]. For the projection (i.e., linear) dimensionality reduction techniques, we have:

Random Projection  The lemma states that a small set of points in a highdimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved. With this technique, we do a controlled sampling of projection matrices so that we preserve (with a predetermined level of tolerance) the pairwise distances between any two samples of the dataset.
 Gaussian Random Projection  We draw the components of the random matrix from the distribution $N(0, \frac{1}{n_{\text{components}}})$
 Sparse Random Projection  When sampling, we use sparse matrices instead of dense matrices for the sake of faster computation time and less memory.

Principal component analysis (PCA)  We use PCA to decompose multivariate datasets into successive orthogonal components that explain a maximum amount of the variance (rather than focusing on sampling with an error tolerance). If you’re reading this post you’ve probably already run into this technique before.
 Incremental PCA  Most PCA implementations only support batch processing, which requires that all your data fit in memory. Incremental PCA allows for component estimates that closely match the true ones, while only storing estimates of component and noise variances.
 Kernel PCA  Kernel PCA extends traditional PCA into nonlinear dimensionality reduction territory, by adding in kernelbased transformations.
 Sparse PCA  One issue with PCA is that its components have nonzero coefficients when expressed as linear combinations of the original variables. Sparse PCA changes this by restricting the outputs to sparse vectors. While this makes the important features easier to highlight for humans, it can struggle when there are very few instances in the data.
 MiniBatch Sparse PCA  This is to Sparse PCA what Incremental PCA is to regular PCA. That is, for data that’s too large to fit entirely in memory, this improves speed and reduces memory burden (at the cost of overall accuracy).
 Truncated SVD  variant of singular value decomposition (SVD) that only computes the $k$ largest singular values

Dictionary Learning  This is a matrix factorization technique that amounts to finding a (usually overcomplete) dictionary that will perform good at sparsely encoding the fitted data. Multiple versions of this exist for either computing the dictionaries from scratch, or building on precomputed dictionaries.
 MiniBatch Dictionary Learning  Like other minibatch approaches, this is a faster but less accurate version of the General Dictionarylearning algorithm
 Factor Analysis  Like the other matrix decomposition techniques, we assume our data ($X$) can be decomposed into component latent variables (e.g., $X = WH + M + E$, with $H$, $M$, and $E$ being the prespecified latent vector, offset vector, and error vector respectively). The main advantage for Factor Analysis (over PCA) is that it can model the variance in every direction of the input space independently (heteroscedastic noise)
Most of these come from Scikitlearn. If we really wanted, we could also include, Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) in this mix. This exploration doesn’t do that for now since it requires the classification labels ahead of time (which makes it less useful for unsupervised learning). All of these projectionbased methods to little more than generate embeddings based on some tracked statistical feature of the data. As mentioned before, most do not assume some ulterior underlying shape to the data other than the shape they’re presented with. This is where manifold learning comes in.
For manifoldlearningbased (i.e., nonlinear) dimensionality reduction, we have:
 Isometric Mapping (IsoMAP)  One of the earliest manifold learning algorithms, Isomap seeks a lowerdimensional embedding which maintains geodesic distances between all points (yes, that’s still kind of similar to the previously discussed projection algorithms, but it’s definitely more like Kernel PCA than standard PCA).
 Multidimensional Scaling(MDS)  Still kind of similar to the projetion methods, MDS seeks out lowdimensional representations in which the lowdimensional distances behave similarly to the highdimensional distances. This is done by generating a similarity matrix of the data points, and making sure the matrix looks the same when the data points are transformed into their lowerdimensional forms.
 Locally Linear Embedding (LLE) Imagine a series of local PCA analyses that all get compared to each other to find the best nonlinear embedding. That’s not exactly how LLE finds lowerdimensional manifolds that preserve local neighborhood distances, but it is pretty close. There are multiple variants (such as Hessian Eigenmapping (HLLE)), most of which are defined by how these local embeddings are compared at a global level.
 Spectral Embedding  This approach finds low dimensional representations of data using spectral decompositions of the graph Laplacian Eigenmaps.

tdistributed Stochastic Neighbor Embedding (tSNE)  Given how often it’s used in biological research (even after the introdution of UMAP), it seems like this would need little introduction. tSNE works by creating two probability distributions, one over the higherdimensional datapoint pairwise distances, and one over the newlygenerated lowerdimensional point pairwise distances. It then updates the positions of the lowerdimensional points to minimize the KL divergence between the distributions. While it can produce much clearer clusters than techniques like PCA, it often requires a lot of hyperparameter tuning, and similar runs might get very dissimilar results.
 Multicore tSNE  This is a multicore modification of BarnesHut tSNE by L. Van der Maaten. It also has the benefit of working much faster than scikitlearn’s TSNE implementation.
 Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP)  By assuming that data is uniformly distributed on a locallyconnected Riemannian manifold, and that the Riemannian metric is locally constant, this algorithm can model the underlying manifold with a fuzzy topological structure. UMAP has several advantages over tSNE. While it is still a stochastic algorithm, it is usually much more stable. It is also much faster than tSNE (and is easier to parallelize).
The list of manifoldlearning techniques progresses from still maintaining the shape of the highdimensional data, to cleverly iteratively testing lowdimensional manifold shapes. At this point, we reach the limitations of the manifold hypothesis. While these underlying manifolds definitely appear in many cases like , they might struggle in domains with noisier data like in the biological sciences. This is where the new manifoldlearning technique PHATE gets added in (UPDATE: added in dbMAP and GeoManCEr). This new technique combines manifold learning with diffisuionbased denoising. In other words, it tries to model both the underlying manifold as well as how exactly the data diffused from this ideal manifold.
 Potential of Heatdiffusion for Affinitybased Trajectory Embedding (PHATE)  PHATE provides a denoised, two or threedimensional visualization of the complete branching trajectory structure in highdimensional data (first developed for cell data). It uses heatdiffusion processes, which naturally denoise the data, to compute cellcell affinities. Then, PHATE creates a diffusionpotential geometry by freeenergy potentials of these processes. This geometry captures highdimensional trajectory structures, while enabling a natural embedding of the intrinsic data geometry.
 diffusionbased Manifold Approximation and Projection (dbMAP)  dbMAP has two parts to it: 1) an adaptive anisotropic reproduction of the initial input diffusion structure, followed by 2) an accelerated UMAP or graph layout. dbMAP is very similar to PHATE, save for it’s striving to include other types of diffusion and other types of manifolds than those generated by UMAP.
 Geometric Manifold Component Estimator (GeoManCEr)  Like dbMAP and PHATE, GeoManCEr combines manifold learning with diffusionbased denoising. It works by estimating the subspaces that are invariant under random walk diffusion, giving an approximation to the de Rham decomposition. While PHATE is geared towards modelling biological processes, GeoManCEr was built to be able to determine the underlying geometric shapes showin in images, with said shapes being shown in many different lighting conditions and angles.
With all these options it may be difficult to choose just one (and this is made harder by the fact that tools like PHATE combine previous techniques). While entire books have been written on information theoretic perspectives and contextspecific use cases (some of which may only persist out of habit or technical debt at this point), I’d like to showcase one quantitative and one qualitative benchmark…
Quantitative Test: Computational Performance
Different dimension reduction techniques can have quite different computational complexity. Beyond the algorithm itself there is also the question of how exactly it is implemented. These two factors can have a significant role in how long it actually takes to run a given dimension reduction. Furthermore the nature of the data you are trying to reduce can also matter — mostly the involves the dimensionality of the original data.
For comparing performance on arbitrailysized slices of the MNIST dataset, the results were as follows:
This was all done in a Google Colab environment. Out of all the possible hardware I could have benchmarked on, this seemed like it would be both familiar and accessible to the most amount of people. If you want to try the benchmark for youself, you can try the notebook here:
Qualitative Test #1: Evaluations on Known 3D Priors
While the algorithms above are often used for transformation into lowerdimensional spaces, these are more often than not used on spaces where we have a very poor intuitive understanding of the space. It’s often tempting to run these algorithms without any kind prior for how the data should be separated (other than having distinct clusters for each category or percentile of the data). With that in mind, I thought it would be interesting to compare these mappings to data with known AND intuitive mappings in 3D space.
These plots are extremely varied, but there are still two important takeaways:
 Your choices of parameters matter. Faster algorithms are useful for finding out which parameter choices suit your needs the best. This is important because you’re definitely going to need multiple scatterplots.
 Things like clustersize, the distances between them, and even the numbers of clusters may not mean as much as you think. Depending on which parameters you choose, noise can definitely be mistaken for signal.
The hard truth is that the ideal lowerdimensional manifold may still have more dimensions than can fit on a 3D scatterplot. This last takeaway brings us to our next qualitative test.
Qualitative Test #2: Parallel Coordinate Plots for MoreThan3Dimensional Visualizations
One of the issues with visualizing these mappings with 3D scatterplots, is that we’re only seeing a maximum of 3 of the ndimensional manifold. True, there are other clever techniques for adding a couple more dimensions of the embeddings, but they still rely on the quite quite limiting scatterplot idea.
One way around this is to use Parallel coordinates. From Wikipedia:
Parallel coordinates are a common way of visualizing and analyzing highdimensional datasets. To show a set of points in an ndimensional space, a backdrop is drawn consisting of n parallel lines, typically vertical and equally spaced. A point in ndimensional space is represented as a polyline with vertices on the parallel axes; the position of the vertex on the ith axis corresponds to the ith coordinate of the point. This visualization is closely related to time series visualization, except that it is applied to data where the axes do not correspond to points in time, and therefore do not have a natural order. Therefore, different axis arrangements may be of interest.
For example, here’s what we get if we render a couple of integers from 1 to 50, along with the results of a bunch of simple functions.
For ndimensional data, this is a lot easier than trying to add a waxis or new colors to an already packed 3dimensional scatterplot. If we arrange these parallel axes by the order of the projection component, we can go beyond 3d scatterplots for all of these tools.
References
 D. Haziza, J. Rapin, and G. Synnaeve, “Hiplot, interactive highdimensionality plots,” https://github.com/facebookresearch/hiplot, 2020.
 Sharp, Nicholas. “Polyscope.” (2019).
 Sainburg, Tim, Marvin Thielk, and Timothy Q. Gentner. “Latent space visualization, characterization, and generation of diverse vocal communication signals.” bioRxiv (2019): 870311.
 Moon, K. R., van Dijk, D., Wang, Z., Chen, W., Hirn, M. J., Coifman, R. R., … & Krishnaswamy, S. (2017). PHATE: a dimensionality reduction method for visualizing trajectory structures in highdimensional biological data. BioRxiv, 120378.
 Gigante, S., Charles, A. S., Krishnaswamy, S., & Mishne, G. (2019). Visualizing the PHATE of Neural Networks. In Advances in Neural Information Processing Systems (pp. 18421853).
 SidartaOliveira, D., & Velloso, L. Comprehensive Visualization of HighDimensional SingleCell Data With DiffusionBased Manifold Approximation and Projection (dbMAP). CELLREPORTSD2001731.
 McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv eprints 1802.03426, 2018
 Pfau, D., Higgins, I., Botev, A., & Racanière, S. (2020). Disentangling by Subspace Diffusion. arXiv preprint arXiv:2006.12982.
Cited as:
@article{mcateer2019dimcc
title = "Battle of the dimensionality countercurses",
author = "McAteer, Matthew",
journal = "matthewmcateer.me",
year = "2019",
url = "https://matthewmcateer.me/blog/dimensionalitycountercurses/"
}
If you notice mistakes and errors in this post, don’t hesitate to contact me at [contact at matthewmcateer dot me]
and I will be very happy to correct them right away! Alternatily, you can follow me on Twitter and reach out to me there.
See you in the next post 😄