Battle of the dimensionality counter-curses

Evaluating performance of common and uncommon dimensionality reduction tools

UPDATE 5/08/2020: Since we’re on the subject of dimensionality reduction for single-cell data, figured I’d add in the recent dbMAP (diffusion-based Manifold Approximaiton and Projection), yet another dimensionality reduction method for single-cell data, this time based on diffusion maps and UMAP.

UPDATE 6/23/2020: A new technique out of DeepMind: The Geometric Manifold Component Estimator (GeoManCEr). While it’s been optimized for 3D meshes, it seems like this could be big in future research on dimensionality reduction for reinforcement learning (or other tasks related to 3D objects).

UPDATE 12/26/2020: You know what’s surprisingly uncommon in high-dimensional visualizations? Parallel Coordinate Plots. To counter this apparent deficiency, I’ve added a new section with a few comparisons done with Facebook’s HiPlot.

With the recent release of PHATE, I felt it was a good time to revisit a comparison of the available Dimensionality reduction techniques. There are plenty of techniques for dimensionality reduction, ranging from ANCOVA for getting rid of covariate variables to using entire neural networks to make latent space representations. With such diversity of techniques in mind, here’s a refresher on some of the basics:

What is Dimensionality Reduction?

For a given problem in statistics or machine learning, we might have a lot of features. This usually means that not only is fitting a model to data time- and compute-intensive, but we’re less certain that we have a good solution after going through all that. This is what we mean when we talk about The Curse of Dimensionality. To undo this “curse”, we look for the best ways to similify the data to contain the most important features. The tradeoff is that any “counter-curse” we use will lose some information. While this may seem likea pretty significant piece of fine-print (which it is), this still makes dimensionality reduction super useful for data compression, data classification, visualization in lower dimensions, and even noise reduction.

When discussing dimensionality reduction, there are two main camps: projection & manifold learning. Projection focuses on taking all high-dimensional features, and projecting them onto a lower-dimensional subspace where some aspect of the data’s shape is preserved (e.g., some distance measure between all the points). Manifold learning is similar to Projection, except that it relies on the manifold hypothesis. This is the assumption that most real-world high-dimensional datasets lie close to a much lower-dimensional manifold. Picture a bunch of acrylic paint dots dried onto a sheet of glass laying flat on a table. If you wanted to describe the position of one of these dots to someone else trying to recreate the piece, you could probably get away with giving just two coordinates for each dot. Now suppose the glass was held at a 45-degree angle, and was slowly rotating. The dots could be descibed with 3 spatial coordinates, at time tt (1 temporal coordinate), and if the dots were different colors you could even add RGB descriptors. What we have is a piece where each dot is described by 7 different numbers (7-dimensional feature vectors), but each one still corresponds to a point in the original 2-dimensional space (even with all the rotation going on). You could even create instructions that recreate this 7-dimensional description from just the 2D coordinates. This is an overly simplistic example (our feature vectors might have hundreds of dimensions each, and we’d be trying to compress them to 1 or 2 or 3 dimensions), but it illustrates the concept. Meanwhile projection would not assume the existence of such an underlying lower-dimensional ground truth, but would try to find the strongest correlations between the coordinates (or any of a variety of linear functions) that results in the least information being lost.

One can think of projection as being linear dimensionality reduction, and manifold learning as being non-linear dimensionality reduction.

What are the differences between the different techniques?

We have several dimensionality reduction tools at our disposal [1]. For the projection (i.e., linear) dimensionality reduction techniques, we have:

  • Random Projection - The lemma states that a small set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved. With this technique, we do a controlled sampling of projection matrices so that we preserve (with a predetermined level of tolerance) the pairwise distances between any two samples of the dataset.

    • Gaussian Random Projection - We draw the components of the random matrix from the distribution N(0,1ncomponents)N(0, \frac{1}{n_{\text{components}}})
    • Sparse Random Projection - When sampling, we use sparse matrices instead of dense matrices for the sake of faster computation time and less memory.
  • Principal component analysis (PCA) - We use PCA to decompose multivariate datasets into successive orthogonal components that explain a maximum amount of the variance (rather than focusing on sampling with an error tolerance). If you’re reading this post you’ve probably already run into this technique before.

    • Incremental PCA - Most PCA implementations only support batch processing, which requires that all your data fit in memory. Incremental PCA allows for component estimates that closely match the true ones, while only storing estimates of component and noise variances.
    • Kernel PCA - Kernel PCA extends traditional PCA into non-linear dimensionality reduction territory, by adding in kernel-based transformations.
    • Sparse PCA - One issue with PCA is that its components have non-zero coefficients when expressed as linear combinations of the original variables. Sparse PCA changes this by restricting the outputs to sparse vectors. While this makes the important features easier to highlight for humans, it can struggle when there are very few instances in the data.
    • Mini-Batch Sparse PCA - This is to Sparse PCA what Incremental PCA is to regular PCA. That is, for data that’s too large to fit entirely in memory, this improves speed and reduces memory burden (at the cost of overall accuracy).
  • Truncated SVD - variant of singular value decomposition (SVD) that only computes the kk largest singular values
  • Dictionary Learning - This is a matrix factorization technique that amounts to finding a (usually overcomplete) dictionary that will perform good at sparsely encoding the fitted data. Multiple versions of this exist for either computing the dictionaries from scratch, or building on pre-computed dictionaries.

    • Mini-Batch Dictionary Learning - Like other mini-batch approaches, this is a faster but less accurate version of the General Dictionary-learning algorithm
  • Factor Analysis - Like the other matrix decomposition techniques, we assume our data (XX) can be decomposed into component latent variables (e.g., X=WH+M+EX = WH + M + E, with HH, MM, and EE being the pre-specified latent vector, offset vector, and error vector respectively). The main advantage for Factor Analysis (over PCA) is that it can model the variance in every direction of the input space independently (heteroscedastic noise)

Most of these come from Scikit-learn. If we really wanted, we could also include, Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) in this mix. This exploration doesn’t do that for now since it requires the classification labels ahead of time (which makes it less useful for unsupervised learning). All of these projection-based methods to little more than generate embeddings based on some tracked statistical feature of the data. As mentioned before, most do not assume some ulterior underlying shape to the data other than the shape they’re presented with. This is where manifold learning comes in.

For manifold-learning-based (i.e., non-linear) dimensionality reduction, we have:

  • Isometric Mapping (IsoMAP) - One of the earliest manifold learning algorithms, Isomap seeks a lower-dimensional embedding which maintains geodesic distances between all points (yes, that’s still kind of similar to the previously discussed projection algorithms, but it’s definitely more like Kernel PCA than standard PCA).
  • Multi-dimensional Scaling(MDS) - Still kind of similar to the projetion methods, MDS seeks out low-dimensional representations in which the low-dimensional distances behave similarly to the high-dimensional distances. This is done by generating a similarity matrix of the data points, and making sure the matrix looks the same when the data points are transformed into their lower-dimensional forms.
  • Locally Linear Embedding (LLE)- Imagine a series of local PCA analyses that all get compared to each other to find the best non-linear embedding. That’s not exactly how LLE finds lower-dimensional manifolds that preserve local neighborhood distances, but it is pretty close. There are multiple variants (such as Hessian Eigenmapping (HLLE)), most of which are defined by how these local embeddings are compared at a global level.
  • Spectral Embedding - This approach finds low dimensional representations of data using spectral decompositions of the graph Laplacian Eigenmaps.
  • t-distributed Stochastic Neighbor Embedding (t-SNE) - Given how often it’s used in biological research (even after the introdution of UMAP), it seems like this would need little introduction. t-SNE works by creating two probability distributions, one over the higher-dimensional datapoint pairwise distances, and one over the newly-generated lower-dimensional point pairwise distances. It then updates the positions of the lower-dimensional points to minimize the KL divergence between the distributions. While it can produce much clearer clusters than techniques like PCA, it often requires a lot of hyperparameter tuning, and similar runs might get very dissimilar results.

    • Multicore t-SNE - This is a multicore modification of Barnes-Hut t-SNE by L. Van der Maaten. It also has the benefit of working much faster than scikit-learn’s TSNE implementation.
  • Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP) - By assuming that data is uniformly distributed on a locally-connected Riemannian manifold, and that the Riemannian metric is locally constant, this algorithm can model the underlying manifold with a fuzzy topological structure. UMAP has several advantages over t-SNE. While it is still a stochastic algorithm, it is usually much more stable. It is also much faster than t-SNE (and is easier to parallelize).

The list of manifold-learning techniques progresses from still maintaining the shape of the high-dimensional data, to cleverly iteratively testing low-dimensional manifold shapes. At this point, we reach the limitations of the manifold hypothesis. While these underlying manifolds definitely appear in many cases like , they might struggle in domains with noisier data like in the biological sciences. This is where the new manifold-learning technique PHATE gets added in (UPDATE: added in dbMAP and GeoManCEr). This new technique combines manifold learning with diffisuion-based denoising. In other words, it tries to model both the underlying manifold as well as how exactly the data diffused from this ideal manifold.

  • Potential of Heat-diffusion for Affinity-based Trajectory Embedding (PHATE) - PHATE provides a denoised, two or three-dimensional visualization of the complete branching trajectory structure in high-dimensional data (first developed for cell data). It uses heat-diffusion processes, which naturally denoise the data, to compute cell-cell affinities. Then, PHATE creates a diffusion-potential geometry by free-energy potentials of these processes. This geometry captures high-dimensional trajectory structures, while enabling a natural embedding of the intrinsic data geometry.
  • diffusion-based Manifold Approximation and Projection (dbMAP) - dbMAP has two parts to it: 1) an adaptive anisotropic reproduction of the initial input diffusion structure, followed by 2) an accelerated UMAP or graph layout. dbMAP is very similar to PHATE, save for it’s striving to include other types of diffusion and other types of manifolds than those generated by UMAP.
  • Geometric Manifold Component Estimator (GeoManCEr) - Like dbMAP and PHATE, GeoManCEr combines manifold learning with diffusion-based denoising. It works by estimating the subspaces that are invariant under random walk diffusion, giving an approximation to the de Rham decomposition. While PHATE is geared towards modelling biological processes, GeoManCEr was built to be able to determine the underlying geometric shapes showin in images, with said shapes being shown in many different lighting conditions and angles.

With all these options it may be difficult to choose just one (and this is made harder by the fact that tools like PHATE combine previous techniques). While entire books have been written on information theoretic perspectives and context-specific use cases (some of which may only persist out of habit or technical debt at this point), I’d like to showcase one quantitative and one qualitative benchmark…

Quantitative Test: Computational Performance

Different dimension reduction techniques can have quite different computational complexity. Beyond the algorithm itself there is also the question of how exactly it is implemented. These two factors can have a significant role in how long it actually takes to run a given dimension reduction. Furthermore the nature of the data you are trying to reduce can also matter — mostly the involves the dimensionality of the original data.

For comparing performance on arbitraily-sized slices of the MNIST dataset, the results were as follows:

Our projection-based algorithms

Our manifold learning algorithms

Pooling all but the obvious slowpokes

Just the top 3 fastest now

This was all done in a Google Colab environment. Out of all the possible hardware I could have benchmarked on, this seemed like it would be both familiar and accessible to the most amount of people. If you want to try the benchmark for youself, you can try the notebook here:

Open In Colab

Qualitative Test #1: Evaluations on Known 3D Priors

While the algorithms above are often used for transformation into lower-dimensional spaces, these are more often than not used on spaces where we have a very poor intuitive understanding of the space. It’s often tempting to run these algorithms without any kind prior for how the data should be separated (other than having distinct clusters for each category or percentile of the data). With that in mind, I thought it would be interesting to compare these mappings to data with known AND intuitive mappings in 3D space.

These plots are extremely varied, but there are still two important takeaways:

  1. Your choices of parameters matter. Faster algorithms are useful for finding out which parameter choices suit your needs the best. This is important because you’re definitely going to need multiple scatterplots.
  2. Things like cluster-size, the distances between them, and even the numbers of clusters may not mean as much as you think. Depending on which parameters you choose, noise can definitely be mistaken for signal.

The hard truth is that the ideal lower-dimensional manifold may still have more dimensions than can fit on a 3D scatterplot. This last takeaway brings us to our next qualitative test.

Qualitative Test #2: Parallel Coordinate Plots for More-Than-3-Dimensional Visualizations

One of the issues with visualizing these mappings with 3D scatterplots, is that we’re only seeing a maximum of 3 of the n-dimensional manifold. True, there are other clever techniques for adding a couple more dimensions of the embeddings, but they still rely on the quite quite limiting scatterplot idea.

One way around this is to use Parallel coordinates. From Wikipedia:

Parallel coordinates are a common way of visualizing and analyzing high-dimensional datasets. To show a set of points in an n-dimensional space, a backdrop is drawn consisting of n parallel lines, typically vertical and equally spaced. A point in n-dimensional space is represented as a polyline with vertices on the parallel axes; the position of the vertex on the i-th axis corresponds to the i-th coordinate of the point. This visualization is closely related to time series visualization, except that it is applied to data where the axes do not correspond to points in time, and therefore do not have a natural order. Therefore, different axis arrangements may be of interest.

For example, here’s what we get if we render a couple of integers from 1 to 50, along with the results of a bunch of simple functions.

This is what to look for if your data is correlated, anti-correlated, uncorrelated, or even just unevenly clustered.

For n-dimensional data, this is a lot easier than trying to add a w-axis or new colors to an already packed 3-dimensional scatterplot. If we arrange these parallel axes by the order of the projection component, we can go beyond 3d scatterplots for all of these tools.

References


Cited as:

@article{mcateer2019dimcc
    title = "Battle of the dimensionality counter-curses",
    author = "McAteer, Matthew",
    journal = "matthewmcateer.me",
    year = "2019",
    url = "https://matthewmcateer.me/blog/dimensionality-counter-curses/"
}

If you notice mistakes and errors in this post, don’t hesitate to contact me at [contact at matthewmcateer dot me] and I will be very happy to correct them right away! Alternatily, you can follow me on Twitter and reach out to me there.

See you in the next post 😄

I write about AI, Biotech, and a bunch of other topics. Subscribe to get new posts by email!


This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

At least this isn't a full-screen popup

That'd be more annoying. Anyways, subscribe to my newsletter to get new posts by email! I write about AI, Biotech, and a bunch of other topics.


This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.