Clothing Automata

Responding to self-classifying MNIST, and going even further

If you haven’t already seen the latest distil pub from Ettore Randazzo, Alexander Mordvintsev, Eyvind Niklasson, Michael Levin, and Sam Greydanus, definitely give it a look. I give a quick summary here about the main takeaways, but you should still look at the original article (even if it’s just to play with the animations). At its highest level, the research sets out to answer the following question:

“Can cellular automatas (CAs) use local message passing to achieve global agreement on what digit they compose?”

The reasearch: Getting MNIST images to self-identify

If you’ve played around with the Distil animation, you’ve probably seen the pixels change color. There’s clearly message passing going on. The pixels are doing a pretty good job for most of the classification, even if there are some weak spots (such as confusing 2s and 8s, or just defaulting to 0). This paper isn’t about state-of-the-art MNIST classification. Classifying digits is often treated as a solved problem (classifying the handwriting author, not so much). I highly recommend you check out the previous Neural CA paper in their series, as this is an extension of that previous work.

The main takeaway of the previous work from these authors was CAs self-organizing and resisting perturbations through differentiable rules for generating the automata. How do these CAs work? A given Image is rasterized, and each cell/pixel represents a node in a graph, connected to all its immediate neighbors (and same pattern goes for all THOSE neighbors). If we have an MNIST image, each cell may only store data about whether it’s on or off (or alive or dead). The dead/off/blank cells do not pass messages around. The alive cells, on the other hand, need to communicate their alive status to each other, and by extension what image class they compose. Only by receiving messages from neighboring cells, storing a state, and sending out messages to neighbors, can the cells agree on what image class they are. Whatever image class is assigned the highest probability by a cell is the image class selected (this is probably sounding familiar to anyone that knows how a softmax layer works, so you can probably see how this thing would be trained).

What would these rules even look like? Suppose you’re in the bend of a 5. One cell could register that it’s neighbor is alive, but that neighbor could convey that it’s neighbor is dead. This original cell receives a message that it’s neighbor’s neighbor is dead, which matches the pattern of an edge. You can imagine these n’th degree neighbors as serving a similar purpose as nodes further back in a neural network. The result is that we have these state vectors being passed around as the message between nodes, with each cell integrating the messages it receives from it’s neighbors as a linear combination of it’s neighbors’ states and it’s own state. Seeing all this, you can probably imagine that it would be easy to create a bunch of these update rules for each node, but the challenge is that this update rule (for integrating a cell’s own state and neighbors’ states) needs to be the same for all cells in the graph.

The biological motivation for this is that the cells in the body, when differentiating, are able to follow rules that help them determine whether they should be a neuron or skin cell or muscle cell etc.. None of these cells have a global positioning system (i.e., they don’t know where exactly they are in the body), but they do have information on their immediate surroundings. This was alluded to in the previous work as well, though the message passing in planaria is probably far less convoluted than the one demonstrated in the CA demo.

That’s a good motivation, and fortunately this process also heavily resembles a 3x3 convolution operation, meaning we can learn these rules with convolutional neural networks connected with residual connections. The architecture differences from the last paper is twofold. First, the automata in the last paper were 3-channel RGB values, but here we’re only interested in the state. Second, the positions of the dead and alive cells are static. This model is only interested in message passing. While the architecture seems familiar, this isn’t something that can be trained just once. The cells need to have a notion of being continuously alive, being continuously prepared that there might be some modification to the cell. This is achieved by randomly initializing the cell states, and then training the cells to predict the MNIST they’re within. This can be achieved pretty quickly, but then at 200 steps of the automata the overall digit is switched (what the authors call the ‘mutation’ step). Rather than all having random initializations, the cells at this stage for the most part retain the digit class from the previous digit, and need to learn to switch tracks.

Maybe 95% of the cells will learn to switch track with little problem, but towards the end many cells will still be active. Over time the number of cells responding with the correct image identity will start to drop. When measured with ‘average total agreement’ (i.e., how many cells agree with each other on which digit they’re part of), the cells begin to agree rapidly, but then curiously begin to decohere. The authors hypothesize that this might be due to the cross entropy loss used in training. Usually in ML we use something like cross entropy loss for classification without a second thought. The softmax operator serves a normalization step of changing the class distribution that’s judged by the cross entropy loss. It’s this combo that causes problems.

If you look at the softmax, the exe^x components all mean that these entries for the other classes can never truly be zero. They can get very close, but never AT zero. The loss will never be zero, and there will never be perfect logits, but the gradients will always push the network in the direction of increasing the likelihood of the most favored logit, and decreasing all others. If we do this in a neural network just once, with a finite runtime, this is usually not a problem. The problem comes about when we run the network for infinite time. These numerical values for the losses explode. If you want to change the class outputs you need to change the loss values by more dramatic values. With L2 loss, this shouldn’t happen. You don’t output logits you output probabilities and compare the L2 distance. While this makes the previous situation harder, it doesn’t completely eliminate it. For fixing this problem on the loss side, an alternative to softmax might be needed, such as SparseMax or sum-normalization.

Fortunately, the authors found a quicker fix than combing through softmax alternatives. The authors demonstrated what happens when noise is consistently introduced during the training process. This does a much better job than the L2 loss of preventing the network accuracy from dropping. It seems keeping the cells on their toes actually increases the network agreement over time.

One interesting robustness property of these networks is their resistance to out-of-distribution classes. If one draws a digit within the mnist dataset, the cells usually converge on the correct class (or at least they converge on a class). If you add a few lines to turn that digit into a letter, or connect multiple digits with lines, the cells will continue disagreeing with each other and will struggle to settle on one global class.

Not only do they visualize the internal cell states (which class), but this message passing can be visualized by the latent variables. Over time, as these messages pass throughout the contours of the shape, you can see how the class labels change in response. This gives more credence to the authors’ model of the network as cells passing around messages of their neighbors’ states.

Self-assembling Clothing

If any of you have seen movies like Black Panther, you’ve probably seen the character with clothes that seem to assemble themselves. Big questions about material science notwithstanding, it’s also a mystery how such components are able to keep track of themselves. This question inspired me to try out the neural Cellular Automatas with the Fashion MNIST dataset.

Replace MNIST with Fashion MNIST

The original self-classifying MNIST team pointed out that MNIST digits are not rotationally invariant, and that this means agents must be aware of their orientation with respect to the grid. Therefore, while they do not know where they are, they do know where up, down, left and right are.

The biological analogy here is a situation where the remodeling structures exist in the context of a larger body and a set of morphogen gradients or tissue polarity that indicate directional information with respect to the three major body axes. Given these preliminaries, we introduce the self-classifying MNIST task.

There are some immediate performance differences here. For one, even in the best of cases, the total agreement and total accuracy usually fall short of the average cases for the self-identifying MNIST. Strangely enough, once these metrics hit a plateau, there seems to be a similar amount of drift compared to the MNIST

Simpler message passing?

Automata are fascinating not for the shapes they ultimately produce, but for the simplicity of the rules that can complete those rules. As far as machine learning research goes, this model was pretty parameter efficient. But, can we go even further?

One thing we can do is replace full precision weights with quantized weights. Even with weight quantization, we can still set up a network that’s over-parameterized enough to safely get aroun the bias-variance tradeoff problem

What does this all mean?

Summary of the ML findings

We’ve demonstrated a few key things about neural cellular automata

  1. agents can learn awareness within larger structures, and learn classification rules (i.e., this was the entire point of the Distil work, I’m just repeating it here). The authors point out that previous work on two-dimensional cellular automata has already been combined with reservoir computing, boltzmann machines, evolutionary algorithms, and ensembles, but that this is the first kind that uses the kind of fully differentiable end-to-end pipeline that Jax fangirls would flip out over.
  2. Said agents are suprisingly robust, and can even be applied to shapes with less distinguishable sillouettes (such as Fashion MNIST)
  3. In terms of rule complexity, we can go even simpler than the ~8000 floating point parameter model. We can go all the way to simple binary decisions.

As for what this means beyond fun web page interactives, quite a lot actually.

Control Theory

Automata that can learn their patterns from within themselves is obviously incredibly important for swarm robotics. True, if exanded beyond 2D grids this could be useful in interpolating the rules underlying bird or flying insect swarms. That being said, this will be immediately useful in agents that don’t need to spare processing power for things like breathing or eating


Biological Control


If any of you have seen Stephen Wolfram’s work on Cellular automata, including his recent work on automata for graph representations, you might be aware of the importance of automata in some newer ‘theories of everything’ attempts. Upon reading this work, and hearing about other recent works on estimating physical laws from observations, one might get excited that we might have a way to learn about underlying automata that guide the universe.

There is just one problem with this. As Stephen Wolfram pointed out in his recent post on such hypergraph automata, such automata may be not only be computaionally irreducible (i.e., you’re not going to find a descent hueristic that replaces the nitty-gritty rules), but its “emergent” properties may only become visible after something on the order of 1050010^500 iterations of the rule. Such a problem would not only be beyond the limits of computing technology we can achieve as a civilization, it may be beyond the computing capabilities of our universe.

A few physicists have pointed out tha it may be possible to use a black hole as a data storage or computing device, if a practical mechanism for extraction of contained information can be found. In The Singularity is Near, Ray Kurzweil cites the calculations of Seth Lloyd that a universal-scale computer is capable of 109010^90 operations per second. The mass of the visible universe can be estimated at 3×10523 × 10^52 kilograms. If all matter in the universe was turned into a black hole, it would have a lifetime of 2.8×101392.8 × 10^139 seconds before evaporating due to Hawking radiation. During that lifetime such a universal-scale black hole computer would perform 2.8×102292.8 × 10^229 operations. This is still many orders of magnitude below the number of computations needed to fully extrapolate the state of the universe from Stephen Wolfram’s “hypergraph automata”.


Cited as:

    title = "Clothing Automata",
    author = "McAteer, Matthew",
    journal = "",
    year = "2020",
    url = ""

If you notice mistakes and errors in this post, don’t hesitate to contact me at [contact at matthewmcateer dot me] and I would be very happy to correct them right away!

See you in the next post 😄

I write about AI, Biotech, and a bunch of other topics. Subscribe to get new posts by email!

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

At least this isn't a full-screen popup

That'd be more annoying. Anyways, subscribe to my newsletter to get new posts by email! I write about AI, Biotech, and a bunch of other topics.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.