# Clothing Automata

## Reproducing the 'Self-Classifying MNIST', except cutting the network size by 2/3

| UPDATED

UPDATE 11/04/2020: Saw new research out of los alamos regading using CAs in general for fundamental neural network research, and I decided I just had to make mention of it.

If you haven’t already seen the latest distil pub from Ettore Randazzo, Alexander Mordvintsev, Eyvind Niklasson, Michael Levin, and Sam Greydanus, definitely give it a look. I give a quick summary here about the main takeaways, but you should still look at the original article (even if it’s just to play with the animations). At its highest level, the research sets out to answer the following question:

“Can cellular automatas (CAs) use local message passing to achieve global agreement on what digit they compose?”

### The reasearch: Getting MNIST images to self-identify

If you’ve played around with the Distil animation, you’ve probably seen the pixels change color. There’s clearly message passing going on. The pixels are doing a pretty good job for most of the classification, even if there are some weak spots (such as confusing 2s and 8s, or just defaulting to 0). This paper isn’t about state-of-the-art MNIST classification. Classifying digits is often treated as a solved problem (classifying the handwriting author, not so much). I highly recommend you check out the previous Neural CA paper in their series, as this is an extension of that previous work.

The main takeaway of the previous work from these authors was CAs self-organizing and resisting perturbations through differentiable rules for generating the automata. How do these CAs work? A given Image is rasterized, and each cell/pixel represents a node in a graph, connected to all its immediate neighbors (and same pattern goes for all THOSE neighbors). If we have an MNIST image, each cell may only store data about whether it’s on or off (or alive or dead). The dead/off/blank cells do not pass messages around. The alive cells, on the other hand, need to communicate their alive status to each other, and by extension what image class they compose. Only by receiving messages from neighboring cells, storing a state, and sending out messages to neighbors, can the cells agree on what image class they are. Whatever image class is assigned the highest probability by a cell is the image class selected (this is probably sounding familiar to anyone that knows how a softmax layer works, so you can probably see how this thing would be trained).

What would these rules even look like? Suppose you’re in the bend of a `5`

. One cell could register that it’s neighbor is alive, but that neighbor could convey that it’s neighbor is dead. This original cell receives a message that it’s neighbor’s neighbor is dead, which matches the pattern of an edge. You can imagine these n’th degree neighbors as serving a similar purpose as nodes further back in a neural network. The result is that we have these state vectors being passed around as the message between nodes, with each cell integrating the messages it receives from it’s neighbors as a linear combination of it’s neighbors’ states and it’s own state. Seeing all this, you can probably imagine that it would be easy to create a bunch of these update rules for each node, but the challenge is that this update rule (for integrating a cell’s own state and neighbors’ states) needs to be the same for all cells in the graph.

The biological motivation for this is that the cells in the body, when differentiating, are able to follow rules that help them determine whether they should be a neuron or skin cell or muscle cell etc.. None of these cells have a global positioning system (i.e., they don’t know where exactly they are in the body), but they do have information on their immediate surroundings. This was alluded to in the previous work as well, though the message passing in planaria is probably far less convoluted than the one demonstrated in the original neural CA demo.

Original “Growing Neural Cellular Automata” Notebook. Unlike the MNIST work, the perception function is a bunch of hard-coded sobel layers rather than a tuneable neural network.

That’s a good motivation, and fortunately this process also heavily resembles a 3x3 convolution operation, meaning we can learn these rules with convolutional neural networks connected with residual connections. The architecture differences from the last paper is twofold. First, the automata in the last paper were 3-channel RGB values, but here we’re only interested in the state. Second, the positions of the dead and alive cells are static. This model is only interested in message passing. While the architecture seems familiar, this isn’t something that can be trained just once. The cells need to have a notion of being continuously alive, being continuously prepared that there might be some modification to the cell. This is achieved by randomly initializing the cell states, and then training the cells to predict the MNIST they’re within. This can be achieved pretty quickly, but then at 200 steps of the automata the overall digit is switched (what the authors call the ‘mutation’ step). Rather than all having random initializations, the cells at this stage for the most part retain the digit class from the previous digit, and need to learn to switch tracks.

Maybe 95% of the cells will learn to switch track with little problem, but towards the end many cells will still be active. Over time the number of cells responding with the correct image identity will start to drop. When measured with ‘average total agreement’ (i.e., how many cells agree with each other on which digit they’re part of), the cells begin to agree rapidly, but then curiously begin to decohere. The authors hypothesize that this might be due to the cross entropy loss used in training. Usually in ML we use something like cross entropy loss for classification without a second thought. The softmax operator serves a normalization step of changing the class distribution that’s judged by the cross entropy loss. It’s this combo that causes problems.

If you look at the softmax, the $e^x$ components all mean that these entries for the other classes can never truly be zero. They can get very close, but never AT zero. The loss will never be zero, and there will never be perfect logits, but the gradients will always push the network in the direction of increasing the likelihood of the most favored logit, and decreasing all others. If we do this in a neural network just once, with a finite runtime, this is usually not a problem. The problem comes about when we run the network for infinite time. These numerical values for the losses explode. If you want to change the class outputs you need to change the loss values by more dramatic values. With L2 loss, this shouldn’t happen. You don’t output logits you output probabilities and compare the L2 distance. While this makes the previous situation harder, it doesn’t completely eliminate it. For fixing this problem on the loss side, an alternative to softmax might be needed, such as SparseMax or sum-normalization.

Fortunately, the authors found a quicker fix than combing through softmax alternatives. The authors demonstrated what happens when noise is consistently introduced during the training process. This does a much better job than the L2 loss of preventing the network accuracy from dropping. It seems keeping the cells on their toes actually increases the network agreement over time.

One interesting robustness property of these networks is their resistance to out-of-distribution classes. If one draws a digit within the mnist dataset, the cells usually converge on the correct class (or at least they converge on a class). If you add a few lines to turn that digit into a letter, or connect multiple digits with lines, the cells will continue disagreeing with each other and will struggle to settle on one global class.

Not only do they visualize the internal cell states (which class), but this message passing can be visualized by the latent variables. Over time, as these messages pass throughout the contours of the shape, you can see how the class labels change in response. This gives more credence to the authors’ model of the network as cells passing around messages of their neighbors’ states.

### Self-assembling Clothing

If any of you have seen movies like Black Panther, you’ve probably seen the character with clothes that seem to assemble themselves. Big questions about material science notwithstanding, it’s also a mystery how such components are able to keep track of themselves. This question inspired me to try out the neural Cellular Automatas with the Fashion MNIST dataset.

Replacing MNIST with Fashion MNIST may seem relatively straightforward at first, but it’s not that simple. The original self-classifying MNIST team pointed out that MNIST digits are not rotationally invariant, and that this means agents must be aware of their orientation with respect to the grid. Therefore, while they do not know where they are, they do know where up, down, left and right are. The same doesn’t always go for items in the Fashion MNIST dataset. A sweater is still going to be a sweater, even when it is turned upside down. There’s also the issue that many of the items in the Fashion MNIST dataset aren’t always neatly translated to their silloutettes. For example, some stripes on a shirt may end up being confused with holes in that shirt (then again, if we have some kind of futuristic clothing use-case in mind, that could potentially be spun as a feature rather than a bug).

The biological analogy here is a situation where the remodeling structures exist in the context of a larger body and a set of morphogen gradients or tissue polarity that indicate directional information with respect to the three major body axes. Given these preliminaries, we introduce the self-classifying MNIST task.

There are some immediate performance differences here. For one, even in the best of cases, the total agreement and total accuracy usually fall short of the average cases for the self-identifying MNIST. Strangely enough, once these metrics hit a plateau, there seems to be a similar amount of drift compared to the MNIST

### Simpler message passing

Automata are fascinating not for the shapes they ultimately produce, but for the simplicity of the rules that can complete those rules. As far as machine learning research goes, this model was pretty parameter efficient. But, can we go even further?

One thing we can do is replace full precision weights with quantized weights. Even with weight quantization, we can still set up a network that’s over-parameterized enough to safely get aroun the bias-variance tradeoff problem.

If we go ahead and replace the 2 sub-networks in the MNIST Neural CA with quantized layers, the results are remarkable in two ways.

- So, with our perceiving model, despite having the same number of parameters (about 18.1 K), we’ve gone from using 70.7 KB to only 2.5 KB. With our dmodel, we’ve gone from using around 31 KB to using only 990 bytes (yes, you read that right, bytes without any kind of
*kilo-*,*mega-*,*giga-*, or*tera-*prefix). While this memory compression is great and all, how well does this work? - While using the same training parameters, our log loss goes from being in the 2 to 3 range, to being in the 9 to 10 range. Yes, this is remarkable, but not in the good way. Since this is log loss, this means our initial network is orders of magniute worse (it is in fact little better than random in it’s pixel selection)

Only replacing the sub-network with binarized layers isn’t enough. Fortunately, since we compressed our model size by about 25x by changing the weight precision, we can do a lot to make the performance similar to the original distil.pub article method while also coming far below the memory requirements.

There are a few different variants we can try for our network, due to the fact that network kernels and inputs can be quantized independently. For example, a network which contains layers where only the kernels are quantized is referred to as a Binary Weight Network (BWN). When only the inputs are binarized, the network is referred to as a Binary Activation Network (BAN). When a network contains layers in which both the inputs as well as the kernels are binarized the network is referred to as a Binary Neural Network (BNN).

For every possible combination of weights and quantizers, the model is being trained as a mutating network with L2 loss and additive noise (since this was the best version of the original `float32`

model).

Perceive params |
DModel params |
Weight precision |
Memory |
Input Quantizer |
Kernel Quantizer |
Log10(Loss) |
---|---|---|---|---|---|---|

18,080 | 7,920 | `float32` |
70.62 KiB + 30.94 KiB | NoOp/ `None` |
NoOp/ `None` |
1.151 |

18,080 | 7,920 | `bin` |
2.51 KiB + 990 B | SteSign/ `ste_sign` |
SteSign/ `ste_sign` |
9.532 |

18,080 | 8,336 | `bin` |
2.51 KiB + 990 B | NoOp/ `None` |
SteSign/ `ste_sign` |
7.463 |

bigger | bigger | `bin` |
bigger | NoOp/ `None` |
ApproxSign/ `approx_sign` |
7.488 |

bigger | bigger | `bin` |
bigger | NoOp/ `None` |
SteHeaviside/ `ste_heaviside` |
1.685 (unstable) |

bigger | bigger | `bin` |
bigger | NoOp/ `None` |
SwishSign/ `swish_sign` |
7.582 |

bigger | bigger | `bin` |
bigger | NoOp/ `None` |
MagnitudeAwareSign/ `magnitude_aware_sign` |
1.682 |

bigger | bigger | `bin` |
bigger | NoOp/ `None` |
SteTern/ `ste_tern` |
8.528 |

bigger | bigger | `bin` |
bigger | NoOp/ `None` |
DoReFa/ `dorefa_quantizer` |
1.668 |

### What does this all mean?

#### Summary of the ML findings

We’ve demonstrated a few key things about neural cellular automata

- agents can learn awareness within larger structures, and learn classification rules (i.e., this was the entire point of the Distil work, I’m just repeating it here). The authors point out that previous work on two-dimensional cellular automata has already been combined with reservoir computing, boltzmann machines, evolutionary algorithms, and ensembles, but that this is the first kind that uses the kind of fully differentiable end-to-end pipeline that Jax fangirls would flip out over.
- Said agents are suprisingly robust, and can even be applied to shapes with less distinguishable sillouettes (such as Fashion MNIST)
- In terms of rule complexity, we can go even simpler than the ~100 KiB floating point parameter model. We can go all the way to simple binary decisions.
- Whether this is simpler from an algorithmic information theory perspective is…complex. If we define parameters as rules, then our binarized approach is indeed more complex in terms of the size of the program it takes to write it out. That being said, the reduced memory means we can see a massive improvement in runtime.

As for what this means beyond fun web page interactives, quite a lot actually.

#### Control Theory

Automata that can learn their patterns from within themselves is obviously incredibly important for swarm robotics. True, if exanded beyond 2D grids this could be useful in interpolating the rules underlying bird or flying insect swarms. That being said, this will be immediately useful in agents that don’t need to spare processing power for things like breathing or eating. It’s also worth distinguishing this from simple particle swarm optimization, which involves all agents converging on a single point.

#### Biological Control

Part of the biological motivation for the original neural cellular automata paper was complex behaviors such as tissue regeneration, which are guided by a suprisingly small set of intra- and extracellular signalling. Neural Cellular automata may be another tool that we can add to ground-up living system design, much like reinforcement learning was recently applied to make “xenobots” out of living cells.

#### Machine Learning fundamentals

According to the lottery ticket hypothesis, the bigger the neural network, the more likely some of its weights are initialized to values that are well suited to learning to perform the task at hand. But just how big does it need to be? Researchers investigated the impact of initial parameter values on models of various sizes. What’s new: Jacob M. Springer at Swarthmore College and Garrett T. Kenyon at Los Alamos National Laboratory used the Game of Life to explore how slight changes in a network’s initial weights affect its ability to learn. To learn consistently, they found, networks need more parameters than are theoretically necessary. Key insight: Devised by mathematician John Horton Conway in 1970, the Game of Life starts with a pattern of black (dead) or white (living) squares on a grid. It changes the color of individual squares according to simple rules that reflect the ideas of reproduction and overpopulation as illustrated above in an animation by Emanuele Ascani. Because the outcome is deterministic, a network that learns its rules can predict its progress with 100 percent accuracy. This makes it an ideal environment for testing the lottery ticket hypothesis. How it works: Each step in the game applies the rules to the current grid pattern to produce a new pattern. The authors limited the grid to eight by eight squares and built networks to predict how the pattern would evolve. The authors generated training data by setting an initial state (randomly assigning a value to each square based on a random proportion of squares expected to be 1) and running the game for n steps. They built minimal convolutional neural networks using the smallest number of parameters theoretically capable of predicting the grid’s state n steps into the future (up to 5). They also built oversized networks, scaling up the number of filters in each layer by a factor of m (up to 24). For a variety of combinations of n and m, they trained 64 networks on 1 million examples generated on the fly. In this way, they found the probability that each combination would master the task. Results: The authors chose the models that learned to solve the game and tested their sensitivity to changes in their initial weights. When they flipped the sign of a single weight, about 20 percent of the models that had learned to predict the grid’s pattern one step into the future failed to learn a consistent solution. Only four to six flips were necessary to boost the failure rate above 50 percent. They also tested the oversized models’ probability of finding a solution. Only 4.7 percent of the minimal one-step models solved the problem, compared to 60 percent of networks that were three times bigger. Why it matters: The authors’ results support the lottery ticket hypothesis. Future machine learning engineers may need to build ever larger networks — or find a way to rig the lottery. We’re thinking: When it comes to accuracy, the old maxim holds: The bigger, the better.

In short, the researchers demonstrated the usefulness of automata and versions of the “Game of Life” as reinforcement learning environments. Still, if this is the AlphaGo or MuZero of automata, many applications of automata would require a similar advance on the part of cooperative RL instead of just competitive RL (e.g., An automata equivalent of Salesforce’s AI economist)

#### Physics

If any of you have seen Stephen Wolfram’s work on Cellular automata, including his recent work on automata for graph representations, you might be aware of the importance of automata in some newer ‘theories of everything’ attempts. Upon reading this work, and hearing about other recent works on estimating physical laws from observations, one might get excited that we might have a way to learn about underlying automata that guide the universe.

There is just one problem with this. As Stephen Wolfram pointed out in his recent post on such hypergraph automata, such automata may be not only be computaionally irreducible (i.e., you’re not going to find a descent hueristic that replaces the nitty-gritty rules), but its “emergent” properties may only become visible after something on the order of $10^{500}$ iterations of the rule. Such a problem would not only be beyond the limits of computing technology we can achieve as a civilization, it may be beyond the computing capabilities of our universe. From wikipedia:

A few physicists have pointed out tha it may be possible to use a black hole as a data storage or computing device, if a practical mechanism for extraction of contained information can be found. In The Singularity is Near, Ray Kurzweil cites the calculations of Seth Lloyd that a universal-scale computer is capable of $10^{90}$ operations per second. The mass of the visible universe can be estimated at $3 \times 10^{52}$ kilograms. If all matter in the universe was turned into a black hole, it would have a lifetime of $2.8 \times 10^{139}$ seconds before evaporating due to Hawking radiation. During that lifetime such a universal-scale black hole computer would perform $2.8 \times 10^{229}$ operations. This is still many orders of magnitude below the number of computations needed to fully extrapolate the state of the universe from Stephen Wolfram’s “hypergraph automata”.

So…yeah. This MNIST CA work is probably as far from discovering the underlying universal automata rule as Prolog was from building DeepMind’s AlphaZero (if not many times further than that).

#### Next stages of CA research

If we continue with the analogy of Prolog (which was superceded by higher-order logic languages), it may be that the secret to designing more complex and more stable automata (like the kind that could be used in ground-up biological systems design) is coming up with higher-order logical languages for automata, and making those differentiable.

### References

- Randazzo, E., Mordvintsev, A., Niklasson, E., Levin, M., & Greydanus, S. (2020). Self-classifying MNIST Digits. Distill, 5(8), e00027-002.
- Mordvintsev, A., Randazzo, E., Niklasson, E., & Levin, M. (2020). Growing neural cellular automata. Distill, 5(2), e23.
- Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32), 15849-15854.
- Laha, A., Chemmengath, S. A., Agrawal, P., Khapra, M., Sankaranarayanan, K., & Ramaswamy, H. G. (2018). On controllable sparse alternatives to softmax. In Advances in Neural Information Processing Systems (pp. 6422-6432).
- Kriegman, S., Blackiston, D., Levin, M., & Bongard, J. (2020). A scalable pipeline for designing reconfigurable organisms. Proceedings of the National Academy of Sciences, 117(4), 1853-1859.
- Wolfram, S. (2020). A Project to Find the Fundamental Theory of Physics. Wolfram Media.
- Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., … & Lillicrap, T. (2017). Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815.

Cited as:

```
@article{mcateer2020automata,
title = "Clothing Automata",
author = "McAteer, Matthew",
journal = "matthewmcateer.me",
year = "2020",
url = "https://matthewmcateer.me/blog/clothing-automata/"
}
```

*If you notice mistakes and errors in this post, don’t hesitate to contact me at [contact at matthewmcateer dot me] and I will be very happy to correct them right away! Alternatily, you can follow me on Twitter and reach out to me there.*

See you in the next post 😄