Style Transfer with Binarized Neural Networks

Neural Style Transfer with binary 0s and 1s instead of floating points in your ImageNet model

Co-Authored by Vinay Prabhu

This post came from our presentation at the ACAI 2020 Workshop. You can see the full presentation in the workshop recording here.


Neural Style Transfer (NST) is a technique for representing a content image in the same style as a style image. A variety of techniques from model-based to image-based, parametric and non-parametric, still image and video. For the sake of this research, we’re starting with image-based parametric still image NST.

Who wants to use this? Everyone and their mothers at this point. You’re probably already familiar with NST being used in various photo and video apps like Instagram.

or in this case, Adobe’s new photo-editing app

It’s increasingly used in post-production for movies and TV shows.

Do you hear that? That’s the sound of somebody’s GPU letting out it’s death cry.

And projects like Google Stadia are even trying to bring live style transfer to videogames.

Press F to pay respects…to your WiFi bill

And despite all these advanced applications, it’s also one of the early techniques learned by people starting out in computer vision today.

Upon seeing all these examples one may wonder what the performance of NST looks like in all these use-cases. For most of these applications, the answer will probably be some variant of “Not good enough”.

Credit to Zach Weiner of Saturday Morning Breakfast Cereal

Let’s look at a way of improving the runtime and memory load.

Compressing Neural Networks

One of the often proposed solutions to issues of neural network performance is to compress it. One of the ways this can be done is by compressing the neural networks (i.e., getting the number of parameters or memory size as low as possible, or making the network shallower). One way of reducing memory constraints is reducing network parameter precision. How much less precision can we get away with?

TPUs speed up NNs by converting float32s into float16s. This is pretty good, but can we go further? 8 bit? 4 bit? Even better. We can go all the way to binary +1+1 or 1-1 (oh yeah! 🔥). This is the main idea behind how binarized neural networks (BNNs), a subtype of quantized neural network, work.

Speaking of how BNNs work, here’s what makes them unique compared to neural networks with floating point parameters. All activations and latent full-precision weights go through a quantization function q(x){\color{Blue} q(x)} in the forward pass:

q(x)={1x<01x0{\color{Blue} q(x) = \begin{cases} -1 & x < 0 \\ 1 & x \geq 0 \end{cases}}

“But…”, you may be thinking, “…how would the network learn with this quantizaton interfering with the gradients?” And you’d be absolutely right. If we leave the network as is, learning just grinds to a halt in this network because the gradient of this function is zero almost everywhere. For this reason, the gradient is instead estimated using the Straight-Through Estimator (STE):

q(x)x={1x10x>1{\color{Orange} \frac{\partial q(x)}{\partial x} = \begin{cases} 1 & \left|x\right| \leq 1 \\ 0 & \left|x\right| > 1 \end{cases}}

On the backward pass the the binarization is essentially replaced by a clipped identity.

And that’s all you really need to know. While different variants of this setup exist for different levels of quantizaton (e.g., 4-bit, 8-bit, etc.), this can be applied to nearly any learnable parameters of the network. This can even be applied to make quantized convolutional layers.

Probably a lot simpler than you were expecting

This all sounds pretty incredible, but how simple is this to implement in practice? BNNs are still relatively new, but there already exist libraries and packages that can interface with frameworks like TF.Keras and PyTorch. Take a simple example of a binarized CNN CIFAR-10 classifier. The implementation isn’t too different from how you’d normally set it up. Some of the major differences include the pixels normalized to [1,1][-1, 1] instead of [0,1][0, 1], and the fact that you replace 2D convolutional and dense layers with their quantized equivalents. The result? You can get 98% accuracy on CIFAR-10 (which is pretty easy), but also do so with model that’s 30x smaller than the full float32 equivalent (that’s a lot more impressive 😲)

Pros and Cons of BNNs

There are no free lunches. Like with any machine learning system, there are a variety of pros and cons to using binarized neural networks.

One of the most obvious advantages of only using 1-bit latent-variable weights is that BNNs have super-low memory requirements compared to their float32 equivalents. Some of the more recent models like QuickNet and QuickNetXL can be stored in as little as 3.18 MB and 6.22 MB, respectively. Even some of the larger BNNs like DoReFaNet and XNOR-Net can be stored in under 23 MB. Compare this to some of the pre-trained ImageNet models in Keras. VGG models are typicaly larger than 500 MB. ResNetV2 models can take up more than 100 MB. With such low memory requirements, many of these binary neural network architectures can be run very quickly even on non-GPU devices. QuickNet, one of the more accurate classification networks with 10.5 million parameters, can run classifications on a Pixel 1 Android device in only 18.4 ms (and can run faster if using more than 1 thread). Compare this with classical float32 ImageNet models that can have latencies of >300 ms in a similar environment. On top of all of this, packages like Larq make it very easy to build binarized/quantized neural networks (either using full pre-trained models or just layers) in tf.Keras.

Of course, there are reasons why these BNNs aren’t suddenly replacing all other convolutional neural networks. The list state-of-the-art architectures is improving very quickly, but even the most advanced ones still fall short of many of the other networks on the ImageNet leaderboards. Also, while it is possible to deploy Larq models to Android Devices and Rasberry Pi models, the setup process isn’t quite as straightforward as it is for most other Tensorflow Deployment pipelines. All of these drawbacks stem from the fact that, while quantized neural networks are advancing quickly on both a practical and theoretical front, this is still a very young subfield.

Setting up BNN-NST experiments

In principle, the same principles for NST should still apply to non-VGG models. When it comes to the choice of the layers in each part of the loss function, the style layers are the parts of the network that recognize the higher-level shapes and patterns (without being too far removed from the input image itself). By contrast the content layers are for making sure whatever image is passing through the network at least almost results in the same image-classification.

For the sake of our experiments, we wanted to optimize the best style-transfer WITHOUT upsampling. When style transfer networks learn how to produce high-resolution images despite the network having a default image input size of 224px×224px224\text{px} \times 224\text{px}, it’s usually because some additional sub-network or autoencoder is added on, allowing the system to learn upsampling in addition. For low-memory devices, adding on so much to the existing networks may be cheating at best or impractical at worst.

Network VGG-19 QuickNet
Year 2015 2020
ImageNet Top-5 Accuracy 90.1 % 81.0 %
ImageNet Top-1 Accuracy 71.3 % 58.6 %
Parameters 138.4 million 10.5 million
Memory 549 MB 3.21 MB

Lessons from using BNNs for NST

NOTE: This research is still ongoing. In fact there was a giant optimization script running in the background of the ACAI 2020 conference. We’re still presenting some of these early results because the results were just too informative / useful / cool-looking not to show off.

Lesson #1: It may take a bit to find a groove

Producing the first intelligible results took much longer than anticipated. A few networks (mainly the smaller ones) were deemed too imprecise for this task. Even with the QuickNet models, searching through stable style and content loss weights took a lot of optimization steps.

…this is roughly what our first 10 attempts at this looked like.

Lesson #2: Don’t forget total variation loss

If you want to avoid noisy images with wild contrasts (no matter how big your image is), it pays to add total variation (TV) loss. The total variation is the sum of the absolute differences for neighboring pixel-values in the input images. This measures how much noise is in the images. This can be used as a loss-function during optimization so as to suppress noise in images. Luckily, TF >2.2.0 now has a built-in function for total variation loss. If you have a batch of images, then you should calculate the scalar loss-value as the sum: loss = tf.reduce_sum(tf.image.total_variation(images)). This implements the anisotropic 2-D version of the formula described here.

It’s usually easy to get away with not using TV loss in normal float32 NST, but it becomes critical with binarized networks.

Notice how the color scheme starts to resemble that of the style image more when TV loss is introduced

Lesson #2.5: …but don’t give TOO much weight

A quick-to-spot mistake, but in one experiment a minus sign was omitted when setting the TV Loss, setting the weight to 31033 * 10^3 instead of 31033 * 10^-3 (I.e, a million times higher than intended)

The result was monster…

This is less “The Great Wave” and more “magic eye”

Lesson #3: Just using style weight is very informative

Constant iterating through the quantitative convolution layers of QuickNet, all without adding any weight to the content loss, provides a lot of insight into what features are activating each layer the most. This way, we can see what kinds of information/features are being prioritized by the different layers.

Our final research output will be something like an activation atlas, but this is still a great first step in any NST project

Lesson #4: Testing different content layers …less dramatic

Like before, we can also iterate the content layer choices throughout the network architecture of QuickNet. In theory this should provide interesting details on what the content layers consider important.

But it’s not that simple.

Unlike Style Layer Choice, combining content layers doesn’t make much of a difference because optimizing content loss from one content layer optimizes for all subsequent layers. It’s better to just pick the content layers with cross-validation with the appropriate style-layer combos.

Lesson #5: Not so simple as using Inception fixes

In Inception networks, these blocky artifacts are often a result of max-pooling layers being used in a network instead of average pooling.

The blocky artifacts, before the attempt at fixing via average pooling

When that fix is applied to QuickNet, the effect is…nothing 😐

Post-average-pooling fix. This was underwhelming

This suggests that these blocky artifacts are results of process like the addition layers in QuickNet.

Lesson #6: Tradeoffs in BNN Architecture choice

All the results thus far were demonstrated on Quicknet, but we also looked at how these experiments fared on networks like QuickNet-XL and XNOR-Net. DoReFa-Net and XNOR-Net do better on avoiding checkerboard artifacts, but struggle to make stylistic changes. Conversely, QuickNet-Large and QuickNet-XL give more dramatic results more easily, but checkerboard artifacts are still (if not more) pronounced.

Comparison of results among BNN architectures

Lesson #7: More to be done, but BNN-NST has use-cases

Even without getting results that superficially result in VGG-19 or Inception quality, we can still produce style transfer that works for two use-cases:

  1. Art Attacks: This adversarial attack method can be done quickly and effectively with BNNs

Normal, unmodified content image

It’s confusing the image for a freshwater turtle, but it’s still not that bad

Okay, this is pretty bad. What resembles a smudge on the lens is making this network think it’s at the post office

  1. Lo-Fi Art: Let’s be honest, a lot of this stuff would still make a good album cover.

No idea of this band or album title like this even exists, but MAN this would make a great cover

What should you take away from all this?

In terms of progress, this problem of BNN-based NST went from “hopeless” to “minimum practical results”. While the results may not be a perfect replication of the Gatys et al., 2016 quality, it’s still useful for many applications both practical and subjective.

We intentionally showed off a variety of negative results to offset publication biases that only favor dramatically positive results. It’s important to remind people of how messy research can be to those of your who are newer to AI research.

We also provide this research as an example of how AI research should not just be focused on adding more compute to problems, but also figuring out what can be done for a given amount of compute.

We want to give an enormous thanks to Joshua D. Eisenberg, Ph.D., and all the organizers of ACAI 2020 for their efforts in putting this event together.

Cited as:

    title = "ACAI 2020 Workshop: Style Transfer with Binarized Neural Networks",
    author = "Matthew McAteer and Vinay Prabhu",
    journal = "",
    year = "2020",
    url = ""

If you notice mistakes and errors in this post, don’t hesitate to contact me at [contact at matthewmcateer dot me] and I would be very happy to correct them right away!

See you in the next post 😄

I write about AI, Biotech, and a bunch of other topics. Subscribe to get new posts by email!

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

At least this isn't a full-screen popup

That'd be more annoying. Anyways, subscribe to my newsletter to get new posts by email! I write about AI, Biotech, and a bunch of other topics.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.