# Structural Similarity in Neural Style Transfer

## Bringing back an ancient computer vision technique to effortlessly improve style transfer results

Co-Authored by Vinay Prabhu

This post came from our presentation at the ACAI 2020 Workshop. You can see the full presentation in the workshop recording here

This is a quick overview of a neat trick we discovered in for modifying neural style transfer outputs, with little-to-no change in hyperparameters (just a loss function addition). For context, when we refer to Neural Style Transfer (NST), we’re referring to the process of harnessing Convolutional Neural Networks (CNNs) to project a real-world content photograph into in different style spaces.

Jing et al. (2019) created a very detailed taxonomy of these techniques, and within that this falls under the category of Non-Photorealistic Image-Optimisation-Based Online Neural Methods. Breaking that down even further within the full space of Example-based techniques, this means that:

1. We don’t care about weak-sauce re-coloring and retexturing

• Yea 👌: Neural Style Transfer
• Nay 👎: Image Analogy Techniques like retexturing and recoloring
2. We’re looking at methods that optimize images for frozen models, not the other way around

• Yea 👌: image-Optimzation-based online neural methods
• Nay 👎: Model-optimization based offline neural methods
3. Ignoring methods that don’t use summary statistics of images

• Yea 👌: Parametric neural methods with summary statistics
• Nay 👎: Non-parametric neural methods with MRFs
4. We’re not after photorealism

• Yea 👌: Style transfer regardless of realism
• Nay 👎: Optimizing for photorealism
5. And we’re not looking at videos (for now)

• Yea 👌: Still image style transfer
• Nay 👎: Video Style transfer

This is how we arrive at Non-Photorealistic Image-Optimisation-Based Online Neural Methods for Neural Style Transfer.

### Style-transfer as a Loss Function

Gatys et al. (2016) describes NST as follows: Given artwork image $\vec a$, and given photograph $\vec p$, synthesize a style-transferred synthetic image $\vec x$ by minimizing the loss function

$L_\text{total}\left( {\vec p,\vec a,\vec x} \right) = \alpha {L_{\text{content}}}\left( {\vec p,\vec x} \right) + \beta L_\text{style}\left( {\vec a,\vec x} \right)$,

with $\alpha$ and $\beta$ representing the weights of the content loss $L_\text{content}$ and style loss $L_\text{style}$.

This approach to image-based optimization for style transfer has produced impressive results, but compared to human style transfer, there is still a lot to be desired. NST using Convolutional Layers can pick up on lower-level features and incorporate them into an image, but is often lacking higher-level context-awareness. It may seem strange to try and make a benchmark for something as subjective as style transfer, but we do have examples of ideal end products.

This human-level attention to detail and imagination may be still be some ways off, BUT in this sub-area of NST, we can improve our process by carefully selecting the image statistics we pay attention to.

This brings us to our improvement on the loss function. The main contribution is the multiplicative term that captures structure similarity between the Gramian feature images produced at a style-extraction layer:

$L_\text{total}^\text{(ss)}\left( {\vec p,\vec a,\vec x} \right) = \alpha {L_{\text{content}}}\left( {\vec p,\vec x} \right) + \beta L_\text{style}^\text{(SSIM)}\left( {\vec a,\vec x} \right) \\ \, = \alpha \frac{1}{2}\sum\limits_{i,j} {{{\left( {F_{ij}^{{l^{\text{content}}}} - P_{ij}^{{l^{\text{content}}}}} \right)}^2}} \\ + \beta \sum\limits_{l \in {L^\text{style}}} { \color{Purple}\left( {{w_l} \color{Black}\left( {\sum\limits_{i,j} {{{\left( {G_{ij}^l - A_{ij}^l} \right)}^2}} } \right) \color{Purple}\times \overbrace {\frac{{\left( {1 - \xi \left( {\tilde G_{ij}^l,\tilde A_{ij}^l} \right)} \right)}}{2}}^\text{SSIM - component}} \right)}$

This new addition may seem to come out of nowhere, but the function $\xi \left( {\tilde G_{ij}^l,\tilde A_{ij}^l} \right)$ actually represents the standard structural similarity (SSIM) index from classical image processing. SSIM combines measures of luminance, contrast and structure to measure the similarity between two images (Wang et al. (2004)). This, as will be seen in the results section has a subtle but tangible effect in terms of finer artistic strokes being rendered on to the style transferred images.

Given two images $\color{Red}X$ and $\color{Blue}Y$, the SSIM() function works as follows:

$\text{SSIM}({\color{Red}X},{\color{Blue}Y})={\frac {(2{\color{Red} \mu _{x}}{\color{Blue} \mu _{y}}+{\color{DarkGreen} c_{1}})(2{\color{Magenta} \sigma _{xy}}+{\color{DarkGreen} c_{2}})}{({\color{Red} \mu _{x}^{2}}+{\color{Blue} \mu _{y}^{2}}+{\color{DarkGreen} c_{1}})({\color{Red} \sigma _{x}^{2}}+{\color{Blue} \sigma _{y}^{2}}+{\color{DarkGreen} c_{2}})}}$

where ${\color{Red} \mu _{x}}$ / $\color{Blue} \mu _{y}$ is the pixel-mean of image $\color{Red}X$ or image $\color{Blue}Y$, respectively, ${\color{Red} \sigma _{x}^{2}}$ / ${\color{Blue} \sigma _{y}^{2}}$ is the pixel-variance of the image $\color{Red}X$ or image $\color{Blue}Y$, respectively, and ${\color{Magenta} \sigma _{xy}}$ represents the empirical covariance of both image $\color{Red}X$ & image $\color{Blue}Y$. The constants ${\color{DarkGreen} c_{1}} = (k_1L)^2$ and ${\color{DarkGreen} c_{2}} = (k_2L)^2$ are unrelated to the inputs, but they keep the function from breaking from denominators close to zero. $L$ is the dynamic range of the pixel-values ($2^\text{\# of bits per pixel} - 1$), while $k_1 = 0.01$ and $k_2 =0.03$ by default.

“Those are some nice equations…but how did this actually work out?”, you might be saying. We’re getting to that.

As motivated in Gatys et al. (2016), the texture or the ‘style layers’ are typically the initial layers in a CNN and the ‘content layer(s)’ are picked from the deeper layers closer to the softmax output. One can think of these two components as follows:

• Style Layers: Parts of the network that recognize the higher-level shapes and patterns (without being too far removed from the input image itself).

• Content Layers: For making sure whatever image is passing through the network at least almost results in the same image-classification.

For the standard VGG-19 architecture (See Simonyan and Zisserman (2014)) chosen in this paper, we set the content layers in $L_\text{content}$ to be [ 'block5_conv2' ] and style layers in $L_\text{style}$ to be [ 'block1_conv1', 'block2_conv1', 'block3_conv1', 'block4_conv1', 'block5_conv1' ]. For all the results shown in the next slides, our SSIM hyper-parameters were as follows:

• $\text{max val}=1$
• $\text{filter size}=11$
• $\text{filter }\sigma=1.5$
• $\text{content weight} (\alpha) = 10^3$
• $\text{style weight } (\beta) = 10^−2$
• And last but not least for the Adam Optimizer (Kingma and Ba (2014)):

• $\text{Learning Rate} =5$
• $\beta_1=0.99$
• $\beta_2=1e−1$
• $N_\text{iter} = 10^3 \text{ iterations}$

For content images we used an image of a Sea Turtle, a photo of the Tuebingen Neckarfront, and a photo of a persian cat. For style images, we used Japanese ukiyo-eartist Hokusai’s Great Wave off Kanagawa, Pillars of Creation ( taken by the Hubble Space Telescope of interstellar gasand dust in the Serpens constellation of the Eagle Nebula), Van Gogh’s Starry Night, and Vassily Kandinsky’s Composition-7.

The stylistic changes are subtle, but still noticeable.

“It looks more like the artist actually drew the image, rather than the style picture being cut up and rearranged” — Matthew’s housemate Sid

## Takeaways

Beyond the initial takeaway of a useful style-transfer technique, we wanted to illustrate that there’s plenty of room for augmenting the basic NST loss function described in Gatys et al. (2016), or any Loss function for that matter. We also think this is a fantastic example of how there are still new areas of the style space unlocked by incorporating classical computer vision techniques (SSIM is from 2004, after all).

Above all, we also want to stress that there is no one singularly correct NST method, and that we are in no way advocating any claims of betterment or superiority of the SSIM weighing.

In order to ensure reproducibility and promote further experimen-tation, we have duly open-sourced the implementation here: https://github.com/vinayprabhu/NST_experiments

We want to give an enormous thanks to Joshua D. Eisenberg, Ph.D., and all the organizers of ACAI 2020 for their efforts in putting this event together.

Cited as:

@article{mcateer2020ssim,
title = "ACAI 2020 Workshop: Structural Similarity in Neural Style Transfer",
author = "Vinay Prabhu and Matthew McAteer",
journal = "matthewmcateer.me",
year = "2020",
url = "https://matthewmcateer.me/posts/ssim-aware-style-transfer/"
}

If you notice mistakes and errors in this post, don’t hesitate to contact me at [contact at matthewmcateer dot me] and I would be very happy to correct them right away!

See you in the next post 😄

I write about AI, Biotech, and a bunch of other topics. Subscribe to get new posts by email!