Structural Similarity in Neural Style Transfer

Bringing back an ancient computer vision technique to effortlessly improve style transfer results

Co-Authored by Vinay Prabhu

This post came from our presentation at the ACAI 2020 Workshop. You can see the full presentation in the workshop recording here

This is a quick overview of a neat trick we discovered in for modifying neural style transfer outputs, with little-to-no change in hyperparameters (just a loss function addition). For context, when we refer to Neural Style Transfer (NST), we’re referring to the process of harnessing Convolutional Neural Networks (CNNs) to project a real-world content photograph into in different style spaces.

Jing et al. (2019) created a very detailed taxonomy of these techniques, and within that this falls under the category of Non-Photorealistic Image-Optimisation-Based Online Neural Methods. Breaking that down even further within the full space of Example-based techniques, this means that:

  1. We don’t care about weak-sauce re-coloring and retexturing

    • Yea 👌: Neural Style Transfer
    • Nay 👎: Image Analogy Techniques like retexturing and recoloring
  2. We’re looking at methods that optimize images for frozen models, not the other way around

    • Yea 👌: image-Optimzation-based online neural methods
    • Nay 👎: Model-optimization based offline neural methods
  3. Ignoring methods that don’t use summary statistics of images

    • Yea 👌: Parametric neural methods with summary statistics
    • Nay 👎: Non-parametric neural methods with MRFs
  4. We’re not after photorealism

    • Yea 👌: Style transfer regardless of realism
    • Nay 👎: Optimizing for photorealism
  5. And we’re not looking at videos (for now)

    • Yea 👌: Still image style transfer
    • Nay 👎: Video Style transfer

This is how we arrive at Non-Photorealistic Image-Optimisation-Based Online Neural Methods for Neural Style Transfer.

Visual intuition of where our work stands in the taxonomy

Style-transfer as a Loss Function

Gatys et al. (2016) describes NST as follows: Given artwork image a\vec a, and given photograph p\vec p, synthesize a style-transferred synthetic image x\vec x by minimizing the loss function

Ltotal(p,a,x)=αLcontent(p,x)+βLstyle(a,x)L_\text{total}\left( {\vec p,\vec a,\vec x} \right) = \alpha {L_{\text{content}}}\left( {\vec p,\vec x} \right) + \beta L_\text{style}\left( {\vec a,\vec x} \right),

with α\alpha and β\beta representing the weights of the content loss LcontentL_\text{content} and style loss LstyleL_\text{style}.

This approach to image-based optimization for style transfer has produced impressive results, but compared to human style transfer, there is still a lot to be desired. NST using Convolutional Layers can pick up on lower-level features and incorporate them into an image, but is often lacking higher-level context-awareness. It may seem strange to try and make a benchmark for something as subjective as style transfer, but we do have examples of ideal end products.

When human artists make content in new styles, they have the ability to recognize which style parts are relevant to which structures in the content.
When human artists make content in new styles, they have the ability to recognize which style parts are relevant to which structures in the content.

This human-level attention to detail and imagination may be still be some ways off, BUT in this sub-area of NST, we can improve our process by carefully selecting the image statistics we pay attention to.

This brings us to our improvement on the loss function. The main contribution is the multiplicative term that captures structure similarity between the Gramian feature images produced at a style-extraction layer:

Ltotal(ss)(p,a,x)=αLcontent(p,x)+βLstyle(SSIM)(a,x)=α12i,j(FijlcontentPijlcontent)2+βlLstyle(wl(i,j(GijlAijl)2)×(1ξ(G~ijl,A~ijl))2SSIM - component)L_\text{total}^\text{(ss)}\left( {\vec p,\vec a,\vec x} \right) = \alpha {L_{\text{content}}}\left( {\vec p,\vec x} \right) + \beta L_\text{style}^\text{(SSIM)}\left( {\vec a,\vec x} \right) \\ \, = \alpha \frac{1}{2}\sum\limits_{i,j} {{{\left( {F_{ij}^{{l^{\text{content}}}} - P_{ij}^{{l^{\text{content}}}}} \right)}^2}} \\ + \beta \sum\limits_{l \in {L^\text{style}}} { \color{Purple}\left( {{w_l} \color{Black}\left( {\sum\limits_{i,j} {{{\left( {G_{ij}^l - A_{ij}^l} \right)}^2}} } \right) \color{Purple}\times \overbrace {\frac{{\left( {1 - \xi \left( {\tilde G_{ij}^l,\tilde A_{ij}^l} \right)} \right)}}{2}}^\text{SSIM - component}} \right)}

This new addition may seem to come out of nowhere, but the function ξ(G~ijl,A~ijl)\xi \left( {\tilde G_{ij}^l,\tilde A_{ij}^l} \right) actually represents the standard structural similarity (SSIM) index from classical image processing. SSIM combines measures of luminance, contrast and structure to measure the similarity between two images (Wang et al. (2004)). This, as will be seen in the results section has a subtle but tangible effect in terms of finer artistic strokes being rendered on to the style transferred images.

Example of SSIM showing the differences between these two pictures of Einstein (original and edited)
Example of SSIM showing the differences between these two pictures of Einstein (original and edited)

Given two images X\color{Red}X and Y\color{Blue}Y, the SSIM() function works as follows:

SSIM(X,Y)=(2μxμy+c1)(2σxy+c2)(μx2+μy2+c1)(σx2+σy2+c2)\text{SSIM}({\color{Red}X},{\color{Blue}Y})={\frac {(2{\color{Red} \mu _{x}}{\color{Blue} \mu _{y}}+{\color{DarkGreen} c_{1}})(2{\color{Magenta} \sigma _{xy}}+{\color{DarkGreen} c_{2}})}{({\color{Red} \mu _{x}^{2}}+{\color{Blue} \mu _{y}^{2}}+{\color{DarkGreen} c_{1}})({\color{Red} \sigma _{x}^{2}}+{\color{Blue} \sigma _{y}^{2}}+{\color{DarkGreen} c_{2}})}}

where μx{\color{Red} \mu _{x}} / μy\color{Blue} \mu _{y} is the pixel-mean of image X\color{Red}X or image Y\color{Blue}Y, respectively, σx2{\color{Red} \sigma _{x}^{2}} / σy2{\color{Blue} \sigma _{y}^{2}} is the pixel-variance of the image X\color{Red}X or image Y\color{Blue}Y, respectively, and σxy{\color{Magenta} \sigma _{xy}} represents the empirical covariance of both image X\color{Red}X & image Y\color{Blue}Y. The constants c1=(k1L)2{\color{DarkGreen} c_{1}} = (k_1L)^2 and c2=(k2L)2{\color{DarkGreen} c_{2}} = (k_2L)^2 are unrelated to the inputs, but they keep the function from breaking from denominators close to zero. LL is the dynamic range of the pixel-values (2# of bits per pixel12^\text{\# of bits per pixel} - 1), while k1=0.01k_1 = 0.01 and k2=0.03k_2 =0.03 by default.

“Those are some nice equations…but how did this actually work out?”, you might be saying. We’re getting to that.

As motivated in Gatys et al. (2016), the texture or the ‘style layers’ are typically the initial layers in a CNN and the ‘content layer(s)’ are picked from the deeper layers closer to the softmax output. One can think of these two components as follows:

  • Style Layers: Parts of the network that recognize the higher-level shapes and patterns (without being too far removed from the input image itself).

What we're using the style layers to look out for
What we're using the style layers to look out for

  • Content Layers: For making sure whatever image is passing through the network at least almost results in the same image-classification.

What we're tasking the content layers with
What we're tasking the content layers with

For the standard VGG-19 architecture (See Simonyan and Zisserman (2014)) chosen in this paper, we set the content layers in LcontentL_\text{content} to be [ 'block5_conv2' ] and style layers in LstyleL_\text{style} to be [ 'block1_conv1', 'block2_conv1', 'block3_conv1', 'block4_conv1', 'block5_conv1' ]. For all the results shown in the next slides, our SSIM hyper-parameters were as follows:

  • max val=1\text{max val}=1
  • filter size=11\text{filter size}=11
  • filter σ=1.5\text{filter }\sigma=1.5
  • content weight(α)=103\text{content weight} (\alpha) = 10^3
  • style weight (β)=102\text{style weight } (\beta) = 10^−2
  • And last but not least for the Adam Optimizer (Kingma and Ba (2014)):

    • Learning Rate=5\text{Learning Rate} =5
    • β1=0.99\beta_1=0.99
    • β2=1e1\beta_2=1e−1
    • Niter=103 iterationsN_\text{iter} = 10^3 \text{ iterations}

For content images we used an image of a Sea Turtle, a photo of the Tuebingen Neckarfront, and a photo of a persian cat. For style images, we used Japanese ukiyo-eartist Hokusai’s Great Wave off Kanagawa, Pillars of Creation ( taken by the Hubble Space Telescope of interstellar gasand dust in the Serpens constellation of the Eagle Nebula), Van Gogh’s Starry Night, and Vassily Kandinsky’s Composition-7.

Content: Sea Turtle, Style: Great Wave off Kanagawa

Content: Sea Turtle, Style: Pillars of Creation

Content: Tuebingen Neckarfront , Style: Pillars of Creation

Content: Tuebingen Neckarfront, Style: Van Gogh’s Starry Night

Content: Tuebingen Neckarfront, Style: Vassily Kandinsky’s Composition-7

Content: Persian Cat, Style: Vassily Kandinsky’s Composition-7

Content: Persian Cat, Style: Van Gogh’s Starry Night

Content: Persian Cat, Style: Pillars of Creation

Content: Persian Cat, Style: Great Wave off Kanagawa

The stylistic changes are subtle, but still noticeable.

“It looks more like the artist actually drew the image, rather than the style picture being cut up and rearranged” — Matthew’s housemate Sid

Takeaways

Beyond the initial takeaway of a useful style-transfer technique, we wanted to illustrate that there’s plenty of room for augmenting the basic NST loss function described in Gatys et al. (2016), or any Loss function for that matter. We also think this is a fantastic example of how there are still new areas of the style space unlocked by incorporating classical computer vision techniques (SSIM is from 2004, after all).

Above all, we also want to stress that there is no one singularly correct NST method, and that we are in no way advocating any claims of betterment or superiority of the SSIM weighing.

In order to ensure reproducibility and promote further experimen-tation, we have duly open-sourced the implementation here: https://github.com/vinayprabhu/NST_experiments


We want to give an enormous thanks to Joshua D. Eisenberg, Ph.D., and all the organizers of ACAI 2020 for their efforts in putting this event together.


Cited as:

@article{mcateer2020ssim,
    title = "ACAI 2020 Workshop: Structural Similarity in Neural Style Transfer",
    author = "Vinay Prabhu and Matthew McAteer",
    journal = "matthewmcateer.me",
    year = "2020",
    url = "https://matthewmcateer.me/posts/ssim-aware-style-transfer/"
}

If you notice mistakes and errors in this post, don’t hesitate to contact me at [contact at matthewmcateer dot me] and I would be very happy to correct them right away!

See you in the next post 😄

I write about AI, Biotech, and a bunch of other topics. Subscribe to get new posts by email!


This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

At least this isn't a full-screen popup

That'd be more annoying. Anyways, subscribe to my newsletter to get new posts by email! I write about AI, Biotech, and a bunch of other topics.


This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.