Turning my Coworker's Chihuahua into a Bear
Demo of Few-Shot Unsupervised Image-to-Image Translation
I read a recent paper that combined two of my favorite topics: conditional generative models, and dogs.
Sophisticated generative models have been in use for many years now. Ever since the creation of generative adversarial networks (GANs), synthesizing complex outputs (such as images from probability distributions has gotten more and more sophisticated)
These results are pretty impressive, but generative models are far from a solved problem. It’s important to make two distinctions within the space of generative models: Conditional and unconditional generative models.
Unconditional models take samples from a probability distribution . The GANs that generate faces like the ones above are essentially complex versions of this. Unconditional GANs for tasks like face-generation are sophisticated enough for sites like https://thispersondoesnotexist.com/ to exist.
Conditional Models take this a step further. Conditional models sample from a probability distribution , that’s dependent on multiple types of input, not just a noise generator. A conditional GAN needs to be able to respond more to complex inputs beyond just tuning simple parameters. For example, a conditional model for a GANs may be trained to take in an audio signal, and transform it by denoising.
Another hot field of study in conditional generative models is image translation. Image translation describes the process of taking an arbitrary picture translating it into an analog of the original. A good example would be if we have a standing lion as an input, we could request an output image representing the exact same lion lying down. Most humans would have no trouble visualizing this (and some can even draw such a scene with high fidelity unassisted).
There are plenty of other applications that are made possible with this. For example, we can take a photograph of a daytime scene and translate it to nighttime. We can go from Google maps to sattelite images complete with terrain, from videogames to reality, and much more.
Most of these methods have an annoying limitation: they require a ton of training data. In other words, these neural-network-based models need to see at least a few thousand images in all of the various classes before they are capable of meaningfully translating between them. Humans seem to generally be able to do this kind of mental image translation with much fewer examples. While some congnitive scientists would argue that humans also rely on a ton of training data (by the age of 16, we’ve experienced about 504,911,232 seconds of high-resolution, high-fidelity training data), most machine learning developers and researchers don’t have the luxury of giving that to every single model.
The dream of everyone working in few-shot classification would be to have an algorithm that can look at very few images, obtain representative knowledge from them, and adequately generalize that knowledge .
NVIDIA recently came out with a very interesting new paper, “Few-Shot Unsupervised Image-to-Image Translation” (you can find the full code on their Github). As an example, they show an example of a Golden Retriever, with a bunch of example classes of other dog breeds specified with single images. They can use this to turn that golden retriever into a pug, or any other dog breed you can think of. What’s important to point out here is that this AI does not have access to these target images, as the only examples it has are the ones the researchers just gave it during the test. it can do this translation with previously unseen object classes.
This work contains a generative adversarial network, which assumes that the training set we give it contains images of different animals (which the descriminator is trained to classify) and what it does during training is practicing the translating process between these different animals (using the generator ). It also contains a class-encoder that contains a low-dimensional latent space for each of these classes. In layman’s terms, these classes are represented by the minimum number of features can still convey the essential qualities of individual dog breeds.
No algorithm is perfect, and there are definitely some limitations. If we give the GAN a target image from a class outside the context of anything it has seen before (e.g., if it has been trained on images of animal classes and it’s shown a pizza) it will struggle to translate it, or at least produce some entertainingly broken images.
The authors have kindly created a Web demo of this code, which you can try out for yourself (though you may want to make sure that your web browser isn’t blocking the scripts needed to actually do the translation).
And now, for the actual test. My coworker volunteered a few pictures of her pet Chihuahua, “Monkey”, to test this out on.
This was what came out when the image was cropped to just monkey’s head, and fed through the FUINT demo:
I got many tempting alternatives, though I think everyone at the office prefers Monkey as is.
Non-traditional head positions are also a struggling point for the AI, though it is possible that this may be solved only a few years (or maybe even months) from now.
I may use this as a template for future machine learning paper discussions, demos, and tutorials: finding some kind of dog-related application. If that’s something you’d like to see, leave a comment below (or if you’re impatient and just want to see more dogs, go ahead and check out my Instagram @thedoggeningcometh (which I mainly use as a dog archive instead of actually posting pictures of myself).