During my AI class back in college (taught by Prof. Vatsa), we were given an interesting exercise which required us to come up with a face morphing technique. The intended use case was one where a user was tasked with describing a face (maybe of a wanted felon?). In such cases, starting from scratch can be incredibly tough. However, morphing different facial attributes (extracted from different face images) as per the user's feedback can make the process faster.
Having recently stumbled across Facenet (back then), I was curious whether I could make use of it to develop something similar. Although Facenet doesn't really allow us to construct an image from scratch, it does allow us to change the facial attributes of an input image. It uses a Variational Autoencoder to learn an attribute vector for input images. By making changes to the attribute vector, we can change the facial attributes of an input image as required.
TL;DR In this post, we try to come up with a framework where a user can gradually improve upon a face image as they want (with respect to its attributes). We use a GA to improve upon the attribute vector which generates the image. At each iteration, we pick the attribute vectors (chromosomes) which generated the top-k images (closest to the target/intended target). In this exercise, we initially pick up a target image and compared the intermediatory images by taking a simple MSE. In case of a real-life scenario, a user could be simply selecting the top k images which fit the criterion. Of course, we can expose this vector directly to the user and ask them to make changes as they deem fit. However, it might be more easy for a user to pick some top-k closest images to their target, rather than changing attributes directly.
The python notebook (as an HTML) is available here. It's an extremely large file, sorry about that! I no longer have access to the server with the code, so cannot regenerate it with lesser images.
Variational Autoencoders and Facenet
A Variational Autoencoder (VAE) is a variation (haha) of the autoencoder which tries to learn the distribution of the training input samples rather than just a dense representation. We try to map the input onto a distribution (a latent vector), rather than a fixed vector. A simple way to do is to couple the standard autoencoder reconstruction loss with and the deviation of the learnt representation from an expected (pre-decided) distribution. This loss can be computed using something like KL divergence.
The Facenet VAE employs a much fancier approach to compute the loss, which in fact allows for extremely granular control over features of the generated image. The learnt latent vector, in fact, allows modifying facial attributes of an image (and hence is appropriately named attribute vector).
I will skip explaining VAEs in detail in this post to save up on space. This blog by Lilian Weng does a great job at explaining autoencoders and other variants, including VAE. Would really recommend having a look if this piques your interest.
Learning with Genetic Algorithm
The genetic algorithm is a fascinating evolutionary algorithm inspired by natural selection. It is an iterative process, each iteration comprising of a population of individuals (called a generation). Each individual is represented by a vector of genes called chromosome. The goal is to search for a set of chromosomes which are the best, according to some criterion.
Each generation gives rise to a set of chromosomes in the next iteration of GA. This is done via two techniques, known as cross-over and mutation. A crossover takes place between two parents and generates a chromosome (child) with genes taken both the parents. This combination can be done randomly, or via selecting certain contiguous segments of the parents' chromosome. Mutation generates a child by taking a chromosome (parent) and randomly changing any one of its genes.
Survival of the fittest
We generally restrict the population size at each generation. Not doing so would make the population grow exponentially, making it computationally impossible to simulate. This is done by using a fitness function and selecting those individuals with the highest fitness score. In the beginning, we randomly generate a set of chromosomes. At every generation, we expect to have a set of chromosomes that have a higher (or at least equal) fitness score compared to the chromosomes in the previous generation. This process allows us to gradually reach a target chromosome.
The description of GA above is very brief, would recommend this blog by Burak Kanber for a detailed write-up.
Using GA with the Facenet VAE
In our GA formulation, we consider the latent vector used to generate images in the Facenet autoencoder as the chromosome of an individual in the population. We make use of the pre-trained Facenet decoder, passing the chromosome as input to generate an image. Starting off from a population of randomly initialized chromosome attribute vector, we would expect to gradually improve upon the generated image until it becomes the target image.
At each step, we generate a set of images and ask the user to select the top-k best (this can be thought of as the fitness function). Using this set, we generate the next generation of chromosomes (attribute vectors). It is important to note that the fitness function is not applied directly on the individuals in a population (since there is no target latent vector, only a target image).
For our experiment, due to the absence of a user, we chose a target image and then for each chromosome in a population, we calculate the fitness as follows. The chromosome is first decoded into an image, and then the fitness is calculated by taking the inverse of the MSE between the image generated and the target image. Let g(z)
be the generator function (the Facenet decoder) which generates the image I
after taking z
as an input. If the target image is T
, we calculate the fitness as the inverse of MSE(I, T)
, i.e. the inverse of MSE(I, g(z))
.
Below are the images with the highest fitness function at different generations. We see how we slowly move towards a latent vector which generates the required smile.
In another scenario, a user can choose the top few images that he/she thinks is ideal, which can form the next generation of individuals. This way, the user can gradually improve upon how he/she wants the attributes in the image to look like.