Chinese engineers at the Institute for Intelligent Computing, Alibaba Group, have developed an AI app called Emote Portrait Live that can animate a still photo of a face and synchronize it to an audio track.
The technology behind this relies on the generative capabilities of diffusion models (mathematical models used to describe how things spread or diffuse over time), which can directly synthesize character head videos from a provided image and any audio clip. This process bypasses the need for complex pre-processing or intermediate representations, thus simplifying the creation of talking head videos.
The challenge lies in capturing the nuances and diversity of human facial movements during video synthesis. Traditional methods simplify this by imposing constraints on the final video output, such as using 3D models to limit facial keypoints or extracting head movement sequences from base videos to guide overall motion. However, these constraints may limit the naturalness and richness of the resulting facial expressions.
Not without challenges
The research team’s objective is to develop a talking head framework that can capture a wide range of realistic facial expressions, including subtle micro-expressions, and allow for natural head movements.
However, the integration of audio with diffusion models presents its own challenges due to the ambiguous relationship between audio and facial expressions. This can result in instability in the videos produced by the model, including facial distortions or jittering between video frames. To overcome this, the researchers included stable control mechanisms in their model, specifically a speed controller and a face region controller, to improve stability during the generation process.
Despite the potential of this technology, there are certain drawbacks. The process is more time-consuming than methods that don’t use diffusion models. Additionally, since there are no explicit control signals to guide the character’s motion, the model may unintentionally generate other body parts, like hands, resulting in artifacts in the video.
The group has published a paper on its work on the arXiv preprint server, and this website is home to a number of other videos showcasing the possibilities of Emote Portrait Live, including clips of Joaquin Phoenix (as The Joker), Leonardo DiCaprio, and Audrey Hepburn.
You can watch the Mona Lisa recite Rosalind’s monologue from Shakespeare’s As You Like It, Act 3, Scene 2, below.