Technology

Hear ‘Mona Lisa’ recite a famous Shakespeare monologue — Chinese engineers manage to get a picture to sing and talk using an AI app called Emote Portrait Live

March 7, 2024

Chinese engineers at the Institute for Intelligent Computing, Alibaba Group, have developed an AI app called Emote Portrait Live that can animate a still photo of a face and synchronize it to an audio track.

The technology behind this relies on the generative capabilities of diffusion models (mathematical models used to describe how things spread or diffuse over time), which can directly synthesize character head videos from a provided image and any audio clip. This process bypasses the need for complex pre-processing or intermediate representations, thus simplifying the creation of talking head videos.

The challenge lies in capturing the nuances and diversity of human facial movements during video synthesis. Traditional methods simplify this by imposing constraints on the final video output, such as using 3D models to limit facial keypoints or extracting head movement sequences from base videos to guide overall motion. However, these constraints may limit the naturalness and richness of the resulting facial expressions.

Not without challenges

The research team’s objective is to develop a talking head framework that can capture a wide range of realistic facial expressions, including subtle micro-expressions, and allow for natural head movements.

However, the integration of audio with diffusion models presents its own challenges due to the ambiguous relationship between audio and facial expressions. This can result in instability in the videos produced by the model, including facial distortions or jittering between video frames. To overcome this, the researchers included stable control mechanisms in their model, specifically a speed controller and a face region controller, to improve stability during the generation process.

Despite the potential of this technology, there are certain drawbacks. The process is more time-consuming than methods that don’t use diffusion models. Additionally, since there are no explicit control signals to guide the character’s motion, the model may unintentionally generate other body parts, like hands, resulting in artifacts in the video.

The group has published a paper on its work on the arXiv preprint server, and this website is home to a number of other videos showcasing the possibilities of Emote Portrait Live, including clips of Joaquin Phoenix (as The Joker), Leonardo DiCaprio, and Audrey Hepburn.

You can watch the Mona Lisa recite Rosalind’s monologue from Shakespeare’s As You Like It, Act 3, Scene 2, below.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

LEAVE A REPLY Cancel reply

EDITOR PICKS

POPULAR POSTS

QUICK LINKS

ABOUT US

FOLLOW US

Cookie bar