EmoSpaceTime: Decoupling Emotion and Content through Contrastive Learning for Expressive 3D Speech Animation

Nov 21, 2024·

P. Witzig

S. Solenthaler

M. Gross

Dr. Rafael Wampfler

Abstract

We present EmoSpaceTime, a method for generating expressive 3D speech animation by explicitly decoupling emotion and semantic content through contrastive learning. Existing speech animation approaches entangle emotional style with phonetic content in their learned representations, limiting the ability to control expressive output independently of the spoken words. EmoSpaceTime learns a factorized latent space where emotion and content are disentangled, enabling fine-grained control over emotional expressivity at inference time. A contrastive training objective ensures that representations from the same emotional register are pulled together while those from different emotions are pushed apart, independent of semantic content. We demonstrate that EmoSpaceTime generates animations that are simultaneously emotionally consistent and semantically coherent, with user studies validating the quality and controllability of the expressive output.

Type

Conference paper

Publication

In Proceedings of the 17th ACM SIGGRAPH Conference on Motion, Interaction and Games (MIG ‘24), Arlington, USA

Last updated on Nov 21, 2024

Facial Animation Speech Animation Emotion Contrastive Learning Generative Models

Authors

Dr. Rafael Wampfler

Senior Researcher & Lecturer

I am a Senior Researcher & Lecturer at the Computer Graphics Laboratory of ETH Zurich, and a Research Consultant at Disney Research. I am leading the Digital Character AI projects at CGL. My research interests include conversational digital characters, affective computing, human-computer interaction, and applied machine learning.

My vision is to create intelligent digital humans that can naturally communicate, understand, and support people across domains such as education and mental health. My research focuses on multimodal artificial intelligence for interactive digital humans, developing models that combine large language models, affective computing, and data-driven animation to create embodied conversational agents endowed with autonomous agency, consistent values, and beliefs.

My work bridges machine learning, human–computer interaction, and computer graphics to enable AI systems such as Digital Einstein and interactive patient avatars for psychotherapy training and health education.

← Immersive Conversations with Digital Einstein: Linking a Physical System and AI Dec 3, 2024

On Multimodal Emotion Recognition for Human-Chatbot Interaction in the Wild Nov 4, 2024 →

No results found

EmoSpaceTime: Decoupling Emotion and Content through Contrastive Learning for Expressive 3D Speech Animation