Multimodal AI | Dr. Rafael Wampfler

Digital Einstein

Tue, 01 Jun 2021 00:00:00 +0000

Overview

Digital Einstein is a flagship embodied conversational agent that brings the historical figure of Albert Einstein to life through real-time multimodal AI interaction. The system combines speech recognition and synthesis, facial animation, gesture control, and a cognitively grounded language understanding pipeline to deliver immersive, personality-consistent conversations.

Digital Einstein serves three interconnected roles:

Research platform — a testbed for studying human–agent interaction in constrained embodied settings, yielding insights on affective computing, personality modeling, dialog act classification, and conversational AI.
Education platform — a live demonstration of conversational AI and multimodal deep learning deployed in university events, science outreach, and public engagement.
Public engagement tool — reaching thousands of visitors globally at scientific conferences, tech summits, museums, and public events, generating sustained international recognition for ETH Zurich.

Motivation

How can AI systems convincingly portray a well-known historical personality — someone whose knowledge, values, and speaking style are culturally established — in real-time dialogue with arbitrary members of the public? This challenge crystallizes core problems in interactive AI: maintaining factual and characterological consistency, adapting dynamically to unpredictable user inputs, and delivering a compelling embodied experience at scale.

Digital Einstein was conceived as both a scientific challenge and a communication vehicle: making abstract advances in AI tangible for general audiences while simultaneously driving rigorous research on the underlying problems.

Approach

The system is built on a full-pipeline architecture described in the SIGGRAPH 2025 paper “A Platform for Interactive AI Character Experiences”. Key components include:

Perception layer: Real-time speech recognition via Microsoft Azure Speech Services and multimodal input processing through a webcam-based vision pipeline, including face detection, user characterization, head pose estimation, and re-identification.
Cognitive reasoning layer: Knowledge-grounded dialogue management with integrated response generation, powered by GPT-4.1 mini, featuring dynamic personality infusion that adapts outputs to user-selectable archetypes: Digital Einstein, Rude Bulldozer, Drama Volcano, Zen Master, and Hashtag Prophet.
Animation synthesis: Data-driven facial animation synchronized with speech output using NVIDIA Audio2Face, blended with emotion-conditioned expressions, and complemented by a curated library of motion-captured body animations categorized by avatar state.
Embodiment: A stylized Albert Einstein avatar rendered in Unity on a 65-inch display, integrated into a themed early-20th-century physical environment with spatial audio, a hidden microphone, and physical personality sliders built from potentiometers and an Arduino.

The SIGGRAPH Asia 2024 demonstration paper “Immersive Conversations with Digital Einstein: Linking a Physical System and AI” details the physical installation setup, including the integration of an animatronic head with the real-time AI pipeline at the Tokyo venue.

Key Results

Digital Einstein has been demonstrated at over 20 major events worldwide, including:

SIGGRAPH Asia 2024 (Tokyo, Japan) — Emerging Technologies
SIGGRAPH 2025 (Vancouver, Canada)
GITEX Global 2024 & 2025 (Dubai, UAE) — Swiss Pavilion
World Economic Forum 2024 & 2026 (Davos, Switzerland)
Berlin Science Week 2025 (Berlin, Germany)
Swiss Re Resilience Summit 2024 (Rüschlikon, Switzerland)
Microsoft Initiative to Advance AI Diffusion in Switzerland 2025 (Berne)
After the Algorithm Festival 2026 (Zurich, Switzerland)

The project has generated sustained international media coverage and public interest, positioning ETH Zurich as a world leader in embodied conversational AI.

Swiss Ambassador to the UAE, Arthur Mattli, interacting with Digital Einstein at GITEX Global in Dubai.

Learn More

Primary Publications

R. Wampfler, C. Yang, D. Elste, N. Kovačević, P. Witzig and M. Gross (2025). A Platform for Interactive AI Character Experiences. Proceedings of the SIGGRAPH Conference Papers ‘25 (Vancouver, Canada, August 10–14, 2025), pp. 1–11.

R. Wampfler, N. Kovačević, P. Witzig, C. Yang, M. Gross (2024). Immersive Conversations with Digital Einstein: Linking a Physical System and AI. In SIGGRAPH Asia 2024 Emerging Technologies (SA ‘24) (Tokyo, Japan, December 3–6, 2024).

Affective Computing & Emotion Recognition

Tue, 01 Jan 2019 00:00:00 +0000

Overview

This research thread develops deep learning architectures for predicting human emotional and cognitive states from rich, naturalistic data streams. Unlike laboratory-controlled setups, our systems operate “in-the-wild” — on real devices, in real environments, with real users — addressing the full complexity of affective computing at scale.

Motivation

Affective computing — the capacity of machines to detect, interpret, and respond to human emotions — is a foundational capability for human-centric AI. Yet most academic benchmarks rely on controlled, acted datasets that poorly predict real-world performance. Building systems that genuinely work in naturalistic settings requires confronting three fundamental challenges:

Domain adaptation: Affective signals vary enormously across individuals and contexts; models must transfer gracefully.
Uncertainty estimation: Emotion recognition inherently involves ambiguity and subjectivity; systems must quantify and communicate their confidence.
Scalability: Continuous affective sensing must operate on resource-constrained mobile and edge devices.

Approach

Multimodal Fusion

Our work leverages a broad set of input modalities, combining them through transformer-based and convolutional architectures:

Smartphone touch and sensor data: Stylus pressure, touch dynamics, accelerometer, and gyroscope signals during naturalistic task completion
Biometric data: Heart rate, skin conductance, and other physiological signals from wearables
Egocentric vision: First-person video from wearable cameras, capturing the user’s visual environment
Typing behavior: Smartphone keyboard dynamics as a passive indicator of affective and personality state

Semi-Supervised and Self-Supervised Learning

Given the difficulty and cost of obtaining large labeled affective datasets in natural settings, we exploit semi-supervised learning strategies that leverage abundant unlabeled data. This improves generalization without requiring exhaustive annotation.

egoEMOTION (NeurIPS 2025)

The most recent and ambitious contribution is egoEMOTION, presented at NeurIPS 2025 (Datasets and Benchmarks track). This work combines egocentric vision and physiological signals into a unified multimodal architecture, advancing both fusion strategies and providing a new reproducible benchmark dataset. egoEMOTION addresses the challenge of predicting emotion and personality from the wearer’s own perspective, a naturalistic setting of growing relevance as wearable cameras become ubiquitous.

Personality Recognition from Typing

Beyond momentary emotions, we have also developed systems for personality trait recognition from passive smartphone typing dynamics. This work (IEEE Transactions on Affective Computing, 2023) demonstrates that stable personality traits leave measurable signatures in everyday smartphone interactions, enabling passive, continuous personality inference.

Key Results

Demonstrated state-of-the-art in-the-wild affective state prediction from smartphone sensors across multiple CHI publications
Published a new egocentric multimodal emotion and personality benchmark (NeurIPS 2025)
Showed that semi-supervised learning substantially closes the gap between labeled and unlabeled-data performance
Developed personality trait recognition from typing dynamics achieving strong classification performance on real-world data

Publications

M. Jammot, B. Braun, P. Streli, R. Wampfler and C. Holz (2025). egoEMOTION: Egocentric Vision and Physiological Signals for Emotion and Personality Recognition in Real-World Tasks. In Conference on Neural Information Processing Systems 2025 (Datasets and Benchmarks, NeurIPS), pp. 1–12.

N. Kovačević, C. Holz, M. Gross and R. Wampfler (2024). On Multimodal Emotion Recognition for Human-Chatbot Interaction in the Wild. In Proceedings of the 26th International Conference on Multimodal Interaction (ICMI ‘24), San Jose, Costa Rica, November 4–8, 2024.

N. Kovačević, C. Holz, T. Günther, M. Gross and R. Wampfler (2023). Personality Trait Recognition Based on Smartphone Typing Characteristics in the Wild. IEEE Transactions on Affective Computing, pp. 1–11, 2023.

R. Wampfler, S. Klingler, B. Solenthaler, V. R. Schinazi, M. Gross and C. Holz (2022). Affective State Prediction from Smartphone Touch and Sensor Data in the Wild. Proceedings of the Conference on Human Factors in Computing Systems (CHI), New Orleans, USA, April 30–May 5, 2022, pp. 1–14.

R. Wampfler, S. Klingler, B. Solenthaler, V. R. Schinazi and M. Gross (2020). Affective State Prediction Based on Semi-Supervised Learning from Smartphone Touch Data. Proceedings of the Conference on Human Factors in Computing Systems (CHI), Virtual, April 25–30, 2020, pp. 1–13.

N. Kovačević, R. Wampfler, B. Solenthaler, M. Gross and T. Günther (2020). Glyph-Based Visualization of Affective States. Eurographics/IEEE VGTC Symposium on Visualization (EuroVis), Virtual, May 25–29, 2020, pp. 121–125.

R. Wampfler, S. Klingler, B. Solenthaler, V. R. Schinazi and M. Gross (2019). Affective State Prediction in a Mobile Setting using Wearable Biometric Sensors and Stylus. Proceedings of the International Conference on Educational Data Mining (EDM), Montréal, Canada, July 2–5, 2019, pp. 224–233.

Dialog Act Classification

Mon, 01 Jan 2024 00:00:00 +0000

Overview

For a conversational agent to respond appropriately, it must understand not just what a user says, but why they said it — the communicative intent behind their utterance. Dialog Act (DA) classification is the task of categorizing utterances by their function in conversation (e.g., question, assertion, greeting, request, clarification). This project develops multimodal dialog act classifiers tailored for interactions with digital characters.

Motivation

Standard dialog act classification systems are trained on text transcriptions alone. In real-world interactions with embodied agents, however, users communicate through a rich combination of speech prosody, gaze, gesture, and lexical content. A question delivered with rising intonation carries different meaning than the same words spoken flatly; a greeting accompanied by eye contact differs from one delivered distractedly.

For digital characters that must respond naturally in real time, dialog act classification must therefore be multimodal — integrating acoustic, linguistic, and where available, visual signals — and must operate with low latency to support interactive response times.

Approach

Our multimodal dialog act classifier integrates:

Lexical features: Encoded via transformer-based text encoders fine-tuned on dialog corpora
Acoustic features: Prosodic signals including pitch, energy, and speech rate, extracted from the raw audio signal
Temporal context: Conversation history modeling to resolve ambiguous acts through discourse-level context

The system is evaluated on naturalistic conversations with digital characters — a challenging setting because users frequently use fragmented, spontaneous speech rather than complete, grammatical sentences. The classifier is optimized for both accuracy and latency, enabling real-time use within the Digital Einstein pipeline.

Key Results

Demonstrated that multimodal integration (text + acoustic features) significantly outperforms text-only baselines for dialog act classification in digital character conversations
Achieved real-time classification latency compatible with interactive agent deployment
Provided insights into which dialog acts are most frequently misclassified in human-agent interaction, informing future system design

Publication

P. Witzig, R. Constantin, N. Kovačević and R. Wampfler (2024). Multimodal Dialog Act Classification for Conversations With Digital Characters. Proceedings of the 6th International Conference on Conversational User Interfaces (CUI), Luxembourg, Luxembourg, July 08–10, 2024, pp. 1–14.

A Platform for Interactive AI Character Experiences

Sun, 10 Aug 2025 00:00:00 +0000

On Multimodal Emotion Recognition for Human-Chatbot Interaction in the Wild

Mon, 04 Nov 2024 00:00:00 +0000

Multimodal Dialog Act Classification for Conversations With Digital Characters

Mon, 08 Jul 2024 00:00:00 +0000