<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Multimodal AI | Dr. Rafael Wampfler</title><link>https://rafael-wampfler.github.io/tags/multimodal-ai/</link><atom:link href="https://rafael-wampfler.github.io/tags/multimodal-ai/index.xml" rel="self" type="application/rss+xml"/><description>Multimodal AI</description><generator>HugoBlox Kit (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Sun, 10 Aug 2025 00:00:00 +0000</lastBuildDate><image><url>https://rafael-wampfler.github.io/media/icon_hu_d100f07c298b9e73.png</url><title>Multimodal AI</title><link>https://rafael-wampfler.github.io/tags/multimodal-ai/</link></image><item><title>Digital Einstein</title><link>https://rafael-wampfler.github.io/projects/digital-einstein/</link><pubDate>Tue, 01 Jun 2021 00:00:00 +0000</pubDate><guid>https://rafael-wampfler.github.io/projects/digital-einstein/</guid><description>&lt;h2 id="overview"&gt;Overview&lt;/h2&gt;
&lt;p&gt;Digital Einstein is a flagship embodied conversational agent that brings the historical figure of Albert Einstein to life through real-time multimodal AI interaction. The system combines speech recognition and synthesis, facial animation, gesture control, and a cognitively grounded language understanding pipeline to deliver immersive, personality-consistent conversations.&lt;/p&gt;
&lt;p&gt;Digital Einstein serves three interconnected roles:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Research platform&lt;/strong&gt; — a testbed for studying human–agent interaction in constrained embodied settings, yielding insights on affective computing, personality modeling, dialog act classification, and conversational AI.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Education platform&lt;/strong&gt; — a live demonstration of conversational AI and multimodal deep learning deployed in university events, science outreach, and public engagement.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Public engagement tool&lt;/strong&gt; — reaching thousands of visitors globally at scientific conferences, tech summits, museums, and public events, generating sustained international recognition for ETH Zurich.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="motivation"&gt;Motivation&lt;/h2&gt;
&lt;p&gt;How can AI systems convincingly portray a well-known historical personality — someone whose knowledge, values, and speaking style are culturally established — in real-time dialogue with arbitrary members of the public? This challenge crystallizes core problems in interactive AI: maintaining factual and characterological consistency, adapting dynamically to unpredictable user inputs, and delivering a compelling embodied experience at scale.&lt;/p&gt;
&lt;p&gt;Digital Einstein was conceived as both a scientific challenge and a communication vehicle: making abstract advances in AI tangible for general audiences while simultaneously driving rigorous research on the underlying problems.&lt;/p&gt;
&lt;h2 id="approach"&gt;Approach&lt;/h2&gt;
&lt;p&gt;The system is built on a full-pipeline architecture described in the SIGGRAPH 2025 paper &lt;em&gt;&amp;ldquo;A Platform for Interactive AI Character Experiences&amp;rdquo;&lt;/em&gt;. Key components include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Perception layer&lt;/strong&gt;: Real-time speech recognition via Microsoft Azure Speech Services and multimodal input processing through a webcam-based vision pipeline, including face detection, user characterization, head pose estimation, and re-identification.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cognitive reasoning layer&lt;/strong&gt;: Knowledge-grounded dialogue management with integrated response generation, powered by GPT-4.1 mini, featuring dynamic personality infusion that adapts outputs to user-selectable archetypes: Digital Einstein, Rude Bulldozer, Drama Volcano, Zen Master, and Hashtag Prophet.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Animation synthesis&lt;/strong&gt;: Data-driven facial animation synchronized with speech output using NVIDIA Audio2Face, blended with emotion-conditioned expressions, and complemented by a curated library of motion-captured body animations categorized by avatar state.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Embodiment&lt;/strong&gt;: A stylized Albert Einstein avatar rendered in Unity on a 65-inch display, integrated into a themed early-20th-century physical environment with spatial audio, a hidden microphone, and physical personality sliders built from potentiometers and an Arduino.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The SIGGRAPH Asia 2024 demonstration paper &lt;em&gt;&amp;ldquo;Immersive Conversations with Digital Einstein: Linking a Physical System and AI&amp;rdquo;&lt;/em&gt; details the physical installation setup, including the integration of an animatronic head with the real-time AI pipeline at the Tokyo venue.&lt;/p&gt;
&lt;h2 id="key-results"&gt;Key Results&lt;/h2&gt;
&lt;p&gt;Digital Einstein has been demonstrated at over 20 major events worldwide, including:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;SIGGRAPH Asia 2024&lt;/strong&gt; (Tokyo, Japan) — Emerging Technologies&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;SIGGRAPH 2025&lt;/strong&gt; (Vancouver, Canada)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GITEX Global 2024 &amp;amp; 2025&lt;/strong&gt; (Dubai, UAE) — Swiss Pavilion&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;World Economic Forum 2024 &amp;amp; 2026&lt;/strong&gt; (Davos, Switzerland)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Berlin Science Week 2025&lt;/strong&gt; (Berlin, Germany)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Swiss Re Resilience Summit 2024&lt;/strong&gt; (Rüschlikon, Switzerland)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Microsoft Initiative to Advance AI Diffusion in Switzerland 2025&lt;/strong&gt; (Berne)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;After the Algorithm Festival 2026&lt;/strong&gt; (Zurich, Switzerland)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The project has generated sustained international media coverage and public interest, positioning ETH Zurich as a world leader in embodied conversational AI.&lt;/p&gt;
&lt;figure&gt;&lt;img src="https://rafael-wampfler.github.io/projects/digital-einstein/gitex.jpg"
alt="Swiss Ambassador to the UAE, Arthur Mattli, interacting with Digital Einstein at GITEX Global in Dubai."&gt;&lt;figcaption&gt;
&lt;p&gt;Swiss Ambassador to the UAE, Arthur Mattli, interacting with Digital Einstein at GITEX Global in Dubai.&lt;/p&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;h2 id="learn-more"&gt;Learn More&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="primary-publications"&gt;Primary Publications&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;R. Wampfler&lt;/strong&gt;, C. Yang, D. Elste, N. Kovačević, P. Witzig and M. Gross (2025). &lt;em&gt;A Platform for Interactive AI Character Experiences&lt;/em&gt;. Proceedings of the SIGGRAPH Conference Papers &amp;lsquo;25 (Vancouver, Canada, August 10–14, 2025), pp. 1–11.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;R. Wampfler&lt;/strong&gt;, N. Kovačević, P. Witzig, C. Yang, M. Gross (2024). &lt;em&gt;Immersive Conversations with Digital Einstein: Linking a Physical System and AI&lt;/em&gt;. In SIGGRAPH Asia 2024 Emerging Technologies (SA &amp;lsquo;24) (Tokyo, Japan, December 3–6, 2024).&lt;/p&gt;</description></item><item><title>Affective Computing &amp; Emotion Recognition</title><link>https://rafael-wampfler.github.io/projects/affective-computing/</link><pubDate>Tue, 01 Jan 2019 00:00:00 +0000</pubDate><guid>https://rafael-wampfler.github.io/projects/affective-computing/</guid><description>&lt;h2 id="overview"&gt;Overview&lt;/h2&gt;
&lt;p&gt;This research thread develops deep learning architectures for predicting human emotional and cognitive states from rich, naturalistic data streams. Unlike laboratory-controlled setups, our systems operate &amp;ldquo;in-the-wild&amp;rdquo; — on real devices, in real environments, with real users — addressing the full complexity of affective computing at scale.&lt;/p&gt;
&lt;h2 id="motivation"&gt;Motivation&lt;/h2&gt;
&lt;p&gt;Affective computing — the capacity of machines to detect, interpret, and respond to human emotions — is a foundational capability for human-centric AI. Yet most academic benchmarks rely on controlled, acted datasets that poorly predict real-world performance. Building systems that genuinely work in naturalistic settings requires confronting three fundamental challenges:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Domain adaptation&lt;/strong&gt;: Affective signals vary enormously across individuals and contexts; models must transfer gracefully.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Uncertainty estimation&lt;/strong&gt;: Emotion recognition inherently involves ambiguity and subjectivity; systems must quantify and communicate their confidence.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scalability&lt;/strong&gt;: Continuous affective sensing must operate on resource-constrained mobile and edge devices.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="approach"&gt;Approach&lt;/h2&gt;
&lt;h3 id="multimodal-fusion"&gt;Multimodal Fusion&lt;/h3&gt;
&lt;p&gt;Our work leverages a broad set of input modalities, combining them through transformer-based and convolutional architectures:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Smartphone touch and sensor data&lt;/strong&gt;: Stylus pressure, touch dynamics, accelerometer, and gyroscope signals during naturalistic task completion&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Biometric data&lt;/strong&gt;: Heart rate, skin conductance, and other physiological signals from wearables&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Egocentric vision&lt;/strong&gt;: First-person video from wearable cameras, capturing the user&amp;rsquo;s visual environment&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Typing behavior&lt;/strong&gt;: Smartphone keyboard dynamics as a passive indicator of affective and personality state&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="semi-supervised-and-self-supervised-learning"&gt;Semi-Supervised and Self-Supervised Learning&lt;/h3&gt;
&lt;p&gt;Given the difficulty and cost of obtaining large labeled affective datasets in natural settings, we exploit semi-supervised learning strategies that leverage abundant unlabeled data. This improves generalization without requiring exhaustive annotation.&lt;/p&gt;
&lt;h3 id="egoemotion-neurips-2025"&gt;egoEMOTION (NeurIPS 2025)&lt;/h3&gt;
&lt;p&gt;The most recent and ambitious contribution is &lt;em&gt;egoEMOTION&lt;/em&gt;, presented at NeurIPS 2025 (Datasets and Benchmarks track). This work combines &lt;strong&gt;egocentric vision&lt;/strong&gt; and &lt;strong&gt;physiological signals&lt;/strong&gt; into a unified multimodal architecture, advancing both fusion strategies and providing a new reproducible benchmark dataset. egoEMOTION addresses the challenge of predicting emotion and personality from the wearer&amp;rsquo;s own perspective, a naturalistic setting of growing relevance as wearable cameras become ubiquitous.&lt;/p&gt;
&lt;h3 id="personality-recognition-from-typing"&gt;Personality Recognition from Typing&lt;/h3&gt;
&lt;p&gt;Beyond momentary emotions, we have also developed systems for personality trait recognition from passive smartphone typing dynamics. This work (IEEE Transactions on Affective Computing, 2023) demonstrates that stable personality traits leave measurable signatures in everyday smartphone interactions, enabling passive, continuous personality inference.&lt;/p&gt;
&lt;h2 id="key-results"&gt;Key Results&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Demonstrated state-of-the-art in-the-wild affective state prediction from smartphone sensors across multiple CHI publications&lt;/li&gt;
&lt;li&gt;Published a new egocentric multimodal emotion and personality benchmark (NeurIPS 2025)&lt;/li&gt;
&lt;li&gt;Showed that semi-supervised learning substantially closes the gap between labeled and unlabeled-data performance&lt;/li&gt;
&lt;li&gt;Developed personality trait recognition from typing dynamics achieving strong classification performance on real-world data&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="publications"&gt;Publications&lt;/h2&gt;
&lt;p&gt;M. Jammot, B. Braun, P. Streli, &lt;strong&gt;R. Wampfler&lt;/strong&gt; and C. Holz (2025). &lt;em&gt;egoEMOTION: Egocentric Vision and Physiological Signals for Emotion and Personality Recognition in Real-World Tasks&lt;/em&gt;. In Conference on Neural Information Processing Systems 2025 (Datasets and Benchmarks, NeurIPS), pp. 1–12.&lt;/p&gt;
&lt;p&gt;N. Kovačević, C. Holz, M. Gross and &lt;strong&gt;R. Wampfler&lt;/strong&gt; (2024). &lt;em&gt;On Multimodal Emotion Recognition for Human-Chatbot Interaction in the Wild&lt;/em&gt;. In Proceedings of the 26th International Conference on Multimodal Interaction (ICMI &amp;lsquo;24), San Jose, Costa Rica, November 4–8, 2024.&lt;/p&gt;
&lt;p&gt;N. Kovačević, C. Holz, T. Günther, M. Gross and &lt;strong&gt;R. Wampfler&lt;/strong&gt; (2023). &lt;em&gt;Personality Trait Recognition Based on Smartphone Typing Characteristics in the Wild&lt;/em&gt;. IEEE Transactions on Affective Computing, pp. 1–11, 2023.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;R. Wampfler&lt;/strong&gt;, S. Klingler, B. Solenthaler, V. R. Schinazi, M. Gross and C. Holz (2022). &lt;em&gt;Affective State Prediction from Smartphone Touch and Sensor Data in the Wild&lt;/em&gt;. Proceedings of the Conference on Human Factors in Computing Systems (CHI), New Orleans, USA, April 30–May 5, 2022, pp. 1–14.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;R. Wampfler&lt;/strong&gt;, S. Klingler, B. Solenthaler, V. R. Schinazi and M. Gross (2020). &lt;em&gt;Affective State Prediction Based on Semi-Supervised Learning from Smartphone Touch Data&lt;/em&gt;. Proceedings of the Conference on Human Factors in Computing Systems (CHI), Virtual, April 25–30, 2020, pp. 1–13.&lt;/p&gt;
&lt;p&gt;N. Kovačević, &lt;strong&gt;R. Wampfler&lt;/strong&gt;, B. Solenthaler, M. Gross and T. Günther (2020). &lt;em&gt;Glyph-Based Visualization of Affective States&lt;/em&gt;. Eurographics/IEEE VGTC Symposium on Visualization (EuroVis), Virtual, May 25–29, 2020, pp. 121–125.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;R. Wampfler&lt;/strong&gt;, S. Klingler, B. Solenthaler, V. R. Schinazi and M. Gross (2019). &lt;em&gt;Affective State Prediction in a Mobile Setting using Wearable Biometric Sensors and Stylus&lt;/em&gt;. Proceedings of the International Conference on Educational Data Mining (EDM), Montréal, Canada, July 2–5, 2019, pp. 224–233.&lt;/p&gt;</description></item><item><title>Dialog Act Classification</title><link>https://rafael-wampfler.github.io/projects/dialog-act/</link><pubDate>Mon, 01 Jan 2024 00:00:00 +0000</pubDate><guid>https://rafael-wampfler.github.io/projects/dialog-act/</guid><description>&lt;h2 id="overview"&gt;Overview&lt;/h2&gt;
&lt;p&gt;For a conversational agent to respond appropriately, it must understand not just &lt;em&gt;what&lt;/em&gt; a user says, but &lt;em&gt;why&lt;/em&gt; they said it — the communicative intent behind their utterance. Dialog Act (DA) classification is the task of categorizing utterances by their function in conversation (e.g., question, assertion, greeting, request, clarification). This project develops multimodal dialog act classifiers tailored for interactions with digital characters.&lt;/p&gt;
&lt;h2 id="motivation"&gt;Motivation&lt;/h2&gt;
&lt;p&gt;Standard dialog act classification systems are trained on text transcriptions alone. In real-world interactions with embodied agents, however, users communicate through a rich combination of speech prosody, gaze, gesture, and lexical content. A question delivered with rising intonation carries different meaning than the same words spoken flatly; a greeting accompanied by eye contact differs from one delivered distractedly.&lt;/p&gt;
&lt;p&gt;For digital characters that must respond naturally in real time, dialog act classification must therefore be multimodal — integrating acoustic, linguistic, and where available, visual signals — and must operate with low latency to support interactive response times.&lt;/p&gt;
&lt;h2 id="approach"&gt;Approach&lt;/h2&gt;
&lt;p&gt;Our multimodal dialog act classifier integrates:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Lexical features&lt;/strong&gt;: Encoded via transformer-based text encoders fine-tuned on dialog corpora&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Acoustic features&lt;/strong&gt;: Prosodic signals including pitch, energy, and speech rate, extracted from the raw audio signal&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Temporal context&lt;/strong&gt;: Conversation history modeling to resolve ambiguous acts through discourse-level context&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The system is evaluated on naturalistic conversations with digital characters — a challenging setting because users frequently use fragmented, spontaneous speech rather than complete, grammatical sentences. The classifier is optimized for both accuracy and latency, enabling real-time use within the Digital Einstein pipeline.&lt;/p&gt;
&lt;h2 id="key-results"&gt;Key Results&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Demonstrated that multimodal integration (text + acoustic features) significantly outperforms text-only baselines for dialog act classification in digital character conversations&lt;/li&gt;
&lt;li&gt;Achieved real-time classification latency compatible with interactive agent deployment&lt;/li&gt;
&lt;li&gt;Provided insights into which dialog acts are most frequently misclassified in human-agent interaction, informing future system design&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="publication"&gt;Publication&lt;/h2&gt;
&lt;p&gt;P. Witzig, R. Constantin, N. Kovačević and &lt;strong&gt;R. Wampfler&lt;/strong&gt; (2024). &lt;em&gt;Multimodal Dialog Act Classification for Conversations With Digital Characters&lt;/em&gt;. Proceedings of the 6th International Conference on Conversational User Interfaces (CUI), Luxembourg, Luxembourg, July 08–10, 2024, pp. 1–14.&lt;/p&gt;</description></item><item><title>A Platform for Interactive AI Character Experiences</title><link>https://rafael-wampfler.github.io/publications/platform-interactive-ai-2025/</link><pubDate>Sun, 10 Aug 2025 00:00:00 +0000</pubDate><guid>https://rafael-wampfler.github.io/publications/platform-interactive-ai-2025/</guid><description/></item><item><title>On Multimodal Emotion Recognition for Human-Chatbot Interaction in the Wild</title><link>https://rafael-wampfler.github.io/publications/multimodal-emotion-recognition-2024/</link><pubDate>Mon, 04 Nov 2024 00:00:00 +0000</pubDate><guid>https://rafael-wampfler.github.io/publications/multimodal-emotion-recognition-2024/</guid><description/></item><item><title>Multimodal Dialog Act Classification for Conversations With Digital Characters</title><link>https://rafael-wampfler.github.io/publications/dialog-act-classification-2024/</link><pubDate>Mon, 08 Jul 2024 00:00:00 +0000</pubDate><guid>https://rafael-wampfler.github.io/publications/dialog-act-classification-2024/</guid><description/></item></channel></rss>