<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Computer Graphics | Dr. Rafael Wampfler</title><link>https://rafael-wampfler.github.io/tags/computer-graphics/</link><atom:link href="https://rafael-wampfler.github.io/tags/computer-graphics/index.xml" rel="self" type="application/rss+xml"/><description>Computer Graphics</description><generator>HugoBlox Kit (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Mon, 01 Jan 2024 00:00:00 +0000</lastBuildDate><image><url>https://rafael-wampfler.github.io/media/icon_hu_d100f07c298b9e73.png</url><title>Computer Graphics</title><link>https://rafael-wampfler.github.io/tags/computer-graphics/</link></image><item><title>Facial Animation Synthesis</title><link>https://rafael-wampfler.github.io/projects/facial-animation/</link><pubDate>Mon, 01 Jan 2024 00:00:00 +0000</pubDate><guid>https://rafael-wampfler.github.io/projects/facial-animation/</guid><description>&lt;h2 id="overview"&gt;Overview&lt;/h2&gt;
&lt;p&gt;Realistic, expressive facial animation is a critical component of embodied conversational agents. For AI characters to communicate naturally, their facial movements must be synchronized with speech, emotionally consistent, and computationally efficient enough for real-time use. This project develops deep learning architectures that achieve all three goals.&lt;/p&gt;
&lt;h2 id="motivation"&gt;Motivation&lt;/h2&gt;
&lt;p&gt;The challenge of generating high-quality 3D facial animation from text or speech involves several competing requirements:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Expressiveness&lt;/strong&gt;: Facial motion should convey the speaker&amp;rsquo;s emotional state convincingly&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Synchronization&lt;/strong&gt;: Lip movements must match phoneme timing precisely to avoid the uncanny valley&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency&lt;/strong&gt;: Emotional expressivity should be decoupled from semantic content, allowing the same phrase to be delivered in multiple emotional registers&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Efficiency&lt;/strong&gt;: Systems deployed in interactive agents must run in real time on consumer hardware&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Prior approaches either sacrifice expressiveness for speed, or require enormous training data and computation. Our work addresses both efficiency and expressiveness simultaneously by rethinking how deep learning encodes linguistic and acoustic structure.&lt;/p&gt;
&lt;h2 id="approach"&gt;Approach&lt;/h2&gt;
&lt;h3 id="phonemenet-mig-2025"&gt;PhonemeNet (MIG 2025)&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;PhonemeNet&lt;/strong&gt; applies a transformer pipeline specifically designed for the phoneme-level structure of speech. Rather than treating the speech signal as a raw audio waveform or frame-level features, PhonemeNet operates at the level of phonemes — the fundamental units of speech that determine lip shape. This problem-specific inductive bias yields both improved accuracy and computational efficiency compared to architectures that ignore linguistic structure.&lt;/p&gt;
&lt;p&gt;PhonemeNet takes text input, extracts phoneme sequences, and generates corresponding 3D facial blendshape sequences that are synchronized with speech audio. The pipeline achieves real-time performance on standard hardware, making it suitable for deployment in interactive embodied agents.&lt;/p&gt;
&lt;p&gt;PhonemeNet received the &lt;strong&gt;Best Paper Honorable Mention&lt;/strong&gt; at the 18th ACM SIGGRAPH Conference on Motion, Interaction, and Games (MIG 2025).&lt;/p&gt;
&lt;h3 id="emospacetime-mig-2024"&gt;EmoSpaceTime (MIG 2024)&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;EmoSpaceTime&lt;/strong&gt; addresses the problem of emotionally expressive 3D speech animation through a contrastive learning strategy. The core insight is that facial animation should be factorized into two independent components: &lt;strong&gt;emotion&lt;/strong&gt; (how the speaker feels) and &lt;strong&gt;content&lt;/strong&gt; (what the speaker is saying). By learning to decouple these in a shared embedding space, EmoSpaceTime enables:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Transfer of emotional style between speakers&lt;/li&gt;
&lt;li&gt;Consistent emotional expressivity across different sentences&lt;/li&gt;
&lt;li&gt;Fine-grained control over emotional intensity at inference time&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The resulting animations are both emotionally coherent — the emotion is consistent throughout an utterance — and semantically coherent — lip synchronization is accurate regardless of the emotional style applied.&lt;/p&gt;
&lt;h2 id="key-results"&gt;Key Results&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;PhonemeNet achieves real-time text-driven facial animation with best-in-class lip synchronization accuracy — Best Paper Honorable Mention at MIG 2025&lt;/li&gt;
&lt;li&gt;EmoSpaceTime demonstrates that contrastive decoupling of emotion and content significantly improves expressive quality while maintaining temporal coherence&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="publications"&gt;Publications&lt;/h2&gt;
&lt;p&gt;P. Witzig, B. Solenthaler, M. Gross and &lt;strong&gt;R. Wampfler&lt;/strong&gt; (2025). &lt;em&gt;PhonemeNet: A Transformer Pipeline for Text-Driven Facial Animation&lt;/em&gt;. Proceedings of the 18th ACM SIGGRAPH Conference on Motion, Interaction, and Games (MIG &amp;lsquo;25), Zurich, Switzerland, December 3–5, 2025, pp. 1–11. &lt;strong&gt;Best Paper Honorable Mention.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;P. Witzig, S. Solenthaler, M. Gross, &lt;strong&gt;R. Wampfler&lt;/strong&gt; (2024). &lt;em&gt;EmoSpaceTime: Decoupling Emotion and Content through Contrastive Learning for Expressive 3D Speech Animation&lt;/em&gt;. In Proceedings of the 17th ACM SIGGRAPH Conference on Motion, Interaction and Games (MIG &amp;lsquo;24), Arlington, USA, November 21–23, 2024.&lt;/p&gt;</description></item></channel></rss>