Dialog Act Classification

Overview
For a conversational agent to respond appropriately, it must understand not just what a user says, but why they said it — the communicative intent behind their utterance. Dialog Act (DA) classification is the task of categorizing utterances by their function in conversation (e.g., question, assertion, greeting, request, clarification). This project develops multimodal dialog act classifiers tailored for interactions with digital characters.
Motivation
Standard dialog act classification systems are trained on text transcriptions alone. In real-world interactions with embodied agents, however, users communicate through a rich combination of speech prosody, gaze, gesture, and lexical content. A question delivered with rising intonation carries different meaning than the same words spoken flatly; a greeting accompanied by eye contact differs from one delivered distractedly.
For digital characters that must respond naturally in real time, dialog act classification must therefore be multimodal — integrating acoustic, linguistic, and where available, visual signals — and must operate with low latency to support interactive response times.
Approach
Our multimodal dialog act classifier integrates:
- Lexical features: Encoded via transformer-based text encoders fine-tuned on dialog corpora
- Acoustic features: Prosodic signals including pitch, energy, and speech rate, extracted from the raw audio signal
- Temporal context: Conversation history modeling to resolve ambiguous acts through discourse-level context
The system is evaluated on naturalistic conversations with digital characters — a challenging setting because users frequently use fragmented, spontaneous speech rather than complete, grammatical sentences. The classifier is optimized for both accuracy and latency, enabling real-time use within the Digital Einstein pipeline.
Key Results
- Demonstrated that multimodal integration (text + acoustic features) significantly outperforms text-only baselines for dialog act classification in digital character conversations
- Achieved real-time classification latency compatible with interactive agent deployment
- Provided insights into which dialog acts are most frequently misclassified in human-agent interaction, informing future system design
Publication
P. Witzig, R. Constantin, N. Kovačević and R. Wampfler (2024). Multimodal Dialog Act Classification for Conversations With Digital Characters. Proceedings of the 6th International Conference on Conversational User Interfaces (CUI), Luxembourg, Luxembourg, July 08–10, 2024, pp. 1–14.

I am a Senior Researcher & Lecturer at the Computer Graphics Laboratory of ETH Zurich, and a Research Consultant at Disney Research. I am leading the Digital Character AI projects at CGL. My research interests include conversational digital characters, affective computing, human-computer interaction, and applied machine learning.
My vision is to create intelligent digital humans that can naturally communicate, understand, and support people across domains such as education and mental health. My research focuses on multimodal artificial intelligence for interactive digital humans, developing models that combine large language models, affective computing, and data-driven animation to create embodied conversational agents endowed with autonomous agency, consistent values, and beliefs.
My work bridges machine learning, human–computer interaction, and computer graphics to enable AI systems such as Digital Einstein and interactive patient avatars for psychotherapy training and health education.