EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotion

News

[2025.03] 🔥 We have released EMOVA's code (training & inference), datasets (training & evaluation), and checkpoints (EMOVA-3B/7B/72B)!

[2025.02] 🔥 Our EMOVA paper has been accepted by CVPR 2025!

Interactive Demo

Abstract

GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering Large Language Models to perceive and generate images, texts, and speeches end-to-end with publicly available data remains challenging in the open-source community. Existing vision-language models rely on external tools for the speech processing, while speech-language models still suffer from limited or even without vision-understanding abilities. To address this gap, we propose EMOVA (EMotionally Omni-present Voice Assistant), to enable Large Language Models with end-to-end speech capabilities while maintaining the leading vision-language performance. With a semantic-acoustic disentangled speech tokenizer, we notice surprisingly that omni-modal alignment can further enhance vision-language and speech abilities compared with the corresponding bi-modal aligned counterparts. Moreover, a lightweight style module is proposed for flexible speech style controls (e.g., emotions and pitches). For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks, and meanwhile, supporting omni-modal spoken dialogue with vivid emotions.

End-to-end Omni-modal Architecture

The vision encoder extracts continuous visual features, which are projected into the text embedding space as visual tokens, while the input speech is encoded and quantized into discrete units. Given omni-modal inputs, EMOVA can generate both textual and speech responses with vivid emotional controls.

State-of-the-art Omni-modal Capabilties

1) EMOVA outperforms GPT-4o/4V and Gemini Pro 1.5 on 11 of 14 vision-language benchmarks, providing a powerful open-sourced alternative. 2) EMOVA achieves state-of-the-art performance on on Librispeech, surpassing its speech and omni-modal counterparts significantly.

Vivid Emotional Control

User	Transcript	Emova	Transcript
	Hi, what's your name?		(Female, Happy) Hello, my name is EMOVA, nice to meed to you.
	Hey, I'm so sad these days.		(Female, Sad) I'm really sorry to hear that you are feeling sad, it's important to take care of yourself during tough times. Let's talk about what's been going on and see if we can find some way to help you feel better.
	Imagine you are a parent who is angry at their teenage daughter for coming home past curfew.		(Male, Angry) I am absolutely furious with you! Coming home past curfew is completely unacceptable! Do you have any idea how dangerous it is to be out late? This is not a joke! You need to understand the consequences of your actions. I am very disappointed in you!
	为即将结婚的朋友送上祝福。		(Female, Happy) 祝福你们，愿你们的爱情如晨曦般温暖，永不熄灭，愿你们的婚姻如同繁花似锦，绚烂夺目，愿你们的生活如同诗篇，甜蜜无比，愿你们的未来充满欢笑与喜悦，永远幸福美满！

Omni-modal Emotional Spoken Dialogues

Omni-modal Dialogue with Structural Data Understanding

Advanced Vision-language Abilities

BibTeX

@article{chen2024emova,
  title={Emova: Empowering language models to see, hear and speak with vivid emotions},
  author={Chen, Kai and Gou, Yunhao and Huang, Runhui and Liu, Zhili and Tan, Daxin and Xu, Jing and Wang, Chunwei and Zhu, Yi and Zeng, Yihan and Yang, Kuo and others},
  journal={arXiv preprint arXiv:2409.18042},
  year={2024}
}

EMOVA:

Empowering Language Models to See, Hear and Speak

with Vivid Emotions

CVPR 2025