EMOVA:

Empowering Language Models to See, Hear and Speak

with Vivid Emotions

CVPR 2025

Kai Chen^1*, Yunhao Gou^1,6*, Runhui Huang^2*, Zhili Liu^1,3*, Daxin Tan^3*, Jing Xu⁴, Chunwei Wang³, Yi Zhu³, Yihan Zeng³, Kuo Yang³, Dingdong Wang⁴, Kun Xiang⁵, Haoyuan Li⁵, Haoli Bai³, Jianhua Han³, Xiaohui Li³, Weike Jin³, Nian Xie³, Yu Zhang⁶, James T. Kwok¹, Hengshuang Zhao², Xiaodan Liang⁵, Dit-Yan Yeung¹, Xiao Chen³, Zhenguo Li³, Wei Zhang³, Qun Liu³, Jun Yao³, Lanqing Hong^3†, Lu Hou^3†, Hang Xu^3†

¹Hong Kong University of Science and Technology, ²The University of Hong Kong, ³Huawei Noah's Ark Lab, ⁴The Chinese University of Hong Kong, ⁵Sun Yat-sen University, ⁶Southern University of Science and Technology

(^*Equal contribution. ^†Corresponding authors. )

arXiv Code Demo Dataset Model
Video Video (中文)

News

[2025.03] 🔥 We have released EMOVA's code (training & inference), datasets (training & evaluation), and checkpoints (EMOVA-3B/7B/72B)!

[2025.02] 🔥 Our EMOVA paper has been accepted by CVPR 2025!

Interactive Demo

Abstract

GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering Large Language Models to perceive and generate images, texts, and speeches end-to-end with publicly available data remains challenging in the open-source community. Existing vision-language models rely on external tools for the speech processing, while speech-language models still suffer from limited or even without vision-understanding abilities. To address this gap, we propose EMOVA (EMotionally Omni-present Voice Assistant), to enable Large Language Models with end-to-end speech capabilities while maintaining the leading vision-language performance. With a semantic-acoustic disentangled speech tokenizer, we notice surprisingly that omni-modal alignment can further enhance vision-language and speech abilities compared with the corresponding bi-modal aligned counterparts. Moreover, a lightweight style module is proposed for flexible speech style controls (e.g., emotions and pitches). For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks, and meanwhile, supporting omni-modal spoken dialogue with vivid emotions.

End-to-end Omni-modal Architecture

The vision encoder extracts continuous visual features, which are projected into the text embedding space as visual tokens, while the input speech is encoded and quantized into discrete units. Given omni-modal inputs, EMOVA can generate both textual and speech responses with vivid emotional controls.

State-of-the-art Omni-modal Capabilties

1) EMOVA outperforms GPT-4o/4V and Gemini Pro 1.5 on 11 of 14 vision-language benchmarks, providing a powerful open-sourced alternative. 2) EMOVA achieves state-of-the-art performance on on Librispeech, surpassing its speech and omni-modal counterparts significantly.

Vivid Emotional Control

User	Transcript	Emova	Transcript
	Hi, what's your name?		(Female, Happy) Hello, my name is EMOVA, nice to meed to you.
	Hey, I'm so sad these days.		(Female, Sad) I'm really sorry to hear that you are feeling sad, it's important to take care of yourself during tough times. Let's talk about what's been going on and see if we can find some way to help you feel better.
	Imagine you are a parent who is angry at their teenage daughter for coming home past curfew.		(Male, Angry) I am absolutely furious with you! Coming home past curfew is completely unacceptable! Do you have any idea how dangerous it is to be out late? This is not a joke! You need to understand the consequences of your actions. I am very disappointed in you!
	为即将结婚的朋友送上祝福。		(Female, Happy) 祝福你们，愿你们的爱情如晨曦般温暖，永不熄灭，愿你们的婚姻如同繁花似锦，绚烂夺目，愿你们的生活如同诗篇，甜蜜无比，愿你们的未来充满欢笑与喜悦，永远幸福美满！

Omni-modal Emotional Spoken Dialogues

User

(transcript: Write a poem for this image.)

EMOVA

(transcript: In winter's chill, a sad face drawn, A frown of snow, a somber tone. The world is quiet, still, and cold, A canvas painted by the frost's cold. The eyes, two dots, the mouth, a line, A simple face, a symbol of decline. The snowflakes dance, they twirl and play, A fleeting moment, soon to fade away. But in this brief sketch, a story's told, Of winter's mood, of solitude, of cold. So let this face remind us all, In winter's chill, we find our call. To feel the frost, to feel the snow, To find the beauty in its woe.)

Omni-modal Dialogue with Structural Data Understanding

User

(transcript: I am seeking a candidate. Can you briefly describe his condition for me?)

EMOVA

(transcript: The candidate is a highly motivated and experienced professional in the field of technology and social media. He is known for his innovative approach to problem-solving and his ability to lead and manage teams. His strengths include leadership, innovation, and vision, which are crucial for driving strategic initiatives and fostering a forward-thinking culture. His skills in programming, entrepreneurship, leadership, and strategic planning demonstrate his capability to contribute significantly to a tech-oriented project or company. His achievements, such as founding Facebook and Internet.org, underscore his impact on the digital landscape and his ability to create platforms that connect billions of people worldwide.)

Advanced Vision-language Abilities

User

Why is this image funny?

EMOVA

The humor in this image comes from the juxtaposition of the woman's statement about going to sleep early and the fact that she is still awake at 2am, as indicated by the text. The woman's expression and the use of a smartphone in bed suggest that she might be scrolling through social media or engaging in some other activity that is keeping her awake. This contrast between her intention to sleep early and her current state of being awake creates a comedic effect, as it highlights a common human experience where plans don't always go as intended. The meme format, which typically presents a situation with an ironic or humorous twist, amplifies this effect by presenting the woman's situation in a relatable and exaggerated way.

BibTeX

@article{chen2024emova,
  title={Emova: Empowering language models to see, hear and speak with vivid emotions},
  author={Chen, Kai and Gou, Yunhao and Huang, Runhui and Liu, Zhili and Tan, Daxin and Xu, Jing and Wang, Chunwei and Zhu, Yi and Zeng, Yihan and Yang, Kuo and others},
  journal={arXiv preprint arXiv:2409.18042},
  year={2024}
}