EMOVA:

Empowering Language Models to See, Hear and Speak

with Vivid Emotions

Kai Chen1*, Yunhao Gou1,6*, Runhui Huang2*, Zhili Liu1,3*, Daxin Tan3*, Jing Xu4, Chunwei Wang3, Yi Zhu3, Yihan Zeng3, Kuo Yang3, Dingdong Wang4, Kun Xiang5, Haoyuan Li5, Haoli Bai3, Jianhua Han3, Xiaohui Li3, Weike Jin3, Nian Xie3, Yu Zhang6, James T. Kwok1, Hengshuang Zhao2, Xiaodan Liang5, Dit-Yan Yeung1, Xiao Chen3, Zhenguo Li3, Wei Zhang3, Qun Liu3, Jun Yao3, Lanqing Hong3†, Lu Hou3†, Hang Xu3†,
1Hong Kong University of Science and Technology, 2The University of Hong Kong, 3Huawei Noah's Ark Lab, 4The Chinese University of Hong Kong, 5Sun Yat-sen University, 6Southern University of Science and Technology
(*Equal contribution. Corresponding authors. )

Interactive Demo

Abstract

GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering Large Language Models to perceive and generate images, texts, and speeches end-to-end with publicly available data remains challenging in the open-source community. Existing vision-language models rely on external tools for the speech processing, while speech-language models still suffer from limited or even without vision-understanding abilities. To address this gap, we propose EMOVA (EMotionally Omni-present Voice Assistant), to enable Large Language Models with end-to-end speech capabilities while maintaining the leading vision-language performance. With a semantic-acoustic disentangled speech tokenizer, we notice surprisingly that omni-modal alignment can further enhance vision-language and speech abilities compared with the corresponding bi-modal aligned counterparts. Moreover, a lightweight style module is proposed for flexible speech style controls (e.g., emotions and pitches). For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks, and meanwhile, supporting omni-modal spoken dialogue with vivid emotions.

End-to-end Omni-modal Architecture


The vision encoder extracts continuous visual features, which are projected into the text embedding space as visual tokens, while the input speech is encoded and quantized into discrete units. Given omni-modal inputs, EMOVA can generate both textual and speech responses with vivid emotional controls.

State-of-the-art Vision-language & Speech Abilties


1) EMOVA surpasses GPT-4V and Gemini Pro 1.5 on 10 of 14 vision-language benchmarks, while reaching over 95% of GPT-4o performance on nearly all benchmarks. 2) EMOVA achieves state-of-the-art performance on the ASR task, surpassing its speech counterparts by a significant margin.

Vivid Emotional Control

User Transcript Emova Transcript
Hi, what's your name? (Female, Happy) Hello, my name is EMOVA, nice to meed to you.
Hey, I'm so sad these days. (Female, Sad) I'm really sorry to hear that you are feeling sad, it's important to take care of yourself during tough times. Let's talk about what's been going on and see if we can find some way to help you feel better.
Imagine you are a parent who is angry at their teenage daughter for coming home past curfew. (Male, Angry) I am absolutely furious with you! Coming home past curfew is completely unacceptable! Do you have any idea how dangerous it is to be out late? This is not a joke! You need to understand the consequences of your actions. I am very disappointed in you!
为即将结婚的朋友送上祝福。 (Female, Happy) 祝福你们,愿你们的爱情如晨曦般温暖,永不熄灭,愿你们的婚姻如同繁花似锦,绚烂夺目,愿你们的生活如同诗篇,甜蜜无比,愿你们的未来充满欢笑与喜悦,永远幸福美满!

Omni-modal Emotional Spoken Dialogues

Omni-modal Dialogue with Structural Data Understanding

Advanced Vision-language Abilities

BibTeX

@article{chen2024emova,
  title={Emova: Empowering language models to see, hear and speak with vivid emotions},
  author={Chen, Kai and Gou, Yunhao and Huang, Runhui and Liu, Zhili and Tan, Daxin and Xu, Jing and Wang, Chunwei and Zhu, Yi and Zeng, Yihan and Yang, Kuo and others},
  journal={arXiv preprint arXiv:2409.18042},
  year={2024}
}