GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering Large Language Models to perceive and generate images, texts, and speeches end-to-end with publicly available data remains challenging in the open-source community. Existing vision-language models rely on external tools for the speech processing, while speech-language models still suffer from limited or even without vision-understanding abilities. To address this gap, we propose EMOVA (EMotionally Omni-present Voice Assistant), to enable Large Language Models with end-to-end speech capabilities while maintaining the leading vision-language performance. With a semantic-acoustic disentangled speech tokenizer, we notice surprisingly that omni-modal alignment can further enhance vision-language and speech abilities compared with the corresponding bi-modal aligned counterparts. Moreover, a lightweight style module is proposed for flexible speech style controls (e.g., emotions and pitches). For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks, and meanwhile, supporting omni-modal spoken dialogue with vivid emotions.
The vision encoder extracts continuous visual features, which are projected into the text embedding space as visual tokens, while the input speech is encoded and quantized into discrete units. Given omni-modal inputs, EMOVA can generate both textual and speech responses with vivid emotional controls.
1) EMOVA surpasses GPT-4V and Gemini Pro 1.5 on 10 of 14 vision-language benchmarks, while reaching over 95% of GPT-4o performance on nearly all benchmarks. 2) EMOVA achieves state-of-the-art performance on the ASR task, surpassing its speech counterparts by a significant margin.
User | Transcript | Emova | Transcript |
---|---|---|---|
Imagine you are a parent who is angry at their teenage daughter for coming home past curfew. | I am absolutely furious with you! Coming home past curfew is completely unacceptable! Do you have any idea how dangerous it is to be out late? This is not a joke! You need to understand the consequences of your actions. I am very disappointed in you! | ||
为即将结婚的朋友送上祝福。 | 祝福你们,愿你们的爱情如晨曦般温暖,永不熄灭,愿你们的婚姻如同繁花似锦,绚烂夺目,愿你们的生活如同诗篇,甜蜜无比,愿你们的未来充满欢笑与喜悦,永远幸福美满! | ||
Hi, what's your name? | Hello, my name is EMOVA, nice to meed to you. | ||
Hey, I'm so sad these days. | I'm really sorry to hear that you are feeling sad, it's important to take care of yourself during tough times. Let's talk about what's been going on and see if we can find some way to help you feel better. |
@article{chen2024emova,
title={EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions},
author={Chen, Kai and Gou, Yunhao and Huang, Runhui and Liu, Zhili and Tan, Daxin and Xu, Jing and Wang, Chunwei and Zhu, Yi and Zeng, Yihan and Yang, Kuo and Wang, Dingdong and Xiang, Kun and Li, Haoyuan and Bai, Haoli and Han, Jianhua and Li, Xiaohui and Jin, Weike and Xie, Nian and Zhang, Yu and Kwok, James T. and Zhao, Hengshuang and Liang, Xiaodan and Yeung, Dit-Yan and Chen, Xiao and Li, Zhenguo and Zhang, Wei and Liu, Qun and Hong, and Hou, Lu and Xu, Hang},
journal={arXiv preprint arXiv:2409.18042},
year={2024}
}