1School of Data Science, 2Shenzhen Research Institute of Big Data
The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), Shenzhen, China
3Alibaba Group, Singapore
Emotional text-to-speech synthesis (TTS) aims to generate realistic emotional speech from input text. However, quantitatively controlling multi-level emotion rendering remains challenging. In this paper, we propose a diffusion-based emotional TTS framework with a novel approach for emotion intensity modeling to facilitate fine-grained control over emotion rendering at the phoneme, word, and utterance levels. We introduce a hierarchical emotion distribution (ED) extractor that captures a quantifiable ED embedding across different speech segment levels. Additionally, we explore various acoustic features and assess their impact on emotion intensity modeling. During TTS training, the hierarchical ED embedding effectively captures the variance in emotion intensity from the reference audio and correlates it with linguistic and speaker information. During inference, the TTS model not only generates emotional speech but also quantitatively controls the emotion rendering over the speech constituents. Both objective and subjective evaluations demonstrate the effectiveness of our framework in terms of speech quality, emotional expressiveness, and hierarchical emotion control.
You can visit the project page of this paper: Github Repository.
The first section presents a demo to evaluate our model's reproducibility. The Hierarchical ED of our proposed models and the emotion intensities of MsEmoTTS are obtained by the reference audio. We compared the following five samples:
ID | 0013_001077 | 0019_001076 | 0011_000374 | 0016_000740 | 0018_000729 | 0020_000393 | 0012_001079 | 0018_000379 | 0016_000721 | 0020_000732 | 0016_000381 | 0020_000392 | 0018_001438 | 0012_000379 | 0011_001074 | 0014_000021 | 0013_000377 | 0013_000729 | 0014_000391 | 0011_001429 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Text |
Text: Monster made a deep bow. |
Text: Let's make the noise a snake. |
Text: The football teams give a tea party. |
Text: In which fox loses a tail and its elder sister finds one. |
Text: Rat came and replied on the leaves. |
Text: Story twenty nine a boy and a monkey. |
Text: Rat came and replied on the leaves. |
Text: Rat came and replied on the leaves. |
Text: As rich-as Peter's son in law! |
Text: Who is been repeating all that hard stuff to you? |
Text: All smile were real and the happier the more sincere . |
Text: She may mind ye of her. |
Text: Hold up my chin, slow and solid. |
Text: Rat came and replied on the leaves. |
Text: The football teams give a tea party. |
Text: As rich-as Peter's son in law! |
Text: Monster made a deep bow. |
Text: Rat came and replied on the leaves. |
Text: I chose the right way. |
Text: Rat came and replied on the leaves. |
Reference Speech | ||||||||||||||||||||
MsEmoTTS (Baseline) | ||||||||||||||||||||
SVM-based HED (Baseline) | ||||||||||||||||||||
Proposed w/ SER | ||||||||||||||||||||
Proposed w/ EPR |
* please scroll horizontally to explore additional columns in the table.
In this section, we demonstrate the control of utterance-level emotion intensities. We set the emotion intensities of the emotion to the values in the first row while maintaining the other emotion intensities constant. We don't present MsEmoTTS since it does not cover this control.
ID | 0014_000732 | 0015_000726 | 0013_000731 | 0018_000037 | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Text |
Text: Who is been repeating all that hard stuff to you? |
Text: Who is been repeating all that hard stuff to you? |
Text: Who is been repeating all that hard stuff to you? |
Text: Who is been repeating all that hard stuff to you? |
Text: Let's make the noise a snake. |
Text: Let's make the noise a snake. |
Text: Let's make the noise a snake. |
Text: Let's make the noise a snake. |
Text: All smile were real and the happier,the more sincere . |
Text: All smile were real and the happier,the more sincere . |
Text: All smile were real and the happier,the more sincere . |
Text: All smile were real and the happier,the more sincere . |
Text: I think it'll encourage me. |
Text: I think it'll encourage me. |
Text: I think it'll encourage me. |
Text: I think it'll encourage me. |
Ground Truth | ||||||||||||||||
SVM-based HED (Baseline) | ||||||||||||||||
Proposed w/ SER | ||||||||||||||||
Proposed w/ EPR |
* please scroll horizontally to explore additional columns in the table.
ID | 0014_000732 | 0015_000726 | 0013_000731 | 0018_000037 | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Text |
Text: Who is been repeating all that hard stuff to you? |
Text: Who is been repeating all that hard stuff to you? |
Text: Who is been repeating all that hard stuff to you? |
Text: Who is been repeating all that hard stuff to you? |
Text: Let's make the noise a snake. |
Text: Let's make the noise a snake. |
Text: Let's make the noise a snake. |
Text: Let's make the noise a snake. |
Text: All smile were real and the happier,the more sincere . |
Text: All smile were real and the happier,the more sincere . |
Text: All smile were real and the happier,the more sincere . |
Text: All smile were real and the happier,the more sincere . |
Text: I think it'll encourage me. |
Text: I think it'll encourage me. |
Text: I think it'll encourage me. |
Text: I think it'll encourage me. |
Ground Truth | ||||||||||||||||
SVM-based HED (Baseline) | ||||||||||||||||
Proposed w/ SER | ||||||||||||||||
Proposed w/ EPR |
* please scroll horizontally to explore additional columns in the table.
ID | 0014_000732 | 0015_000726 | 0013_000731 | 0018_000037 | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Text |
Text: Who is been repeating all that hard stuff to you? |
Text: Who is been repeating all that hard stuff to you? |
Text: Who is been repeating all that hard stuff to you? |
Text: Who is been repeating all that hard stuff to you? |
Text: Let's make the noise a snake. |
Text: Let's make the noise a snake. |
Text: Let's make the noise a snake. |
Text: Let's make the noise a snake. |
Text: All smile were real and the happier,the more sincere . |
Text: All smile were real and the happier,the more sincere . |
Text: All smile were real and the happier,the more sincere . |
Text: All smile were real and the happier,the more sincere . |
Text: I think it'll encourage me. |
Text: I think it'll encourage me. |
Text: I think it'll encourage me. |
Text: I think it'll encourage me. |
Ground Truth | ||||||||||||||||
SVM-based HED (Baseline) | ||||||||||||||||
Proposed w/ SER | ||||||||||||||||
Proposed w/ EPR |
* please scroll horizontally to explore additional columns in the table.
ID | 0014_000732 | 0015_000726 | 0013_000731 | 0018_000037 | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Text |
Text: Who is been repeating all that hard stuff to you? |
Text: Who is been repeating all that hard stuff to you? |
Text: Who is been repeating all that hard stuff to you? |
Text: Who is been repeating all that hard stuff to you? |
Text: Let's make the noise a snake. |
Text: Let's make the noise a snake. |
Text: Let's make the noise a snake. |
Text: Let's make the noise a snake. |
Text: All smile were real and the happier,the more sincere . |
Text: All smile were real and the happier,the more sincere . |
Text: All smile were real and the happier,the more sincere . |
Text: All smile were real and the happier,the more sincere . |
Text: I think it'll encourage me. |
Text: I think it'll encourage me. |
Text: I think it'll encourage me. |
Text: I think it'll encourage me. |
Ground Truth | ||||||||||||||||
SVM-based HED (Baseline) | ||||||||||||||||
Proposed w/ SER | ||||||||||||||||
Proposed w/ EPR |
* please scroll horizontally to explore additional columns in the table.
In this section, we demonstrate the control of emotion intensities in a specific segment of speech, primarily focusing on three words with underscores. We set the word-level emotion intensities of these words, and the corresponding phoneme-level intensities, to the values in the first row, while maintaining the other emotion intensities constant. This showcases our model's ability to adjust emotion granularly within specified speech segments. We also present the speech samples focusing on the modified parts ('___OnlyModified___').
The MsEmoTTS baseline is only able to change the intensity of the ground-truth emotion so only one emotion control is available.
ID | 0014_000732 | 0015_000726 | 0013_000731 | 0018_000037 | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Text |
Text: Who is been repeating all that hard stuff to you? |
Text: Who is been repeating all that hard stuff to you? |
Text: Who is been repeating all that hard stuff to you? |
Text: Who is been repeating all that hard stuff to you? |
Text: Let's make the noise a snake. |
Text: Let's make the noise a snake. |
Text: Let's make the noise a snake. |
Text: Let's make the noise a snake. |
Text: All smile were real and the happier,the more sincere . |
Text: All smile were real and the happier,the more sincere . |
Text: All smile were real and the happier,the more sincere . |
Text: All smile were real and the happier,the more sincere . |
Text: I think it'll encourage me. |
Text: I think it'll encourage me. |
Text: I think it'll encourage me. |
Text: I think it'll encourage me. |
Ground Truth | ||||||||||||||||
MsEmoTTS (Baseline) | ||||||||||||||||
___OnlyModified___ | ||||||||||||||||
SVM-based HED (Baseline) | ||||||||||||||||
___OnlyModified___ | ||||||||||||||||
Proposed w/ SER | ||||||||||||||||
___OnlyModified___ | ||||||||||||||||
Proposed w/ EPR | ||||||||||||||||
___OnlyModified___ |
* please scroll horizontally to explore additional columns in the table.
ID | 0014_000732 | 0015_000726 | 0013_000731 | 0018_000037 | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Text |
Text: Who is been repeating all that hard stuff to you? |
Text: Who is been repeating all that hard stuff to you? |
Text: Who is been repeating all that hard stuff to you? |
Text: Who is been repeating all that hard stuff to you? |
Text: Let's make the noise a snake. |
Text: Let's make the noise a snake. |
Text: Let's make the noise a snake. |
Text: Let's make the noise a snake. |
Text: All smile were real and the happier,the more sincere . |
Text: All smile were real and the happier,the more sincere . |
Text: All smile were real and the happier,the more sincere . |
Text: All smile were real and the happier,the more sincere . |
Text: I think it'll encourage me. |
Text: I think it'll encourage me. |
Text: I think it'll encourage me. |
Text: I think it'll encourage me. |
Ground Truth | ||||||||||||||||
MsEmoTTS (Baseline) | ||||||||||||||||
___OnlyModified___ | ||||||||||||||||
SVM-based HED (Baseline) | ||||||||||||||||
___OnlyModified___ | ||||||||||||||||
Proposed w/ SER | ||||||||||||||||
___OnlyModified___ | ||||||||||||||||
Proposed w/ EPR | ||||||||||||||||
___OnlyModified___ |
* please scroll horizontally to explore additional columns in the table.
ID | 0014_000732 | 0015_000726 | 0013_000731 | 0018_000037 | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Text |
Text: Who is been repeating all that hard stuff to you? |
Text: Who is been repeating all that hard stuff to you? |
Text: Who is been repeating all that hard stuff to you? |
Text: Who is been repeating all that hard stuff to you? |
Text: Let's make the noise a snake. |
Text: Let's make the noise a snake. |
Text: Let's make the noise a snake. |
Text: Let's make the noise a snake. |
Text: All smile were real and the happier,the more sincere . |
Text: All smile were real and the happier,the more sincere . |
Text: All smile were real and the happier,the more sincere . |
Text: All smile were real and the happier,the more sincere . |
Text: I think it'll encourage me. |
Text: I think it'll encourage me. |
Text: I think it'll encourage me. |
Text: I think it'll encourage me. |
Ground Truth | ||||||||||||||||
MsEmoTTS (Baseline) | ||||||||||||||||
___OnlyModified___ | ||||||||||||||||
SVM-based HED (Baseline) | ||||||||||||||||
___OnlyModified___ | ||||||||||||||||
Proposed w/ SER | ||||||||||||||||
___OnlyModified___ | ||||||||||||||||
Proposed w/ EPR | ||||||||||||||||
___OnlyModified___ |
* please scroll horizontally to explore additional columns in the table.
ID | 0014_000732 | 0015_000726 | 0013_000731 | 0018_000037 | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Text |
Text: Who is been repeating all that hard stuff to you? |
Text: Who is been repeating all that hard stuff to you? |
Text: Who is been repeating all that hard stuff to you? |
Text: Who is been repeating all that hard stuff to you? |
Text: Let's make the noise a snake. |
Text: Let's make the noise a snake. |
Text: Let's make the noise a snake. |
Text: Let's make the noise a snake. |
Text: All smile were real and the happier,the more sincere . |
Text: All smile were real and the happier,the more sincere . |
Text: All smile were real and the happier,the more sincere . |
Text: All smile were real and the happier,the more sincere . |
Text: I think it'll encourage me. |
Text: I think it'll encourage me. |
Text: I think it'll encourage me. |
Text: I think it'll encourage me. |
Ground Truth | ||||||||||||||||
MsEmoTTS (Baseline) | ||||||||||||||||
___OnlyModified___ | ||||||||||||||||
SVM-based HED (Baseline) | ||||||||||||||||
___OnlyModified___ | ||||||||||||||||
Proposed w/ SER | ||||||||||||||||
___OnlyModified___ | ||||||||||||||||
Proposed w/ EPR | ||||||||||||||||
___OnlyModified___ |
* please scroll horizontally to explore additional columns in the table.