Hierarchical Control of Emotion Rendering in Speech Synthesis


Sho Inoue1,2, Kun Zhou3, Shuai Wang2†, Haizhou Li1,2

1School of Data Science, 2Shenzhen Research Institute of Big Data
The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), Shenzhen, China
3Alibaba Group, Singapore


overall diagram of the pipeline

Abstract

Emotional text-to-speech synthesis (TTS) aims to generate realistic emotional speech from input text. However, quantitatively controlling multi-level emotion rendering remains challenging. In this paper, we propose a diffusion-based emotional TTS framework with a novel approach for emotion intensity modeling to facilitate fine-grained control over emotion rendering at the phoneme, word, and utterance levels. We introduce a hierarchical emotion distribution (ED) extractor that captures a quantifiable ED embedding across different speech segment levels. Additionally, we explore various acoustic features and assess their impact on emotion intensity modeling. During TTS training, the hierarchical ED embedding effectively captures the variance in emotion intensity from the reference audio and correlates it with linguistic and speaker information. During inference, the TTS model not only generates emotional speech but also quantitatively controls the emotion rendering over the speech constituents. Both objective and subjective evaluations demonstrate the effectiveness of our framework in terms of speech quality, emotional expressiveness, and hierarchical emotion control.

You can visit the project page of this paper: Github Repository.

Section 1: Emotion Expressiveness: Reproducibility via Ground-Truth Emotion Information (Emotion Transfer)

The first section presents a demo to evaluate our model's reproducibility. The Hierarchical ED of our proposed models and the emotion intensities of MsEmoTTS are obtained by the reference audio. We compared the following five samples:

  • Reference Speech: The reference audio
  • MsEmoTTS (Baseline)
  • SVM-based HED (Baseline): Hierarchical ED is obtained by SVM-based relative functions
  • Proposed w/ SER: Hierarchical ED is obtained by the proposed SER-based relative functions
  • Proposed w/ EPR: Hierarchical ED is obtained by the proposed EPR-based relative functions

ID 0013_001077 0019_001076 0011_000374 0016_000740 0018_000729 0020_000393 0012_001079 0018_000379 0016_000721 0020_000732 0016_000381 0020_000392 0018_001438 0012_000379 0011_001074 0014_000021 0013_000377 0013_000729 0014_000391 0011_001429
Text

Text: Monster made a deep bow.
Emotion: Sad

Text: Let's make the noise a snake.
Emotion: Sad

Text: The football teams give a tea party.
Emotion: Angry

Text: In which fox loses a tail and its elder sister finds one.
Emotion: Happy

Text: Rat came and replied on the leaves.
Emotion: Happy

Text: Story twenty nine a boy and a monkey.
Emotion: Angry

Text: Rat came and replied on the leaves.
Emotion: Sad

Text: Rat came and replied on the leaves.
Emotion: Angry

Text: As rich-as Peter's son in law!
Emotion: Happy

Text: Who is been repeating all that hard stuff to you?
Emotion: Happy

Text: All smile were real and the happier the more sincere .
Emotion: Angry

Text: She may mind ye of her.
Emotion: Angry

Text: Hold up my chin, slow and solid.
Emotion: Surprise

Text: Rat came and replied on the leaves.
Emotion: Angry

Text: The football teams give a tea party.
Emotion: Sad

Text: As rich-as Peter's son in law!
Emotion: Neutral

Text: Monster made a deep bow.
Emotion: Angry

Text: Rat came and replied on the leaves.
Emotion: Happy

Text: I chose the right way.
Emotion: Angry

Text: Rat came and replied on the leaves.
Emotion: Surprise

Reference Speech
MsEmoTTS (Baseline)
SVM-based HED (Baseline)
Proposed w/ SER
Proposed w/ EPR

* please scroll horizontally to explore additional columns in the table.

Section 2: Utterance-level Emotion Intensity Control

In this section, we demonstrate the control of utterance-level emotion intensities. We set the emotion intensities of the emotion to the values in the first row while maintaining the other emotion intensities constant. We don't present MsEmoTTS since it does not cover this control.

  • Ground Truth: The ground-truth audio
  • SVM-based HED (Baseline): Hierarchical ED is obtained by SVM-based relative functions
  • Proposed w/ SER: Hierarchical ED is obtained by the proposed SER-based relative functions
  • Proposed w/ EPR: Hierarchical ED is obtained by the proposed EPR-based relative functions




Utterance-level Control: Emotion: Angry


ID 0014_000732 0015_000726 0013_000731 0018_000037
Text

Text: Who is been repeating all that hard stuff to you?
Intensity: 0.0

Text: Who is been repeating all that hard stuff to you?
Intensity: 0.4

Text: Who is been repeating all that hard stuff to you?
Intensity: 0.6

Text: Who is been repeating all that hard stuff to you?
Intensity: 1.0

Text: Let's make the noise a snake.
Intensity: 0.0

Text: Let's make the noise a snake.
Intensity: 0.4

Text: Let's make the noise a snake.
Intensity: 0.6

Text: Let's make the noise a snake.
Intensity: 1.0

Text: All smile were real and the happier,the more sincere .
Intensity: 0.0

Text: All smile were real and the happier,the more sincere .
Intensity: 0.4

Text: All smile were real and the happier,the more sincere .
Intensity: 0.6

Text: All smile were real and the happier,the more sincere .
Intensity: 1.0

Text: I think it'll encourage me.
Intensity: 0.0

Text: I think it'll encourage me.
Intensity: 0.4

Text: I think it'll encourage me.
Intensity: 0.6

Text: I think it'll encourage me.
Intensity: 1.0

Ground Truth
SVM-based HED (Baseline)
Proposed w/ SER
Proposed w/ EPR

* please scroll horizontally to explore additional columns in the table.




Utterance-level Control: Emotion: Happy


ID 0014_000732 0015_000726 0013_000731 0018_000037
Text

Text: Who is been repeating all that hard stuff to you?
Intensity: 0.0

Text: Who is been repeating all that hard stuff to you?
Intensity: 0.4

Text: Who is been repeating all that hard stuff to you?
Intensity: 0.6

Text: Who is been repeating all that hard stuff to you?
Intensity: 1.0

Text: Let's make the noise a snake.
Intensity: 0.0

Text: Let's make the noise a snake.
Intensity: 0.4

Text: Let's make the noise a snake.
Intensity: 0.6

Text: Let's make the noise a snake.
Intensity: 1.0

Text: All smile were real and the happier,the more sincere .
Intensity: 0.0

Text: All smile were real and the happier,the more sincere .
Intensity: 0.4

Text: All smile were real and the happier,the more sincere .
Intensity: 0.6

Text: All smile were real and the happier,the more sincere .
Intensity: 1.0

Text: I think it'll encourage me.
Intensity: 0.0

Text: I think it'll encourage me.
Intensity: 0.4

Text: I think it'll encourage me.
Intensity: 0.6

Text: I think it'll encourage me.
Intensity: 1.0

Ground Truth
SVM-based HED (Baseline)
Proposed w/ SER
Proposed w/ EPR

* please scroll horizontally to explore additional columns in the table.




Utterance-level Control: Emotion: Sad


ID 0014_000732 0015_000726 0013_000731 0018_000037
Text

Text: Who is been repeating all that hard stuff to you?
Intensity: 0.0

Text: Who is been repeating all that hard stuff to you?
Intensity: 0.4

Text: Who is been repeating all that hard stuff to you?
Intensity: 0.6

Text: Who is been repeating all that hard stuff to you?
Intensity: 1.0

Text: Let's make the noise a snake.
Intensity: 0.0

Text: Let's make the noise a snake.
Intensity: 0.4

Text: Let's make the noise a snake.
Intensity: 0.6

Text: Let's make the noise a snake.
Intensity: 1.0

Text: All smile were real and the happier,the more sincere .
Intensity: 0.0

Text: All smile were real and the happier,the more sincere .
Intensity: 0.4

Text: All smile were real and the happier,the more sincere .
Intensity: 0.6

Text: All smile were real and the happier,the more sincere .
Intensity: 1.0

Text: I think it'll encourage me.
Intensity: 0.0

Text: I think it'll encourage me.
Intensity: 0.4

Text: I think it'll encourage me.
Intensity: 0.6

Text: I think it'll encourage me.
Intensity: 1.0

Ground Truth
SVM-based HED (Baseline)
Proposed w/ SER
Proposed w/ EPR

* please scroll horizontally to explore additional columns in the table.




Utterance-level Control: Emotion: Surprise


ID 0014_000732 0015_000726 0013_000731 0018_000037
Text

Text: Who is been repeating all that hard stuff to you?
Intensity: 0.0

Text: Who is been repeating all that hard stuff to you?
Intensity: 0.4

Text: Who is been repeating all that hard stuff to you?
Intensity: 0.6

Text: Who is been repeating all that hard stuff to you?
Intensity: 1.0

Text: Let's make the noise a snake.
Intensity: 0.0

Text: Let's make the noise a snake.
Intensity: 0.4

Text: Let's make the noise a snake.
Intensity: 0.6

Text: Let's make the noise a snake.
Intensity: 1.0

Text: All smile were real and the happier,the more sincere .
Intensity: 0.0

Text: All smile were real and the happier,the more sincere .
Intensity: 0.4

Text: All smile were real and the happier,the more sincere .
Intensity: 0.6

Text: All smile were real and the happier,the more sincere .
Intensity: 1.0

Text: I think it'll encourage me.
Intensity: 0.0

Text: I think it'll encourage me.
Intensity: 0.4

Text: I think it'll encourage me.
Intensity: 0.6

Text: I think it'll encourage me.
Intensity: 1.0

Ground Truth
SVM-based HED (Baseline)
Proposed w/ SER
Proposed w/ EPR

* please scroll horizontally to explore additional columns in the table.

Section 3: Word-level Emotion Intensity Control

In this section, we demonstrate the control of emotion intensities in a specific segment of speech, primarily focusing on three words with underscores. We set the word-level emotion intensities of these words, and the corresponding phoneme-level intensities, to the values in the first row, while maintaining the other emotion intensities constant. This showcases our model's ability to adjust emotion granularly within specified speech segments. We also present the speech samples focusing on the modified parts ('___OnlyModified___').
The MsEmoTTS baseline is only able to change the intensity of the ground-truth emotion so only one emotion control is available.

  • Ground Truth: The ground-truth audio
  • MsEmoTTS (Baseline): Only one emotion is available.
  • SVM-based HED (Baseline): Hierarchical ED is obtained by SVM-based relative functions
  • Proposed w/ SER: Hierarchical ED is obtained by the proposed SER-based relative functions
  • Proposed w/ EPR: Hierarchical ED is obtained by the proposed EPR-based relative functions




Word-level Control: Emotion: Angry


ID 0014_000732 0015_000726 0013_000731 0018_000037
Text

Text: Who is been repeating all that hard stuff to you?
Intensity: 0.0

Text: Who is been repeating all that hard stuff to you?
Intensity: 0.4

Text: Who is been repeating all that hard stuff to you?
Intensity: 0.6

Text: Who is been repeating all that hard stuff to you?
Intensity: 1.0

Text: Let's make the noise a snake.
Intensity: 0.0

Text: Let's make the noise a snake.
Intensity: 0.4

Text: Let's make the noise a snake.
Intensity: 0.6

Text: Let's make the noise a snake.
Intensity: 1.0

Text: All smile were real and the happier,the more sincere .
Intensity: 0.0

Text: All smile were real and the happier,the more sincere .
Intensity: 0.4

Text: All smile were real and the happier,the more sincere .
Intensity: 0.6

Text: All smile were real and the happier,the more sincere .
Intensity: 1.0

Text: I think it'll encourage me.
Intensity: 0.0

Text: I think it'll encourage me.
Intensity: 0.4

Text: I think it'll encourage me.
Intensity: 0.6

Text: I think it'll encourage me.
Intensity: 1.0

Ground Truth
MsEmoTTS (Baseline)

___OnlyModified___

SVM-based HED (Baseline)
___OnlyModified___
Proposed w/ SER
___OnlyModified___
Proposed w/ EPR
___OnlyModified___

* please scroll horizontally to explore additional columns in the table.




Word-level Control: Emotion: Happy


ID 0014_000732 0015_000726 0013_000731 0018_000037
Text

Text: Who is been repeating all that hard stuff to you?
Intensity: 0.0

Text: Who is been repeating all that hard stuff to you?
Intensity: 0.4

Text: Who is been repeating all that hard stuff to you?
Intensity: 0.6

Text: Who is been repeating all that hard stuff to you?
Intensity: 1.0

Text: Let's make the noise a snake.
Intensity: 0.0

Text: Let's make the noise a snake.
Intensity: 0.4

Text: Let's make the noise a snake.
Intensity: 0.6

Text: Let's make the noise a snake.
Intensity: 1.0

Text: All smile were real and the happier,the more sincere .
Intensity: 0.0

Text: All smile were real and the happier,the more sincere .
Intensity: 0.4

Text: All smile were real and the happier,the more sincere .
Intensity: 0.6

Text: All smile were real and the happier,the more sincere .
Intensity: 1.0

Text: I think it'll encourage me.
Intensity: 0.0

Text: I think it'll encourage me.
Intensity: 0.4

Text: I think it'll encourage me.
Intensity: 0.6

Text: I think it'll encourage me.
Intensity: 1.0

Ground Truth
MsEmoTTS (Baseline)

___OnlyModified___

SVM-based HED (Baseline)
___OnlyModified___
Proposed w/ SER
___OnlyModified___
Proposed w/ EPR
___OnlyModified___

* please scroll horizontally to explore additional columns in the table.




Word-level Control: Emotion: Sad


ID 0014_000732 0015_000726 0013_000731 0018_000037
Text

Text: Who is been repeating all that hard stuff to you?
Intensity: 0.0

Text: Who is been repeating all that hard stuff to you?
Intensity: 0.4

Text: Who is been repeating all that hard stuff to you?
Intensity: 0.6

Text: Who is been repeating all that hard stuff to you?
Intensity: 1.0

Text: Let's make the noise a snake.
Intensity: 0.0

Text: Let's make the noise a snake.
Intensity: 0.4

Text: Let's make the noise a snake.
Intensity: 0.6

Text: Let's make the noise a snake.
Intensity: 1.0

Text: All smile were real and the happier,the more sincere .
Intensity: 0.0

Text: All smile were real and the happier,the more sincere .
Intensity: 0.4

Text: All smile were real and the happier,the more sincere .
Intensity: 0.6

Text: All smile were real and the happier,the more sincere .
Intensity: 1.0

Text: I think it'll encourage me.
Intensity: 0.0

Text: I think it'll encourage me.
Intensity: 0.4

Text: I think it'll encourage me.
Intensity: 0.6

Text: I think it'll encourage me.
Intensity: 1.0

Ground Truth
MsEmoTTS (Baseline)

___OnlyModified___

SVM-based HED (Baseline)
___OnlyModified___
Proposed w/ SER
___OnlyModified___
Proposed w/ EPR
___OnlyModified___

* please scroll horizontally to explore additional columns in the table.




Word-level Control: Emotion: Surprise


ID 0014_000732 0015_000726 0013_000731 0018_000037
Text

Text: Who is been repeating all that hard stuff to you?
Intensity: 0.0

Text: Who is been repeating all that hard stuff to you?
Intensity: 0.4

Text: Who is been repeating all that hard stuff to you?
Intensity: 0.6

Text: Who is been repeating all that hard stuff to you?
Intensity: 1.0

Text: Let's make the noise a snake.
Intensity: 0.0

Text: Let's make the noise a snake.
Intensity: 0.4

Text: Let's make the noise a snake.
Intensity: 0.6

Text: Let's make the noise a snake.
Intensity: 1.0

Text: All smile were real and the happier,the more sincere .
Intensity: 0.0

Text: All smile were real and the happier,the more sincere .
Intensity: 0.4

Text: All smile were real and the happier,the more sincere .
Intensity: 0.6

Text: All smile were real and the happier,the more sincere .
Intensity: 1.0

Text: I think it'll encourage me.
Intensity: 0.0

Text: I think it'll encourage me.
Intensity: 0.4

Text: I think it'll encourage me.
Intensity: 0.6

Text: I think it'll encourage me.
Intensity: 1.0

Ground Truth
MsEmoTTS (Baseline)

___OnlyModified___

SVM-based HED (Baseline)
___OnlyModified___
Proposed w/ SER
___OnlyModified___
Proposed w/ EPR
___OnlyModified___

* please scroll horizontally to explore additional columns in the table.