Audio Samples


This demo accompanies our paper titled 'Fine-Grained Quantitative Emotion Editing for Speech Generation', in which we introduce a novel emotion representation called Hierarchical Emotion Distribution (Hierarchical ED). This representation facilitates modeling of emotion intensity at varying granularities. The following audio files showcase examples of our model's capabilities. You can try more audio samples in the following sections.


Original Audio Edited Audio

Increase 'Happy' intensity of the first two words

Increase 'Sad' intensity of the first two words

Increase 'Surprise' intensity of the last two words

Increase 'Angry' intensity of the last two words

Utterance-level Control: 'Angry'



Section 1: Emotion Expressiveness: Reproducibility via Ground-Truth Emotion Information


The first section presents a demo to evaluate our model's reproducibility. The Hierarchical ED of our model and The emotion intensities of MsEmoTTS are obtained by the reference audio.


Target Audio MsEmoTTS Hierarchical ED (ours)


Section 2: Maximizing Emotion Intensities of Two Long Words


In this section, we demonstrate the control of emotion intensities in a specific segment of speech, primarily focusing on two words with the longest phoneme sequence. We elevate the word-level emotion intensities of these words, and the corresponding phoneme-level intensities, to 1.0, while maintaining the other emotion intensities constant. This showcases our model's ability to adjust emotion granularly within specified speech segments. The modifed words are represented in the curly blankets. the MsEmoTTS baseline is only able to change the intensity of the ground-truth emotion so only one emotion control is available.


Model: MsEmoTTS (baseline)


Original Angry Happy Sad Surprise

All the girls {traipsing} off to {Sheffield} every day!

---

---

---

And as to {smaller-sized} rooms than I had been used {to,} I really could not give it a thought.

---

---

---

"This man is {almost} too {gallant} to be in love," thought Emma.

---

---

---

This {violence} is all most {repulsive:"} and so, no doubt, she felt it."

---

---

---

He paused {before} he {answered:} 'Neither, I hope.

---

---

---

it is {disposed} to impart all that the brain {conceives;}

---

---

---

Clifford felt his father was a {hopeless} {anachronism.}

---

---

---


Model: Hierarchical ED (ours)


Original Angry Happy Sad Surprise

All the girls {traipsing} off to {Sheffield} every day!

And as to {smaller-sized} rooms than I had been used {to,} I really could not give it a thought.

"This man is {almost} too {gallant} to be in love," thought Emma.

This {violence} is all most {repulsive:"} and so, no doubt, she felt it."

He paused {before} he {answered:} 'Neither, I hope.

it is {disposed} to impart all that the brain {conceives;}

Clifford felt his father was a {hopeless} {anachronism.}



Section 3: Maximizing Utterance-level Emotion Intensities (Hierarchical ED)


In this section, we demonstrate the control of utterance-level emotion intensities. We elevate the emotion intensities of the emotion to 1.0 while maintaining the other emotion intensities 0.0. We only present our model's demo since MsEmoTTS does not cover this control.


Angry Happy Sad Surprise

All the girls traipsing off to Sheffield every day!

And as to smaller-sized rooms than I had been used to, I really could not give it a thought.

"This man is almost too gallant to be in love," thought Emma.

This violence is all most repulsive:" and so, no doubt, she felt it."

He paused before he answered: 'Neither, I hope.

it is disposed to impart all that the brain conceives;

Clifford felt his father was a hopeless anachronism.



Appendix: Emotion Expressiveness: Reproducibility via Predicted Emotion Information


The appendix section presents a demo to show our model's reproducibility without using the reference audio. The Hierarchical ED and The Emotion Intensity of MsEmoTTS are predicted from the linguistic embedding generated by the linguistic encoder and used to reproduce the audio.


Target Audio MsEmoTTS Hierarchical ED (ours)