Audio Samples


This demo accompanies our paper titled 'Hierarchical Emotion Prediction and Controlin Text-to-speech Synthesis', in which we introduce a novel emotion representation called Hierarchical Emotion Distribution (Hierarchical ED). This representation facilitates modeling of emotion intensity at varying granularities. The following audio files showcase examples of our model's capabilities. You can try more audio samples in the following sections.


Dataset: Blizzard Challenge 2013


Original Audio Edited Audio

Increase 'Happy' intensity of the first two words (Blizzard)

Increase 'Sad' intensity of the last two words (Blizzard)

Gradually Increase 'Surprise' intensity throughout a speech (Blizzard)

Gradually Decrease 'Angry' intensity throughout a speech (Blizzard)


Dataset: Emotional Speech Database (ESD)


Original Audio Edited Audio

Increase 'Happy' intensity of the first two words (ESD)

Increase 'Sad' intensity of the last two words (ESD)

Gradually Increase 'Surprise' intensity throughout a speech (ESD)

Gradually Decrease 'Angry' intensity throughout a speech (ESD)



Section 1: Emotion Expressiveness: Reproducibility


The first section presents a demo to evaluate our model's reproducibility. The Hierarchical ED is predicted from the linguistic embedding generated by the linguistic encoder and used to reproduce the audio. We selected FastSpeech2 as a baseline. We define the following symbols: (1) ED Predictor: Integration of hierarchical ED predictor, (2) BERT: Replacement with a BERT-based encoder.


Target Audio FastSpeech2 Proposed w/o ED Predictor Proposed w/o BERT Proposed


Section 2: Maximizing Emotion Intensities of Two Long Words


In this section, we demonstrate the control of emotion intensities in a specific segment of speech, primarily focusing on two words with the longest phoneme sequence. We elevate the word-level emotion intensities of these words, and the corresponding phoneme-level intensities, to 1.0, while maintaining the other emotion intensities constant. This showcases our model's ability to adjust emotion granularly within specified speech segments. The modifed words are represented in the curly blankets.


Dataset: Blizzard Challenge 2013


Original Angry Happy Sad Surprise

and what the {bourgeois} is, isn't quite {defined.}

"This man is {almost} too {gallant} to be in love," thought Emma.

All the girls {traipsing} off to {Sheffield} every day!

{Holding} my hand in both his own, he {chafed} it;

Harriet Smith {refuse} Robert {Martin?}


Dataset: Emotional Speech Database (ESD)


Original Angry Happy Sad Surprise

A {divine} wrath made her blue eyes {awful.}

How I hate this {foul} {pool!}

Who is been {repeating} all that hard {stuff} to you?

She had said, so that one could {keep} up a {conversation!}

{Let's} make the noise a {snake.}



Section 3: Gradually Increasing Emotion Intensity throughout a speech


In this section, we exhibit our model's capability to gradually INCREASE intensities of one of the emotions throughout a speech. We maintain the utterance-level intensities and word and phoneme-level intensities of non-selected emotions.


Dataset: Blizzard Challenge 2013


Angry Happy Sad Surprise

Dataset: Emotional Speech Database (ESD)


Angry Happy Sad Surprise


Section 4: Gradually Decreasing Emotion Intensity throughout a speech


In this section, we exhibit our model's capability to gradually DECREASE intensities of one of the emotions throughout a speech. We maintain the utterance-level intensities and word and phoneme-level intensities of non-selected emotions.


Dataset: Blizzard Challenge 2013


Angry Happy Sad Surprise

Dataset: Emotional Speech Database (ESD)


Angry Happy Sad Surprise