Audio Samples

This demo accompanies our paper titled 'Hierarchical Emotion Prediction and Controlin Text-to-speech Synthesis', in which we introduce a novel emotion representation called Hierarchical Emotion Distribution (Hierarchical ED). This representation facilitates modeling of emotion intensity at varying granularities. The following audio files showcase examples of our model's capabilities. You can try more audio samples in the following sections.

Dataset: Blizzard Challenge 2013

		Original Audio	Edited Audio
Increase 'Happy' intensity of the first two words (Blizzard)
Increase 'Sad' intensity of the last two words (Blizzard)
Gradually Increase 'Surprise' intensity throughout a speech (Blizzard)
Gradually Decrease 'Angry' intensity throughout a speech (Blizzard)

Dataset: Emotional Speech Database (ESD)

		Original Audio	Edited Audio
Increase 'Happy' intensity of the first two words (ESD)
Increase 'Sad' intensity of the last two words (ESD)
Gradually Increase 'Surprise' intensity throughout a speech (ESD)
Gradually Decrease 'Angry' intensity throughout a speech (ESD)

Section 1: Emotion Expressiveness: Reproducibility

The first section presents a demo to evaluate our model's reproducibility. The Hierarchical ED is predicted from the linguistic embedding generated by the linguistic encoder and used to reproduce the audio. We selected FastSpeech2 as a baseline. We define the following symbols: (1) ED Predictor: Integration of hierarchical ED predictor, (2) BERT: Replacement with a BERT-based encoder.

Target Audio	FastSpeech2	Proposed w/o ED Predictor	Proposed w/o BERT	Proposed

Section 2: Maximizing Emotion Intensities of Two Long Words

In this section, we demonstrate the control of emotion intensities in a specific segment of speech, primarily focusing on two words with the longest phoneme sequence. We elevate the word-level emotion intensities of these words, and the corresponding phoneme-level intensities, to 1.0, while maintaining the other emotion intensities constant. This showcases our model's ability to adjust emotion granularly within specified speech segments. The modifed words are represented in the curly blankets.

Dataset: Blizzard Challenge 2013

				Original	Angry	Happy	Sad	Surprise
and what the {bourgeois} is, isn't quite {defined.}
"This man is {almost} too {gallant} to be in love," thought Emma.
All the girls {traipsing} off to {Sheffield} every day!
{Holding} my hand in both his own, he {chafed} it;
Harriet Smith {refuse} Robert {Martin?}

Dataset: Emotional Speech Database (ESD)

				Original	Angry	Happy	Sad	Surprise
A {divine} wrath made her blue eyes {awful.}
How I hate this {foul} {pool!}
Who is been {repeating} all that hard {stuff} to you?
She had said, so that one could {keep} up a {conversation!}
{Let's} make the noise a {snake.}

Section 3: Gradually Increasing Emotion Intensity throughout a speech

In this section, we exhibit our model's capability to gradually INCREASE intensities of one of the emotions throughout a speech. We maintain the utterance-level intensities and word and phoneme-level intensities of non-selected emotions.

Dataset: Blizzard Challenge 2013

Angry	Happy	Sad	Surprise

Dataset: Emotional Speech Database (ESD)

Angry	Happy	Sad	Surprise

Section 4: Gradually Decreasing Emotion Intensity throughout a speech

In this section, we exhibit our model's capability to gradually DECREASE intensities of one of the emotions throughout a speech. We maintain the utterance-level intensities and word and phoneme-level intensities of non-selected emotions.