This demo accompanies our paper titled 'Hierarchical Emotion Prediction and Controlin Text-to-speech Synthesis', in which we introduce a novel emotion representation called Hierarchical Emotion Distribution (Hierarchical ED). This representation facilitates modeling of emotion intensity at varying granularities. The following audio files showcase examples of our model's capabilities. You can try more audio samples in the following sections.
Dataset: Blizzard Challenge 2013
Original Audio
Edited Audio
Increase 'Happy' intensity of the first two words (Blizzard)
Increase 'Sad' intensity of the last two words (Blizzard)
Gradually Increase 'Surprise' intensity throughout a speech (Blizzard)
Gradually Decrease 'Angry' intensity throughout a speech (Blizzard)
Dataset: Emotional Speech Database (ESD)
Original Audio
Edited Audio
Increase 'Happy' intensity of the first two words (ESD)
Increase 'Sad' intensity of the last two words (ESD)
Gradually Increase 'Surprise' intensity throughout a speech (ESD)
Gradually Decrease 'Angry' intensity throughout a speech (ESD)
The first section presents a demo to evaluate our model's reproducibility. The Hierarchical ED is predicted from the linguistic embedding generated by the linguistic encoder and used to reproduce the audio. We selected FastSpeech2 as a baseline. We define the following symbols: (1) ED Predictor: Integration of hierarchical ED predictor, (2) BERT: Replacement with a BERT-based encoder.
Target Audio
FastSpeech2
Proposed w/o ED Predictor
Proposed w/o BERT
Proposed
Section 2: Maximizing Emotion Intensities of Two Long Words
In this section, we demonstrate the control of emotion intensities in a specific segment of speech, primarily focusing on two words with the longest phoneme sequence. We elevate the word-level emotion intensities of these words, and the corresponding phoneme-level intensities, to 1.0, while maintaining the other emotion intensities constant. This showcases our model's ability to adjust emotion granularly within specified speech segments. The modifed words are represented in the curly blankets.
Dataset: Blizzard Challenge 2013
Original
Angry
Happy
Sad
Surprise
and what the {bourgeois} is, isn't quite {defined.}
"This man is {almost} too {gallant} to be in love," thought Emma.
All the girls {traipsing} off to {Sheffield} every day!
{Holding} my hand in both his own, he {chafed} it;
Harriet Smith {refuse} Robert {Martin?}
Dataset: Emotional Speech Database (ESD)
Original
Angry
Happy
Sad
Surprise
A {divine} wrath made her blue eyes {awful.}
How I hate this {foul} {pool!}
Who is been {repeating} all that hard {stuff} to you?
She had said, so that one could {keep} up a {conversation!}
{Let's} make the noise a {snake.}
Section 3: Gradually Increasing Emotion Intensity throughout a speech
In this section, we exhibit our model's capability to gradually INCREASE intensities of one of the emotions throughout a speech. We maintain the utterance-level intensities and word and phoneme-level intensities of non-selected emotions.
Dataset: Blizzard Challenge 2013
Angry
Happy
Sad
Surprise
Dataset: Emotional Speech Database (ESD)
Angry
Happy
Sad
Surprise
Section 4: Gradually Decreasing Emotion Intensity throughout a speech
In this section, we exhibit our model's capability to gradually DECREASE intensities of one of the emotions throughout a speech. We maintain the utterance-level intensities and word and phoneme-level intensities of non-selected emotions.