Audio samples from "Predicting Expressive Speaking Style From Text in End-to-End Speech Synthesis"


Authors: Daisy Stanton, Yuxuan Wang, RJ Skerry-Ryan

Abstract: Global Style Tokens (GSTs) are a recently-proposed method to learn latent disentangled representations of high-dimensional data. GSTs can be used within Tacotron, a state-of-the-art end-to-end speech synthesis system, to uncover expressive factors of variation in speaking style. In this work, we introduce the Text-Predicting Global Style Token (TP-GST) architecture, which treats GST combination weights or style embeddings as ``virtual'' speaking style labels within Tacotron. TP-GST learns to predict stylistic renderings from text alone, requiring neither explicit labels during training, nor auxiliary inputs for inference. We show that, when trained on an expressive speech dataset, our system can render text with more pitch and energy variation than two state-of-the-art baseline models. We further demonstrate that TP-GSTs can synthesize speech with background noise removed, and corroborate these analyses with positive results on human-rated listener preference audiobook tasks. Finally, we demonstrate that multi-speaker TP-GST models successfully factorize speaker identity and speaking style.

Text: And I have found both freedom of loneliness and the safety from being understood, for those who understand us enslave something in us.

The following samples compare synthesis from a baseline Tacotron vs a system Synthesizing with text-predicted style based on GST combination weights in inference mode ("TPCW-GST"). Note how the text-prediction system often leads to clearer, more expressive speech.

These samples were generated using a WaveNet vocoder to illustrate the maximum spectral quality possible in both systems.

Text: And without a backward glance at Harry, Filch ran flat-footed from the office, Mrs. Norris streaking alongside him. Peeves was the school poltergeist, a grinning, airborne menace who lived to cause havoc and distress.

Text: "I expect they've let it rot to give it a stronger flavor," said Hermione knowledgeably, pinching her nose and leaning closer to look at the putrid haggis. "Can we move? I feel sick," said Ron.

Text: There had been a flying motorcycle in it. He had a funny feeling he'd had the same dream before. His aunt was back outside the door. "Are you up yet?" she demanded.

Text: Uncle Vernon now came in, smiling jovially as he shut the door. "Tea, Marge?" he said. "And what will Ripper take?"

Text: "Have you - did you read -?" he sputtered. "No," Harry lied quickly. Filch's knobbly hands were twisting together.

Text: "This is boring," Dudley moaned. He shuffled away. Harry moved in front of the tank and looked intently at the snake.

The following samples compare synthesis from a baseline Tacotron vs a system using a text-predicted style in inference mode ("TPSE-GST"). This system predicts a style embedding directly; anecdotally, it results in even clearer, more expressive speech than a TPCW-GST system.

These samples were generated using a WaveNet vocoder to illustrate the maximum spectral quality possible in both systems.

Text: "Thirty-six," he said, looking up at his mother and father. "That's two less than last year." "Darling, you haven't counted Auntie Marge's present, see, it's here under this big one from Mommy and Daddy."

Text: Harry sat up and gasped; the glass front of the boa constrictor's tank had vanished. The great snake was uncoiling itself rapidly, slithering out onto the floor. People throughout the reptile house screamed and started running for the exits.

Text: But nobody heard much more. Sir Patrick and the rest of the Headless Hunt had just started a game of Head Hockey and the crowd were turning to watch. Nearly Headless Nick tried vainly to recapture his audience, but gave up as Sir Patrick's head went sailing past him to loud cheers.

Text: "Harry, what was that all about?" said Ron, wiping sweat off his face. "I couldn't hear anything-- ." But Hermione gave a sudden gasp, pointing down the corridor. "Look!"

Text: Go with Errol. Ron'll look after you. I'll write him a note, explaining. And don't look at me like that" - Hedwig's large amber eyes were reproachful - "it's not my fault. It's the only way I'll be allowed to visit Hogsmeade with Ron and Hermione."

Text: "Do something about your hair!" Aunt Petunia snapped as he reached the hall. Harry couldn't see the point of trying to make his hair lie flat. Aunt Marge loved criticizing him, so the untidier he looked, the happier she would be.

These samples refer to Section 4.1.4 of our paper, "Automatic Denoising". About 10% of the recordings used to train the models used to synthesize the output below have some high-frequency background noise. The samples below show that while the baseline Tacotron model reproduces this noise, the TP-GST samples have removed this noise without using any supervision.

These samples were generated using a WaveNet vocoder to illustrate the maximum spectral quality possible in both systems.

Text: When he was dressed he went down the hall into the kitchen. The table was almost hidden beneath all Dudley's birthday presents. It looked as though Dudley had gotten the new computer he wanted, not to mention the second television and the racing bike.

Text: "Thirty-six," he said, looking up at his mother and father. "That's two less than last year." "Darling, you haven't counted Auntie Marge's present, see, it's here under this big one from Mommy and Daddy."

Text: With a quick glance at the door to check that Filch wasn't on his way back, Harry picked up the envelope and read: kwikspell A Correspondence Course in Beginners' Magic. Intrigued, Harry flicked the envelope open and pulled out the sheaf of parchment inside.

Text: Harry was at the point of telling Ron and Hermione about Filch and the Kwikspell course when the salamander suddenly whizzed into the air, emitting loud sparks and bangs as it whirled wildly round the room..

Text: The horses galloped into the middle of the dance floor and halted, rearing and plunging. At the front of the pack was a large ghost who held his bearded head under his arm, from which position he was blowing the horn. The ghost leapt down, lifted his head high in the air so he could see over the crowd -- everyone laughed --, and strode over to Nearly Headless Nick, squashing his head back onto his neck.

Text: "Why do you bore me with these dreams of yours? They get more childish every time! You can't dream anything but sentimental nonsense!"

Text: "Oh, to travel, to travel!" cried he; "there is no greater happiness in the world: it is the height of my ambition."

Text: "Then commit them over again," he said gravely. "To get back one's youth, one has merely to repeat one's follies."

Text: "Who do you think you are?" he said, in a harsh voice. "How dare you insult my sister?"

Text: How the sunshine cheers me, and how sweet and refreshing is the rain; my happiness overpowers me, no one in the world can feel happier than I am.