Poised for Mass Adoption? Synthesized Voices for the Media and Entertainment Industry

Oct 16 2022 VITAC
A microphone sits on a mixing board in a recording studio for the synthesized voice blog.

Popular posts

Cell phone laying on a desk near a computer keyboard with the Twitch logo displayed on the phone screen
How to Add Captions to Twitch How to Add Captions to Twitch
lamp on desk
So You Want to Be a VITAC Realtime Captioner… So You Want to Be a VITAC Realtime Captioner…

Related posts

An Olympic flag in the foreground reads Paris 2024 and includes the Olympic rings. Blurred in the background is the Eiffel Tower at night
Paris Olympic Games Broadcasts Promise Greater Captioning, Description, and Accessibility   Paris Olympic Games Broadcasts Promise Greater Captioning, Description, and Accessibility  
Front of FCC Building
FCC Adopts ‘Readily Accessible’ Requirement for Caption Display Settings FCC Adopts ‘Readily Accessible’ Requirement for Caption Display Settings

Synthesized speech is nothing new. We’ve long been accustomed to hearing the robotic tones of computer-generated voices in messaging systems, access services and, more recently, in virtual assistant technologies like Siri and Alexa. But we’d never confuse them for the real thing. And we’d certainly never consider replacing the talent in our TV and film productions with a synthesized voice. Or would we? 

At VITAC, we’ve always been big advocates of combining the power of technology and a skilled team to deliver the best results in the transcription, access and localization services we provide to our broadcast, streaming and production clients. In recent years, AI-powered tools have become increasingly valuable in increasing the efficiency and speed of these workflows but, when it comes to recording dubbing tracks and audio descriptions, we would always choose live talent over a synthesized voice. However, recent demonstrations have our team wondering whether synthesized speech is now poised for mass adoption across the media and entertainment industry. 

Setting the Stage for Synthesized Voices

Up until as recently as a few years ago, synthetic voices were created by recording an actual voice, chopping their speech into component sounds and then ‘mixing and matching’ these sounds to make new words. With no way to alter the tone or inflection of these recordings, it’s no wonder that the result wasn’t particularly convincing. From around 2016 however, developments in deep learning spurred a massive progression in speech recognition science so that neural networks are now trained on sets of speech data to generate far more realistic results. The speech and voice recognition market has since boomed and is expected to reach a value of $26.8 billion by 2025.

These developments couldn’t come at a better time for the M&E sector. The growth of streaming platforms over the last decade means that the industry is facing an unprecedented demand for accessible and localized content. Netflix subtitled seven million and dubbed five million run-time minutes in 2021 alone and ResearchAndMarkets.com predicts that the dubbing market will grow from US$2.4 billion in 2019 to US$3.6 billion in 2027, an increase of 50% in less than a decade. But the industry is also facing a talent and facility shortage that threatens to undermine our ability to meet these ever-increasing demands. 

So, could these two sets of circumstances combine to provide an elegant solution? 

In Pursuit of Synthesized Speech Perfection

Despite popular misconception, not all speech synthesis is the same.

Text to speech, or synthesized speech, makes it possible to convert text into a computer simulation of human speech by using machine learning to create voice robots that can ‘read’ any written content aloud. These solutions provide results that represent a dramatic improvement on the original approach and are often used for corporate and brand messages or for access services. But these voices are often still criticized for sounding robotic and lacking the emotion of a human voice and are very unlikely to replace professional voice over artists used by the media and entertainment sector any time soon.

Voice cloning software or speech-to-speech technologies, on the other hand, use one person’s speech to generate another voice. This AI-powered software uses audio recordings to clone a voice, identifying patterns like tone, pace, emphasis, and pronunciations to create a model that can be used to voice completely new scripts. The results are far more convincing because they include pauses, breathing points, emotion, and the other typical characteristics of a real voice – in short clips, in particular, they can be indistinguishable from human recordings. This technology already has been used to recreate Val Kilmer’s voice in “Top Gun: Maverick” and to de-age actor Mark Hamill’s voice in “The Mandalorian” – and it  could be a game changer for access and localization services in M&E.

As media localization expert, Yota Georgakopoulou explains, “Synthesized voices that provide access to information don’t need to be perfect but, if they’re going to be used in entertainment programming, they need to be flawless. That means we need to have full control over the voice – to be able to adjust the speech rate, pronunciation, and the emotion of the voice – to edit it in the same way that you edit text. Otherwise, instead of being entertained, you’ll be irritated. We saw great demos with synthetic voices at IBC Show this year and people were wowed.” 

The benefits of using speech synthesis to create voices that sound realistic are obvious – from saving on studio recording time, to cutting out performance and useage fees for voice artists and making revisions to voice-over tracks simply by editing the text in the script. We’d also be able to ensure that renowned and loved voices like Sir David Attenborough’s would be available for centuries to come and provide producers with any-time access to an almost limitless catalogue of voices to suit any creative brief. 

From an access and localization perspective, voice cloning could solve many of the budget and capacity issues that we’re currently facing. In fact, the first feature-length film has already been dubbed into Latin-American Spanish using AI voices. Tel Aviv-based start-up Deepdub cloned the original voices of the English-speaking cast from “Every Time I Die” to create the Latin-American Spanish dubs and is now reportedly working with various Hollywood studios on similar projects.

But, before we consider all our problems solved, there are some issues to iron out.

The (Potential) Issues with Synthesized Voices

The media and entertainment industry is similar to other sectors in that one of the biggest concerns about the increasing use of AI is that it will put people out of work. In this case, voice artists and actors could face the possibility of being replaced. However, voice cloning companies argue that the software could also provide opportunities for talent to increase their earnings. The process requires real voices to build the models initially and provides the opportunity for actors to license their voices and make them available to much wider markets. This means voice talent could potentially earn royalties for a limitless number of products without having to attend any recordings other than the original data capture.

Rights management is another area that will need to be considered. If voice artists license the use of their voice at a premium to make up for the loss in performance fees, the industry will need to devise systems to ensure that the sources are authorized suppliers and that the appropriate usage fees make their way back to the original artist. The potential issues around rights management for cloned voices have already been seen in the lawsuit against TikTok by voice actor Bev Standing who alleged that the social platform used her voice without her permission. Organizations like the Screen Actors Guild and The American Federation of Television and Radio artists (SAG-AFTRA) are already working to help members protect their ‘digital self’ by making sure performers know their rights, that their digital data is protected and that artists are fairly compensated for any use of their digital identities. No doubt there are many other underlying processes and workflows that will need to be reconsidered should voice cloning become an intrinsic part of future production workflows.

The term “fake news” never even existed before 2016, yet fictitious reports shared over social media are now attributed with influencing elections, inciting violence and even threatening democracy. And that’s just the written content. While skeptical audiences might question the accuracy of a text-based piece, when confronted with a video or audio recording that looks and sounds like the real thing, most of us wouldn’t doubt its legitimacy. For example, filmmaker Morgan Neville recently admitted to creating 45 seconds of an AI voice for late chef Anthony Bourdain the documentary “Roadrunner” sparking controversy about whether he had permission from the family and whether he should have informed the audience. Distinguishing fact from fiction will become increasingly difficult.

There has also been some resistance from the blind and low-vision community to using synthesized voices in the production of audio descriptions. They contend that, if every element of production – from wardrobes to set dressing, make-up, lighting, and camera angles – is so carefully considered for viewing audiences, it isn’t fair to compromise the experience for audiences with visual disabilities by using synthesized voices that can’t match the dramatic delivery of professional voice actors.

The Future Potential of Synthesized Speech

Clearly there’s work to be done, but all indications are that synthesized speech will have a big impact on production, access, and localization workflows in the future. Just how far away that future is, is yet to be determined. The media and entertainment industry can be slow to adopt new approaches and processes like proof of concepts, vendor agreements, API integration, staff training, and large-scale implementation have a habit of eating up a lot of time. So, it may be a good few years until we notice a difference in the content we’re watching – although if we get it right, the viewer shouldn’t actually notice a difference in the content at all.

Get in touch with the VITAC team to discuss your dubbing or audio description requirements.