Helping ASR ‘Name that Tune’

Oct 31 2023 VITAC
A large scoreboard hangs over center court inside basketball arena. There is a stoppage in action and an image of a player is on the screen.

Popular posts

Cell phone laying on a desk near a computer keyboard with the Twitch logo displayed on the phone screen
How to Add Captions to Twitch How to Add Captions to Twitch
lamp on desk
So You Want to Be a VITAC Realtime Captioner… So You Want to Be a VITAC Realtime Captioner…

Related posts

Exterior of a government building. Five tall, white marble columns stand at the entrance to the building
Enhancing Accessibility in Government Communications: The Role of Captions and Transcripts Enhancing Accessibility in Government Communications: The Role of Captions and Transcripts
A man and woman sitting on the couch. They are facing away from the camera so we see their backs and the television set in front of them
CCSUBs Industry Group Discusses Live Streaming, Caption Quality; Sets Next Meeting CCSUBs Industry Group Discusses Live Streaming, Caption Quality; Sets Next Meeting

Francis Scott Key would not be happy.

Videos recently making the rounds on social media show how automated captions can mishear and misrepresent audio with disastrous results. The clips show captions created by an automatic speech recognition (ASR) program displayed on a center court scoreboard during “The Star-Spangled Banner” at a pro basketball game.

In one of the videos, “…broad stripes and bright stars, through the perilous fight” was captioned as “STARS. PASS THROUGH THE PAYROLL. BUS FIRE.”

In another, “gallantly streaming” became “MENTALLY SCREAMING” while “say does that Star-Spangled Banner yet wave” was captioned as “OH, SO SAY HART STAR SPACE ANGLE. OLD NO. HIRED AWAY.”

Though caption fails like these can be funny, the truth is they are no laughing matter for the millions who rely on captions every day for information and entertainment. Poor-quality captions, even on something as well-known as the national anthem, can leave viewers confused, frustrated, and angry.

The Problem with Live Songs

Though the quality of ASR captions has improved in recent years, there still are problems that auto-captions have yet to figure out. Song lyrics being one of them.

The national anthem ─ like many songs ─ can be difficult for an ASR engine to figure out, said Adi Margolin, Verbit’s Director of Product Management. 

One problem area is that songs can feature a variety of vocal techniques and skills, such as rapping, singing in different registers, or the intensity in which lyrics are voiced. ASR systems, however, typically are trained on the spoken language and may not be able to accurately recognize different vocal styles.

Furthermore, vocalists often stress words in a way that ASR doesn’t understand. And when it’s a live performance, like “The Star-Spangled Banner” at a sporting event, the engine has split seconds to decide what it’s hearing, and when words are drawn out and pronounced in ways they aren’t normally spoken, it makes it difficult for the engine.

Songs also can contain non-standard vocabulary and grammar, including slang, idioms, and poetic language not commonly found in everyday speech (think “O’er the ramparts”). As a result, the ASR may not have been trained on enough data to accurately recognize these words and phrases.

That said, ASR models can be trained to identify “The Star-Spangled Banner” or any other national anthem. The problem, however, is captioning a national anthem (or a halftime pop-song performance, for that matter) is not the sole purpose of the engine. The ASR also needs to properly caption the sporting event itself – player information, game action, scoring details, in-stadium announcements, etc.

And it’s in this dual responsibility of the engine – where it’s being asked to accurately caption in two content worlds and within two different training sets (specifically, the language and tempo of a sporting event vs. the vocal range and lyrical nature of songs) – that the challenges lie.

Customized, Trained ASR Solutions

Though ASR might not always be able to get the lyric right, the technology is constantly advancing.

At VITAC, we’re always looking to improve our already high-quality ASR offering. Our proprietary ASR solution is built on a custom, always-trained engine developed by captioning, speech, and machine-learning experts. It’s designed with customers in mind and is adaptable to specific programs, events, and needs. Our preparation team trains the engine on customers’ individual content ─ reviewing previous captioned material and adding prep terms to improve the accuracy of the client-specific ASR models throughout.

And because live music performances often include background noise (loud instruments, crowds, applause) that can make it difficult for ASR systems to distinguish between speech and non-speech sounds, our engine features audio filtering that enables it to hear the most transparent audio.

Through further testing and training, we’re confident that we’ll have a stronger solution for captioning live music and help everyone to correctly “name that tune.”