Francis Scott Key would not be happy.
Videos recently making the rounds on social media show how automated captions can mishear and misrepresent audio with disastrous results. The clips show captions created by an automatic speech recognition (ASR) program displayed on a center court scoreboard during “The Star-Spangled Banner” at a pro basketball game.
In one of the videos, “…broad stripes and bright stars, through the perilous fight” was captioned as “STARS. PASS THROUGH THE PAYROLL. BUS FIRE.”
In another, “gallantly streaming” became “MENTALLY SCREAMING” while “say does that Star-Spangled Banner yet wave” was captioned as “OH, SO SAY HART STAR SPACE ANGLE. OLD NO. HIRED AWAY.”
Though caption fails like these can be funny, the truth is they are no laughing matter for the millions who rely on captions every day for information and entertainment. Poor-quality captions, even on something as well-known as the national anthem, can leave viewers confused, frustrated, and angry.
The Problem with Live Songs
Though the quality of ASR captions has improved in recent years, there still are problems that auto-captions have yet to figure out. Song lyrics being one of them.
The national anthem ─ like many songs ─ can be difficult for an ASR engine to figure out, said Adi Margolin, Verbit’s Director of Product Management.
One problem area is that songs can feature a variety of vocal techniques and skills, such as rapping, singing in different registers, or the intensity in which lyrics are voiced. ASR systems, however, typically are trained on the spoken language and may not be able to accurately recognize different vocal styles.
Furthermore, vocalists often stress words in a way that ASR doesn’t understand. And when it’s a live performance, like “The Star-Spangled Banner” at a sporting event, the engine has split seconds to decide what it’s hearing, and when words are drawn out and pronounced in ways they aren’t normally spoken, it makes it difficult for the engine.
Songs also can contain non-standard vocabulary and grammar, including slang, idioms, and poetic language not commonly found in everyday speech (think “O’er the ramparts”). As a result, the ASR may not have been trained on enough data to accurately recognize these words and phrases.
That said, ASR models can be trained to identify “The Star-Spangled Banner” or any other national anthem. The problem, however, is captioning a national anthem (or a halftime pop-song performance, for that matter) is not the sole purpose of the engine. The ASR also needs to properly caption the sporting event itself – player information, game action, scoring details, in-stadium announcements, etc.
And it’s in this dual responsibility of the engine – where it’s being asked to accurately caption in two content worlds and within two different training sets (specifically, the language and tempo of a sporting event vs. the vocal range and lyrical nature of songs) – that the challenges lie.
Customized, Trained ASR Solutions
Though ASR might not always be able to get the lyric right, the technology is constantly advancing.
At VITAC, we’re always looking to improve our already high-quality ASR offering. Our proprietary ASR solution is built on a custom, always-trained engine developed by captioning, speech, and machine-learning experts. It’s designed with customers in mind and is adaptable to specific programs, events, and needs. Our preparation team trains the engine on customers’ individual content ─ reviewing previous captioned material and adding prep terms to improve the accuracy of the client-specific ASR models throughout.
And because live music performances often include background noise (loud instruments, crowds, applause) that can make it difficult for ASR systems to distinguish between speech and non-speech sounds, our engine features audio filtering that enables it to hear the most transparent audio.
Through further testing and training, we’re confident that we’ll have a stronger solution for captioning live music and help everyone to correctly “name that tune.”