>> Welcome to the "Bars and
Tone" radio program, an
in-depth look at the news and
issues facing AHECTA members
today.
Now, here are your hosts,
Hal Meeks and B.J. Attarian.
>> All right, so, here's the
deal.
Our voiceover person, Megan,
recorded a brand-new
introduction that included you,
Brandon, and forgot to save it.
Now she is the Harley-Davidson
intern girl riding across the
country on the motorcycles.
So, we can't get a new intro
until she gets back here in the
fall.
So, you've got me, B.J., and
Hal.
Hey, Hal, how you doing?
>> I'm doing great.
>> And Brandon is here.
It's the Fourth of July week.
You have big plans for the
Fourth of July?
>> Oh, yeah, yeah.
I'm going to Bellhaven,
North Carolina, to watch the
Fourth of July parade.
It's a wonderful slice of
Americana.
I highly recommend it.
>> It's for everyone who is in
the vicinity of Bellhaven.
What's the parade like?
>> Pretty much anything that has
wheels can go.
And that includes riding
tractors, ATVs, semis, tractors.
>> Big Wheels?
>> Yeah, tractors and just all
kinds of stuff.
It's pretty awesome.
>> Cool.
Enjoy the Fourth of July,
whatever you're doing for the
Fourth of July week.
Today on the show, we have a
great show for us here today.
We're going to be talking
captioning, all types of
captioning -- captioning after
the fact, captioning for the
web, live captioning.
Our guests include
Daniell Krawczyk, president of
Municipal Captioning, which is a
live-caption aggregator.
We've also got the vice
president of product for
rev.com.
Mark Chen will be with us.
And John Capobianco, chief
marketing officer of Vitac, a
leading company in the
live-caption sector.
So, let's get right into it.
And joining us now is friend of
the show, really,
Daniell Krawczyk.
He is the founder and president
of Municipal Captioning,
formerly of TelVue, then
Tightrope.
>> Yeah, and LiveU in between.
>> That's right.
And you were on the show, I
think, for each one of those.
>> I think I may have been.
>> So, yeah.
So, welcome back to the show
again.
>> Thank you.
>> And talk about Municipal
Captioning.
What is the latest thing you're
doing?
>> Yeah, sure.
So, last year, I was at the
AHECTA conference, and I met a
bunch of folks who were doing
closed captioning through
Georgia Tech, AMAC --
Accessibility Media Research
Center -- and learned what was
happening in the world of higher
education, that due to ADA
lawsuits, higher education was
close-captioning everything --
all of the videos being
distributed online,
post-production-wise, but also
pushed to close-caption all
things that were happening live.
And it made me realize that the
world of cities and counties,
public access, government
access, that larger world I've
been working with four years,
was also going to need to be
resolving this issue of
providing effective
communication to all the
citizens through live,
close-captioning of meetings,
sports events, other live
events, and then captioning of
the other content of
broadcasting.
So, shortly after AHECTA, I left
the job I was at, at Tightrope,
and I started Municipal
Captioning to help these
community-television
organizations, universities,
different groups evaluate all
the different options that are
out there for captioning the
content live or non-live and
help them project the costs out
for all the content that they
have, compare three or more
different options, and then be
able to buy something that meets
their needs.
>> So, you don't actually do
the captioning itself.
>> That's right, yeah.
Rather than be the person who is
serving as a human professional
captioner when there's many
different services that provide
that, or try to launch a new
technology product that does it
automatically with AI, I'm
aggregating the needs of
hundreds of different cities and
pulling them together so that we
can get better pricing from all
the different solutions that are
out there and then helping
cities combine the different
elements -- hardware from here,
software from there, maybe
correction via this interface or
correction on their own,
correction with a third party --
so that they can use the scale
that they need and the scale of
all the other communities around
them to get something that would
fit their budget.
>> So, why, if I am a company
-- and you kind of hit on it
there a little bit -- but if I'm
a company, why would I come to
you instead of going directly to
the captioning source?
>> Sure, sure.
So, if someone's trying to
figure out what they're going to
to do, the first thing that
tends to be a problem is trying
to figure out three or more
different options so that they
can get multiple quotes.
They can compare and see what
that fit is.
So, I make it a lot easier for
them.
Rather than starting from
scratch, I can help them see
what all the different
combinations, what all the
different live-hardware
solutions look like, in terms of
if it's an up-front-only cost,
if it's something you can pay
for by the hour, and I can help
them project out those costs and
see.
And then, this is a field where
things are changing really
quickly, so to presume that we
can figure out what the perfect
solution is right now and that
it will still be the best
solution for everybody in six
months or a year is really
unlikely.
So, by working with me, they get
to see what all the options are
now, and then I'm going to keep
them abreast of what all the
options are as time goes on, so
if in six months, a year, a year
and a half, there's a better
solution, they can easily switch
without having to create a
whole, new contract.
>> And I want to come back to
that in a minute because I have
a different question -- kind of
what you're talking about there.
But as you know, we're at a
university here, and we are
getting into live captioning.
What are some of the big
obstacles that you see for a
university or a smaller
municipality in getting into
this live captioning, because
it's one thing if I want to go
ahead and caption after the
fact.
>> Sure.
>> Because there are a lot of
opportunities for that.
But actually to do the
live-captioning part, that's
another whole ballgame.
>> Yeah, for sure.
So, there's a couple elements
there.
Obviously, you need hardware.
You need equipment that's
capturing the audio in real time
and either feeding it to a human
captioner, who is being paid by
the hour, or feeding it into
machine learning or
artificial-intelligence system
that is doing the speech
recognition.
So, you have the initial
hardware that's sitting there in
your broadcast path, taking the
audio from the meeting or the
game, and then you have what I
call the engine.
That engine could be a human
engine.
It could be a person who is
typing furiously on their
keyboard and swapping every hour
and a half, two hours with
another person for a long
event.
It could be a physical server
that sits right next to that
encoder, runs the software
locally.
It could be an engine in the
cloud so that the audio's going
off and then coming back.
So, those are the two main
things.
You need to have the encoder
that's putting the closed
captions into the signal, and
you need to have the "engine"
that's generating those
captions.
And there's a lot of barriers
because people have to figure
out how they're going to pay for
this, who's going to be doing
the work, what's going to be
compatible.
>> And it does sound like that
could be quite a bit more
expensive than if I was just
doing an AI
caption-after-the-fact.
>> It can be.
I will say, though, that there's
so many options now that it's
actually surprising that
real-time AI captioning can be
less expensive than a lot of the
after-the-fact post-production
solutions.
So, traditional, live human
captioning tends to be over $100
an hour.
There are some that are less,
but for the most part, $150,
$125, sometimes even more, is
what you pay for real-time human
captioning.
But just because there is such a
range of artificial-intelligence
solutions now, they tend to be a
fraction of that, and some of
them are even a fraction, the
more expensive AI solutions.
>> So, if I was a small school
or a big school, how would I
start this process of trying to
research this?
>> Sure.
Again, this is what I'm mostly
serving folks as, is the
central-research person to help
them with it, but if I wanted to
give a couple pieces of advice
for someone who wanted to do the
research on their own, it would
be to look at the various pieces
of equipment, try to figure out
the compatibility with the
various engines, see what's
flexible enough that if six
months or a year from now, the
best technology for changing
audio to text is different, can
you still reuse the things
you've already invested in, or
have you sunk costs into
something that you can't reuse
it, anyway?
>> Or they could come to you.
>> Yeah, of course.
>> And make it a lot easier,
right?
>> I'd be happy to walk them
through what the different
options are, what the different
pricing models are, figure out
what's relevant.
>> Okay, one thing I wanted to
ask is that when you're talking
to your clients, they're
primarily focusing on captioning
for broadcast.
>> Yeah.
So, a lot of my clients both put
their meetings and other content
on a television channel, a
cable-television-broadcast
channel, and they stream it
online.
They have a webstream that can
either just be watching the
browser or it can be watched on
people's over-the-top devices.
So, when I talk about broadcast
with my customers, it's usually
both television and the web.
>> Right.
Okay, so, do you have these
folks doing any post-production
captioning, as well?
>> Sure.
So, there's both elements,
right?
So, captioning it live doesn't
give you the corrected version
for post.
So, I also have a database of
solutions that are post.
They range from ones that are
entirely automated and don't
have corrections to ones that do
three layers, where you have the
AI followed by two levels of
human correction.
And then there's even options
that combine live with post,
where it uses AI to generate
captions in real time and then
humans correct it to get it
perfect or close to perfect,
post-production quality.
>> So, you kind of hit on it a
little bit earlier about talking
about doing the research and
seeing where we are in six
months.
Technology is changing so
quickly.
>> Oh, it's crazy.
>> So, look into your crystal
ball.
Where do you think the market's
going, and what do you see
coming?
>> Sure.
So, I went to IBC last fall and
AB this spring.
What I noticed already as a
marked difference was the sheer
number of things that were
advertising using AI, not just
for what I would think of as
this first generation of
transcribing purposes, but also
to use AI to do screen scraping.
I saw products that will feed
all the lower thirds in your
video, or even if it's just the
name badge in front of the
speaker on the desk.
It could read that text and
incorporate that into your
search.
I saw a lot of things coming out
that incorporated the real-time
speech transcription so that you
could do a better job of
searching your giant video
archive.
So, I think what we're going to
see is a lot of secondary
services, tertiary services,
things that are built for
helping people deal with their
thousands of hours of video in a
more efficient way, now that
they have searchable text to
speech -- or speech to text.
I'm sorry.
>> I've seen some of those
things, too, and Final Cut is
actually incorporating some of
that now, as well, where they're
going out and they're looking at
not even with the metadata, but
AI is determining what the
metadata should be without
actually having to go in and put
that in.
And then they group it.
So, I can see that coming down.
>> That's a really good point.
So, I think it's shifting from
something that has traditionally
always happened after the
production -- captioning was,
like, the last step -- that
we're now starting to see tools
that are built to be used inside
the nonlinear editor, and then
they advertise that it can help
you with your editing because
you can use the transcripts
to find the things that people
are saying.
Some of the tools allow you to
trim the text transcript, and
then it gives you an edit list
for your video.
So, if we were inputting this
podcast, and I found the part
where I said the wrong phrase,
we could just delete that phrase
I said incorrectly, and it would
stitch the audio together so it
didn't sound like I stumbled
over myself.
>> Amazing where it's going.
So, if we want to find out more
about Municipal Captioning,
where do we go?
>> All right, so, we have our
website, obviously,
MunicipalCaptioning.com.
You can also find us on
Facebook, but you can also
e-mail me, and it's
DanK@MunicipalCaptioning.com.
>> Okay, Daniell Krawczyk,
founder and president of
Municipal Captioning.
Thanks for joining us here
today.
>> Thank you, guys.
>> We're talking with Mark Chen,
who's the vice president of
product -- is that correct?
>> Yes, VP of product.
>> For Rev, who provides
captioning and transcription
services.
In what instances are closed
captions required?
Do you have any thoughts on
that?
>> Yeah.
The requirement that closed
captions on video comes from two
different sources, two different
regulatory bodies.
One is the FCC, and the other is
the ADA.
For purposes of educational
content, it's primarily governed
by the ADA, Americans with
Disabilities Act.
It basically says that you have
to accommodate students who
might be deaf or hard of
hearing.
So, if you put any video online,
if you're capturing lectures and
placing them, you're making them
available online for other
students, if you potentially
have students who are deaf or
hard of hearing, you need to
also have those captioned,
right?
Provide an alternative so they
can get the same value out of
that video as hearing students.
And the FCC steps in when
content is put on television.
Essentially anything that goes
on television has to be
captioned.
There are some carve-outs where
if it's, like, broadcast between
2:00 a.m. and 4:00 a.m. in the
morning or something, or if it's
in a foreign language, it
doesn't have to be captioned.
But if you're using public
airwaves, basically you need to
have your video captioned.
And then of course that sort of
becomes pervasive online, as
well, because the FCC a couple
of years ago said that if any
content has ever been on
television and then put online,
that content then needs to be
captioned, as well, right?
So, any TV shows or movies that
were broadcast on television and
then are now on Netflix, well,
that needs to be captioned,
right?
If you have a talk show that's
played, broadcast at 10:30 p.m.
at night, but then you take a
5-minute clip of it and put it
on Facebook, well, that was on
television, that 5-minute clip,
and so therefore it has to be
captioned, as well.
And so, basically, anything that
you want to be shown to a larger
audience and be accessible to
deaf or hard-of-hearing people
has to be captioned.
>> Okay, so, you've touched on
something here that I think is
very important, and what you
talked about just now is that
for content that was originally
in broadcast or some other
medium, that content, when it's
put online, has to be captioned.
What about content that is
native to an online environment,
that was never broadcast or
anything like that?
>> Yes, technically it doesn't
have to be captioned.
At least the FCC has primarily
steered clear of it.
In fact, there was an interview
a number of months ago -- I
forget with whom.
Basically declared that Netflix
originals, right, content that
Netflix develops on its own and
doesn't...and only goes on
Netflix doesn't have to be
captioned.
So, from a regulatory
perspective, you're not required
to have it captioned.
But from a business standpoint,
sort of for customer
satisfaction, most content
owners are moving that way.
I think interesting models for
this would be the
online-education platforms, like
Craftsy, Pluralsight, Khan
Academy, lynda.com -- all of
those sites, I'm not sure how
familiar your listeners are with
those, but they're subscription
sites for the most part, where
you can go online and learn,
right -- further your career,
learn personal skills, et
cetera.
And because they charge for it,
for the most part, customers are
looking for viewers.
They're looking for a better
premium experience.
So, therefore, you got to those
sites, and all of their videos
are captioned because that's
what customers are looking for.
You know, 30% or more of all
online-video viewers are playing
video with captions turned on.
Even though the total population
of people with hearing
difficulties is somewhere around
6%, a much larger share of the
audience is actually getting
value out of captions.
>> Why is that?
Why do you think people are
actually using captioning?
>> I think there's a wide
variety of reasons.
One big driver of it is mobile.
When you are mobile, you're
listening off headphones.
You're on the move.
So, even if you're watching,
say, Netflix or Amazon Video,
you might be wearing headphones
possibly low quality.
You want to watch with captions,
generally with other people in
the room.
Some people, they may not be
considered hard of hearing, but
they have less-sensitive ears
than their partners.
They don't want to turn the
volume way up.
Sometimes, what I've heard from
some viewers is when you're
watching a show with accents,
right?
People are watching, I don't
know, "Game of Thrones" or
something along those lines,
sometimes you want the captions
just so you can understand
what's being heard because it's
being spoken with a heavy
accent.
But going back to mobile, with
Facebook or Instagram, sort of
auto-roll, basically video is
starting to play as soon as you
scroll through your feed.
It's much more important to have
captions.
We heard from a content owner
that makes heavy use of...
that their videos on Facebook
are viewed three times as often
with captions as without
captions.
And that's primarily because
when Facebook uses auto-roll,
it's automatically muted.
Your videos play, but there's no
sound because just imagine all
the people either on the bus or
in their office setting sort of
discreetly scrolling through
Facebook and watching a video.
And so, because they play
without sound, captions are
critical to get your content
understood.
>> That's a great answer.
I think you've touched on some
things that talk about why
captioning is relevant for
people who are not
hearing-disabled.
One question I think that comes
up a lot is how are subtitles
different than captioning?
Do you have some thoughts on
that?
>> Yeah, I have some thoughts.
Unfortunately, it's not really
industry standards.
Subtitles and captions, those
terms get used quite often
interchangeably.
For us at Rev, and I think it's
probably the most common usage
in the industry, captioning is
putting words on screen in the
same language that the content
is originally recorded in,
right?
So, imagine English video,
English content with English
words on screen, and that can be
closed or open forms of it,
whereas subtitles tend to refer
to words that are in a different
language, right?
So, movies with English video
with French subtitles, or vice
versa.
>> Okay, so, what we're going to
do now is we're going to switch
gears a little bit.
We're going to talk about your
company and some things that you
do.
First of all, what media formats
do you accept for captioning?
>> We essentially accept any
nonproprietary video format,
right?
So, .mp4, QuickTime movies,
Windows media files, even .avis.
Essentially, if you can open up
in any sort of video player,
like VLC, we'll be able to
caption it.
On our side, what we do is we
transcode it all into a standard
.mp4 format, sort of down with
the lower resolution so it makes
it easier to move around.
Some people will send us ProRes
files that are gigabytes per
hour or per 10 minutes, which
are just impossible to move
around.
So, yes, we take pretty much
anything, as you can tell.
>> So, that includes audio-file
formats, like .mp3, as well,
right?
>> Yes, .mp3, .wav, et cetera.
With audio files, it gets a
little bit trickier because we
do have some audio recorders
from, like, Olympus or Sony,
that will record into their own
formats, which are a little bit
more problematic, but yes.
>> Oh, yeah.
Yeah, I'm familiar with Olympus.
Yeah, they use a weird audio
format.
A lot of times what we'll do is
we'll submit video.
If the files are going to be
really large files, what we'll
do is we'll basically submit the
.mp3, just the audio-only
portion of the video, and then
afterwards in post we'll
basically take the caption
content that you provide and
marry it back into the video.
And that works fine.
>> Yeah, that works fine for us,
as well.
I'd say more common is people
will create low-res proxies and
send those over to us because
there are some cases where
having the video in conjunction
with the audio leads to better
output.
Like, you know somebody is
off-screen, and you can refer to
that person as being off-screen.
It helps with speaker tracking a
little better if you do have
video.
But .mp3, just audio, is fine,
as well.
>> Okay.
Actually, that's a great point.
That's something I actually
hadn't thought about.
Let's say I submit a video file
to you roughly about an hour
long.
What would I expect in terms of
the turnaround time?
>> We'll get an hour-long
caption file back to you within
two days.
That's our guarantee, 48 hours.
I'd say what's probably more
typical, what you can expect is
about a day, and that turnaround
time is highly dependent on
length, right?
At the end of the day, it takes
time to go through a video, type
it out, synchronize it, and then
quality-check.
And so, if you were to send in
something that's 30 minutes
long, we guarantee 24 hours, and
what's more typically is
probably like 10.
And if you were to send -- we
have a lot of clients who use us
for shorter clips for social
media, YouTube, et cetera, and
5-minute videos will get turned
around in about an hour.
>> Wow.
That's fantastic.
What is the minimum cost for
captioning?
>> The minimum cost is just a
dollar.
So, our pricing is pretty
simple.
It's $1 per minute of content.
So, if you have a 30-minute
video, it costs $30.
If it's an hour, it'll cost $60
with a one-minute minimum.
If somebody sends us 10 seconds
of video, we'll charge for $1.
>> What an outrage.
[ Laughs ]
>> [ Laughs ]
We have had a lot of people ask
for discounts, but, frankly,
people wanted to stitch those
together into five of them and
by a minute longest video, it'll
still be a dollar.
So, that's fine.
>> Yeah, I've told my students
about that, that they can submit
their student projects to you
guys, and they're typically
about 3 minutes long, and then
for $3, they can have captioning
for their videos.
They could also have a
transcription, as well.
>> Right, right.
>> That's great.
Your pricing is wonderful, and
it's nice that it's easy to
understand.
How do you guys handle multiple
speakers in, let's say, a video.
If you've got, let's say, two or
three people, how do you handle
that?
>> Yeah, our standard is to make
a note that it is a different
speaker by putting a dash in
front of the dialogue block.
That is customizable.
In other words, some clients
don't like having the dash.
They think it distracts from the
viewing experience, and so we
can actually remove it.
In other cases, you can actually
add names, as well.
You can have us add names.
We do all that on the back end,
anyway.
So, our transcriptors are
identifying speakers, noting
them as such already.
But if you don't want those,
obviously we can relatively
easily remove those.
>> Okay, okay.
Let's say I've got some video
content that's got very
specialized terminology.
Let's say, for instance, medical
terminology.
Can you guys handle that?
>> Yeah, I'd say for the most
part, yes.
You know, just to give you a
little bit of sort of context
for how we do things, how Rev
works behind the scenes, right?
When a customer provides,
uploads a video, places an
order, the video goes into our
system, as I mentioned before.
We do a few things to sort of
transpose it, clean up, and get
it to a format that's easy to
sort of ingest.
Then, essentially, we make it
available to our captioner, and
it's first come, first served,
right?
They have access to all of the
different projects or jobs that
are available at any given time
and how much essentially they'll
get paid to do that work.
And they can listen to clips.
And so, as you might imagine,
what ends up happening is the
people who are best suited to
work on certain content or most
excited to caption a particular
project will claim those
projects first, right?
So, somebody who wants to learn
about physics or knows about
physics is going to be the first
person most likely to claim a
project, a video, that is a
physics lecture.
Or somebody who has experience
doing medical transcription will
claim medical jobs first because
for others, what we expect is
that you will do the necessary
research to look up terms that
are new to you.
And for somebody who has
experience, that's relatively
easy, and for somebody who
doesn't have that experience,
they can do those projects, too,
but they're still expected to
look them up, right, or to
identify things.
>> Right.
>> Yeah, so, we can do it within
limits.
Basically, if it's an actual
word, our transcriptionist and
captioners will typically find
it.
>> What is your accuracy, in
terms of your transcription?
>> We guarantee 99% word-level
accuracy, so that every word,
every audible word is captured
properly.
And then as far as time
alignment is concerned, we
guarantee down to a hundred
milliseconds, that we'll start
the caption group, the block of
text will appear on screen
within a hundred milliseconds of
when it was actually spoken.
>> That's fantastic.
That's absolutely wonderful
because as you know for
captioning content, because of
accessibility guidelines, it's
very critical that you have a
high degree of accuracy.
And that's actually one of the
problems that we see with
machine translation is typically
the accuracy is not good enough.
>> Yeah, it's very common,
particularly in the academic
world but even in broadcast
sometimes, well, for online,
where the requirements are a
little bit, they're not so
stringent, in terms of accuracy,
people are looking for an
automated solution because the
cost is lower.
Despite our costs, there
are...that are even lower
because the accuracy isn't
there, right, and particularly
in words that matter.
The speech-rec systems always
seem to be able to get words
like "the" and "and."
It's the proper names of
companies or individuals or
products, et cetera, that they
don't get correct.
>> Okay, I've got one last
question for you, and this has
to do with basically a scenario
that some people would probably
experience.
Let's just say that you have
someone who has a YouTube video,
and they want it to be
captioned.
What would they actually do to
use your service?
>> Yeah, it seems like it's
pretty easy, although there are
two methods.
The first method is probably the
easiest.
You just go to rev.com/caption,
and click "get started," and
copy the link over from YouTube,
right?
So, you can go to YouTube.
This is a video that you have.
Copy the link and literally
paste it into the order form.
We will automatically get it,
detect the length, and you can
add in your credit card and
check out, and then you'll get
it back in an .srt format, which
is a text file that you can then
go back to YouTube and upload.
That's probably the easiest to
get started.
I'd say most of our YouTube
customers, particularly those
who have ongoing needs and
manage a channel, connect their
YouTube account with Rev.
So, the process is similar.
It's just that when you go to
rev.com/caption, and you go to
place an order, there's an
option there to connect to your
YouTube account.
And by clicking that, you
basically log into YouTube and
give Rev authorization to view
your channel.
And then you get a little sort
of file with thumbnails of all
the videos that you've uploaded
to your channel.
You can check the box or uncheck
the box for any videos you want
or don't want captioned, and
then you click "okay."
Then the checkout process is the
same.
We charge you $1 per minute.
We automatically detect all the
length.
But when we're done with the
captions, instead of sending you
an .srt file, we push those
captions back to YouTube on
your behalf, right?
So, when they're done, you just
go onto your YouTube video, and
you'll automatically see the
captions appear on your video
without having to touch it.
>> Okay.
Well, you know, Mark, that's it.
That's all I've got for you
today.
>> Great.
Thank you so much.
>> I really appreciate your
time.
Okay?
>> No problem -- our pleasure.
We love talking about videos
clearly and captions.
We're trying to drive down the
cost as much as possible so that
more people can have access to
the technology, to text on their
videos.
So, happy to help.
>> Okay.
I was talking with Mark Chen
from Rev, who provides
transcription and captioning
services.
Mark, I really appreciate your
time, and I hope you have a
great day.
>> Thank you, Hal.
Thank you.
>> Thank you so much for tuning
into the "Bars and Tones"
podcast.
Today, we are talking about
captioning, and I'm joined by a
very special guest, the chief
marketing officer of Vitac, John
Capobianco.
He is joining us today over the
phone.
It's a big company, the biggest
captioning company,
accessibility company in the
country.
You'll see some of their
captions if you're watching the
recent Stanley Cup finals,
"America's Got Talent," "Tonight
Show with Jimmy Fallon," all the
things that they caption.
They also do conferences,
graduations, events, and sports,
which will be a little bit more
relevant to our listeners here
in the education field.
John, what else can you tell us
about Vitac and what a kind of
average day is like?
How much stuff are you
captioning every single day?
>> [ Chuckles ]
Well, sometimes it kind of
amazes people, just the volume
of captioning we do on a daily
basis.
We do about 550,000 hours of
captioning a year.
That's a little bit more than a
minute's worth of captioning for
every second of every day,
24/7/365.
>> Wow.
>> So, 2 billion seconds of
captioning on an annual basis --
just kind of an amazing thought
when you think about that it's
people that do this.
There are people that are
listening to whatever the event
or the broadcast is, and they
are transcribing that into the
written word and transmitting
that to -- it could be an event
center, it could be a
classroom, it could be the NBC
News.
And that gets put into the
screen.
Now, most people think of
captions on real-time, on the
morning news and stuff like
that.
You can see it.
Or for a sporting event if
you're in a restaurant or
another establishment where it
might be kind of noisy, and
they're kind enough to put the
captions on so you can actually
know what's going on when you
can't hear it.
So, the average day here is a
lot of realtime, what we call
realtime, which is the live
broadcast, and we're captioning
those.
And, again, that's true whether
it's a lecture hall or it's an
event center, and there's some
baseball game going on, for
instance, or it's a major event
with a major corporation, and
the keynote speech is being
transmitted in text, as well as
through sound.
So, a lot of people think about
that, but there's also a lot of
what we call "offline," and you
might think of as prerecorded
programs.
So, those files are sent to us,
as well.
And you mentioned some of those
in the open.
A lot of the TV shows and those
kinds of things are prerecorded
programs, so they come, and then
they go to what's called
"offline."
That's what we call it, anyway.
And they actually make a
transcript of what's being said.
They then take all that, and
they time and place that
verbiage.
You'll notice the difference
when you see captioning.
If there's a lag between what
somebody says and the words that
are popping up, that's because
it's being done real-time.
It's got to move from the person
who's speaking to the
captioner's ears.
It has to be transcribed.
It has to be sent back.
And typically in that
environment, it's going through
a bunch of technology, like
encoders and those kind of
things, that cause a little bit
of delay.
If you see the words coming up,
and typically they come up at
the time that somebody is
speaking, that's a prerecorded
program.
And if it's really done
properly, it's timed and placed.
That is, the words are placed
near the people that are
speaking.
If it's done properly, which we
take great pride in, they make
sure that the captions don't
cover anything important on the
screen.
By the way, those are also FCC
standards.
And we also include things that
can be heard but are not
necessarily the speakers.
So, there's some description of
what's going on, you know, "dog
barks," "clap," "music playing."
You'll also see, if it's being
done live captioning, you'll see
the words to songs that are
being sung and lyrics and those
kinds of things.
That's also a requirement.
So, there's a lot of stuff
that's going on.
We also do up to 50 different
languages in our multilanguage
services.
We also do multicasting, where
we are putting together the same
transmission in both English and
Spanish simultaneously.
We do that, as well.
And, by the way, that's done in
real time and in the offline.
There's a lot of activity with
hundreds and hundreds of people
online right now transcribing
some audio that's going on and
turning it into the written
word, which benefits lots and
lots and lots of people.
>> That's really amazing.
And under ideal conditions, what
are your live captioners --
like, what's their lag time from
when they hear it to where they
actually type it, minus all the
encoders, just from their ear to
the type?
>> Well, for the actual lag
that's introduced by the
captioner is about a second or
two.
That's really all it is.
The rest of that time is all
technology delays.
Encoders and those kinds of
things cause additional delays.
>> Right.
>> But the captioner themselves
is basically -- this is what's
really strange, too, and people
don't think about this because
-- and I'm very familiar with
that because I've only been in
this business for about a year
and a half and before that I
thought the TV did the
captioning, just like everybody
else.
[ Laughs ]
A normal typist can type at
about -- a fast typist does
what, 40 to 60 words a minute?
Most people talk in normal,
casual conversation at about 180
words a minute.
The average broadcaster is at
about 225 and usually ramps up
to about 280 and sometimes
higher than that, 280 words a
minute.
These captioners keep up with
that level.
Our normal captioners think
about a couple of hundred words
a minute, 200 words a minute, as
normal speaking, and that's how
fast they translate this
information from the spoken word
to the written word.
They do that in our company.
We mandate a minimum of 98%
accuracy, and most of our people
are well above that.
And that's just their normal
daily work, and this is what
they do all day, every day.
And they enjoy it.
I was amazed a year and a half
ago, when I came to the company,
and I went out and met many,
many, you know, hundreds of
captioners because most of them
work remotely, and they mostly
work from home.
It's something that's a
lifestyle business for them that
they enjoy.
And beyond the fact that they
enjoy their work, they feel
great pride in their delivery of
service to the community because
there are 50 million deaf and
hard-of-hearing people in the
United States alone, and they
rely on the captions to be able
to not only be included in
society -- we call that
accessibility -- but also, more
importantly, for disaster
preparedness and emergency
situations, it's the only way
they can get information
because, of course, they can't
hear it.
You can add to that there's 83
million millennials.
58 million of them watch videos
without sound, according to the
projections that we've seen on
places like Facebook, where 85%
of all videos watch without
sound on.
That means there's another 58
million millennials that are
receiving information typically
on their mobile handheld through
videos, and if your video is not
captioned, it's meaningless,
because they can't see anything
coming out of it.
And what's worse than that is if
you let the machines caption it,
and then you're the recipient of
the stupid remarks that the
machines make [Laughs] since
they are generally in the 70%
correct stand, facility.
Anyway, that's kind of when we
see the ASR engines.
Most of them don't work all that
well.
They're fine for some things.
You know, Siri works, but how
often does it get a word wrong?
And the problem is, unless
captions are at least 98%
accurate, they don't work for
people who can't hear.
They could be funny, but they're
not -- by the way, the deaf and
hard of hearing don't think
that's funny at all.
>> Mm-hmm.
>> But having it be highly
accurate is not only part of the
law, but it's the right thing to
do.
>> Right, and we talked about,
or you just talked about
something I was going to hit on
there.
You're taking all your
captioning in with human
captioning.
>> Correct.
>> And we have this huge AI and
computer-generated, really,
insurgence of captioning.
But it's still not there yet,
especially for things that are
critical, like health and safety
information, tornado warnings,
weather information.
That -- you just really can't
get to that level anywhere
close, right?
You have to use the humans
still.
>> It's just not accurate
enough.
Listen -- I don't throw any cold
water on new technologies that
are coming along.
We use some ASR here, too,
because we have voice writers,
as well as stenocaptioners, so
they use interpretive language.
But there's a person there, and
if something goes wrong -- the
problem with the automated
engines is nobody's monitoring
it.
So, when it makes a mistake --
because every word it does is a
guess, right?
That's what it's doing.
It's guessing.
"I think it's this."
If they get it wrong, there's
nobody there to correct it.
The deaf and hard of hearing are
used to this.
Most people who aren't that
don't know, but if you're
watching captions, and you see a
dash followed by words, that
means the word prior to the dash
was an error.
The dash means "I'm correcting
the error," and the correction
is immediately following that.
So, the fact that we have
captioners associated with this
means that even if it's ASR
working, we have humans actually
overseeing what's going on.
When you try to use them without
that, listen, they're making
great strides, and we're all
proud of the work that's going
on in ASR.
But it can't caption the way a
human can.
It doesn't have the human
intelligence behind it that the
captioner does.
So, typically, you see things
like synchronous problems.
The words are on screen for --
they come too fast, or they come
too slow.
They catch up.
The accuracy and completeness is
way off.
Usually, on things like proper
nouns and foreign phrases, you
can see a lot of times when
people are using engines instead
of humans.
That works pretty well if you
can feed it a script, like, if
people are working off of
scripted environments.
And the problem is, as soon as
they go off script, you wind up
seeing a bunch of blanks on the
screen because the ASR engine
doesn't know what they said.
So, speaker accents can cause
all kinds of problems with that.
So, there's a lot of the human
element.
When you think of captions, you
got to think of them as a
combination of art and science.
The science part can be dealt
with, but the human part is
really important because the
recipient is a human, and what
they're looking for is the human
context and the punctuation and
all the things that the machines
still have problems with.
Maybe someday they'll get to the
right spot.
I don't think that's going to be
in my lifetime, but they
continue to make that better
every day.
We believe in human captioning
because our job is not just the
captions.
It's the quality and service
that we provide to the industry,
not just the words themselves.
>> Absolutely, and that's a big
thing here in the education
field is making sure that this
huge accessibility push that's
been going on more and more
recently, especially as a lot of
things have moved over to
digital and technology, just
making sure on campus that
everybody's included and
everybody is able to get the
information.
Can you tell us a little bit
where we can expand this in the
education field?
>> Well, when we think about
education, we got to think about
more than just the accessibility
for the deaf and hard of
hearing, even though that's the
primary mission that we have.
We also need to think about
English as a second language.
We need to think about the
benefit of transcriptions that
can be available when you do
captioning for sessions, whether
they're training sessions,
seminars, lectures, or whatever
they are.
Think about the fact that if you
do realtime captioning for a
lecture, let's say, not only do
you make sure that words are
presented for those who are not
necessarily English as a first
language, but English as a
second language, but everybody
has availability of the
transcript of that spoken
session.
That's of great appeal, I
believe, to the educational
community because it's
effectively notes that everybody
can use to better understand
what happened.
By the way, that's not contained
only in the education world,
even though that's what we're
talking about.
Corporations find the same
things.
We see a huge increase in
corporations doing training
sessions and keynote speeches
and their seminars and their big
meetings, all captioning those
things.
Again, not only because they're
presenting the information in
another view -- that is, not
just auditory but in the written
word -- but they also have the
benefit of the transcripts that
come from all of that, which I
think is greatly important for
the education community.
>> Absolutely.
And we're almost out of time
here with you today, but can you
give us some information of how
to get in contact with you, if
someone's interested in reaching
out?
>> Well, I think the best way to
contact us and find out more
about is just to go straight to
our website.
We take a lot of pride in what
we put out there.
It's vitac.com, and you can find
out all about us.
You can contact us.
You can get a hold of us there
and see all the different things
we do and all the different kind
of programs that we offer in the
marketplace.
And, by the way, just to make
sure everybody knows this,
getting captions on your files
is "A," easy, "B," it's quick,
and "C," it's not all that
expensive, when you think about
the quality and the value you
get out of it.
So, just want to make sure that
everybody knows that, and again
just come see us at vitac.com.
We'd love to help you out.
>> Thank you so much for your
time, John.
It was a very interesting
interview, and I think it's
going to be a huge benefit to
our listeners.
Thank you for joining us on this
edition of "Bars and Tones."
>> Great.
Thank you very much.
>> John Capobianco, thanks for
joining us here today.
Now, Hal, we've heard a lot of
things here today, but when it
gets right down to it,
captioning, it shouldn't be
something that is "Oh, my gosh,
I have to go caption this
stuff."
It should be something that we
want to do.
>> Right.
So, Mark Chen said something
really important, which is that
captioning is something that
benefits all of us.
We often think of captioning for
Section 508 compliance for
accessibility and also because
of broadcast guidelines.
But, really, when we're watching
a video in a noisy environment,
and we turn captioning on, then
suddenly we're able to follow
the video along.
My dad, for instance, was hard
of hearing.
He was functionally deaf.
He could follow conversation,
but for him captioning was a
godsend.
In fact, he sought out theaters
that actually provided
captioning equipment, which
actually some do.
So, he could go to a movie
theater and actually follow
along with the movie along with
everyone else.
So, when we think of captioning,
we often think of captioning for
a special case, but the reality
is captioning is something that
impacts all of us.
So, it's really something that
you want to do for your own
work, but, also, you want to be
an advocate for other people, as
well.
>> And, you know, it's becoming
easier and easier to do it.
Heck, even with just dropping
the files onto Vimeo or YouTube,
Final Cut now, you can caption
right inside the NLE.
So, it's really becoming easier.
It's becoming much more
commodity for colleges,
universities, really, everybody
to be able to do.
>> Yes, actually, you know, one
of the things that you probably
would follow from this
conversation we've had today is
that there are standards in
place for doing captioning that
are easy to follow, and now
you've got multiple paths in
terms of being able to get your
captioning.
You can do captioning yourself
for short-form content.
Certainly there are tools now
that -- like MovieCaptioner and
tools like that -- that allow
you to do it.
But if that is a burdensome
effort for you, there are, as
you have heard, commercial
services that can handle
captioning for you and in most
cases fairly reasonable.
The price on captioning in
general has come down a whole
lot.
And, again, the technology is
there now.
One thing that we always have to
keep in mind, though, is machine
translation is still not quite
there yet.
And so, while it can be useful
for things such as key-word
searching and stuff like that,
we're not at a point where
machine translation is good for
100% accuracy.
It's still just not quite there.
>> Our thanks to John
Capobianco.
He is the chief marketing
officer at Vitac, vitac.com.
Mark Chen, rev.com.
He is the vice president of
product at Rev.
And Daniell Krawczyk, the
president of Municipal
Captioning.
You can get to them at
MunicipalCaptioning.com, or you
can e-mail him.
And it says
DanK@MunicipalCaptioning.com,
but the way I'm going to
remember that is it also spells
"dank."
DanK@MunicipalCaptioning.com.
Everybody have a great Fourth of
July week.
Any final thoughts?
>> I have none, other than be
sure that you grill and don't
burn the hot dogs.
>> All right.
For Hal Meeks and Brandon
Boucher, I'm B.J. Attarian.
We will see you next time right
here on the "Bars and Tone"
podcast.