I listen to a lot of podcasts. Some people might say I have a problem. Pocket Casts would confirm it. The app has clocked me at about two hours a day for the past year. That’s a lot of ideas and insights flying at my head. I decided to find a way to record the particularly tasty tidbits for posterity so I could catalog and share them with other inquiring MINDs.* Thus, I was thrust into the ever-evolving world of Automatic Speech Recognition (ASR), an increasingly AI-driven technology for converting spoken words into text. ASR is just one of the AI language services poised to revolutionize business. It’s already empowering people to work smarter.
The Promise of AI-Enabled Language Services
Most companies are drowning in phone calls, social media posts, emails, audio and video. Like podcasts, all this unstructured data contains good info. It’s trying to tell you something, but it’s trapped in a form that doesn’t easily lend itself to data analysis. Artificial intelligence helps you liberate your data, organize it, and put it to use.
Example: Let’s say you run a customer call center, and you’ve noticed an uptick in poor survey results. Who is going to listen to all those recorded conversations and try to figure out the problem? Will you have to drop what you’re doing and take it on yourself?
This is a perfect opportunity for AI-powered technology like ASR to augment your team. Imagine if you had a system to (1) transcribe all the calls, (2) identify keywords and use sentiment analysis to understand your callers’ needs and opinions, (3) dump all that info into a database, and (4) present it to you in a daily dashboard.
Can you hear angels singing right now? I thought you’d be impressed. Kinda makes you want to learn more about this tech so you can make it work for you. I’m on it.
Speech To Text: Challenge & Progress
ASR is not new. It actually has a long and storied history dating back to the 1950s. (If you’re a tech history buff, you’re gonna want to follow that link, especially if you appreciate archival footage.) But it didn’t hit the consumer market until 1990 when Dragon Dictate burst onto the business scene with its 80,000 word dictionary and $9,000 price tag.
Speech recognition has gotten considerably cheaper and significantly more accurate since then, but there is still room for improvement. This isn’t a simple technology we’re talking about here. Processing verbal communication can be difficult for two humans standing face-to-face. (Think husband/wife, parent/teenager.) Factor in a poor audio source, an accent or local dialect, overlapping speakers, and things get confusing quickly – for humans and computers.
Have you ever played the game Mad Gab? It’s an excellent exercise in the importance of context to speech recognition. Players read phrases off cards, and their team members must translate the gibberish into coherent ideas. With a little trial, error and luck, "These if hill wore" becomes “the Civil War.”
Now I’ve got one for you: Mayan dough for May sheens. (Hint: It’s not an ancient beauty secret.)
Context is key to understanding the spoken word, and it’s something computers didn’t have much of until recent AI advancements. Now, cloud providers like AWS and Microsoft Azure offer ASR tools that “recognize speech,” rather than "wreck a nice beach." These powerful, accessible speech-to-text capabilities are ready for developers to integrate into their products.
ASR for Business
In addition to the call center example above, ASR offers lots of simple automations to make your work life easier. Use it to take meeting notes, subtitle videos, or transcribe your latest conference and make it available online for people to read at their convenience.
Speech to Text can also be part of a larger whole. You can string AI-enabled language services together to accomplish all kinds of tasks. AWS gives this example for translating foreign languages: “Use Amazon Transcribe to convert voice to text, send the text to Amazon Translate to translate it into another language, and send the translated text to Amazon Polly to speak the translated text.” So in the same workflow that subtitles videos, you could take those subtitles, translate them, turn them into spoken Spanish, and add a second video for Spanish speakers.
If the idea of recording phone calls, meetings or training sessions for data analysis creeps you out a little right now, bear in mind all the cultural shifts in acceptable use we’ve seen just in the past 20-30 years. Back in the ‘90s, you never would’ve considered jumping into a stranger’s car. Now, you summon the stranger and their car, hop right in, and call it Uber or Lyft.
The Future is Coming Quick
ASR technology has come a long way since the ‘50s. It still has a long way to go, but we’ve seen the tremendous resources being poured into AI development, and there’s certainly no sign of investment letting up. If you have a need for speech-to-text technology, it may make sense to get in on the ground floor. You can keep building on a cloud-based system that will grow and improve as the major service providers learn more about dictation, modeling conversations and interactive speech models.
Nobody has a crystal ball, but I have Director of Emerging Technologies Tim Kulp right down the hall, so I’m doing better than most. When I asked Tim where AI-driven language services were headed, he of course went right to Star Trek’s universal translator. “In 2-3 years, we’ll be there: real-time, live translations; traveling the world talking with native language speakers. In 3-5 years, we’ll have technology that doesn’t just say what you said; it will say what you intended – smoothing out the bumps for people speaking a second language or anyone who struggles with verbal communication,” Tim predicts.
Speech to Text is the base building block for all the language services AI is starting to make possible. It’s wrangling data into a framework where you can play with it and learn from it. If you have a use case in mind, I’d love to help you explore it. Connect with me on LinkedIn or email to talk through your business and how AI could help transform the way you work.
If you’re still debating whether Speech to Text is for you, stay tuned for Part 2. The big AI providers (AWS, Google, IBM and Microsoft) are constantly innovating and improving their AI speech service offerings. In my next installment, we’ll explore ASR best practices: what you need to consider before embarking on an ASR project, tips for staying on top of the industry, and how to avoid making the wrong move.
Colin Reynolds is a modern-day philosopher and a self-professed “growth-minded productivity junky trying to automate everything.” Those two traits would seem to be at odds, but the more you automate, the more time you have to think. Thinking is one of Colin’s favorite pastimes. As a devotee of Cal Newport’s Deep Work, he consistently makes a point to turn off the distractions (yes, even the podcasts) and enjoy some dedicated thought time.
While finishing up his bachelor’s of philosophy at Towson University, Colin took a few business courses and poured himself into learning Excel. It was his voluntary application of those self-taught Excel skills at a college internship that landed Colin his first paying gig. Since then, he has continued to ingratiate himself with employers and clients alike by showing them cool things they didn’t know were possible.
Now, Colin has someone new in his life showing him things he didn’t know were possible. His 7-month-old daughter loves to take Colin and his wife on weekend hikes, drawing their attention to the sights and sounds of nature with enthusiastic coos of approval. Dedicated thought time may get harder to come by as family obligations increase, but where there’s a will, there’s a way (and it probably involves automation).
*If you're interested in Colin's speech-to-text podcast experimentation, explore here.