« It's a great day to be an entrepreneur | Main | VoIP, VoiceXML/SALT, NLSR converging to make better voice applications »

February 08, 2005

Speech Recognition's Holy Grail

A reader named Ravi posted a comment to my post about speech recognition asking:

What are your thoughts on using VOIP to do dictation (as opposed to speaker independent, command based) recognition on the server side for an intranet deployed application? Has this been done with any success? Also, any idea on vendors with Java based server side solutions for this?

I think he's asking a question that's been on a lot of people's mind, with VoIP thrown into the equation.  In our world there is a technology called "Text to Speech".  It's the technology that you would use to get your computer to read your screen so you could listen to what I'm writing via a computer synthesized voice.  In self service telephone applications we use text to speech (TTS) to deal with situations where we can't use a prerecorded voice.  For instance, it isn't practical to record all the possible street addresses that might be needed in an account inquiry application.

TTS has evolved from the early days when the voice sounded like a drunken Swede to today where the quality is so good that many people can not distinguish TTS from a real human voice.  I predict that in the future, we'll only use TTS but that is still a few years off.

Here's where I answer Ravi's question.  People often turn TTS around and come up with Speech to Text.  If only it were as easy as rearranging the words.  Speaker dependant speech to text has been around for years.  Most people have heard of the Dragon product or IBM's ViaVoice, both of which are now available from our partner ScanSoft.  Both are very good products, but both assume that the user will spend time training the recognition engine.  Microsoft has incorporated a simpler product into Windows XP, but I don't think they expect people to use it for heavy dictation like the ScanSoft products.

The Holy Grail though is speaker independent speech to text.  In other words, a speech recognition engine that can understand whatever you say without any prior training by the user.  After all, this works in Star Trek, it should work in real life. 

Speaker independent recognition works today on a limited basis - it's the technology that we use to build telephone self service applications.  We just finished an application for a large Midwestern insurance company and bank.  Their customers can call up and say things like "what's my checking account balance?" or "I want to transfer $300 from savings to checking tomorrow."  No more ugly and complicated touch-tone menus.  However, if the caller for some reason decides to say "By the way, how's the weather where you are?" our application will have no idea how to respond.  We're working with a very narrow domain of possible responses to our questions, and done correctly, we get a high recognition rate.

I believe that we might see decent speaker independent recognition engines with large vocabularies and high accuracy in the next five to ten years, but it isn't practical today.  By then VoIP will be everywhere and it will be practical to distribute speech recognition throughout the network.

Next time you hear the recording "This call may be monitored for quality purposes", you just might be speaking to one of our human factors experts.  While tuning this same banking application we discovered that some people didn't realize that they weren't talking to a live human.  They went all the way through the call flow but would fail when they tried to make small talk with the computer.  We had to change the opening prompts to make it just a little more clear that they are in fact talking to a computer with very little information about the weather in Illinois!

February 8, 2005 | Permalink


TrackBack URL for this entry:

Listed below are links to weblogs that reference Speech Recognition's Holy Grail: