« Back to work! | Main | Voice Command Cheat Sheet »

January 12, 2006

Speech Recognition and the trough of disillusionment

In 1995 Gartner came up with what they call The Hype Cycle to explain how new technologies get hyped, fall out of favor with the press, and then ultimately (sometimes) go on to be mainstream.  One phase is the Trough of Disillusionment, and I believe that Speech Recognition may be in the trough now.  All great technologies must go through it.  Even as the technology continues to improve and some amazing things are happening, it seems to me that some people are getting tired of hearing how great it is going to be and they just want it to understand everything they say with little tolerance for errors.

There are two issues that have little to do with the science of speech recognition.  The first is Human Factors.  (I capitalize it because I believe it is so important.)  No one would disagree with me that Human Factors is important, but we still see applications being built that seem to go out of their way to make life difficult for the user.  That's another soap box for another time - I'll just say that it is very hard to make something very simple, but it is worth the effort.

The other issue is documentation, or at least expectation setting.  If you encounter speech recognition on the telephone, there is almost never documentation in hand for what the system can understand, and since we're years away from a system that can understand everything a person might say (hey, people can't even do it!) you have to guess at what you might be able to say, or you have to wait for the system to prompt you.

Lately I've been trying to surround myself with speech recognition, just to live with it and understand what works and what doesn't work.  I have "Wise Crackin' Shrek and Donkey" Shrek_and_donkeyand all sorts of gadgets that do speech recognition.  My latest is the new Palm  700W, which is a Palm Trio  phone that runs Windows Mobile.  Sort of like an Intel based Apple - they both came out this month, causing many people to wonder if in fact Hell has frozen over.  My very first Palm was made by U.S. Robotics and until switching to the Pocket PC a few years ago I always liked the Palms, so I was happy to have the best of both worlds when the Trio came out last week.

I quickly loaded a cool little application called Microsoft Voice Command.  (I think it comes with it - not sure)

It's been around for awhile and runs on Windows Mobile and Pocket PC Phone Edition.  You push a button on the phone and then speak to it.  You can say "Call Terry Gold at work", or anyone else that is in your contacts.  No training required.  I tried 20 different names and it got every one right except for "Dan DeGolier".  I have over 900 contacts, so there were a lot to differentiate.  Now, I just looked in my contacts and I had Dan in as "Daniel B. DeGolier".  When I changed it to "Dan DeGolier" and let it automatically sync, it  immediately got it right.  (Sorry Dan, the text-to-speech still makes a mess of your last name.)

It isn't just for speed dialing though.  You can say things like "What are my appointments", "What calls have I missed", and even "Start Program", where Program is any program you have loaded on your phone.  I'm going to see if I can do most of the command and control using just speech recognition.  This is what Bill Joy once called "prototyping the future."  You figure out some way to live with the technology of the future, and that lets you think even farther ahead.

But back to the second challenge of great speech recognition.  The one thing it couldn't recognize was me saying  "Display Terry Gold".  According to the website, it is supposed to bring up my contact.  In fact it wouldn't work on any other name.  Determined to make it work, I kept at it.  No matter how carefully I spoke it, up would pop Media Player and Bill Monroe would start to sing "Long Black Veil".  Bill Monroe is the Father of Bluegrass music and I'm an amateur Bluegrass mandolin player, so I took it as a great compliment that Voice Command was getting us confused.  After all, we did grow up only 37 miles and 50 years apart.

Since Voice Command had worked so well up to this point, I didn't give up.  After figuring out that I could say "Help" and then "Contacts", I realized that the software was actually looking for me to say "Show Terry Gold", not "Display Terry Gold."

My guess is that the developers realized that "Display" and "Play" were too similar late in the product life cycle, especially for guys like me who pronounce "Display" as two words - "Dis" "Play".  "Terry Gold" sounds enough like "Bill Monroe" that I can see that mistake.  The documentation on the web didn't get changed, and now some people are having a bad experience through no fault of the technology.  The product is so great though, that hopefully this won't turn anyone off.  I'll see if I can get to them to point out the typo.

Simply having a cheat sheet is a great help with speech recognition devices.  That's how I found this mistake - I was making my own little cheat sheet.  It is easy enough to just ask for help, but I wanted something on paper that I could have on my desk until I figured out the common commands that I would be using.

I have another application that can recognize hundreds of commands, and it does a great job, but the documentation listed all of the commands in alphabetical order.  Again I made my own cheat sheet of the ten commands that I cared about, and now I can't imagine not using the product.  I'll bet most people tried it, didn't know exactly what to say, and gave up on it.  I'll write about that one another day.

When I first learned the vi editor, someone gave me a dog-eared card of the most common commands.  It made all the difference in the world and I was soon raving about how superior vi was to any other editor in the world, especially Emacs.  All because of that card.  Until speech recognition advances to the point where we really can just say anything, let's see more cheat sheets, more obvious commands and help prompts that don't make the user feel like an idiot.

January 12, 2006 in Speech Recognition | Permalink


TrackBack URL for this entry:

Listed below are links to weblogs that reference Speech Recognition and the trough of disillusionment:


Great post Terry, especially the stuff about the Trough of Disillusionment. It seems like an especially deep trough for Speech Rec when companies start to use "press 0 for a live person" as a way to differentiate their service. Have you seen that commercial? I forget the company. Validation that the disillusionment with Speech Rec is "troughing".

I don't think we've met, although I've been to your company to demo some stuff a couple of years back. My company, Finali, was at the beginning of entering the speech rec biz when we decided to sell to Convergys. I am a big believer that successful Speech Rec is almost entirely about design, and only marginally about the technology. Can't wait for you to post on that topic.

Posted by: Daniel Burgin | Jan 16, 2006 9:17:16 AM