|
From: Bob C. <bob...@sp...> - 2002-02-13 17:03:54
|
I have three points: (1) you don't need to use all of VoiceXML -- it can be used just as an ASR/TTS engine, (2) VoiceXML cleanly separates dialog, ASR and TTS, and (3) states + memory = Turing machine, so I'd contend that "state-based" systems impose *no* limitations. > >Peter Gober > >As somebody already also pointed out, I see the Communicator and > VoiceXML as > >being complementary: Though in principle it is possible to define a voice > >dialog with the Communicators' hub scripting language, you will typically > >want something more focussed for that task. (That's why the CSLR > travel demo > >has a "dialog server".) VoiceXML seems to be predestinated for defining a > >dialog. On the other hand, one major strength of the Communicator lies in > >efficiently and flexibly connecting different components together. > > > >Another thing is politics: People, who give you money, tend to ask for > >buzzwords, that they have recently read in large letters on some magazine > >cover. So I am convinced that it would be advantageous for the > Communicator > >to be describable as a "superset" of VoiceXML. > > > >So, I think it would be very nice to have a kind of VoiceXML based dialog > >server for the Communicator. (Actually, I found a presentation > from Sam from > >October 2000, where he suggested such a thing.) (1) VoiceXML can also be used for something like Communicator as a simple synthesis and recognition server. Any old HTTP server (eg Apache) can be used as a VoiceXML server. In conjunction with a VoiceXML client (eg IBM's or VXI), you get a speech recognition and synthesis engine. One benefit is that you can use clients with telephony connections such as VoiceGenie or Tellme, or you can build or buy your own. By running the HTTP server through your favorite brand of CGI (eg. Perl, servlets, ASP), the service could communicate with the DARPA Communicator Hub as a server. Then you're off and running with your favorite Hub-compliant dialog management scheme. Of course, if you already have your components in place, and you're happy with them, and your customers (eg. DARPA) aren't seeing "VoiceXML" in red, then I don't see any advantage to doing things this way. > Alex Rudnicky > As to the relationship between OpenVXI and Communicator, I would tend to > agree that they are in some sense complimentary. Probably a good way to > think about it is that Communicator provides an exploded view of a dialog > system, allowing the developer to work on what are nominally the component > stages (understanding, dialog management, generation, etc.) of dialog; a > vxml-based system conflates such components into a single representation. (2) VoiceXML provides a way of specifying an entire dialog system in XML (assuming you don't need any server side data dips or messages to be sent externally). But, it cleanly separates ASR, TTS, and dialog management. It allows an ASR engine to be configured for grammar and language model, and provides some useful underlying functionality like running grammars in parallel and presenting scored N-best results. It allows whatever text you want to be fed into TTS. And while it supplies mechanisms for client-side dialog control, you don't even have to use these if you'd rather do it server side. It's like the choice between using JavaScript (client side) or CGI (server side) to implement a web site. In fact, the point of VoiceXML was to turn the components into commodities so that customers could avoid being locked into proprietary systems (eg. SpeechWorks!). So it'd be easy to hook up a VoiceXMl-based system and try it with various recognizers. Ditto for TTS or voice talent comparisons. > Alex Rudnicky > From another perspective, there are serious limitations to state-based > representations of dialog, particularly for more complex tasks such as > Communicator's Travel Planning domain (we know this from > experience, as we > first implemented script-based dialog management then had to replace it > with a more flexible architecture, Agenda). (3) I would contend that there are *no* limitations to state-based representations of dialog, assuming that we allow "state" + "memory". That gives you a Turing machine. I think that's what people using "state-based" dialogs do. At least it's what we do at SpeechWorks. We tend to use states for recognition contexts, which involve specialized language models and grammars (to increase accuracy in the sense of precision). Programming languages that allow function calls are essentially state-based in that the memory provides a stack of variable values that can be used to return to a previous "state". Now one might argue that having to manage all this by oneself is ugly, and we'd agree, which is why we're working on frameworks that would "hide" the management of state the way programming languages do. But the point is really like that of (1). The use of states may not be helpful, and you may code around them with a lot of use of memory and conditional action, but they won't restrict what you can do. - Bob Bob...@Sp... 17 State St, 12th Fl, NY, NY 10004 USA Vox: +1.212.425.7600 Fax: +1.212.425.8845 Web: http://www.colloquial.com/carp |