|
From: Bob C. <bob...@sp...> - 2002-02-13 19:27:37
|
> I have three points: (1) you don't need to use all > of VoiceXML -- it can be used just as an ASR/TTS engine, > (2) VoiceXML cleanly separates dialog, ASR and TTS, > and (3) states + memory = Turing machine, so I'd > contend that "state-based" systems impose *no* limitations. > > VoiceXML 2.0, as I read the spec, interfaces to the output of SR and > the input of TTS; I don't see how it can be used as an ASR/TTS > engine. Sasha Caskey (and a lot of the other members of our VoiceXML team) helped me get my head around the problem. There are two parts to VoiceXML, the client side and the server side. The speech rec and TTS are on the client, whereas the dialog control can be handled fully on the server side. Of course, control can be largely ceded to the form interpretation part of the VoiceXML client, but where's the fun in that?. If you set up the server to spit out very simple VoiceXML that just asks the client to do a simple turn of tts or reco and then call back to the server, you can handle everything with session state on the server side, for instance by using Java servlets. For instance, the following two well-formed VoiceXML docs just play a wave file and play sound using TTS respectively. Critically, they call back to the application URL when they're done. To start the app, the client requests a page from the URL: http://www.foo.com/app.vxml The app will start tracking the session, and will see that this is the first turn, and will do the right thing in terms of prompting and/or recognizing. To "play a wave file", the server just sends the client the following: <vxml> <form id="foo1"> <block> <audio src="http://www.foo.com/bar.wav"/> <goto next="http://www.foo.com/app.vxml"/> </block> </form> </vxml> Playing a prompt with TTS is just as easy: <vxml> <form id="foo2"> <block> <audio> Welcome to foo dot com.</audio> <goto next="http://www.foo.com/app.vxml"/> </block> </form> </vxml> Now the client will play the audio and then critically, make an HTTP request back to the application because of the goto. With servlets (or other things), you can manage session state on the server side. You can send data encoded by URL or with an explict data request (just like for a web page -- this is just HTTP after all!). To do a recognition, the server generates the following: <vxml> <form id="foo3"> <field id="field3"> <grammar url="...include a pointer to your grammar..."> ... or you can just include the grmamar here ... </grammar> ... your parameters for the reco turn go here .... <prompt/> <nomatch> <goto expr="http://www.foo.com/app.vxml?result=__nomatch__"/> </nomatch> <noinput> <goto expr="http://www.foo.com/app.vxml?result=__noinput__"/> </noinput> <filled> <goto expr="'http://www.foo.com/app.vxml?result=' + field3"/> </filled> </field> </form> </vxml> Same thing as with TTS. The server sends this page to the client. The server specifies the grammar, which it can either inline for dynamically generated grammars or reference with a URL for precompiled grammars. (The client's caching, typically.) The server can also specify language model weightings, timeout timings, rejection thresholds, etc. When the client reads the form, it'll load the grammar (which might be cached and compiled), execute a recognition turn, and then immediately pack up the results and send another request to the server (note the call to ECMAscript on the client side to compute the URL of the return request; alternatively, you'd use an explicit request and post the data). Let's say we're in the travel task and have recognized "yes boston" with high confidence. The server then sees its next request in the form: http://www.foo.com/app.vxml?result=yes_boston if you're using URL encoding. The server then retrieves the state of the dialog from the session and then acts conditionally on the result, which it can also retrieve from the URL. (Note that timeout and rejection are also reported. Their thresholds are set in the location indicated in the field above Then it's basically just a Java program figuring out what to do. One possibility is to have it hooked up to the hub. When the VoiceXML server gets results from the VoiceXML client, it then acts as a Hub server and posts them. The relevant hub dialog server picks up the ball, figures out what to do, and can post an answer back. The answer gets picked up through the hub by the VoiceXML server servlet, translated into VoiceXML and shipped back to the client for some more ASR and TTS. The real hassle is that you can't keep a program stack between calls. The program that computes what to do next in the VoiceXMl server has to exit. State's maintained through an external object, like other callbacks. Thus rather than being able to make a call to recognition in the middle of some procedure and wait for the result, you need to re-enter the dialog server each time and take an action based on the session state and the answer you received. This can be hidden pretty well, but not entirely. We even built a framework that allows recursive specifications where the ASR/TTS look just like function calls, but clever behind-the-scenes state and stack management make sure everything will work in the required client/server configuration. Piece of cake :-) The only tricky part conceputally is the notion that the client always makes the request and the server can't maintain any program state -- just data state. And I know it works, because we're doing it -- Sasha Caskey built most of it, in fact, porting a design and implementation of Roberto Pieraccini's and his for running in-the-skins (where reco and TTS really were function calls). - Bob |