[Communicator-user] VoiceXML as ASR/TTS server mini-tutorial

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

>    I have three points:  (1) you don't need to use all
>    of VoiceXML -- it can be used just as an ASR/TTS engine,
>    (2) VoiceXML cleanly separates dialog, ASR and TTS,
>    and (3) states + memory = Turing machine, so I'd
>    contend that "state-based" systems impose *no* limitations.
> 
> VoiceXML 2.0, as I read the spec, interfaces to the output of SR and
> the input of TTS; I don't see how it can be used as an ASR/TTS
> engine. 

Sasha Caskey (and a lot of the other members of our VoiceXML
team) helped me get my head around the problem. 

There are two parts to VoiceXML, the client side and the server
side.  The speech rec and TTS are on the client, whereas the
dialog control can be handled fully on the server side. Of course,
control can be largely ceded to the form interpretation part of the
VoiceXML client, but where's the fun in that?.    If you set up the server
to spit out very simple VoiceXML that just asks the client to
do a simple turn of tts or reco and then call back to
the server, you can handle everything with session state
on the server side, for instance by using Java servlets.

For instance, the following two well-formed VoiceXML docs 
just play a wave file and play sound using TTS respectively.
Critically, they call back to the application URL when they're
done.  To start the app, the client requests a page
from the URL:  http://www.foo.com/app.vxml

The app will start tracking the session, and will see that
this is the first turn, and will do the right thing in
terms of prompting and/or recognizing.

To "play a wave file", the server just sends the
client the following:

<vxml>
<form id="foo1">
<block>
<audio src="http://www.foo.com/bar.wav"/>
<goto next="http://www.foo.com/app.vxml"/>
</block>
</form>
</vxml>

Playing a prompt with TTS is just as easy:

<vxml>
<form id="foo2">
<block>
<audio> Welcome to foo dot com.</audio>
<goto next="http://www.foo.com/app.vxml"/>
</block>
</form>
</vxml>

Now the client will play the audio and then critically,
make an HTTP request back to the application because
of the goto.  With servlets (or other things), you can manage
session state on the server side.  You can send data
encoded by URL or with an explict data request (just like
for a web page -- this is just HTTP after all!).

To do a recognition, the server generates the following:

<vxml>
<form id="foo3">
<field id="field3">
<grammar url="...include a pointer to your grammar..."> 
    ... or you can just include the grmamar here ... 
</grammar>
... your parameters for the reco turn go here ....

<prompt/>
<nomatch>
  <goto expr="http://www.foo.com/app.vxml?result=__nomatch__"/>
</nomatch>
<noinput>
  <goto expr="http://www.foo.com/app.vxml?result=__noinput__"/>
</noinput>
<filled>
  <goto expr="'http://www.foo.com/app.vxml?result=' + field3"/>
</filled>
</field>
</form>
</vxml>

Same thing as with TTS.  The server sends this page to the
client.  The server specifies the grammar, which it can either
inline for dynamically generated grammars or reference with a URL
for precompiled grammars.  (The client's caching, typically.)
The server can also specify language model weightings, timeout
timings, rejection thresholds, etc.  When the client reads the
form, it'll load the grammar (which might be cached and compiled),
execute a recognition turn, and then immediately pack up the
results and send another request to the server (note the call
to ECMAscript on the client side to compute the URL of the
return request;  alternatively, you'd use an explicit request
and post the data).  Let's say we're in the travel task
and have recognized "yes boston" with high confidence.
The server then sees its next request in the form:

   http://www.foo.com/app.vxml?result=yes_boston

if you're using URL encoding.  The server then retrieves the
state of the dialog from the session and then acts conditionally
on the result, which it can also retrieve from the URL.  (Note
that timeout and rejection are also reported.  Their thresholds
are set in the location indicated in the field above
Then it's basically just a Java program figuring out what
to do.  One possibility is to have it hooked up to the
hub.  When the VoiceXML server gets results from the VoiceXML
client, it then acts as a Hub server and posts them.  The
relevant hub dialog server picks up the ball, figures out what
to do, and can post an answer back. The answer gets picked up
through the hub by the VoiceXML server servlet, translated
into VoiceXML and shipped back to the client for some more
ASR and TTS.

The real hassle is that you can't keep a program stack between
calls.  The program that computes what to do next in the
VoiceXMl server has to exit.  State's maintained through an
external object, like other callbacks.
Thus rather than being able to make a call to recognition in
the middle of some procedure and wait for the result, you need
to re-enter the dialog server each time and take an action based
on the session state and the answer you received.  This can
be hidden pretty well, but not entirely.  We even built a framework
that allows recursive specifications where the ASR/TTS look just
like function calls, but clever behind-the-scenes state and stack
management make sure everything will work in the required
client/server configuration.

Piece of cake :-)  The only tricky part conceputally is the
notion that the client always makes the request and the server
can't maintain any program state -- just data state.  

And I know it works, because we're doing it -- Sasha Caskey built most
of it, in fact, porting a design and implementation of Roberto
Pieraccini's and his for running in-the-skins (where reco and TTS 
really were function calls). 

- Bob