Re: [Cairo-user] using other acoustic models in Cairo/Zanzibar

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Thanks for your answer!

Summing up most crucial things from what I wrote below, is there any
way to move the whole logic from vxml and gram files to .java file? If
it is, may you give some examplary code showing this approach, please?
May you also write something more about "analyze the raw results" and
"select the name of a wav file" parts?

------------------------

You say you are not vxml expert. Unfortunately I'm neither. I tried to
find some kind of internet forum or mailing list for vxml users but I
couldn't. One of those "forums" didn't contain internet forum, the
other was forum for 'Java VoiceXML Interpreter' application. Do you
know any vxml forum?

I found this great example about vxml
http://www.w3.org/TR/voicexml20/#dml2.1.4 (second code in the section
- credit card information). It uses digits.grxml
(http://mail-archives.apache.org/mod_mbox/jakarta-taglibs-dev/200506.mbox/%3C1...@uk...%3E).
>From this example it looks like I can ask the user to speak twelve
digits.

But the above doesn't help with adding control sum calculation to the
application. So I thought that it may be good idea to use some kind of
inline script in vxml. I found this
http://www.w3.org/TR/voicexml20/#dml1.5.3 which uses value returned by
acct_info.vxml#basic which I found here
http://msdn.microsoft.com/en-us/library/bb857574.aspx . I also found
this http://www.w3.org/TR/voicexml20/#dml5.3.12 about using inline
scripts. (Calculating control sum is not the only one thing, which is
not so easy from vxml side - I also need to save recognized digits to
database or text file). By the way, can I use ECMAscripts in vxml file
for Cairo/Zanzibar?

I also found http://www.w3.org/TR/voicexml20/#dml2.3.4 (part which
says "Sorry, your credit card") which checks information about credit
card. It is somehow similar to my application which also needs to
check something (control sum) based on sequence of digits. I just
wonder how to use it in my case.

The above still doesn't help too much. The life would be much easier
for me if I can simply specify grammar which has <list_of_words> =
first_word | ... | last_word, then to have in source code string
variable which contains actually recognized word and do everything in
java source code, without any vxml or grammar files. It wouldn't be
elegant solution and the only difficulty would be again with playing
wav files as answers. Of course I also would need to have the loop
which checks if 'string r != ""' and allow me to read the string and
proceed with it if it is not null. That would involve using some
variables to store actual position. Is there any way to move the whole
logic to .java file? Doing it - together with allowing to play wav
files from inside it - would save much time and effort for trying to
do the same things with vxml. It looks like the easiest way, if it is
possible. If it is, may you give some examplary code showing this
approach, please?

Anyway, let me show what I've got now:

<?xml version="1.0" encoding="UTF-8"?>
<vxml version="2.0" xmlns="http://www.w3.org/2001/vxml"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://www.w3.org/2001/vxml
   http://www.w3.org/TR/voicexml20/vxml.xsd">
<form id="get_sequences_of_digits">
 <block>Welcome to the application. You can speak twelve digits, then
say next. If you mistake, you can simply say the word mistake /that's
other difficulty to add this feature/. After saying twelve digits, the
control sum will be checked. If it is all right, you will be informed
about it and allowed to say next twelve digits. If it won't be
correct, you will be asked if you'd like to repeat digits again or
accept digits with improper code. If you will say less or more than
twelve digits and then say next, you also will be asked to accept or
reject number. You can exit the application by saying exit.</block>
 <field name="twelve_digits">
  <grammar type="application/srgs+xml" src="/grammars/digits.grxml"/>
  <prompt bargein="false"><audio src="/audio/speak_twelve_digits.wav"/></prompt>
  <!-- from examplary talk it looks like 'if' which I erased from here
wasn't used -->
  <filled>
     <!-- the code below is useless, i can't specify it properly -->
     <if cond="(control_sum_is_ok != true")>
       Control sum improper. Would you like to accept it anyway?
       <clear namelist="card_num"/>
       <throw event="nomatch"/>
     <else/>
       Control sum is OK. You can proceed with new set of digits.
       <clear namelist="card_num"/>
       <throw event="nomatch"/>
     </if>
  </filled>
 </field>
</form>
</vxml>

And the code of my version of digits.grxml:

<?xml version="1.0"?>
<grammar xmlns="http://www.w3.org/2001/06/grammar"
    xmlns:nuance="http://voicexml.nuance.com/grammar"
    version="1.0" mode="voice" xml:lang="en-US"
    tag-format="Nuance" root="numbers">
    <!-- http://www.apache.org/licenses/LICENSE-2.0 -->
<rule id="digit">
     <one-of>
         <item> zero <tag> return("0") </tag></item>
         <item> jeden <tag> return("1") </tag></item>
         <item> dwa <tag> return("2") </tag></item>
         <item> trzy <tag> return("3") </tag></item>
         <item> cztery <tag> return("4") </tag></item>
         <item> piec <tag> return("5") </tag></item>
         <item> szesc <tag> return("6") </tag></item>
         <item> siedem <tag> return("7") </tag></item>
         <item> osiem <tag> return("8") </tag></item>
         <item> dziewiec <tag> return("9") </tag></item>
     </one-of>
</rule>
</grammar>

--------------------------

I still wonder why you suggested me to use rserver, transmitter1,
receiver1 and asteriskConnector rather than launch.sh in your first
mail, if this launch should run all four things.

> You can always try to parse and analyize the raw results
> In any case, the code that analizes the results (either the raw results or the tags) can select the name of a wav file that will be passed into the next call to playAndRecognize(..)

May I ask for some further tips how to implement it, please?

> Given a number like 1234, will your grammar allow for "one two three four"  or "one thousand two hundred and thirty four" or  "twelve thirtyfour"

I would make it possible only to use ten digits. You are fortunate
because there is already existing English acoustic models of good
quality on VoxForge. I'm not this fortunate (I don't use English) and
the only what I can do is to create speaker-dependent model. I don't
want to force users to follow too much training which would be
required for more sophisticated acoustic models.

> Sorry about the bug, can you let me know what you fixed, so I can check it in of others. Maybe send the fixed file?

I changed only one character in one, short file
(/zanzibar-0.1/bin/cairo/rserver.sh) from:
sh ..\launch.sh $CLASS -sipPort 5050 -sipTransport udp
to this:
sh ../launch.sh $CLASS -sipPort 5050 -sipTransport udp
In the first case it told me that it cannot find ..launch.sh.

> Also do you remember which script had trouble? There should be no need to copy things around like that.

As far as I remember it was /zanzibar-0.1/bin/cairo/transmitter1.sh.

> if you can send the dialplan, I can take a look too?

It is just original dialplan of newly-installed Asterisk. After all
the only changes which I have done are as follows:

sip.conf (http://www.spokentech.org/openivr/aik.html)

[Zanzibar]
type=peer
host=localhost //I wasn't sure what to write here
port=5090
dtmfmode=info
canreinvite=no

extensions.conf (Is [demo] the proper place to put it? Would I need to
change it somehow in the future, when I will be using SIP provider
account with PSTN numer instead of Twinkle softphone?):

exten => 1001,1,SIPAddHeader(x-channel:${CHANNEL})
exten => 1001,n,SIPAddHeader(x-application:basic|org.speechforge.apps.demos.Parrot)
exten => 1001,n,Dial(SIP/Zanzibar)

And now I connect from Twinkle to 1001@127.0.0.1. This time I hear
"connecting" but it never reaches "established" state.

In the future I will use number from the SIP provider which gave this
examplary configuration: http://forum.ipfon.pl/index.php?topic=64

> I am interested in building models too. Have you had much success yet?

I created all the files required to build acoustic model. Those are
list of phonemes for my language, list of words with their
transcription, language model, wav files with their transcription for
both training and testing and so on. I also configured SphinxTrain,
run all of those perl scripts (there were some difficulties with doing
that), created model from those files, tested it with decode.pl script
and test package (my wav files with their transcriptions), obtained
good results for speaker-dependent model, packed to jar file. Now I'd
like to test those in Cairo/Zanzibar instead of using WSJ model. If
you've got any questions about building models, just ask. If I will
know, I'll answer for sure.

Let me also say about process of compiling. Would it be just like: 1.
I extract cairo-rtp-0.2-src.tar.bz2, 2. I create my application in src
(java/.../demos -> myapp.java, resources -> grammar and wav files,
voicexml -> vxml file), 3. edit configuration in
src/resources/config/sphinx-config.xml, 4. type "ant" command in
zanzibar-0.1 directory from terminal.

I hope these changes in xml would be enough:

//linguist part
<property name="acousticModel" value="pl1"/>
or maybe I should leave default value
<property name="acousticModel" value="acousticModel"/>

//dictionary part
value="/home/mainaccount/acoustic/pl1/etc/pl1.dic"/>
<property name="fillerPath"
value="/home/mainaccount/acoustic/pl1/etc/pl1.filler"/>

//acoustic model part - it is because of those model loaders
I cannot use model loaders from WSJ. They suggested me to use those
from sphinx3 as shown in 'transcriber' demo application. I hope it
will work like this.
<!-- I took the whole section from transcribe config.xml and chanded
tidigits to pl1 -->
<component name="tidigits"
  type="edu.cmu.sphinx.model.acoustic.pl1.Model">
    <property name="loader" value="sphinx3Loader"/>
    <property name="unitManager" value="unitManager"/>
</component>
<component name="sphinx3Loader"
           type="edu.cmu.sphinx.model.acoustic.pl1.ModelLoader">
    <property name="logMath" value="logMath"/>
    <property name="unitManager" value="unitManager"/>
</component>

Thanks for sharing your precious time :-)
Regards!