We have set up Cassandra on the ITL dev machines (Cairo instance TJR199), but the recognition results are completely off. For example, in callId 2c816b2d8054e1e3e881d25425d916a9@141.31.8.61, I said "yes, this is correct" which was recognized as "the school". While such results are theoretically possible, the WER is clearly above 50% on average suggesting there is a major issue.
I pulled the above mentioned recording from the server (201508250128420867.wav) and sent it through the Cassandra test web page and, guess what? The recognition hypothesis was
{ YES THIS IS CORRECT }
So, there is a discrepancy between how Cairo calls Cassandra and how I am doing it through the test page.
I created Ext 7751 which is pointing at Patrick's dev Halef instance TJR211/12. Unfortunately, the Kaldi hypotheses are currently not logged there, nor do the recorded audio utterances show up. @Patrick? At any rate, I called in, and this is what I got (utterance -> hypothesis)
hello this is the david pizza service -> { WELL IS THE THE PROFESSOR IS }
is that for delivery or take-out -> { IS THAT FOR DISAGREE ABOUT THE ABOUT }
is this for delivery -> { IS THE FOR THE }
can you please tell me your name -> { UNIVERSITY'S TELL YOUR NAME }
what is your address -> { ONE IS TO ATTRACT }
can you tell me what your address is please -> { CHANGED ONLY WHAT YOUR ADDRESSES FEES }
now please tell me your phone number -> { APPEARS TELL YOUR PHONE NUMBER }
now what do you want on your pizza's topping -> { WELL WHAT YOU WANT TO PROTEST TOPIC }
which toppings would you like on your pizza -> { WHEN SHOPPING WHICH I LIKE GOING SHOPPING }
allright we will deliver the pizza within thirty minutes to your home place -> { OLD LIKE TO GO TO THE PROFESSOR GIVES A WITH THE FEATHER TO MANAGE TO YOUR FRIENDS }
we will deliver the pizza within thirty minutes to your home location -> { <unk> OR YOUR SELF WITH THE STUDYING IT'S TO ON THE TEACHING }</unk>
Still not great, but seems a little better
Just sent the last recording to the web service and got the following recognition result:
{ WE WILL DELIVER YOUR FEET THAT WITHIN THIRTY MINUTES TO YOUR HOME LOCATED }
This is much, much closer to the expected results. Why is there such a discrepancy?
The file name on TJR 211 is 201508260238080104.wav
@Alex, please take a look ASAP.
It definitely looks like a problem with the signal buffer ordering. I.e. the sequence of audio chunks that are sent to the recognizer becomes shuffled. A very characteristic behavior pointing to that is "toppings"=> "SHOPPING WHICH I LIKE GOING SHOPPING"
Dumping the contents of the buffer that the server sees at its end will provide a necessary verification of the above hypothesis.
There is a possibility to add an extra step when the client and the server compute an integral characteristic of some sort (e.g. a specific energy or md5 sum) of their respective buffers and exchange these values post-recognition for the communication channel verification purposes.
Thanks, Alex. Patrick told me you are storing the audio files on the Cassandra server as well. Would you please check how these look like and how they differ from those stored on the Cairo server? Please send me some examples from the Cassandra server.
Dear David, Patrick,
Please find attached the buffer dumps that I have obtained from the
CASSANDRA server on TJR1001 with the test sessions via
test_bin_file.html while recognizing the file 201410291608300064.wav
(also in the attachment).
I have made 4 successive recognitions and apparently:
WANT TO STUDY });
./201410291608300064.wav ./test.data_expX'
In order to extract the signal buffer dump on the server one needs to do
the following:
signal buffer dump.
After that the buffer can be compared to the client buffer version.
May I ask you to perform a similar experiment with a live recognition
from Cairo server?
Thank you,
AI
On 08/26/2015 05:56 PM, David Suendermann-Oeft wrote:
Related
Tickets: #84
Dear Alex,
Would you please check the buffer dumps of Cassandra which were generated at the time of recognition? For example when the recent audio file 201508251904000497.wav was generated as part of call ID 01da31a4a0360086fd3ea1f470257e63@141.31.8.61 at 2015-08-25 17:56:14 database time? At runtime, the hypothesis was { WHAT EPIPHYTES } while the test page produces { WHAT ABOUT THE EDUCATION } (I said "what about your address").
Yours,
DSO
That is not supported. The buffer dump file has the same name all the time. It always gets overwritten by the most recent recognition session.
Dear Alex, please enable storing of past speech buffers as separate files on the Cassandra server. This is essential for troubleshooting of past calls.
Current Kaldi test extensions are 7723, 7731, among others, for you to test.
Dear All,
This functionality is now awailable for the server on TJR1001. A raw audio stream, corresponing to every individual recognition is now stored in
~/halef-cassandra/CASSANDRA_STRM2/processed_streams/stream_{date}-{time}.raw
Sure, that is an essential feature for ASR-related troubleshooting.
Patrick,
You resolved this bug last week, however, there still seem to be discrepancies between what Halef recognizes in live mode vs. batch mode. An example:
CallId 1b813b4b19ff76c1bf57b977913eb8c6@141.31.8.61
recording 201510140000040604.wav
resulted in
{ THAT PERSON NOW I WOULD LIKE TO }
in live mode, but in
{ THAT PERMITTED TO NOW I WOULDN'T LIKE TO }
in batch mode.
In the same call,
recording 201510140000220858.wav
has
{ I LIKE TO TALK ABOUT IS THAT THE IDEA }
vs.
{ I LIKE TO TALK ABOUT IN THE NOTE AND TO }
Yours,
DSO
I will compare the three files if they are the same. 1. Halef recording, 2. my recording of what I send 3. recording of the Kaldi server
However, this is atm not possible. I think we only store the latest recording of 2) and 3).
Alex changed the Kaldi server to store historic recordings
Looking at a call Keelan pointed out:
922f4f06211353f2d5c494648ca22e1f@141.31.8.61
201510140819380887.wav
was recognized as
{ URN PLANT THE PLANT }
in Halef, but calling the command line tool, I am getting
$ kaldi 201510140819380887.wav
Establishing the connection took 849172561 nanoseconds
{ ACTUALLY I DIDN'T USE THE }
Recognition took 2640942734 nanoseconds
When I ran it again, I got yet another hypothesis:
$ kaldi 201510140819380887.wav
Establishing the connection took 914194415 nanoseconds
{ REGARDING AND DIDN'T USE THE }
Recognition took 2764824757 nanoseconds
Apparently, hypotheses are not consistent.
The EPIPHYTES example is even worse:
846b7507bbde434b243e3e8c260697a7@141.31.8.61
201510140838230289.wav
Halef recognized
{ EPIPHYTES }
while the command line tool returns
$ kaldi 201510140838230289.wav
Establishing the connection took 907055569 nanoseconds
{ EPIPHYTES CAN MAKE A LINE TO THE UM THE THE }
Recognition took 2472632539 nanoseconds