InproTK Wiki

An Incremental Spoken Dialogue Processing Toolkit

Status: Beta

Brought to you by: davidschlangen, timobaumann

Tutorial-iSR

Authors:

Attachments

IUmodules.png (15586 bytes)

Tutorial Part 1: Incremental Speech Recognition

This section will make you familiar with InproTK's speech input component. First, the interface for receiving (or simulating) incremental spoken input will be explained, before we dive into the details of incremental input evaluation and the code that optimizes iSR behaviour.

You will have to use some of the programs that come with InproTK (either from the command-line or from Eclipse), you should look at and try to understand some programming code.

Tasks:

1. Starting Incremental Speech Recognition

InproTK's entry point to iSR is inpro.apps.SimpleReco, a command-line application for starting up InproTK using a default or custom configuration. SimpleReco aims to bundle various setup-tasks and variations for speech recognition, like audio source, language modelling, output selection, ...

First, startup SimpleReco without any options, this will give you some usage information. To do this, you can:

Right-click inpro.apps.SimpleReco in Eclipse and select Run as->Java Application, or
in a terminal window, change to the bin/-directory of your InproTK installation, and run java inpro.apps.SimpleReco. You may have to fiddle with your Java-classpath to make this work.

Now, let's give SimpleReco something to recognize. There's an audio file in res/DE_1234.wav which contains some spoken digits (you can guess what digits, but you should rather listen to the file). Can you find out what the command should look like? (Hint: a filesystem URL is not just a path specification, but starts with file:, see Wikipedia.)

You certainly noticed that there was no incremental output, just a final recognition result. This is because no IU module listened for the incremental output.

Add the runtime parameter -L for LabelWriter output, and/or -C for the current hypothesis viewer (what does it do?)

Certainly, you want to get live speech recognition results; use -M. By default, InproTK listens for German speech (German speech about elefant puzzles in particular).

You may have noticed that results are not always correct (with DE_1234.wav, LabelWriter first recognizes ja, ei, ein, before settling for eins). There are built-in techniques for optimizing incremental results. Otimization trades timeliness of results (that is, when a word is first recognized) against the overall quality of results (how often words that later turn out to be wrong are passed on to a listening module). The best method for this is smoothing, which takes as parameter a smoothing factor which roughly equates to how long a delay is introduced for new results. (Google has a marginally better method, see McGraw & Gruenstein, Interspeech 2012.)

Switch on smoothing with factor 7 by adding the switches -Is 7. What smothing factor do you need to get rid of any mis-recognitions in the beginning of DE_1234.wav?

2. Simulating iSR with text

For testing, it is often convenient to have a textual interface instead of true ASR, especially because speech recognition makes many errors that make repeatable testing very difficult.

inpro.apps.SimpleText can be used to supply text as if it were coming from incremental speech recognition, either interactively (default) or with text from a file or the command-line. Do play around with it a little bit. What happens if you use the -L switch? In particular: Do you have an idea what happens behind the scenes? You may also want to try out the non-interactive modes.

3. Code-analysis: What does an IU module do?

When a new result is generated, InproTK sends the partial hypothesis as a sequence of IUs (compare lecture last Friday, see also inpro.incremental.unit.IU). InproTK sends both the full list as well as a list of edits since the last update (see inpro.incremental.unit.EditMessage and EditType); we call this a dual representation of incremental hypotheses. Depending on the task, the receiving module can use either representation, whichever is more suitable.

To simplify this, each IUModule (see inpro.incremental.IUModule) contains a RightBuffer object (see the encapsulated definition in IUModule) which can be provided with either representation and automatically constructs the missing representation.

IUModules do not contain left buffers (see important concepts below). Instead, input is passed on in the call to leftBufferUpdate(List<IU>, List<EditMessage<IU>>).

Thus, the workflow of a typical IU module is to:

consume the list of IUs or the list of edits,
process them (often the same steps for every item in the list of IUs or edits,
generate new output (in the form of IUs or edits since the previous call) and put it into the right buffer.
that's it, passing it on to the next module(s) is automated by IUModule.

Can you identify these four steps in inpro.incremental.processor.Tagger?? What exactly is the processing step of Tagger? (Bonus: Improve the tagger by integrating the results from Beuck et al.'s NODALIDA 2011 paper).

There is also a more simple form of IU modules, those which do not contain a right buffer. These are defined as inpro.incremental.PushBuffer, a class that IUModule builds on. Most modules that do not need to produce any output (we call them sinks) are simply PushBuffers.

4. Implementing a very simple module

Take a close look at inpro.incremental.sink.LabelWriter (this is the module that is triggered by SimpleReco and SimpleText with the -L switch).

Can you write your own class -- similarly to LabelWriter -- which reacts to a certain combination of IUs (e.g. to eins eins zwei) and performs some action (notify the fire department -- please only simulate this step)?

Alternatively, write an IU sink that gives more detailed output about the data that it receives: At every call, it should print all the IUs and all the edits that it has received. Also, you could follow the IUs' grounded-in links and print any IUs that you find there (do this recursively, or use IU's deepToString() method).

More (some advanced, some peripheral) tasks:

recognize speech in a different language; InproTK comes with built-in support for German and English. Switching to English is a matter of changing the configuration file reference in inpro/apps/config.xml from ./sphinx-de.xml to ../../demo/inpro/apps/sphinx-en.xml. You could also use the -c configuration option of InproTK to load the configuration in demo/inpro/apps/config-en.xml which already contains sphinx-en.xml. In addition you need to define a good default language model, or a grammar to recognize from. Can you set up Sphinx and InproTK for other languages like French?
SimpleReco can also receive audio via RTP (the real-time protocol) that is sent via internet from somewhere else (or from the same computer) using SimpleRTP. Try this on your machine or try sending from one computer to the other (your network may be restricted making this impossible).
SimpleReco can send partial hypotheses to TEDview, a viewer for incremental data; find TEDview (it's somewhere in the DSG-Subversion repository), open TEDview and the run SimpleReco with the -T switch.
implement your own input module (for speech, gestures, whatever); you might want to have a look at inpro.incremental.source.IUDocument which is the source of IUs when working with SimpleText.

Important Concepts

Incremental Unit

In the theoretical model on which InproTK is built, information is kept as smallest pieces of information: incremental units. As their name indicates, these are the units that are produced and/or consumed by a processing module; their size depends on the type of information they contain: phonemes are smaller, ideas are larger than words.

Incremental Modules

Incremental units are produced and consumed by incremental modules. Incremental modules are connected to form an acyclic graph (most often just a pipeline). Connections can be set up from code, or are specified in iu-config.xml where successors of a module are listed as hypChangeListeners.

Two interconnected IU modules in the anstract general model

In David's abstract general model (AgMo) of incremental processing, IU modules have both a left buffer and a right buffer. In InproTK, they have only a right buffer and the information that would be on the left buffer in AgMo is passed on in the call to leftBufferUpdate(List<IU>, List<EditMessage<IU>>).

There are three places for different types of incremental modules in InproTK:

inpro.incremental.processor contains full-fledged incremental modules such as RMRS parsing and semantics modules, which have IUs as input and IUs of a different type as output,
inpro.incremental.source contains modules that produce IUs out of thin air (well, vibrations of thin air go into the iSR module which turns them into word IUs),
inpro.incremental.sink contains modules that consume IUs but do not generate new ones. These are primarily debugging and logging modules.

IU Network

Incremental units are connected with each other via two types of links:

same-level links bind together units on the same level. Words that are uttered one after the other are linked to each other via a sequence of same-level links
grounded-in links refer to units that are on a lower level (closer to the speech signal): ideas are grounded in words, which are grounded in syllables, which are grounded in speech segments (phonemes).
The IU network thus is the data structure that links all the IUs together and the changes in the IU network over time reflect the state of knowledge in the system as information accumulates.

Quite importantly, InproTK (at the current stage) only partially automates the setting/unsetting/changing of links. In most cases, you (or the module that you implement) should check very carefully whether all links that you set up actually point where they should.

Examining the IU network is not always easy, although inpro.incremental.sinks like IUNetworkToDOT and IUNetworkJGraphX exist (but do not currently work as expected -- your help is greatly appreciated).

Background Information

InproTK's incremental speech recognition is based on Sphinx-4 which is a (pretty much) state-of-the-art speech recognizer written in Java.

Wiki: Tutorial