CMU Sphinx / Forums / Help: Using flite for tokenization

Daniel Wolf - 2016-05-17

I'm looking for a simple way to perform tokenization on an input text. For instance, the text

In 1982, Mr. Smith spent $42 to find 1+1=2

should be converted to

in nineteen eighty two mister smith spent fourty two dollars to find one plus one equals two.

I had a look at the source code to Flite and it looks as if it does this kind of tokenization as part of its speech synthesis process. However, I couldn't find out how to use just the tokenizer.

Does anybody have experience with Filte?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-05-18
  
  Something like
  
  const char *text; u = new_utterance(); utt_set_input_text(u,text); utt = flite_do_synth(utt,voice,utt_synth_tokens); for (t=relation_head(utt_relation(u,"Token")); t; t=item_next(t)) { string_val(item_feat_string(i,"name"); }
  
  Festival was not really designed for use in a library. You can check espeak, maybe it is would be easier to use it. For most advanced implementation there is https://github.com/google/sparrowhawk, but it is not complete as well.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Daniel Wolf - 2016-05-18

Thanks for your suggestions! Unfortunately, I can't use espeak or sparrowhawk due to their licenses (GPL and Apache, respecively; my project is MIT).

So I'm left with Flite. Thank you for the code snippet! It confirms what I was afraid of: That there is no straightforward way to perform just text normalization without the full synthesis overhead including specifying a voice.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

I expanded on your draft code and created the following program. It compiles and runs, but the flite_do_synth call terminates the application with the message "Relation: Token not present in utterance".

#include <cst_utt_utils.h>
#include <flite.h>
#include <lang/usenglish/usenglish.h>

void main() {
    const char *text = "In 1982, Mr. Smith of 1982 Dr. Dolittle Dr. spent $42 to find 1+1=2.";

    cst_utterance* utterance = new_utterance();
    utt_set_input_text(utterance, text);
    cst_voice *voice = new_voice();
    voice->name = "dummy_voice";
    usenglish_init(voice);
    utterance = flite_do_synth(utterance, voice, utt_synth_tokens);
    for (cst_item* item = relation_head(utt_relation(utterance, "Token")); item; item = item_next(item)) {
        const char* word = item_feat_string(item, "name");
        printf("%s ", word);
    }
}

I'm at a loss. Do you have any idea what the problem might be?

The call stack is

utt_relation(cst_utterance_struct * u, const char * name) Line 106
default_textanalysis(cst_utterance_struct * u) Line 224
apply_synth_module(cst_utterance_struct * u, const cst_synth_module_struct * mod) Line 126
apply_synth_method(cst_utterance_struct * u, const cst_synth_module_struct * meth) Line 135
utt_synth_tokens(cst_utterance_struct * u) Line 160
flite_do_synth(cst_utterance_struct * u, cst_voice_struct * voice, cst_utterance_struct *(*)(cst_utterance_struct *) synth) Line 108
main(...) Line 13

Daniel Wolf - 2016-05-20

I got it working! I was using the wrong utterance function (utt_synth_tokens instead of utt_synth). Also, I needed to specify a dictionary. The resulting test program looks like this:

#include <cst_utt_utils.h> #include <flite.h> #include <lang/usenglish/usenglish.h> #include <lang/cmulex/cmu_lex.h> cst_voice* createDummyVoice() { cst_voice *voice = new_voice(); voice->name = "dummy_voice"; usenglish_init(voice); cst_lexicon *lex = cmu_lex_init(); feat_set(voice->features, "lexicon", lexicon_val(lex)); return voice; } void main() { const char *text = "In 1982, Mr. Smith of 1982 Dr. Dolittle Dr. spent $42 to find 1+1=2."; cst_utterance* utterance = new_utterance(); utt_set_input_text(utterance, text); cst_voice* voice = createDummyVoice(); utterance = flite_do_synth(utterance, voice, utt_synth); for (cst_item* item = relation_head(utt_relation(utterance, "Word")); item; item = item_next(item)) { const char* word = item_feat_string(item, "name"); printf("%s ", word); } }

And the output is this:

in nineteen eighty two mister smith of nineteen eighty two doctor dolittle drive spent forty two dollars to find one + one = two

It's not perfect -- the second '1982' really should be 'one thousand nine hundred and eighty two' rather than 'nineteen eighty two', and I'd expected the '+' and '=' to be expanded to 'plus' and 'equals', respectively. But I'm very happy nonetheless!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-05-21
  
  Congratulations. utt_synth does too much work I believe, it creates a waveform. It must be enough to run utt_synth_text2segs.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Daniel Wolf - 2016-05-21

You're right, it was doing too much. In fact, I've now defined my own synth method:

static const cst_synth_module synth_method_normalize[] = { { "tokenizer_func", default_tokenization }, // split text into tokens { "textanalysis_func", default_textanalysis }, // transform tokens into words { "pos_tagger_func", default_pos_tagger }, // add position information to words { NULL, NULL } };

I've still got one problem: Filte splits words at apostrophes, so "won't" becomes won and 't. An easy way to fix this would be to join each word that starts with an apostrophe with the word before. Unfortunately, there are words that actually start with an apostrophe, like 'tis or 'twas. I don't want to merge them with the previous word.

The best approach would probably be to check whether the original text had any space between the words, that is, whether the text index of the first character of the second word immediately follows the text index of the last character of the first word.

There seems to be position information for each word, but it seems to always be 0 (or I'm doing something wrong). Do you have an idea?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-05-22
  
  I've still got one problem: Filte splits words at apostrophes, so "won't" becomes won and 't. An easy way to fix this would be to join each word that starts with an apostrophe with the word before. Unfortunately, there are words that actually start with an apostrophe, like 'tis or 'twas. I don't want to merge them with the previous word.
  
  This was explicitely hardcoded in flite, you can disable it in a function us_tokentowords_one line 682
  
  else if ((p=(cst_strrchr(name,'\'')))) { static const char * const pc[] = { "'s", "'ll", "'ve", "'d", NULL };
  
  There seems to be position information for each word, but it seems to always be 0 (or I'm doing something wrong). Do you have an idea?
  
  pos is not a position, it is "part of speech". And, to make it work you need to call
  
  feat_set(v->features,"pos_tagger_cart",cart_val(&us_pos_cart));
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Daniel Wolf - 2016-06-02

Nickolay, thanks so much for your help on this! Just in case someone else wants to perform tokenization/normalization using Flite, here are a few pointers.

I tried using Flite 2.0, but all its source files are tightly coupled. This means that you need to build everything, including files you'll never use. What's more, some of these files refused to link for me due to unresolved dependencies. So I ended up using Flite 1.4 instead. It's a bit less monolithic. See this CMake file (starting at line 97) for a list of the files actually required.

It turns out that both Flite and sphinxbase define a function named feat_print. That means that any program using both will get a linker error. I ended up hacking Flite, renaming the function to flite_feat_print.

For working C++ code for tokenization using Flite, see tokenization.cpp. It's the code I posted above translated to C++ (RAII), plus some post-processing.

One post-processing step I do is re-merge words containing apostrophes. For instance, he'd gets split into he and 'd. I'm converting it back to he'd. Strangely, Flite treats some cases differently. For instance, wouldn't doesn't become wouldn and 't, but wouldnt. It's not ideal, but I can live with that.

Another post-processing step is that I search the output for any characters other than a-z and ', either turning them into words or removing them. There's quite a number of symbols Flite will let pass as words, and this solves it.

I've written some unit tests.

Bottom line: Tokenization via Flite works, but it's quite a hassle and the results are far from perfect. My recommendation is to use Flite if you absolutely need an MIT-compatible open-source license. Otherwise, take a look at the alternetives Nickolay mentions above.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-06-03
  
  I think you'd write the whole thing from scratch faster probably. There are not so many complex rules currently implemented anyway.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Using flite for tokenization

Speech Recognition Toolkit

Forums

Help

Using flite for tokenization document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Using flite for tokenization