I'm looking for a simple way to perform tokenization on an input text. For instance, the text
In 1982, Mr. Smith spent $42 to find 1+1=2
should be converted to
in nineteen eighty two mister smith spent fourty two dollars to find one plus one equals two.
I had a look at the source code to Flite and it looks as if it does this kind of tokenization as part of its speech synthesis process. However, I couldn't find out how to use just the tokenizer.
Does anybody have experience with Filte?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
const char *text;
u = new_utterance();
utt_set_input_text(u,text);
utt = flite_do_synth(utt,voice,utt_synth_tokens);
for (t=relation_head(utt_relation(u,"Token")); t; t=item_next(t)) {
string_val(item_feat_string(i,"name");
}
Festival was not really designed for use in a library. You can check espeak, maybe it is would be easier to use it. For most advanced implementation there is https://github.com/google/sparrowhawk, but it is not complete as well.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for your suggestions! Unfortunately, I can't use espeak or sparrowhawk due to their licenses (GPL and Apache, respecively; my project is MIT).
So I'm left with Flite. Thank you for the code snippet! It confirms what I was afraid of: That there is no straightforward way to perform just text normalization without the full synthesis overhead including specifying a voice.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I expanded on your draft code and created the following program. It compiles and runs, but the flite_do_synth call terminates the application with the message "Relation: Token not present in utterance".
#include<cst_utt_utils.h>#include<flite.h>#include<lang/usenglish/usenglish.h>voidmain(){constchar*text="In 1982, Mr. Smith of 1982 Dr. Dolittle Dr. spent $42 to find 1+1=2.";cst_utterance*utterance=new_utterance();utt_set_input_text(utterance,text);cst_voice*voice=new_voice();voice->name="dummy_voice";usenglish_init(voice);utterance=flite_do_synth(utterance,voice,utt_synth_tokens);for(cst_item*item=relation_head(utt_relation(utterance,"Token"));item;item=item_next(item)){constchar*word=item_feat_string(item,"name");printf("%s ",word);}}
I'm at a loss. Do you have any idea what the problem might be?
The call stack is
utt_relation(cst_utterance_struct * u, const char * name) Line 106
default_textanalysis(cst_utterance_struct * u) Line 224
apply_synth_module(cst_utterance_struct * u, const cst_synth_module_struct * mod) Line 126
apply_synth_method(cst_utterance_struct * u, const cst_synth_module_struct * meth) Line 135
utt_synth_tokens(cst_utterance_struct * u) Line 160
flite_do_synth(cst_utterance_struct * u, cst_voice_struct * voice, cst_utterance_struct *(*)(cst_utterance_struct *) synth) Line 108
main(...) Line 13
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I got it working! I was using the wrong utterance function (utt_synth_tokens instead of utt_synth). Also, I needed to specify a dictionary. The resulting test program looks like this:
#include<cst_utt_utils.h>#include<flite.h>#include<lang/usenglish/usenglish.h>#include<lang/cmulex/cmu_lex.h>cst_voice*createDummyVoice(){cst_voice*voice=new_voice();voice->name="dummy_voice";usenglish_init(voice);cst_lexicon*lex=cmu_lex_init();feat_set(voice->features,"lexicon",lexicon_val(lex));returnvoice;}voidmain(){constchar*text="In 1982, Mr. Smith of 1982 Dr. Dolittle Dr. spent $42 to find 1+1=2.";cst_utterance*utterance=new_utterance();utt_set_input_text(utterance,text);cst_voice*voice=createDummyVoice();utterance=flite_do_synth(utterance,voice,utt_synth);for(cst_item*item=relation_head(utt_relation(utterance,"Word"));item;item=item_next(item)){constchar*word=item_feat_string(item,"name");printf("%s ",word);}}
And the output is this:
in nineteen eighty two mister smith of nineteen eighty two doctor dolittle drive spent forty two dollars to find one + one = two
It's not perfect -- the second '1982' really should be 'one thousand nine hundred and eighty two' rather than 'nineteen eighty two', and I'd expected the '+' and '=' to be expanded to 'plus' and 'equals', respectively. But I'm very happy nonetheless!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You're right, it was doing too much. In fact, I've now defined my own synth method:
staticconstcst_synth_modulesynth_method_normalize[]={{"tokenizer_func",default_tokenization},// split text into tokens{"textanalysis_func",default_textanalysis},// transform tokens into words{"pos_tagger_func",default_pos_tagger},// add position information to words{NULL,NULL}};
I've still got one problem: Filte splits words at apostrophes, so "won't" becomes won and 't. An easy way to fix this would be to join each word that starts with an apostrophe with the word before. Unfortunately, there are words that actually start with an apostrophe, like 'tis or 'twas. I don't want to merge them with the previous word.
The best approach would probably be to check whether the original text had any space between the words, that is, whether the text index of the first character of the second word immediately follows the text index of the last character of the first word.
There seems to be position information for each word, but it seems to always be 0 (or I'm doing something wrong). Do you have an idea?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I've still got one problem: Filte splits words at apostrophes, so "won't" becomes won and 't. An easy way to fix this would be to join each word that starts with an apostrophe with the word before. Unfortunately, there are words that actually start with an apostrophe, like 'tis or 'twas. I don't want to merge them with the previous word.
This was explicitely hardcoded in flite, you can disable it in a function us_tokentowords_one line 682
Nickolay, thanks so much for your help on this! Just in case someone else wants to perform tokenization/normalization using Flite, here are a few pointers.
I tried using Flite 2.0, but all its source files are tightly coupled. This means that you need to build everything, including files you'll never use. What's more, some of these files refused to link for me due to unresolved dependencies. So I ended up using Flite 1.4 instead. It's a bit less monolithic. See this CMake file (starting at line 97) for a list of the files actually required.
It turns out that both Flite and sphinxbase define a function named feat_print. That means that any program using both will get a linker error. I ended up hacking Flite, renaming the function to flite_feat_print.
For working C++ code for tokenization using Flite, see tokenization.cpp. It's the code I posted above translated to C++ (RAII), plus some post-processing.
One post-processing step I do is re-merge words containing apostrophes. For instance, he'd gets split into he and 'd. I'm converting it back to he'd. Strangely, Flite treats some cases differently. For instance, wouldn't doesn't become wouldn and 't, but wouldnt. It's not ideal, but I can live with that.
Another post-processing step is that I search the output for any characters other than a-z and ', either turning them into words or removing them. There's quite a number of symbols Flite will let pass as words, and this solves it.
Bottom line: Tokenization via Flite works, but it's quite a hassle and the results are far from perfect. My recommendation is to use Flite if you absolutely need an MIT-compatible open-source license. Otherwise, take a look at the alternetives Nickolay mentions above.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm looking for a simple way to perform tokenization on an input text. For instance, the text
should be converted to
in nineteen eighty two mister smith spent fourty two dollars to find one plus one equals two
.I had a look at the source code to Flite and it looks as if it does this kind of tokenization as part of its speech synthesis process. However, I couldn't find out how to use just the tokenizer.
Does anybody have experience with Filte?
Something like
Festival was not really designed for use in a library. You can check espeak, maybe it is would be easier to use it. For most advanced implementation there is https://github.com/google/sparrowhawk, but it is not complete as well.
Thanks for your suggestions! Unfortunately, I can't use espeak or sparrowhawk due to their licenses (GPL and Apache, respecively; my project is MIT).
So I'm left with Flite. Thank you for the code snippet! It confirms what I was afraid of: That there is no straightforward way to perform just text normalization without the full synthesis overhead including specifying a voice.
I expanded on your draft code and created the following program. It compiles and runs, but the
flite_do_synth
call terminates the application with the message "Relation: Token not present in utterance".I'm at a loss. Do you have any idea what the problem might be?
The call stack is
I got it working! I was using the wrong utterance function (
utt_synth_tokens
instead ofutt_synth
). Also, I needed to specify a dictionary. The resulting test program looks like this:And the output is this:
It's not perfect -- the second '1982' really should be 'one thousand nine hundred and eighty two' rather than 'nineteen eighty two', and I'd expected the '+' and '=' to be expanded to 'plus' and 'equals', respectively. But I'm very happy nonetheless!
Congratulations.
utt_synth
does too much work I believe, it creates a waveform. It must be enough to runutt_synth_text2segs
.You're right, it was doing too much. In fact, I've now defined my own synth method:
I've still got one problem: Filte splits words at apostrophes, so "won't" becomes
won
and't
. An easy way to fix this would be to join each word that starts with an apostrophe with the word before. Unfortunately, there are words that actually start with an apostrophe, like'tis
or'twas
. I don't want to merge them with the previous word.The best approach would probably be to check whether the original text had any space between the words, that is, whether the text index of the first character of the second word immediately follows the text index of the last character of the first word.
There seems to be position information for each word, but it seems to always be 0 (or I'm doing something wrong). Do you have an idea?
This was explicitely hardcoded in flite, you can disable it in a function
us_tokentowords_one
line 682pos
is not a position, it is "part of speech". And, to make it work you need to callNickolay, thanks so much for your help on this! Just in case someone else wants to perform tokenization/normalization using Flite, here are a few pointers.
feat_print
. That means that any program using both will get a linker error. I ended up hacking Flite, renaming the function toflite_feat_print
.he'd
gets split intohe
and'd
. I'm converting it back tohe'd
. Strangely, Flite treats some cases differently. For instance,wouldn't
doesn't becomewouldn
and't
, butwouldnt
. It's not ideal, but I can live with that.Bottom line: Tokenization via Flite works, but it's quite a hassle and the results are far from perfect. My recommendation is to use Flite if you absolutely need an MIT-compatible open-source license. Otherwise, take a look at the alternetives Nickolay mentions above.
I think you'd write the whole thing from scratch faster probably. There are not so many complex rules currently implemented anyway.