In working with the HUB4 files for Sphinx4, I found there are 1133 words missing from the cmudict.0.6d file used in the HUB4 Language Model.

As I have need to possibly add many more words to the dictionary as my Language Model requirements change, I looked for a way to quickly add words to the designated pronunciation dictionary.

Under SphinxTrain, you'll find a script called make_dict.pl - this was inadequate for my needs.

So, I created a set of tools, written Java to easily facilitate this:
1. newwords.jar - compares a vocabulary list from a LM to a desired dictionary and generates a list of any missing words.
2. newdict.jar - takes a list of words (such as the output from newwords.jar) and generates a compatible pronunciation key for them. The output from newdict.jar is ready for immediate incorporation into cmudict.0.6d or any other pronunciation dictionary in the same format.
3. collate.jar - takes two sorted pronunciation dictionaries and merges them. I found that it was necessary to create this utility because "cat" followed by "sort" screwed my dictionaries up.

Anyways, newdict.jar requires that Festival be installed in the executable path but that's about the only external requirement.

You'll find that the LTS (Letter To Sound) rules used by festival are pretty good, but some irregular words get erroneous pronunciation keys. Still, in the long run, it beats having to add 1133 new entries to the dictionary by hand.

If you're interested in this, go ahead and email me.

HTH,

Darren