It's been several years since I last used the tools, so I can't give you much feedback. I used them for two things: I seem to remember that the LM generation scripts were somehow hosted on the site, allowing me to download and study them. It's surprisingly hard to find good (and simple!) code that generates language models, and those scripts served as a great starting point for writing my own, application-specific implementation. I then used the output of the web interface as a kind of ground truth...
This article lists two other options for building language models. I haven't used them myself, but it sounds like they will give you the same result, albeit with a bit more effort.
Oh, that sucks. Good luck restoring the contents! Is there a public repo with the source code? I did some searching, but all links pointed to the website only.
The LMTool site (http://www.speech.cs.cmu.edu/tools/lmtool-new.html) appears to be down. It would be great if it could be got up again!
I'm thinking of training a custom acoustic model. The dictionary contains words with multiple pronunciations, like this: either a ɪ ð ə either(2) i ð ə Not let's suppose one of the training samples is the phrase "You say either [i ð ə] and I say either [a ɪ ð ə]". What should the transcript file contain? Is the trainer smart enough to determine the correct pronunciation from context, so that the transcript can be "<s> you say either and i say either </s>"? Or do I need to give it the exact word alternatives,...
I'm thinking of training a custom acoustic model. The dictionary contains words with multiple pronunciations, like this: either a ɪ ð ə either(2) i ð ə Not let's suppose one of the training samples is the phrase "You say either [i ð ə] and I say either [a ɪ ð ə]". What should the transcript file contain? Is the trainer smart enough to determine the correct pronunciation from context, so that the transcript file can be "<s> you say either and i say either </s>"? Or do I need to give it the exact word...
Thank you very much!
I also looked at other C/C++ implementations of MFCC extraction. But those that looked good either came with a non-permissive license (GPL) or were part of huge libraries. Any help on how to extract MFCCs in code with PocketSphinx would be appreciated!