Rich: A few days ago I spoke with Alex Rudnicky, Evandro Gouvea, Bhiksha Raj, and Rita Singh, who all work on the CMU Sphinx project.
Alex, Bhiksha, and Rita are currently with the language technologies institute at the school of computer science at Carnegie Mellon University. Evandro used to be there as well, but he’s currently working as a speech consultant in Germany, and at the moment he’s working with a Brazilian company called Vocalize. I spoke with these four about the history of the CMU Sphinx project, and what its aims are. Here’s some of my interview with them.
Alex: Let me give you a quick overview of it. The original Sphinx was developed as part of a dissertation by Kai-Fu Lee in the Computer Science department here. Kai-Fu has gone on to bigger things since then, and, in fact, you might know his name. The original project demonstrated something that people didn’t think was possible, which is to simultaneously do continuous speech, have it be connected speech, and be speaker-independent. These are things we take for granted today, but it was still a bit of a head-scratcher at that time.
That was Sphinx 1.
There was a Sphinx 2, which was developed by Xuadong Huang, and others, while he was a post-doc here. He’s subsequently gone on to Microsoft. And at some point after that, we had Sphinx 3, which was …
Rita: Mosur Ravishankar
Alex: … and it’s something that featured continuous HMM models …
Bhiksha: And Eric Thayer.
Alex: The whole activity was overseen by professor Raj Reddy, in Computer Science. At some point – I think this was around the time that Kevin was here – Raj declared that from now on, Sphinx would be Open Source. And you have to realized that this was back in the time where everybody sat on top of their software and was generally suspicious.
Rita: Sphinx was export controlled at one point.
Alex: Yeah, so it was kind of a mess. But then suddenly it was Open Source. Nobody knew quite what would happen, and since then, things have seemed to have gone fairly well. The code gets maintained. There was one more version, called Sphinx 4, which Bhiksha and Rita can speak to better than I can. Why don’t you guys talk a bit about that?
Bhiksha: The Sphinx 4 project was somewhat different from the earlier Sphinxes, in that the earlier Sphinxes had been entirely CMU-internal projects that got Open Sourced. But Sphinx 4 was actually a multi-institutional collaboration. We had Sun Microsystems … engineers from Sun Microsystems, and engineers from Mitsubishi Electric Research Labs, and of course, lots of people from CMU. This had Sun in it, so it was also different in that it was Java-based, and so we had to architecture it in a manner that Java could actually be used and Java could be taken advantage of. That was the fourth version.
Alex: One thing to point out is that there are several components to the software collection. And sometimes people don’t always distinguish this. There are decoders, of which there are three currently available. And then there’s a suite of training software that creates the statistical models that drive these systems. Currently there’s Pocketsphinx, which is a small-footprint, fast decoder that, for example, I use for interactive systems. There’s Sphinx 3, which is considered a research system, and has a variety of features that don’t normally show up in realtime systems. And finally there’s Sphinx 4, which actually is quite popular at this point, and people are incorporating it into a variety of applications.
Rita: It’s extremely modular, and enables a lot of mix-and-match in the core technology itself. Different kinds of searches and different kinds of language models and things like that.
Bhiksha: I think the big attraction for Sphinx 4 really is that it’s in Java, which seems to be the language of the day.
Alex: Some of us who actually kind of remember C and still use it a bit scratch our heads about this Java stuff, but, you know, what can you do?
Evandro: One interesting thing about Sphinx 4 is that, like Bhiksha mentioned, it was built by several institutions. The different institutions had different goals with Sphinx 4. And one of the goals that Sun Microsystems had was exactly to popularize Java by building tools that had Java as the language. And, at that time, it was 2001-2002, so Java was still starting as a language. It’s interesting that nowadays, Java is very popular, and Sphinx 4 has become popular, because it’s in Java, partially.
Rich: Now, this software, it’s a collection of libraries, is that correct, primarily, or is there actually an application that I can download and run on my Android phone or whatever?
Alex: Actually, there is an Android version, which you’ll find on the repository. And there are a few basic applications that you compile and run on a desktop under … actually, both under Linux and Windows, I believe.
Bhiksha: And Mac.
Alex: And Mac. But there are also other things out there that incorporate Sphinx in one way or another. There’s actually kind of a whole lot, and I can’t really go through them. Although of course I’ll mention some of the work we have here, and that’s the Olympus/RavenClaw dialog manager, that allows you to build interactive systems and Sphinx there is used as the recognition component. That’s also a fully Open Source system.
Bhiksha: In response to whether it’s a library or an application, there are a core set of libraries, but we also have various demo apps that hook into these libraries. So you could actually download programs that you can run on various platforms and that will give you an output. Now it may not necessarily be the outputs that you would require, so to get exactly what you need, you’d have to customize it and build things around the libraries.
Rita: Sphinx 4 includes three or four demos.
Alex: All the Sphinxes have, minimally, a command-line version of the application.
Rich: Is there a connection with the Festival/Festvox project?
Alex: Well, they’re both speech. And something like our dialog work incorporates Festival and its variants. The useful thing there is that it’s Open Source and we can fool with it for our purposes. That work is done by Alan Black, and in the scheme of things, once you get down to speech processing, it’s an area with diverse, almost self-contained areas or fields. Recognition and synthesis has that case – people sort of know about each other, and maybe even do work in each other’s areas, but really it’s separate activities.
Bhiksha: There’s also a slightly deeper connection here to Sphinx 4 in particular. When Sphinx 4 came to be, the folks at Sun were trying to demonstrate the capabilities of Java, as Evandro mentioned. So the first thing they did was to port the synthesizer, which is Festival toolkit, to Java. And once they’d warmed up on Festvox, they switched to the recognizer, and collaborated with us on Sphinx 4.
Rich: What sort of things are you all working on these days?
Alex: I should mention that we’re working on human/robot communication by language, and we’re also trying to work gesture into that, so you can kind of wave your arms and talk.
Rita: Yeah, we’re looking at distance speech recognition, and focus of attention, things like that.
Alex: Another thing that we’re interested in has to do with model training, and one particular project at this point is trying to induce vocabulary from just speech. This is something that would be useful for working with languages you’ve never seen or heard, so to speak.
Rich: Can you elaborate on that? Is that like learning a language just by being immersed in it?
Alex: Well, it’s maybe best characterized as taking a whole bunch of speech and identifying recurrences of particular patterns, on the assumption that people reuse words and so on.
Rita: And then we continue to work on the usual thing: speech recognition in highly noisy environments, in automotive environments, in open environments. We’re working on recognition with speech captured through multi-sensor arrays, from various distances, in open spaces, by moving devices, and so on.
Bhiksha: And then of course there’s the inevitable business of dealing with the large amounts of data that we’ve currently got. We’re also working on how we can leverage the data that have currently become available. Now, most of this data are not transcribed, so you don’t know exactly what the people who spoke them said. We have to figure out how best to use these data to improve our models and come up with better recognition in various languages. So that is something we’re working on.
Rich: What languages does the software currently recognize?
Bhiksha: There’s a number on the website, because at this point it’s very strictly Open Source, in the sense that we have contributions from around the world. For instance, Nickolay Shmyrev, whose name has been on the list, contributes a lot. There are people who have uploaded models in French, Spanish and whatnot.
Rita: The bottom line is that you can actually build models for any language, provided you have the training data, and the software will allow you to recognize speech in that language. So people around the world are doing this. Some put out their products and byproducts on their own websites. Not everything makes it into the Open Source software bundle. The software currently is capable of recognizing a few major languages. English, Spanis, I think French.
Alex: We did Korean at some point.
Bhiksha: But then again, the correct way to think about it is not as a black box that recognizes specific languages, but as a toolkit that you can use to build a recognizer …
Rita: For any language
Bhiksha: … in a language you want to work in.
Rich: Now, one thing that I deal with every single week, is I do a recording like this, and it has … as in this case, we have several different voices on here, each of which has a distinct accent. Then I have to transcribe that for the web site. This sounds like something that I’m going to spend quite a bit of time playing with.
Bhiksha: And I thought I was understandable!
Rich: Oh, you’re all very understandable, but I’ve played with several transcription software, and I can train it to transcribe myself perfectly, and as soon as I have a second speaker, it gets confused.
Alex: Yes, so that’s the “speaker dependant” bit, that actually originally people thought was going to be inevitable. And you still get much better performance if you train for an individual, but really that’s not quite what you want.
Rich: I see that Sphinx is going to be involved in the Google Summer of Code this summer. Can you tell me something about that?
Evandro: Sphinx is going to be part of the Google Summer of Code, so there’s some interesting projects going on there with participation from students, which haven’t been chosen yet. So that’s something that’s coming up this summer. This year, one of the projects, for example is a possible better user interface, or a graphical user interface for a training and then decoding using Sphinx.
Bhiksha: And last year we built tools for allowing people to read entire books, and to align the recorded audio to the actual text.
Rich: Thank you all so much for talking with me about this. I’m sure that there’s a lot more that we could talk about.
Alex: We hope we gave you some reasonably coherent answers!
Rich: Thanks a lot!
All: Thank you!