I am working on a wireless light switch/dimmer (like the ones you have in your
house). It is based on the Freescale MC13224 CPU. This is a 32 bit ARM7 CPU
with 128K of flash and 96K of RAM. It runs at 24MHz. It has an integrated
2.4GHz radio and 802.15.4 stack. Nice part but it doesn't have an FPU.
During a design discussion it was suggested that somebody look into adding
voice command. I was elected. :)
The number of voice commands that the switch would have to recognize seem
small. Thing like 'Lights xxx' where xxx is one of 'On, Off, Up, Down, Sleep,
Party, Away...etc'. With a little thought, I would think the set could be
limited to a couple dozen. The commands would have to work for various
speakers (speaker independent) and be able to be identified within a moderate
amount of background noise.
I have downloaded pocketsphinx and I was starting to go through it. I have the
demos working. The functionality of 'tidigits' is close to what I think I
need.
I have also contacted the Sensory guys thinking that maybe one of their RSC-4x
chips might help.
So, I was wondering if I could tap the experience base at this site for a
little guidance:
1) Would a 'cut down' version of pocketsphinx have a chance of fitting on my
CPU? The compiled sizes of the libraries and applications seem a little large
(compared to 96K RAM), but I assume that an experienced pocketsphinx person
could improve on that.
2) How much CPU horsepower is generally required. The MC13224 is a 32 bit ARM
running at 24MHz. Is that enough computing power for a small vocabulary?
3) If my CPU is not large enough, any suggestions? What would be the minimum
mircrocontroller (non-FPU) I would have to use to run pocketsphinx on a small
vocabulary? What speed would it need to be? How much program and data memory
would it need?
4) Anybody tried the Sensory chips (the ones that they used in the Furby)? How
do they compare to the pocketsphinx technology?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
During a design discussion it was suggested that somebody look into adding
voice command. I was elected. :)
I think it's a great project
1) Would a 'cut down' version of pocketsphinx have a chance of fitting on my
CPU? The compiled sizes of the libraries and applications seem a little large
(compared to 96K RAM), but I assume that an experienced pocketsphinx person
could improve on that.
I think it's too small for generic HMM engine. You can probably implement DTW
recognizer with such amount of memory but even good frontend will require a
lot. Overall, processors are getting more powerful nowdays, spend a few more
bucks to get 10 times more power ;)
There are specialized solutions for such a small chips, but they are carefully
developed with the limited hardware in mind.
For example there is an engine for similar Fujitsu chip, but it took
2) How much CPU horsepower is generally required. The MC13224 is a 32 bit
ARM running at 24MHz. Is that enough computing power for a small vocabulary?
The target hardware platform for pockesphinx was the Sharp Zaurus SL-5500
hand-held computer. The Zaurus is typical of the previous generation of hand-
held PCs, having a 206MHz StrongARM R processor, 64MB of SDRAM, 16MB of flash
memory, and a quarter-VGA color LCD screen.
3) If my CPU is not large enough, any suggestions? What would be the minimum
mircrocontroller (non-FPU) I would have to use to run pocketsphinx on a small
vocabulary? What speed would it need to be? How much program and data memory
would it need?
See above. FPU is not an issue. Pocketsphinx can work with fixed-point
processor.
4) Anybody tried the Sensory chips (the ones that they used in the Furby)?
How do they compare to the pocketsphinx technology?
Sorry, no idea.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
So, a quick search/read on the web tells me that DTW has largely been replaced
by HMM because HMM does a better job. However, it is your guess that my
processor and memory are too small for HMM.
Just one more question...
1) I really like the MC13224 (price is right. radio is easy to use. does a
bang-up job of running my dimmer) but it appears to be about 10x too small for
HMM (even for a small vocabulary I guess). However, there is a chance it could
do DTW. My question is, would you ship a product like mine with a DTW
recognizer? In other words, would it work in most situations or should I step-
up to a processor that can handle HMM ?? The thing has got to work or I will
just get a bunch of returned dimmers...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Another important question is that of distant vs close-talk speech. Is the
goal (as one would expect) to allow the user to speak from anywhere in the
room and control the lights? In that case, do you plan to use only a single
microphone on the dimmer? That's not gonna work very well, even for small
microphones. Typically, people use microphone arrays for that.
My recommendation would be to write an iPhone/Android app to wirelessly
control the dimmer. Then you can do ASR on the phone ;)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Doh!
Thanks anchan77. That is a good idea. In the back of our heads was always to
write an iPhone/Android app that let us control an entire house full of these
things. Of course, it makes sense to just add voice command to the program.
Thanks.
Although, it is a little disappointing as well. Wouldn't it be neat if you
could just go to Home Depot and pick up some light switches that responded to
voice command? No iPhone/Android required.
Also, smart phones go to sleep. They aren't on all the time. Once you have it
in your hand, you might as well poke a button as opposed to speaking a
command. Hmm...
You say people use arrays of microphones for room coverage. By this, do you
mean that you place microphones all over the room? So, in a normal sized room
(say 20' x 15'), how many microphones would you require to do a decent job? Or
is it just one high quality microphone?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
My question is, would you ship a product like mine with a DTW recognizer?
DTW is just a little bit different approach but overall it's not really bad.
It was successfully used widely. Moreover, if you'll add the functionality to
record user samples to recognize them later. That could solve many issues.
how many microphones would you require to do a decent job? Or is it just one
high quality microphone?
The issue here is not to cover the room but to fight with reverberation echo.
Most systems are trained on close-distance microphones where there is no
reverberation. In room recording echo significantly corrupts spectrum and
lowers HMM accuracy. The problem also is that corruption depends on the
position in the room. In research system you need to collect data from all
microphones, take room geometry into account to be able to clean up speech.
Well, you can test this effect with pocketsphinx on your computer.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I agree with Nickolay, the number, position of microphones, as well as the
effect on performance are all empirical questions... I don't have any hands-on
experience with distant speech recognition so I can't really give you any
insight. It could be that with very limited vocabulary (as in your case) and a
small room that is not too reverberant (e.g. no glass walls, etc), you can
still get away with one microphone and reasonable accuracy but that's not
sure. You could have several dimmers for the same light, in which case, you
could have an algorithm to pick the ASR result from the one closest from the
speaker (using power or SNR). But then the switches need to talk to each other
(what happens if two switches hear each a different command?). That might be
more complex than what you want for light switches... As Nick suggests, trying
out with a laptop that you put at different places in different rooms and see
how good it is might be a good first step.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you nshmyrev and anchan77.
I took your advice and walked around my house with my laptop and tidigits. I
was using a Logitech QuickCAM Pro 9000 as a microphone (I use it for Skype and
GTalk with no problems). I also set it up in several rooms and walked around.
I could get it to work in most locations but 'distance speech' was a problem.
I have trained myself to speak clearly to get a better response (although I
can't seem to get 'six' to work very often). Overall, I would say that if I
was slowly/loudly talking directly into the microphone, it worked in just
about any room. If I was wandering around the room, however, accuracy was
pretty poor (or zero).
My wife had a lot more trouble. I think that was because she hadn't taken the
time to train herself on the device and she wasn't speaking very
loudly/clearly.
It does appear to me that putting voice command into a light switch is beyond
the cost/performance curve right now. Also, as anchan77 points out, voice
command in a light switch is probably misplaced.
Instead, I think the world needs a 'voice pod'. Something you set around the
house that can understand a limited number of voice commands and that has a
radio in it to transmit those commands to a device (or server of some sort).
The time for this device is pretty soon. Smart energy issues are starting to
make home device manufacturers seriously consider networking their devices.
When that happens, a market will form for command and control software. That
software will run on a server. That server will need input devices (something
more convenient than just a web page).
Voice processing is outside my expertise... but if there are some EE/CS
students reading this blog and looking for something interesting to do...
<fill_in_your_own_plan_here>. The endpoint of this project would be a device
you set around the house that understood your commands. Then take that device
to <fill_in_your_favorite_gadget_manufacturer_here> and ask them to buy you so
you can continue to advance this work. </fill_in_your_favorite_gadget_manufacturer_here></fill_in_your_own_plan_here>
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am working on a wireless light switch/dimmer (like the ones you have in your
house). It is based on the Freescale MC13224 CPU. This is a 32 bit ARM7 CPU
with 128K of flash and 96K of RAM. It runs at 24MHz. It has an integrated
2.4GHz radio and 802.15.4 stack. Nice part but it doesn't have an FPU.
During a design discussion it was suggested that somebody look into adding
voice command. I was elected. :)
The number of voice commands that the switch would have to recognize seem
small. Thing like 'Lights xxx' where xxx is one of 'On, Off, Up, Down, Sleep,
Party, Away...etc'. With a little thought, I would think the set could be
limited to a couple dozen. The commands would have to work for various
speakers (speaker independent) and be able to be identified within a moderate
amount of background noise.
I have downloaded pocketsphinx and I was starting to go through it. I have the
demos working. The functionality of 'tidigits' is close to what I think I
need.
I have also contacted the Sensory guys thinking that maybe one of their RSC-4x
chips might help.
So, I was wondering if I could tap the experience base at this site for a
little guidance:
1) Would a 'cut down' version of pocketsphinx have a chance of fitting on my
CPU? The compiled sizes of the libraries and applications seem a little large
(compared to 96K RAM), but I assume that an experienced pocketsphinx person
could improve on that.
2) How much CPU horsepower is generally required. The MC13224 is a 32 bit ARM
running at 24MHz. Is that enough computing power for a small vocabulary?
3) If my CPU is not large enough, any suggestions? What would be the minimum
mircrocontroller (non-FPU) I would have to use to run pocketsphinx on a small
vocabulary? What speed would it need to be? How much program and data memory
would it need?
4) Anybody tried the Sensory chips (the ones that they used in the Furby)? How
do they compare to the pocketsphinx technology?
I think it's a great project
I think it's too small for generic HMM engine. You can probably implement DTW
recognizer with such amount of memory but even good frontend will require a
lot. Overall, processors are getting more powerful nowdays, spend a few more
bucks to get 10 times more power ;)
There are specialized solutions for such a small chips, but they are carefully
developed with the limited hardware in mind.
For example there is an engine for similar Fujitsu chip, but it took
https://mcu.emea.fujitsu.com/emea_content/downloads/MICRO/fme/micros
/Fujitsu_FlexRay_Solutions_-_from_systems_support_to_silicon.pdf
The target hardware platform for pockesphinx was the Sharp Zaurus SL-5500
hand-held computer. The Zaurus is typical of the previous generation of hand-
held PCs, having a 206MHz StrongARM R processor, 64MB of SDRAM, 16MB of flash
memory, and a quarter-VGA color LCD screen.
See above. FPU is not an issue. Pocketsphinx can work with fixed-point
processor.
Sorry, no idea.
Thanks for the feedback nshmyrev.
So, a quick search/read on the web tells me that DTW has largely been replaced
by HMM because HMM does a better job. However, it is your guess that my
processor and memory are too small for HMM.
Just one more question...
1) I really like the MC13224 (price is right. radio is easy to use. does a
bang-up job of running my dimmer) but it appears to be about 10x too small for
HMM (even for a small vocabulary I guess). However, there is a chance it could
do DTW. My question is, would you ship a product like mine with a DTW
recognizer? In other words, would it work in most situations or should I step-
up to a processor that can handle HMM ?? The thing has got to work or I will
just get a bunch of returned dimmers...
Another important question is that of distant vs close-talk speech. Is the
goal (as one would expect) to allow the user to speak from anywhere in the
room and control the lights? In that case, do you plan to use only a single
microphone on the dimmer? That's not gonna work very well, even for small
microphones. Typically, people use microphone arrays for that.
My recommendation would be to write an iPhone/Android app to wirelessly
control the dimmer. Then you can do ASR on the phone ;)
Doh!
Thanks anchan77. That is a good idea. In the back of our heads was always to
write an iPhone/Android app that let us control an entire house full of these
things. Of course, it makes sense to just add voice command to the program.
Thanks.
Although, it is a little disappointing as well. Wouldn't it be neat if you
could just go to Home Depot and pick up some light switches that responded to
voice command? No iPhone/Android required.
Also, smart phones go to sleep. They aren't on all the time. Once you have it
in your hand, you might as well poke a button as opposed to speaking a
command. Hmm...
You say people use arrays of microphones for room coverage. By this, do you
mean that you place microphones all over the room? So, in a normal sized room
(say 20' x 15'), how many microphones would you require to do a decent job? Or
is it just one high quality microphone?
DTW is just a little bit different approach but overall it's not really bad.
It was successfully used widely. Moreover, if you'll add the functionality to
record user samples to recognize them later. That could solve many issues.
The issue here is not to cover the room but to fight with reverberation echo.
Most systems are trained on close-distance microphones where there is no
reverberation. In room recording echo significantly corrupts spectrum and
lowers HMM accuracy. The problem also is that corruption depends on the
position in the room. In research system you need to collect data from all
microphones, take room geometry into account to be able to clean up speech.
Well, you can test this effect with pocketsphinx on your computer.
I agree with Nickolay, the number, position of microphones, as well as the
effect on performance are all empirical questions... I don't have any hands-on
experience with distant speech recognition so I can't really give you any
insight. It could be that with very limited vocabulary (as in your case) and a
small room that is not too reverberant (e.g. no glass walls, etc), you can
still get away with one microphone and reasonable accuracy but that's not
sure. You could have several dimmers for the same light, in which case, you
could have an algorithm to pick the ASR result from the one closest from the
speaker (using power or SNR). But then the switches need to talk to each other
(what happens if two switches hear each a different command?). That might be
more complex than what you want for light switches... As Nick suggests, trying
out with a laptop that you put at different places in different rooms and see
how good it is might be a good first step.
Thank you nshmyrev and anchan77.
I took your advice and walked around my house with my laptop and tidigits. I
was using a Logitech QuickCAM Pro 9000 as a microphone (I use it for Skype and
GTalk with no problems). I also set it up in several rooms and walked around.
I could get it to work in most locations but 'distance speech' was a problem.
I have trained myself to speak clearly to get a better response (although I
can't seem to get 'six' to work very often). Overall, I would say that if I
was slowly/loudly talking directly into the microphone, it worked in just
about any room. If I was wandering around the room, however, accuracy was
pretty poor (or zero).
My wife had a lot more trouble. I think that was because she hadn't taken the
time to train herself on the device and she wasn't speaking very
loudly/clearly.
It does appear to me that putting voice command into a light switch is beyond
the cost/performance curve right now. Also, as anchan77 points out, voice
command in a light switch is probably misplaced.
Instead, I think the world needs a 'voice pod'. Something you set around the
house that can understand a limited number of voice commands and that has a
radio in it to transmit those commands to a device (or server of some sort).
The time for this device is pretty soon. Smart energy issues are starting to
make home device manufacturers seriously consider networking their devices.
When that happens, a market will form for command and control software. That
software will run on a server. That server will need input devices (something
more convenient than just a web page).
Voice processing is outside my expertise... but if there are some EE/CS
students reading this blog and looking for something interesting to do...
<fill_in_your_own_plan_here>. The endpoint of this project would be a device
you set around the house that understood your commands. Then take that device
to <fill_in_your_favorite_gadget_manufacturer_here> and ask them to buy you so
you can continue to advance this work. </fill_in_your_favorite_gadget_manufacturer_here></fill_in_your_own_plan_here>