I want to make recordings of bird song, but automatically ignore any recordings that also have accidently recorded people talking. I'm doing this on an Android phone and thought PocketSphinx might be a good way to achieve this. If you agree that this sounds reasonable, are you able to help me (point me in the right direction ) to configure pocketspinx-android-demo to do this?
I've started to modify the code so that I have a button that calls switchSearch() when pressed, but then I don't know how to configure this method to listen for any (or a list of most common) words.
Thanks in advance for any assistance you can give.
In pocketsphinx you need to train an acoustic model with three single-phoneme word - speech, silence and bird singing. Each GMM should have many mixtures, like 128. You can download any speech database to emulate speech and you can use your clean bird singing recordings to emulate bird singing and other noises. Then you can recognize incoming audio with a simple grammar of three variants, it will give you segments with speech, birds and other noises. It might require slight modification of pocketsphinx code since pocketsphinx HMMs are three-states and you need single-state HMM.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks Nickolay for your quick and knowledgeable/infomative response. Looks like I have a lot to learn - will keep me busy for a while!
BUT - I don't need to separate the speech from birdsong - just indentify if a recording has speech in it (any speech) and then just ignore that recording. Can that be done?
cheers
Tim
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I tried OK Google on my phone - it recongised my talking, but ignored bird song. So do you think Google has has already done something similar to what you suggest above? I don't what to use OK
Google as I'll be using old phones with very limited data connectivity.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks Nickolay ( I'm very impressed that you have the time/will to answer all these questions) -
Even once you go past the OK Google phrase, ie when it enters the listening stage, it is able to distinguish between 'general' speech (I tried many different phrases) and the recorded bird song. I was hoping that pocketsphinx could also do this 'out of the box', as I don't think I have the skill 'to build a HMM-GMM voice activity detection' that you suggest - I'll ponder some more. Thanks again.
Last edit: Tim Hunt 2016-08-27
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I want to make recordings of bird song, but automatically ignore any recordings that also have accidently recorded people talking. I'm doing this on an Android phone and thought PocketSphinx might be a good way to achieve this. If you agree that this sounds reasonable, are you able to help me (point me in the right direction ) to configure pocketspinx-android-demo to do this?
I've started to modify the code so that I have a button that calls switchSearch() when pressed, but then I don't know how to configure this method to listen for any (or a list of most common) words.
Thanks in advance for any assistance you can give.
ps this is part of project to help save the native birds of New Zealand https://cacophony.org.nz/
Cheers
Tim
This is not really a speech recognition problem, so pocketsphinx will not work out of box.
To separate speech from bird singing you might want to build a HMM-GMM voice activity detection. You can read about algorithms in detail here:
http://static.googleusercontent.com/media/research.google.com/ru//pubs/archive/40362.pdf
In pocketsphinx you need to train an acoustic model with three single-phoneme word - speech, silence and bird singing. Each GMM should have many mixtures, like 128. You can download any speech database to emulate speech and you can use your clean bird singing recordings to emulate bird singing and other noises. Then you can recognize incoming audio with a simple grammar of three variants, it will give you segments with speech, birds and other noises. It might require slight modification of pocketsphinx code since pocketsphinx HMMs are three-states and you need single-state HMM.
Thanks Nickolay for your quick and knowledgeable/infomative response. Looks like I have a lot to learn - will keep me busy for a while!
BUT - I don't need to separate the speech from birdsong - just indentify if a recording has speech in it (any speech) and then just ignore that recording. Can that be done?
cheers
Tim
Identification of something is always a separation of that something from alternatives.
I tried OK Google on my phone - it recongised my talking, but ignored bird song. So do you think Google has has already done something similar to what you suggest above? I don't what to use OK
Google as I'll be using old phones with very limited data connectivity.
Google built a similar classifier to distinguish between "ok google" phrase and everything else. You can read details here:
http://static.googleusercontent.com/media/research.google.com/ru//pubs/archive/42537.pdf
Thanks Nickolay ( I'm very impressed that you have the time/will to answer all these questions) -
Even once you go past the OK Google phrase, ie when it enters the listening stage, it is able to distinguish between 'general' speech (I tried many different phrases) and the recorded bird song. I was hoping that pocketsphinx could also do this 'out of the box', as I don't think I have the skill 'to build a HMM-GMM voice activity detection' that you suggest - I'll ponder some more. Thanks again.
Last edit: Tim Hunt 2016-08-27