Apologies in advance if this is a topic that has been covered in great detail already.
I work for a small market research firm that conducts face to face interviews in developing countries. These interviews are carried out using tablet computers which are able to make recordings of portions of the interview for quality control purposes. One of our biggest challenges is quality control, and at the moment we employ people to manually go through the files that we have collected and note if they can hear the interview being conducted or not.
A typical project for us has around 2000-3000 interviews, which makes this process very time consuming. The files typically have background noise, and can have music or other distractions in them, so know that we would still have to do some manual checking. However, any software that could do a basic assessment of VOICE vs NO VOICE would be very helpful to us as it would reduce massively the number of interviews that we have to manually check, as those with no voice could be immediately flagged. It would also have to work in a wide variety of different languages. Many thanks for your help.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
There are many implementations around. You are looking for neural-network based audio tagging or speech activity detection trained on both music, noises and speech. The complexity of the solutions are varying.
It all depends on the capabilities of your developer if he will be able to set it up and programming language/OS requriements. I would simply use
If you are looking for something plug and play you can try any SAAS service, like https://www.speechmatics.com/, they will properly detect the speech and return you the transcript too.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Apologies in advance if this is a topic that has been covered in great detail already.
I work for a small market research firm that conducts face to face interviews in developing countries. These interviews are carried out using tablet computers which are able to make recordings of portions of the interview for quality control purposes. One of our biggest challenges is quality control, and at the moment we employ people to manually go through the files that we have collected and note if they can hear the interview being conducted or not.
A typical project for us has around 2000-3000 interviews, which makes this process very time consuming. The files typically have background noise, and can have music or other distractions in them, so know that we would still have to do some manual checking. However, any software that could do a basic assessment of VOICE vs NO VOICE would be very helpful to us as it would reduce massively the number of interviews that we have to manually check, as those with no voice could be immediately flagged. It would also have to work in a wide variety of different languages. Many thanks for your help.
There are many implementations around. You are looking for neural-network based audio tagging or speech activity detection trained on both music, noises and speech. The complexity of the solutions are varying.
It all depends on the capabilities of your developer if he will be able to set it up and programming language/OS requriements. I would simply use
http://kaldi-asr.org/models/m4
but there are also more advanced solutions at https://www.kaggle.com/c/freesound-audio-tagging/leaderboard and pretty OK baseline at https://github.com/DCASE-REPO/dcase2018_baseline/tree/master/task2
There are also projects on github like https://github.com/pyannote/pyannote-audio/tree/master/tutorials/speech-activity-detection
If you are looking for something plug and play you can try any SAAS service, like https://www.speechmatics.com/, they will properly detect the speech and return you the transcript too.
thanks for the replies!