My organization is interested in synchronizing audio output with word-highlighting of the corresponding text. Put more tangibly, our clients are children with cognitive or learning disabilities who make frequent use of audio books, and would like to be able to follow along with the text in real time. It looks to me like Sphinx may be helpful here, but I would love to know what the developers and user base think. Would it be feasible to develop software using Sphinx for this application?
Many thanks.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for the replies. I just wanted to make sure that I wasn't going to dump a lot of time into Sphinx only to find out that it wasn't an appropriate suite for what I wanted. In reading the forums, I discovered a couple of noteworthy threads dealing with audio/text alignment:
There were also threads touching on associated topics like silence recognition and time-stamping of results, all very interesting and potentially (hopefully) helpful.
As for text to speech, this is not an option. The reason is that the emphasis in our application is on natural speech with children. The client who asked us about text highlighting based on audio book readout has experience with, for example, Kurzweil, but has noticed that children who spend too much time in text to speech applications start speaking in the same sort of monotone often found in such programs. Obviously, this is an undesired side effect.
Cheers,
Nathanael
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You might also want to look at text to speech - that way you could take any text (not just ones with audio books) and have them read aloud. And if you are controlling the output of the text, highlighting the word currently being converted seems pretty easy. I also think that would be simpler than trying to recognize all the speech being spoken with the audio book and to then try and correlate that to a text passage.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
My organization is interested in synchronizing audio output with word-highlighting of the corresponding text. Put more tangibly, our clients are children with cognitive or learning disabilities who make frequent use of audio books, and would like to be able to follow along with the text in real time. It looks to me like Sphinx may be helpful here, but I would love to know what the developers and user base think. Would it be feasible to develop software using Sphinx for this application?
Many thanks.
Thanks for the replies. I just wanted to make sure that I wasn't going to dump a lot of time into Sphinx only to find out that it wasn't an appropriate suite for what I wanted. In reading the forums, I discovered a couple of noteworthy threads dealing with audio/text alignment:
https://sourceforge.net/forum/forum.php?thread_id=1987549&forum_id=5471
https://sourceforge.net/forum/forum.php?thread_id=2123213&forum_id=5471
There were also threads touching on associated topics like silence recognition and time-stamping of results, all very interesting and potentially (hopefully) helpful.
As for text to speech, this is not an option. The reason is that the emphasis in our application is on natural speech with children. The client who asked us about text highlighting based on audio book readout has experience with, for example, Kurzweil, but has noticed that children who spend too much time in text to speech applications start speaking in the same sort of monotone often found in such programs. Obviously, this is an undesired side effect.
Cheers,
Nathanael
> Would it be feasible to develop software using Sphinx for this application?
yes, this question was discussed too many times on this forum. I suggest you to use search.
You might also want to look at text to speech - that way you could take any text (not just ones with audio books) and have them read aloud. And if you are controlling the output of the text, highlighting the word currently being converted seems pretty easy. I also think that would be simpler than trying to recognize all the speech being spoken with the audio book and to then try and correlate that to a text passage.