I'm using pocketsphinx5prealpha from github, I wanted to run multiple transcriptions of audio files in parallel, in pocketsphinx, so, for example I have an audio file, and and I wanted to transcribe it using 4 or more different configurations (each configuration has its own acoustic model+language model), the aim is to take advantage of the number of processors and cores in modern computers, In order to do that, I created a class PocketSphinx (this code is indicative, to make its easier to understand my question, and to not confuse myself)
classPocketSphinx{private:ps_decoder_t*ps;cmd_ln_t*config;charconst*hyp,*uttid;int16buf[512];intrv;int32score;FILE*fh;public://Functions to operate the classboolinitialiseModel(stringpathToLanguageModel,stringpathToAcousticModel);stringtranscribe(stringpathToAudioFile);boolterminateModel();//free memory functions}
I create multiple instances of the class PocketSphinx using:
vector<PocketSphinx> pp(4); //4, for example, is the number of processes that will run on parallel
and then I run the decoders in parallel using openMp, from the main c++ program:
vector<string> result(4);
#pragma omp parallel for
for (int i = 0; i < 4; i++)
{
result[i] = pp[i].transcribe("someaudioPath");
}
The procedure transcribes the audio in parallel from different instances of the PocketSphinx class, each returning their string (the hypothesis/result)
to Initialise a model, I used the code situated in the official tutorial for building a program for pocketsphinx, by making some oop changes:
so I initialise the models sequentially, and then I run the transcriptions in parallel like I have shown above, my server has 6 processors, with 8 cores each (total of 48 cores, "nproc" command returns 48):,256 Gb of Ram, running debian, so I did some testing, by initializing a different number of instances of the same acousticModel and the same Language Model, and using the same audio file:
If I transcribe an audio file using only 1 instance (traditional way), it takes 39 seconds to finish (I tried 4 times for each test to make sure the results were very similar)
If I transcribe with 2 threads: 41 seconds
If I transcribe with 4 threads: 43 seconds
If I transcribe with 8 threads: 61 seconds
If I transcribe with 16 threads: 97 seconds
If I transcribe with 32 threads: 152 seconds
I was expecting n- threads to be executed just slightly slower than 1 thread, not 3 times slower, If I run the command "htop" while executing, I can see that the number of cores corresponding to the number of threads in parallel show full, so the program is effectively computing the transcriptions in parallel, my guess is that those instances are racing for some shared component that slows down the entire process, they are probably sharing the pocketsphinx library, any ideas about having the different instances not share anything with the purpose of making the parallel transcription faster?
(please let me know if anything wasn't clear)
Last edit: Orest 2015-03-05
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If you share a complete implementation it would be easier for me to check what is going on.
Overall, decoding is memory intensive process and you can not run 48 decoders in parallel. Most likely decoders should be optimized for optimal throughput (for example, you can implement acoustic and language model sharing).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
thanks for the answer Nickolay, unfortunately I can't share the code at this time, I understand that this makes it way more difficult to analyze the query.
decoding is memory intensive process and you can not run 48 decoders in parallel
What exactly stops me from running 48 decoders in parallel if I have enough memory and enough processors and cores? (in my case 256 Gb Ram, 6 processors with 8 cores each, running on Debian Linux)
The idea is that they don't share anything so I can have different Acoustic Models and Language models in each of them
Do you have any suggestions about data (apart from the audio file) those decoders that run in parallel might be sharing that slows down the overall execution time?
1 thread takes 39 seconds, 8 threads take 61 seconds, which is reasonable to me because If I run 8 transcriptions sequentially (with the same audio file) it takes 39 x 8 seconds, with 16 threads it takes 97 seconds, which is a bit more than twice the time of a single thread, way better than 16 sequential transcriptions (39 x 16 seconds, approximately), and so on, but I believe it can be faster than that and I believe that they are racing for some shared element.
If I initialize the decoders in parallel, the program crashes, from my understanding this happens because they are sharing something that they shouldn't, do you have any advice on that also ?
EDIT: as an additional note: If I check the processors load while I run the threads in parallel, I can see that for example 8 threads use 8 cores at 100%, 10 threads use 10 cores at 100%, and so on, so, this is an additional element that makes me guess that this type of configuration is sharing something
EDIT: Initialising the models in parallel (with the code shown above), not only causes the "malloc(): memory corruption:" error, but sometimes causes the following error too:
If I initialize the decoders in parallel, the program crashes, from my understanding this happens because they are sharing something that they shouldn't, do you have any advice on that also ?
Our strtod implementation was not thread safe, I've just committed a fix for that. You can update or just keep initializing recognizers sequentially. It does not affect decoding.
Do you have any suggestions about data (apart from the audio file) those decoders that run in parallel might be sharing that slows down the overall execution time?
I wrote you above that you are going out of memory bandwidth. It is not related to memory size. On each frame decoder reads and writes about 100mb of memory. That is 10Gb/s. If you want 48 parallel processes that would be 480Gb/s, something that only GPU cards provides for sequential reads and for random reads/writes speed must be way lower.
Overall, design a parallel system processing speech is not trivial and worth a research. It might make sense to process a single stream on multiple cores in your case than to process many independent streams. It also might more sense to use several servers of 8 cores each than one big server of 48 cores.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm using pocketsphinx5prealpha from github, I wanted to run multiple transcriptions of audio files in parallel, in pocketsphinx, so, for example I have an audio file, and and I wanted to transcribe it using 4 or more different configurations (each configuration has its own acoustic model+language model), the aim is to take advantage of the number of processors and cores in modern computers, In order to do that, I created a class PocketSphinx (this code is indicative, to make its easier to understand my question, and to not confuse myself)
I create multiple instances of the class PocketSphinx using:
I initialize the models sequentially using:
and then I run the decoders in parallel using openMp, from the main c++ program:
The procedure transcribes the audio in parallel from different instances of the PocketSphinx class, each returning their string (the hypothesis/result)
to Initialise a model, I used the code situated in the official tutorial for building a program for pocketsphinx, by making some oop changes:
and to transcribe, I again use the code from the documentation by making some oop changes:
there are 2 issues I can't seem to understand:
If I initialise the models in parallel (like I do for transcription), the program crashes, giving a:
so I initialise the models sequentially, and then I run the transcriptions in parallel like I have shown above, my server has 6 processors, with 8 cores each (total of 48 cores, "nproc" command returns 48):,256 Gb of Ram, running debian, so I did some testing, by initializing a different number of instances of the same acousticModel and the same Language Model, and using the same audio file:
If I transcribe an audio file using only 1 instance (traditional way), it takes 39 seconds to finish (I tried 4 times for each test to make sure the results were very similar)
If I transcribe with 2 threads: 41 seconds
If I transcribe with 4 threads: 43 seconds
If I transcribe with 8 threads: 61 seconds
If I transcribe with 16 threads: 97 seconds
If I transcribe with 32 threads: 152 seconds
I was expecting n- threads to be executed just slightly slower than 1 thread, not 3 times slower, If I run the command "htop" while executing, I can see that the number of cores corresponding to the number of threads in parallel show full, so the program is effectively computing the transcriptions in parallel, my guess is that those instances are racing for some shared component that slows down the entire process, they are probably sharing the pocketsphinx library, any ideas about having the different instances not share anything with the purpose of making the parallel transcription faster?
(please let me know if anything wasn't clear)
Last edit: Orest 2015-03-05
If you share a complete implementation it would be easier for me to check what is going on.
Overall, decoding is memory intensive process and you can not run 48 decoders in parallel. Most likely decoders should be optimized for optimal throughput (for example, you can implement acoustic and language model sharing).
thanks for the answer Nickolay, unfortunately I can't share the code at this time, I understand that this makes it way more difficult to analyze the query.
What exactly stops me from running 48 decoders in parallel if I have enough memory and enough processors and cores? (in my case 256 Gb Ram, 6 processors with 8 cores each, running on Debian Linux)
The idea is that they don't share anything so I can have different Acoustic Models and Language models in each of them
Do you have any suggestions about data (apart from the audio file) those decoders that run in parallel might be sharing that slows down the overall execution time?
1 thread takes 39 seconds, 8 threads take 61 seconds, which is reasonable to me because If I run 8 transcriptions sequentially (with the same audio file) it takes 39 x 8 seconds, with 16 threads it takes 97 seconds, which is a bit more than twice the time of a single thread, way better than 16 sequential transcriptions (39 x 16 seconds, approximately), and so on, but I believe it can be faster than that and I believe that they are racing for some shared element.
If I initialize the decoders in parallel, the program crashes, from my understanding this happens because they are sharing something that they shouldn't, do you have any advice on that also ?
EDIT: as an additional note: If I check the processors load while I run the threads in parallel, I can see that for example 8 threads use 8 cores at 100%, 10 threads use 10 cores at 100%, and so on, so, this is an additional element that makes me guess that this type of configuration is sharing something
EDIT: Initialising the models in parallel (with the code shown above), not only causes the "malloc(): memory corruption:" error, but sometimes causes the following error too:
but If I initialise the models sequentially, there are no problems, a part from the speed (the act of initializing n decoders sequentially)
Last edit: Orest 2015-03-06
Our strtod implementation was not thread safe, I've just committed a fix for that. You can update or just keep initializing recognizers sequentially. It does not affect decoding.
I wrote you above that you are going out of memory bandwidth. It is not related to memory size. On each frame decoder reads and writes about 100mb of memory. That is 10Gb/s. If you want 48 parallel processes that would be 480Gb/s, something that only GPU cards provides for sequential reads and for random reads/writes speed must be way lower.
You can read about memory bandwidth in HPC here:
http://en.wikipedia.org/wiki/Memory_bandwidth
http://www.cs.virginia.edu/stream
Overall, design a parallel system processing speech is not trivial and worth a research. It might make sense to process a single stream on multiple cores in your case than to process many independent streams. It also might more sense to use several servers of 8 cores each than one big server of 48 cores.
Hi, I am running pocketsphinx/c++ on multiple threads, invoking from Java (SWIG). I appears that the long file (-logfn) is shared across the threads.