From: Doug C. <idi...@us...> - 2012-03-14 02:53:49
|
> 0. As a prerequisite, we should commit to using OpenMP (when supported at > compile time). At the only place where it is used currently (within > flow_effect()), there are two almost identical code blocks--one using > OpenMP, one not--just so that multithreading can be controlled via > sox_globals.use_threads. If we had to use similar guards everywhere, it > would discourage using OpenMP directives or make the code less maintainable > due to duplication. So I propose to You are correct that the current structure of that code is more complex than necessary for the general case. It was previously just a single loop. I split it up because there is a little bit of extra efficiency to be gained by committing to the single-threaded case (and it gives the optimizer more to work with). Since this was the only parallel loop in the program and didn't look like a hot area for changes, I figured it deserved any extra care we could give it to ensure the multi-threading didn't penalize the single-threaded case. I agree that if we start doing more parallelization spread throughout sox, it would be better to try to simplify the pattern used for each one instead of trying to super-optimize each one. I'm not totally enthusiastic about OpenMP. It is supported in fewer environments than pthreads or other threading libraries. It would be really easy for me to throw something together that works a lot like OpenMP but would be based on pthreads for Unix and Win32 threads on Win32. It would use a function like void parallel_for(iteration_count, task_function, task_state) instead of depending on OpenMP compiler transformation. Doing this would also cut those two extra DLLs (which have always made me a bit nervous) out of the Win32 distribution. > a) give up on sox_globals.use_threads with its code duplication and instead > refer users to the OMP_NUM_THREADS/OMP_THREAD_LIMIT environment variables, > which is more flexible too (a quad-core user might want to assign two > threads); for the transition, we can call omp_set_num_threads() according to > command-line options; Actually, I consider this to be a liability of OpenMP. When sox is run as a command-line tool where sox owns the whole process, this automatic configuration is great. But if we are serious about providing a libsox (are we serious about providing a libsox?) then sox can't assume it owns the process, and sox should neither read nor write process-global settings (like OpenMP's num_threads) unless requested to do so by the caller. > b) make multi-threaded the default again when/since we've sorted out the > problems that made MT slower than ST. > There is a penalty on using OpenMP with only one thread, but it is not high. > When comparing "sox --single-threaded" and "export OMP_NUM_THREADS=1; sox > --multi-threaded" (simulation of the proposal), the latter takes at most a > few percent more time. Yeah, I agree. If you do your threading correctly, the single-threaded case shouldn't be any slower. > 2. Give a separate FFT cache to each thread, so that it doesn't have to wait > until all other threads have completed their FFT calculations. See attached > fftcache.patch for my idea of how to do that; running a > sinc-spectrogram-dither chain on a 6-channel file, for me, takes 20 seconds > single-threaded, 14 seconds multi-threaded without that patch (27 seconds of > CPU time), but 11 seconds (21 CPU) with that patch. There is one blemish, > though: I can't see a way to reliably free all buffers afterwards. We can > perhaps live with that. I agree about the buffers. I didn't mention them before but I have given them some thought. I would be interested to see what happens when you just allocate/free the buffers on each iteration. For a single-threaded sox, caching the buffer is likely to be faster. For a multi-threaded sox, it's probably no slower to malloc/free than to keep track of multiple buffers and find a free one. Maybe there is a way to be clever and use the buffer cache in the case of num_threads==1? Or maybe it isn't worth it. As far as freeing the buffers, this is another one of those areas where it depends on whether we're serious about providing libsox or whether we just provide the sox program. It is acceptable (at least in my book) for a process to leave memory and threads behind for the OS to clean up after process termination. It is not acceptable for a library to unload and leave behind allocated dynamic memory or running threads. The general solution for libraries is to require the client to open and close the sox library. I am strongly in favor of doing this. Without this, sox can't be safely used by multiple clients in the same process (for example, two plugins running in the same application that might both want to use libsox, and if both use the global settings at the same time, they will stomp on each other). > > 3. Exploit the fact that some effects operate on each (wide) sample > independently, among them gain (in some of its modes) and remix. Instead of > processing the whole buffer in one go, flow_effect() could give a part each > to several threads. I'm less enthusiastic about this. The effects that operate on samples independently are unlikely to spend enough CPU cycles for it to be worth splitting them up. |