Re: [SoX-devel] Parallelization

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

> 0. As a prerequisite, we should commit to using OpenMP (when supported at
> compile time). At the only place where it is used currently (within
> flow_effect()), there are two almost identical code blocks--one using
> OpenMP, one not--just so that multithreading can be controlled via
> sox_globals.use_threads. If we had to use similar guards everywhere, it
> would discourage using OpenMP directives or make the code less maintainable
> due to duplication. So I propose to

You are correct that the current structure of that code is more
complex than necessary for the general case. It was previously just a
single loop. I split it up because there is a little bit of extra
efficiency to be gained by committing to the single-threaded case (and
it gives the optimizer more to work with). Since this was the only
parallel loop in the program and didn't look like a hot area for
changes, I figured it deserved any extra care we could give it to
ensure the multi-threading didn't penalize the single-threaded case. I
agree that if we start doing more parallelization spread throughout
sox, it would be better to try to simplify the pattern used for each
one instead of trying to super-optimize each one.

I'm not totally enthusiastic about OpenMP. It is supported in fewer
environments than pthreads or other threading libraries. It would be
really easy for me to throw something together that works a lot like
OpenMP but would be based on pthreads for Unix and Win32 threads on
Win32. It would use a function like

void parallel_for(iteration_count, task_function, task_state)

instead of depending on OpenMP compiler transformation. Doing this
would also cut those two extra DLLs (which have always made me a bit
nervous) out of the Win32 distribution.

> a) give up on sox_globals.use_threads with its code duplication and instead
> refer users to the OMP_NUM_THREADS/OMP_THREAD_LIMIT environment variables,
> which is more flexible too (a quad-core user might want to assign two
> threads); for the transition, we can call omp_set_num_threads() according to
> command-line options;

Actually, I consider this to be a liability of OpenMP. When sox is run
as a command-line tool where sox owns the whole process, this
automatic configuration is great. But if we are serious about
providing a libsox (are we serious about providing a libsox?) then sox
can't assume it owns the process, and sox should neither read nor
write process-global settings (like OpenMP's num_threads) unless
requested to do so by the caller.

> b) make multi-threaded the default again when/since we've sorted out the
> problems that made MT slower than ST.
> There is a penalty on using OpenMP with only one thread, but it is not high.
> When comparing "sox --single-threaded" and "export OMP_NUM_THREADS=1; sox
> --multi-threaded" (simulation of the proposal), the latter takes at most a
> few percent more time.

Yeah, I agree. If you do your threading correctly, the single-threaded
case shouldn't be any slower.

> 2. Give a separate FFT cache to each thread, so that it doesn't have to wait
> until all other threads have completed their FFT calculations. See attached
> fftcache.patch for my idea of how to do that; running a
> sinc-spectrogram-dither chain on a 6-channel file, for me, takes 20 seconds
> single-threaded, 14 seconds multi-threaded without that patch (27 seconds of
> CPU time), but 11 seconds (21 CPU) with that patch. There is one blemish,
> though: I can't see a way to reliably free all buffers afterwards. We can
> perhaps live with that.

I agree about the buffers. I didn't mention them before but I have
given them some thought. I would be interested to see what happens
when you just allocate/free the buffers on each iteration. For a
single-threaded sox, caching the buffer is likely to be faster. For a
multi-threaded sox, it's probably no slower to malloc/free than to
keep track of multiple buffers and find a free one. Maybe there is a
way to be clever and use the buffer cache in the case of
num_threads==1? Or maybe it isn't worth it.

As far as freeing the buffers, this is another one of those areas
where it depends on whether we're serious about providing libsox or
whether we just provide the sox program. It is acceptable (at least in
my book) for a process to leave memory and threads behind for the OS
to clean up after process termination. It is not acceptable for a
library to unload and leave behind allocated dynamic memory or running
threads.

The general solution for libraries is to require the client to open
and close the sox library. I am strongly in favor of doing this.
Without this, sox can't be safely used by multiple clients in the same
process (for example, two plugins running in the same application that
might both want to use libsox, and if both use the global settings at
the same time, they will stomp on each other).
>
> 3. Exploit the fact that some effects operate on each (wide) sample
> independently, among them gain (in some of its modes) and remix. Instead of
> processing the whole buffer in one go, flow_effect() could give a part each
> to several threads.

I'm less enthusiastic about this. The effects that operate on samples
independently are unlikely to spend enough CPU cycles for it to be
worth splitting them up.