prll is a pearl of a utility for parallel command execution

By May 14th, 2010

SourceForge.net hosts projects of grand enterprise scope, as well as simple utilities that do just one thing to improve your life online. One project in the latter category is prll, a tool that lets you execute commands in parallel. It’s useful for anyone who uses the Linux command line and wants to execute the same command on multiple pieces of data (e.g. files) in parallel.

There are, of course, other such tools, but prll is different in that it takes as a command the name of a shell function (or its body directly), which makes it easy to use when working interactively in a shell. prll doesn’t require you to write scripts for complex commands, as is the case with, for example, xargs. And with prll the shell function has access to the same context (profile settings and environment variables) as the parent shell. Functions can request an abort of a whole batch, which makes it easy to stop on an error or to implement an ad-hoc parallel search. prll also does internal output buffering and locking, preventing interleaving of outputs from different jobs.

Project leader Jure Varlec created prll to meet a common need. “Everyone who uses a CLI accumulates, with time, many aliases in their shell init file. My preferred way to work on a particular task is to open a shell, then source a file that contains specialized environment variables, aliases, and functions for that task. That way, I can have different environments for different tasks without polluting my normal environment. Using shell functions instead of utility scripts not only prevents polluting PATH, but allows me to use environment variables to share state between functions.

“When I got to use a multicore machine three years ago, I found that parallelizing execution in a shell isn’t as simple as I expected it to be. The easiest way is, of course, a for loop with a counter that calls ‘wait’ for every N processes backgrounded, where N is the number of cores. But waiting for all N jobs to complete before starting the next batch is quite inefficient. I did some further research and found other paralellization tools in time, but none that would make use of shell functions simple.

“prll is implemented as a shell function itself. That enables it to inherit the whole environment. It utilizes subshells, so it can’t modify this environment, but that’s not much of a limitation – modifying shared variables in a parallel environment is complex and demands a complex solution. That is not prll’s scope; rather, it makes simple things simple, and moderately complex things a bit easier.

“prll also has helper programs, written in C, to handle System V IPC. C is the obvious choice here since these programs interface with the operating system directly. In the current version, there are two of them; one handles message queues, the other semaphores and buffering. The latter is another important feature of prll. When multiple processes write to the same output, there are problems when they write at the same time. To see what I mean, try using xargs’ parallel execution mode to md5sum a large number of files and redirect the results to a file. You will find that the output is mangled, even though each job only writes a small amount of data. If a job’s output is larger that the OS’s stream buffer, the jobs’ outputs are sure to be interleaved. prll’s buffers read the whole output from each job and synchronise to only write one at a time.”

But Varlec cautions that prll isn’t a tool for all uses. “prll is made to solve a certain class of problems, and it makes parallelization very easy in a lot of cases. prll uses IPC to facilitate locking and synchronisation so you don’t have to deal with it. But when the functions you want it to execute need to utilise locking themselves, you’re dealing with a more complex problem. Don’t expect to find a solution that is both simple yet general enough.”

Varlec is working on some cosmetic changes for the next version of prll. “I will also try to make prll POSIX-compliant. Most of the work has already been done by a helpful user, but I have yet to check it, merge it with the latest version, and see if I can maintain it in the long run. I’m also considering some internal utilities to be made available to the functions being executed. For example, I’d like to provide a locking mechanism to users, further expanding the usefulness of prll. But I want to keep prll as simple as possible.

“prll is a small utility and that’s how I’d like to keep it. I can do the work myself, but I’m always looking for comments and ideas. A tool like prll has to be focused on usability, so I’d like to hear from anyone who has an idea about how to make it more usable. There are also technical issues I’d like to discuss with interested people. There is, for instance, a need to limit the amount of memory a buffer uses, but what to do when that limit is exceeded? I’m available through e-mail.”

Tags: prll

Posted in General | Comments disabled