I have a number of very large binary files (cross-compilers, etc.) in my source trees, and when I genxref (using swish-e) it sucks up way more CPU/memory/time than I would expect. I examined the code and found an algorithmic error.
For swish-e, genxref reads the contents of the files into memory (in Perl), THEN it examines the file to determine whether it's binary or not using File::MMagic::checktype_contents(). If the file is determined to be binary (or empty) we skip it.
This is obviously extremely inefficient, since File::MMagic never needs to see the entire file to know its file type: it needs to look at the filename (which it doesn't have here!) and possibly the first 8K or so of the file.
So, I rewrote this section of genxref to get a filehandle for the file, then test to see if it's binary, and only if it's not do we read the contents of the file into memory. This sped up my genxref significantly and greatly reduced the load on my server during indexing.
Patch from latest CVS attached.