Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

#126 genxref/swish-e/binary files is VERY inefficient

current_cvs
closed-fixed
Malcolm Box
genxref (49)
5
2009-03-23
2007-03-30
Paul D. Smith
No

I have a number of very large binary files (cross-compilers, etc.) in my source trees, and when I genxref (using swish-e) it sucks up way more CPU/memory/time than I would expect. I examined the code and found an algorithmic error.

For swish-e, genxref reads the contents of the files into memory (in Perl), THEN it examines the file to determine whether it's binary or not using File::MMagic::checktype_contents(). If the file is determined to be binary (or empty) we skip it.

This is obviously extremely inefficient, since File::MMagic never needs to see the entire file to know its file type: it needs to look at the filename (which it doesn't have here!) and possibly the first 8K or so of the file.

So, I rewrote this section of genxref to get a filehandle for the file, then test to see if it's binary, and only if it's not do we read the contents of the file into memory. This sped up my genxref significantly and greatly reduced the load on my server during indexing.

Patch from latest CVS attached.

Discussion

  • Paul D. Smith
    Paul D. Smith
    2007-03-30

    Patch to fix binary file detection with swish-e

     
    Attachments
  • Malcolm Box
    Malcolm Box
    2009-03-23

    Thanks for the patch, have applied this to CVS.

     
  • Malcolm Box
    Malcolm Box
    2009-03-23

    • assigned_to: nobody --> mbox
    • status: open --> closed-fixed