#131 Tagger & Plain.pm inefficient

current_cvs
closed-fixed
Andre-Littoz
genxref (49)
3
2012-03-29
2009-03-26
Malcolm Box
No

Tagger.pm uses Files::tmpfile to get a file for indexing. In the plain file case, this is created by copying the existing file into a temporary file. After indexing the temporary file is then deleted.

Should fix the interface from Tagger.pm to Files.pm to simply get a read-only reference to the file, and close it when no longer needed. Leave it up to Files.pm to worry about where/how it's on the disk.

Alternatively on Unix systems Plain.pm could just create a hardlink and give that back to Tagger.pm, but this isn't portable.

Discussion

  • Andre-Littoz
    Andre-Littoz
    2012-03-29

    Files.pm interface changed as follows:
    - method tmpfile renames to realfilename to return a filename containing text requested as file and version
    - new method releaserealfilename to tell when the real filename is no longer needed.

    Plain.pm:
    - realfilename returns the OS absolute path to the requested file version
    - releaserealfilename does nothing

    CVS.pm, BK.pm (this one on a guess, result not tested)
    - realfilename makes a temp copy of the requested file version and returns the OS absolute path of the temp file
    -releaserealfilename deletes the temps file

    Git.pm: no fix because Git support is broken in CPAN library.

     
  • Andre-Littoz
    Andre-Littoz
    2012-03-29

    • assigned_to: mbox --> ajlittoz
    • status: open --> closed-fixed
     
  • Andre-Littoz
    Andre-Littoz
    2012-04-06

    Made extensive tests and I'm a bit disappointed. The overall time-improvement seems to be 10-15% only. It means either the Linux caching mechanism is very very efficient or other factors dominate indexation time.

    One of these could be --reindexall processing. On a moderately sized test (8 LXR releases), genxref takes 70 seconds (high-end computer, 4GB memory, 3,4GHz, fast SATA) for a first indexation (empty database) and 88 seconds with --reindexall. In both cases, fluctuation on run time is about +-1 second.

    For Linux kernel 3.1 (more than 31,000 files), genxref takes 3 hours 53 minutes 58 seconds (single run, no deviation computed) with a fresh database while --reindexall gives 5 hours 30 minutes!!! Unless you want to keep cross-references on other versions, it is much better to erase the database first then to use --reindexall.

    Another examination shows that ctags step is quite fast (ctags is compiled, not interpreted) while the references step is slow. Most of the time is spent in SimpleParse.pm's nextfrag subroutine. A real improvement would result from using a compiled parser and probably all the more if it is a real finite state automaton parser instead of the surrogate regexp-based Perl-interpreted parser.

    Also what happens during glimpseindex step is not clear: A line "This is glimpseindex version ..." is printed, then nothing more during a very long time (kernel case) as if glimpseindex was frozen (but it is not). This needs some explanation or caveat from glimpse team.

    What also puzzles me: I made tests on my old low-end laptop (512 MB memory, 650 MHz, standard PATA) with the LXR sample. The results do not reflect the clock ratio between the machines. I should have a penalty with the slow PC (less memory to hold the caches) while it behaves better: 3 minutes 58 seconds average against 1 minute 12 seconds (same sample). Ratio is 3.3 for clock ratio 5.23.

    All times used were 'real' results from time. I have not done the comparisons on 'user' and 'sys'.

     
  • Andre-Littoz
    Andre-Littoz
    2012-04-07

    Accurate time for 3.1 kernel is 5 hours 46 minutes 56 seconds.