Menu

#1011 Startup is slow

open
nobody
None
5
2014-08-21
2006-04-19
Jon Shemitz
No

Loading POPFile is very slow when the bucket files get
big. (For example, I have 54,894 distinct 'words' in my
spam folder.) This has a very strong negative impact on
other processes that run at startup. For example, I
have a couple of applets that require a database to be
running; per-POPFile, they ran just fine at startup;
once I'd been using POPFile for a while, they started
to fail because the db manager wasn't initialized. I
had to add a sleep() to the processes that loaded these
applets ....

* One quick fix thing I've done is to write an app that
runs at startup and lowers the popfileib.exe priority.
This doesn't seem to hurt POPFile, and it does let
other processes load more normally. This is something
that POPFile itself could do, very trivially.

* A more involved fix might involve a UI to prune the
bucket files. Perhaps an option to show word frequency
distribution (ie, five words have been used thousands
of times; dozens of words have been used hundreds of
times; hundreds of words have been used tens of times;
while most owrds have only been used two or three
times) and then a way to delete words that have only
been used less than N times.

Discussion

  • Brian Smith

    Brian Smith - 2006-04-24

    Logged In: YES
    user_id=696850

    Have you checked for corpus corruption? (The Wiki explains
    how to do this)

    >> prune the bucket files ... a way to delete words <<

    Manni (one of the POPFile developers) has written a useful
    script which can be used to clean your corpus. The default
    mode removes "meaningless" words from the corpus. It also
    has a mode where it removes words which appear in every
    bucket with similar probabilities.

    You can find out more about this script (and download it)
    here:

    http://popfile.manni-heumann.de/cc/

    I have been using his script frequently for some time now
    and have found it very useful.

    By default the script makes no changes to your corpus, it
    simply creates a text file listing all the words that the
    script will remove from your corpus if you re-run the script
    and tell it to update your corpus.

    The size of the corpus database depends upon more than just
    the number of words in your corpus. POPFile uses the
    database to store some information on every message kept in
    the message history so if you keep thousands of messages
    there this can make the corpus very big.

    You may be able to reduce the corpus size by defragmenting
    it using the SQLite command-line utility (the VACUUM command
    cleans up the database).

    You didn't mention the operating system and the version of
    POPFile so I cannot give detailed instructions here.

    Brian

     
  • Texas Fett

    Texas Fett - 2006-04-28

    Logged In: YES
    user_id=663087

    Large message history really does have an effect on
    POPFile's startup spead. It seems to me much more than a
    large number of words in buckets. I have the exact same
    problem at startup (except that it doesn't cause other
    programs to fail) on Windows XP. POPFile is very slow to
    startup and makes the entire machine slow to finish loading
    enough to be usable.

    I have 30,000 messages in history and used to have even
    more. Getting rid of some helped a lot, but I would
    recommend keeping a lot less than 30,000 messages. That is
    365 days worth for me. You can change how many days POPFile
    keeps messages on the Configuration page under History View.

    I also think the Clean Corpus utility may help. It should
    do exactly what you want and remove unimportant words.

     
  • Jon Shemitz

    Jon Shemitz - 2006-04-28

    Logged In: YES
    user_id=999520

    I rarely have more than 12 hours worth of message history -
    I scan for false-negatives regularly, and delete immediately
    after scanning. I, too, have found how expensive message
    history is.

    Clean corpus did help a bit, but POPFile loading is still
    slow and does impact other processes, though not as much. I
    would still suggest that clean corpus belongs in the UI, not
    on the command line.

    Fwiw, even after cleaning corpus, I still use my drop
    priority hack, and would recommend that POPFile incorporate
    it. Dropping process priority should be a one (or two, if
    you have to access a library) line patch, in Perl as in C# ....

     

Log in to post a comment.