POPFile - Automatic Email Classification / Feature Requests / #1011 Startup is slow

Brian Smith - 2006-04-24

Logged In: YES
user_id=696850

Have you checked for corpus corruption? (The Wiki explains
how to do this)

>> prune the bucket files ... a way to delete words <<

Manni (one of the POPFile developers) has written a useful
script which can be used to clean your corpus. The default
mode removes "meaningless" words from the corpus. It also
has a mode where it removes words which appear in every
bucket with similar probabilities.

You can find out more about this script (and download it)
here:

http://popfile.manni-heumann.de/cc/

I have been using his script frequently for some time now
and have found it very useful.

By default the script makes no changes to your corpus, it
simply creates a text file listing all the words that the
script will remove from your corpus if you re-run the script
and tell it to update your corpus.

The size of the corpus database depends upon more than just
the number of words in your corpus. POPFile uses the
database to store some information on every message kept in
the message history so if you keep thousands of messages
there this can make the corpus very big.

You may be able to reduce the corpus size by defragmenting
it using the SQLite command-line utility (the VACUUM command
cleans up the database).

You didn't mention the operating system and the version of
POPFile so I cannot give detailed instructions here.

Brian

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Texas Fett - 2006-04-28

Logged In: YES
user_id=663087

Large message history really does have an effect on
POPFile's startup spead. It seems to me much more than a
large number of words in buckets. I have the exact same
problem at startup (except that it doesn't cause other
programs to fail) on Windows XP. POPFile is very slow to
startup and makes the entire machine slow to finish loading
enough to be usable.

I have 30,000 messages in history and used to have even
more. Getting rid of some helped a lot, but I would
recommend keeping a lot less than 30,000 messages. That is
365 days worth for me. You can change how many days POPFile
keeps messages on the Configuration page under History View.

I also think the Clean Corpus utility may help. It should
do exactly what you want and remove unimportant words.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jon Shemitz - 2006-04-28

Logged In: YES
user_id=999520

I rarely have more than 12 hours worth of message history -
I scan for false-negatives regularly, and delete immediately
after scanning. I, too, have found how expensive message
history is.

Clean corpus did help a bit, but POPFile loading is still
slow and does impact other processes, though not as much. I
would still suggest that clean corpus belongs in the UI, not
on the command line.

Fwiw, even after cleaning corpus, I still use my drop
priority hack, and would recommend that POPFile incorporate
it. Dropping process priority should be a one (or two, if
you have to access a library) line patch, in Perl as in C# ....

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Startup is slow

Group

Searches

Help

#1011 Startup is slow

Discussion