Loading POPFile is very slow when the bucket files get
big. (For example, I have 54,894 distinct 'words' in my
spam folder.) This has a very strong negative impact on
other processes that run at startup. For example, I
have a couple of applets that require a database to be
running; per-POPFile, they ran just fine at startup;
once I'd been using POPFile for a while, they started
to fail because the db manager wasn't initialized. I
had to add a sleep() to the processes that loaded these
applets ....
* One quick fix thing I've done is to write an app that
runs at startup and lowers the popfileib.exe priority.
This doesn't seem to hurt POPFile, and it does let
other processes load more normally. This is something
that POPFile itself could do, very trivially.
* A more involved fix might involve a UI to prune the
bucket files. Perhaps an option to show word frequency
distribution (ie, five words have been used thousands
of times; dozens of words have been used hundreds of
times; hundreds of words have been used tens of times;
while most owrds have only been used two or three
times) and then a way to delete words that have only
been used less than N times.
Logged In: YES
user_id=696850
Have you checked for corpus corruption? (The Wiki explains
how to do this)
>> prune the bucket files ... a way to delete words <<
Manni (one of the POPFile developers) has written a useful
script which can be used to clean your corpus. The default
mode removes "meaningless" words from the corpus. It also
has a mode where it removes words which appear in every
bucket with similar probabilities.
You can find out more about this script (and download it)
here:
http://popfile.manni-heumann.de/cc/
I have been using his script frequently for some time now
and have found it very useful.
By default the script makes no changes to your corpus, it
simply creates a text file listing all the words that the
script will remove from your corpus if you re-run the script
and tell it to update your corpus.
The size of the corpus database depends upon more than just
the number of words in your corpus. POPFile uses the
database to store some information on every message kept in
the message history so if you keep thousands of messages
there this can make the corpus very big.
You may be able to reduce the corpus size by defragmenting
it using the SQLite command-line utility (the VACUUM command
cleans up the database).
You didn't mention the operating system and the version of
POPFile so I cannot give detailed instructions here.
Brian
Logged In: YES
user_id=663087
Large message history really does have an effect on
POPFile's startup spead. It seems to me much more than a
large number of words in buckets. I have the exact same
problem at startup (except that it doesn't cause other
programs to fail) on Windows XP. POPFile is very slow to
startup and makes the entire machine slow to finish loading
enough to be usable.
I have 30,000 messages in history and used to have even
more. Getting rid of some helped a lot, but I would
recommend keeping a lot less than 30,000 messages. That is
365 days worth for me. You can change how many days POPFile
keeps messages on the Configuration page under History View.
I also think the Clean Corpus utility may help. It should
do exactly what you want and remove unimportant words.
Logged In: YES
user_id=999520
I rarely have more than 12 hours worth of message history -
I scan for false-negatives regularly, and delete immediately
after scanning. I, too, have found how expensive message
history is.
Clean corpus did help a bit, but POPFile loading is still
slow and does impact other processes, though not as much. I
would still suggest that clean corpus belongs in the UI, not
on the command line.
Fwiw, even after cleaning corpus, I still use my drop
priority hack, and would recommend that POPFile incorporate
it. Dropping process priority should be a one (or two, if
you have to access a library) line patch, in Perl as in C# ....