Is it possible for bogofilter tu run in daemon mode
(e.g. server-client style)? That way it don't have to
reload the wordlists everytime a message comes.
I'm working at an ISP with more than 1 Million email
users. Email traffic is VERY high. We implemented bogo
on six MTA (mail transport agent) servers, two handles
outgoing mail and four handles incoming mail.
I'm very satisfied with bogofilter result. Most of
times it will correctly classify a spam or non spam.
However, the machine load gets VERY high, especially
disk usage. I guess this is because of the repeated
reading or wordlists (auto-update is disabled).
If bogo could run in daemon mode, the worddlist will
only be loaded once and stays in memory, thus improve
overall system performance (anti virus programs do that
with their virus patterns).
Logged In: YES
user_id=2788
Bogofilter does not "reload" the word lists each time a
message arrives. Bogofilter uses the BerkeleyDB, and opening
a data base for read-only access is as cheap as opening a
regular file. BerkeleyDB only loads the parts of the file
that contain the tokens to look at, and the kernel will
cache these pages, so make sure your memory is not fit too
tightly.
Note also that the more recent BerkeleyDB versions use the
mmap(2) system call, which "maps" the file into memory where
it's read on-demand only, and which avoids copying data
forth and back between the kernel and the application.
mmap may not work across networked file systems, depending
on your operating system and version. BerkeleyDB will then
silently use regular read/write operations, but it will
still only read the data that it actually needs, and not the
whole data base.
If we switched to use a "daemon", we might have to send
enormous amounts of data between client and server, and I
wonder if that is really faster than mapping disk blocks
into the application's data memory.
Logged In: YES
user_id=722099
this would be a great thing to make a daemon, it woud run
several times faster especially on smaller files
on high-load systems it is impossible to install bogofilter system-
wide because of the "low speed" startup, and that's a pity since
the bogofilter is a great thing
Logged In: YES
user_id=2788
Well, my home setup that runs bogofilter 0.11.1.x from mail
drop version takes like 30 ms wallclock time to process a
short mail with bogofilter, out of 180 ms total for maildrop
(without registering, i. e. without -u option to
bogofilter). With bogofilter -u, it's between 50 and 200 ms
more. (AMD Duron 700, Linux 2.4, plenty of RAM, 7200/min
U160-SCSI drive, ext3fs).
I wouldn't call that "low speed" startup. However, this
doesn't constitute a statement about high-load systems. If
anyone could come up with details where exactly bogofilter
takes so long, that would be much appreciated. An idea to
obtain such logs is running (Linux/FreeBSD):
strace -tt -o bogofilter.dump.$$ bogofilter OPTIONS
Replace OPTIONS with your options; the output will be in
files named bogofilter.dump.12345, bogofilter.dump.32463 and
so on.
Are you using "bogofilter -u"?
Logged In: YES
user_id=715651
I'm not using bogofilter -u. When I did, the machine just
goes beserk (VERY HIGH LOAD) and I had to reboot it because
it won't respond anymore.
I don't know whether my system use mmap or not.
It's a sun4u sparc SUNW,UltraAX-i2 running solaris 8 with
Berkeley DB 4.1, local disk, bogofilter version 0.10.0.
Daemon doesn't necessarily mean lots of data transfers. Anti
virus daemons (eg. ClamAV) only pass filename on the socket,
so it uses small amount of data transfer.
Another thing. I tried replacing bogofilter with spamd
(spamassassin), but the load is much higher, so I stop using
it. It's not a surprise, however, since spamassassin is
written in perl.
Logged In: YES
user_id=2788
bogofilter -u causes synchronous writes on the data base
(which mean processes in I/O wait state, adding to the
load), and you may have to limit the number of bogofilter
processes running at the same time when you have a loaded
mail system. It would be possible to make bogofilter do
asynchronous writes, at the risk of much higher chance for
data base corruption.
Solaris + BDB 4.1 will do mmap(). This means all data will
stay on disk until accessed, and the kernel will take care
for caching the data. There will be virtually no copying
data around (even if Sparcs are quite good at that).
As to the daemon mode, are you familiar with profiling
software? Getting gprof output might be useful to identify
the places that limit performance. I should like to look at
the figures to find out if we need to tune the lexer and
parser or if it's really the data base. If it's the lexer,
we can get along without adding a daemon mode (which adds a
lot of complexity), if it's indeed the data base access that
limits bogofilter performance, we'll have to do some
research to figure how this can happen in a good way.
Logged In: YES
user_id=715651
Unfortunately I have never used profiling software before.
Could you tell me how to do it?
Logged In: YES
user_id=715651
Soory, just read gprof manual earlier. Here's the result. I
have NO IDEA how to read it though. Hope it's useful for you.
Gprof output for bogofilter
Logged In: YES
user_id=2788
Well, I can read it, but it does not contain useful
information -- the reason is not that you did something
wrong, but that the program has exited within 30 ms, and the
profile information does not contain useful time values
(only three single code samples have been made); and the
function calls don't look bad or suspicious.
Logged In: YES
user_id=715651
I've been doing some more experiment with exim and
bogofilter. It seems that no matter how efficient the filter
is, exim will still use twice as much resource because
bogofilter must run as transport filter and the message is
rejected with system filter. Thus, for every message
received exim must send mail to itself (to use the transport
filter).
I've been using exiscan + clamav for mail virus scanner,
which works great. It has builtin support for spamd, which
is not so great (perl, slower, higher resource demand). It
will be great if there's a spamd-like interface to
bogofilter, so I can use it with exiscan. There will be no
need for exim to deliver mail to itself then.
It would be easier if exiscan supports bogofilter.