maildb-devel Mailing List for maildb

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Maildb is dead, long live maildb...er, gmail...whatever...

JLS

Damn it. Should have sold to G.

Fare well, mail-db.  You will be missed.

On Oct 25, 2007, at 9:08 AM, Liza Weissler wrote:

> Funny I was just thinking about this project the other day and how  
> for me it was superceded by using gmail...so I think my response is  
> "works for me". :-)
>
> - Liza
>
> On 10/25/07, Jeff Squyres <jsq...@os...> wrote:
> Given that there has been zero progress on this SF project for years,
> and given that Gmail now supports IMAP, I think all the ideas of
> maildb have "been done."  Gmail isn't an open source implementation,
> but that doesn't matter to me anymore (meaning: I certainly don't have
> the cycles to do this stuff myself).  I'm very glad that others have
> implemented these ideas; I think that e-mail clients will benefit
> greatly (gmail is great; others are copying the ideas to other
> clients).
>
> In particular, look at Gmail's mapping of IMAP actions:
>
>      http://mail.google.com/support/bin/answer.py?answer=77657
>
> So, unless someone else wants to take over this project, I think it's
> time to officially declare this SF project "dead."
>
> --
> {+} Jeff Squyres
>
> ---------------------------------------------------------------------- 
> ---
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems?  Stop.
> Now Search log events and configuration files using AJAX and a  
> browser.
> Download your FREE copy of Splunk now >> http://get.splunk.com/
> _______________________________________________
> maildb-devel mailing list
> mai...@li...
> https://lists.sourceforge.net/lists/listinfo/maildb-devel
>
> ---------------------------------------------------------------------- 
> ---
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems?  Stop.
> Now Search log events and configuration files using AJAX and a  
> browser.
> Download your FREE copy of Splunk now >> http://get.splunk.com/ 
> _______________________________________________
> maildb-devel mailing list
> mai...@li...
> https://lists.sourceforge.net/lists/listinfo/maildb-devel

Funny I was just thinking about this project the other day and how for me it
was superceded by using gmail...so I think my response is "works for me".
:-)

- Liza

On 10/25/07, Jeff Squyres <jsq...@os...> wrote:
>
> Given that there has been zero progress on this SF project for years,
> and given that Gmail now supports IMAP, I think all the ideas of
> maildb have "been done."  Gmail isn't an open source implementation,
> but that doesn't matter to me anymore (meaning: I certainly don't have
> the cycles to do this stuff myself).  I'm very glad that others have
> implemented these ideas; I think that e-mail clients will benefit
> greatly (gmail is great; others are copying the ideas to other
> clients).
>
> In particular, look at Gmail's mapping of IMAP actions:
>
>     http://mail.google.com/support/bin/answer.py?answer=77657
>
> So, unless someone else wants to take over this project, I think it's
> time to officially declare this SF project "dead."
>
> --
> {+} Jeff Squyres
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems?  Stop.
> Now Search log events and configuration files using AJAX and a browser.
> Download your FREE copy of Splunk now >> http://get.splunk.com/
> _______________________________________________
> maildb-devel mailing list
> mai...@li...
> https://lists.sourceforge.net/lists/listinfo/maildb-devel
>

Given that there has been zero progress on this SF project for years,
and given that Gmail now supports IMAP, I think all the ideas of
maildb have "been done."  Gmail isn't an open source implementation,
but that doesn't matter to me anymore (meaning: I certainly don't have
the cycles to do this stuff myself).  I'm very glad that others have
implemented these ideas; I think that e-mail clients will benefit
greatly (gmail is great; others are copying the ideas to other
clients).

In particular, look at Gmail's mapping of IMAP actions:

    http://mail.google.com/support/bin/answer.py?answer=77657

So, unless someone else wants to take over this project, I think it's
time to officially declare this SF project "dead."

-- 
{+} Jeff Squyres

I finally had a free window this past week and spent some time 
working on maildb (!).  Woo hoo!

Specifically, I wrote a perl script to import mbox archives into the
maildb database (well, actually, I used the CPAN module Mail::Box
which natively handles lots of kinds of mail archives -- not just
mbox).  I wrote this script to understand the database schema as we
had in CVS and to take it into the proof-of-concept realm.

It seems to work.  I imported over 53K messages from my current mbox
archives (damn, I get a lot of mail) into a maildb MySQL database.
Woot!

I then wrote another script to look up categories and messages in
those categories -- i.e., display the messages that had been imported.
After finding some bugs in MySQL (!) and Mail::Box (more on these
later), that script now also seems to work.  Double woot!

There is still much work to be done.  I'm not much of a database guy,
and it's a little hard for me to think along those lines -- I'm sure
that my queries and indices can be optimized (doing the import of 53K
messages takes many hours on a reasonably fast Linux x86
box)... which I'll let you other DBA-types argue about.  :-)  Also,
nothing has been done on the IMAP-server side -- this past week was
just spent understanding the DB schema and trying to do some
practical stuff with it.

-----

During this process, I either exposed weaknesses in the design that
are inevitable during a first implementation of a design, or I didn't
fully understand Liza's original intent (which is quite probable).
I've committed changes to the MySQL schema that Liza proposed --
mysql/libmaildb/db/mysql/doc/cr_maildb.sql.  If I'm totally off-base
and simply misunderstood the original intent, we can always roll back
CVS to the original stuff.

Rather than try to explain all the changes, let me explain the schema
philosophy that I've committed:

- During my work, an epiphany came to me: we really don't need to
   *interpret* much of the data that we're storing.  We really only
   need to *store* and *retrieve* it.  For example, we had a scheme to
   normalize MIME types.  This is good for space saving, but I ditched
   it in favor of just storing message header data that is transparent
   to maildb.  Specifically, all we have to do is store a set of
   message headers and then be able to output them upon request -- we
   don't have to know what any of them *mean*.

   That being said, there are good reasons for normalization (e.g.,
   space savings).  And we might still want to do that -- but let's get
   it working first, and then go back to that (e.g., selectively
   interpret some of the header lines, such as the MIME type, and
   therefore be able to normalize them).

- I think we had also thought of saving the entirety of the original
   message in a separate table (headers and all).  Does anyone remember
   why we were thinking of doing this?  Was it just for debugging?  I
   ditched this table as well; it seemed to simply double the storage
   space required.

- Terminology: Ignoring a lot of details -- a RFC 2822 message is
   comprised of a header and a body.  The body may be plain text or one
   or more "parts."  Each part may or may not have its own sub-header,
   and may actually be another RFC 822 / 2822 message itself.  So it's
   really a recursive thing -- a message will have one or more body
   parts, each of which may be another message in itself.

- Keep in mind that some of the stuff described below is because it
   was the way we originally designed it (2+ years ago!).  I don't
   remember all the reasons for what we did -- and I actually question
   at least some of it -- but I stuck with most of the original
   decisions.

- Here's a breakdown of the major tables:

   - users: a simple maildb-UID to username mapping.  The maildb-UID is
     a maildb-specific UID used for establishing the ownership of
     messages in the database.  It's referenced in most of the other
     tables.  Remember -- we don't want to implement an authentication
     scheme (that's a job for other tools); we only need a simple
     username-to-UID mapping.

   - cats: mapping of category names to category IDs, including the
     concepts of user ownership and hierarchical organization of
     categories (i.e., nested categories, like filesystem directories).

   - messages: every message (including embedded RFC822 messages) has
     exactly one entry in the message table, giving it a unique ID.
     This message ID value is extensively cross-referenced in other
     tables to bind header and body data to a single message.  Messages
     are [currently] owned by a single UID, and have flags that, among
     other things, indicate whether the record is a valid message or
     not (e.g., partially inserted messages will have their "valid"
     flag set to 0).

   - msg_cats: A message will have a msg_cats record for every category
     that it is in.  Hence, it's mainly a cross reference between
     message ID's and category ID's.

   - msg_hdrs: A series of key=value records of header lines from any
     part in a single message (remember that body parts can have header
     lines).  Header lines are attached to a specific part in a
     specific message (e.g., part=0 means the main header).  The
     ordering of the header lines is, of course, maintained.

   - msg_parts: Each message has at least one body part.  Each record
     in this table is tied to a specific message, and has an ordered
     part ID (i.e., all parts, in order, are the "body" of the
     message).  Each part will either be stored in the record itself
     (as a mediumtext BLOB) if it's under a specific size, or will be
     stored in the filesystem if it's over that size.

   - msg_quick_search: this is the one table where we actually
     interpret several of the "common" fields in the RFC 2822 header
     (to, cc, bcc, from, subject, date, etc.).  We store them all in
     text blobs for quick searching.  The entire point of this table is
     for quick searching that resolves down to a message ID where we
     can actually get to the real message.

   - config: a simple key=value table where maildb configuration can be
     stored.  For example, the max length (in bytes) of messages that
     will be stored in the DB is in this table.  I anticipate that
     we'll eventually have lots of tunable maildb parameters in here.
     Users can put their own overrides in here (where it makes sense),
     so there's a UID field as well (UID=0 are system-level config
     options).

So here's how a message is inserted:
------------------------------------

1. A record is created in messages so that the message ID is
    created.  The "valid" flag is set to 0 upon its creation, so any
    other threads/agents looking at the database won't think that this
    is a message that can be read.

2. For each part (to include the main header):

    2a. The "quick search" record is inserted, cross referenced to the
        message ID.

    2b. If headers exist for this part, they are insertted in the
        msg_hdrs table, cross referenced to the message ID and part ID.

    2c. The body part is either stored in the msg_parts table or in the
        filesystem; either way, a new entry is insertted in the
        msg_parts table and is cross referenced to the message ID and
        part ID.

    2d. A record is created in msg_cats tying the new message ID to a
        category ID.

    2e. The "valid" flag on the messages record is changed to "1",
        indicating that this is now a valid message that can be read.

Scripts that I wrote:
---------------------

Both scripts are located in libmaildb/db/mysql/doc (we can change the
directory structure later).

Both of these scripts assume that you've followed the instructions in
libmaildb/db/mysql/doc/README to create the maildb MySQL database and
all of its tables.  You'll also need to create /var/spool/maildb and
give it the same permissions as /tmp (777 and chmod +t; I forget what
t is offhand :-).  This directory is where long messages are stored.

- import_mbox.pl: Takes argv listing mbox files to import.  Ensures
   that your unix username is in the users table.  Ensures that you
   have a special INBOX category.  Sets some default config values in
   the config table if they aren't already set.  Each mbox file is then
   read and parsed; messages are inserted in a category name matching
   the filename of the mbox file being imported (please only use
   forward relative filenames -- I didn't put any logic in for absolute
   directories or "." or "..").  For example:

   ./import_mbox.pl foo bar/baz bar/moog/cow

   will import the messages in 3 mbox files, and make the following
   categories along the way:

   foo
   bar
   baz, child of bar (i.e., "bar/baz")
   moog, child of bar (i.e., "bar/moog")
   cow, child of moog (i.e., "bar/moog/cow")

- index_cat.pl: for a given category, show all of its sub-categories,
   list the number of messages in that category, and display the
   headers of all the messages.

These are both works-in-progress; they'll probably change a bit more
over this weekend (e.g., showing the bodies of the messages in
index_cat.pl is a trivial addition).

Some DB issues:
---------------

- I'm not sure we understand what we need for indexes.  Indexes are
   easy to add/modify, so I'm not worried about it now.  But when we're
   done with most of the design/coding, we should probably look up the
   searches that we're always doing and make indices to support those.

- Another minor point -- mbox archives of my 53K messages occupy
   approximately 530MB of disk space.  After the import of these 53K
   messages, the resulting MySQL DB used 881MB of disk space.  I
   suspect that at least some of this is because of the oodles of
   indexes that we're creating now.  This is not a huge deal, but we
   should try to not take up *too* much more space than is really
   necessary...

- I don't have an exact number, but importing the 53K messages took
   something like 10+ hours on a reasonably fast Linux box.  More
   specifically, the further into the import it got, the slower it
   became.  I suspect that this has something to do with the indexes we
   currently have, and it may simply be the nature of the beast.  But
   it's something we should look at optimizing (average time to insert
   a single message is going to be a critical performance factor in the
   long run).

- Because we're storing long messages in the filesystem (and not in
   the database), we can't do full-text searches on messages.  I know
   we talked about this at least a little bit, but I don't remember why
   we decided to do this instead of either allowing a bigger BLOB
   and/or splitting the message part across multiple records (which
   would seem necessary, regardless of the max part size that we have,
   unless we outright reject messages that are too large).  Does anyone
   remember?

Bugs in other software:
-----------------------

- I was working with the perl CPAN module Mail::Box v2.055.  There are
   two bugs in Mail/Message/Head/Complete.pm where you can get warnings
   at run-time in perl about uninitialized variables used with the ">"
   operator.  These are fairly harmless for our purposes; I've mailed
   the Mail::Box author about them.  You can ignore them.

- Mail::Box was relatively senative to improperly-formatted messages.
   It rejected a few messages that had malformed addresses in to CC
   line, had incorrect MIME separator lines, etc.  This is not a
   problem for maildb itself (i.e., this has no bearing on our actual
   run-time -- remember, we odn't need to *interpret* the data that we
   store) -- it's just an issue for the importer script that I wrote.
   I think it rejected something like 10 messages out of the 53K that I
   imported.  This definitely falls within the bounds of "good enough
   for prototyping."  :-)

- MySQL v4.0.17 (the default for fink on OSX) seems to have a bug with
   inserting indexed text field values that have spaces at the end of
   them (or spaces at the end of the indexed portion).  This does *not*
   happen in v4.0.15 nor 4.0.20).  Specifically, here's a case that
   will trip the bug:

-----
create table bogus ( subject text, index subject_index(subject(16)) );
insert into bogus values ("hello");
insert into bogus values ("hello"); # this works fine
insert into bogus values ("hello ");
insert into bogus values ("hello "); # this will barf, complaining of
                                      # a duplicate key
-----

   Clearly, this should not happen.  I put a workaround in the perl
   scripts that I wrote to ensure that there is never any whitespace at
   the end of an imported field.  But we shouldn't need to do that.

- Although the MySQL mediumtext BLOB allows values up to 16M in
   length, the client and/or server is only configured to allow
   max_allow_packet (a MySQL parameter) bytes to be sent between the
   client and server in a single query.  This value defaults to 1M for
   the server on my OSX laptop (and I think on all systems...?).
   Hence, the upper bound for mediumtext is effectively 1M unless you
   increase the max_allow_packet value.  But 1M is probably ok -- in my
   ~53K imported messages, I had 65 only parts that were >1MB.

   (you can easily change the value on the MySQL server -- supply a
   parameter to mysqld_safe when you start it).

   Note that the 1M rule applies to the entire insertion SQL string
   sent to import a *part* into the database -- not to the entire
   *message*.  Hence, there's roughly a 1MB limit on each *part* of a
   message.

Open questions:
---------------

- How to do deletions?  I *think* know the answer to this one, but it
   still requires a little more thought (haven't done any prototyping
   code yet).  MySQL doesn't have trigger procedures, so the
   possibility of a race condition in a multi-threaded server, or a
   server allowing multiple simultaneous user connections (like UW
   IMAP) is real -- need to think about this a little more.  Current
   thought is that when a message is removed from a category, do
   another search to see if it's referenced in *any* category.  If it's
   not, then delete it (this is effectively reference counting).  Any
   other opinions here?

- Should multiple users be able to own the same message?  This implies
   -- at the very least -- separating the UID out of the messages
   table (and probably some other minor re-organization).  This would
   seem nice for when a 20MB e-mail is sent to 500 users on the same
   server -- only one copy of the message needs to exist, and it's just
   "owned" by multiple users.  When all users delete it, it actually
   gets deleted (i.e., reference counting, in some form).

- Is the msg_quick_search table worth it?  It duplicates much of the
   data in the msg_headers table, and probably causes a lot of space
   to be used in indexes.  Can we effect the same searches in msg_hdrs
   without this table?

Work still to be done:
----------------------

- Look at the UW IMAP docs and see what actions it requires, how a
   maildb device should be designed, etc.

- Postgres version of the same stuff that I've extended from Liza's
   work in MySQL.

- "Views" (stored searches).

- Create and maintain logs.

- Had an interesting idea about pre-defined views -- we should
   probably offer a set of time-based pre-defined views (e.g.,
   "yesterday", "within last week", "within last month", etc.).  And we
   should offer these views as a sub-view of any view and category.  So
   you can see "yesterday" mails in the "foo" category, for example.
   Could be handy.

- Expand the set of configuration options, and allow users to have
   their own overrides (where it makes sense).  Hence, I added a UID
   field to the config table (UID=0 means system values).

...and probably a lot more that I'm not thinking of right now.  :-)

That's it!
----------

Comments appreciated on any of the above!

-- 
{+} Jeff Squyres
{+} jsq...@os...
{+} Post Doctoral Research Associate, Open Systems Lab, Indiana University
{+} http://www.osl.iu.edu/

Looks like someone else has finally taken up the idea of using what we 
called categories -- Google's Gmail uses something called "labels" that 
looks almost exactly what we were thinking of.  Check out this review (who 
cares about the review -- you can see the features that Gmail is going to 
have):

    http://www.extremetech.com/article2/0,1558,1586090,00.asp

-- 
{+} Jeff Squyres
{+} jsq...@os...
{+} Research Associate, Open Systems Lab, Indiana University
{+} http://www.osl.iu.edu/

On Wed, 9 Jul 2003, Jeff Squyres wrote:

> -----
> :0 fc
> * ^TO_...@li...
> | /usr/local/bin/maildb.insert --category mailb/devel
>
> :0 fc
> * ^FROM_.*@squyres.com
> | /usr/local/bin/maildb.insert --category received/squyres/family
> -----

I forgot to mention a critical point here -- the user has no concept of
what "maildb.insert --category abc" actually *does*.  We can implement it
however we want.  So if that means add an X-Maildb-Category header line,
or whether that menas frobbing the DB -- we can do whatever we want
(include totally change how it works, as long as the end result is the
same) and the user interface stays the same (i.e., we don't break any
procmail rules).

-- 
{+} Jeff Squyres
{+} jsq...@os...
{+} Research Associate, Open Systems Lab, Indiana University
{+} http://www.osl.iu.edu/

On Wed, 9 Jul 2003, Darrell Kresge wrote:

> Even if only temporary, I see using procmail as choosing a sledgehammer
> to hang a picture (not that I'm necessarily opposed to such things ;-) )
>
> Since you've already a requirement to parse/extract header information,
> why not just implement regex -> folder filters directly?

Not sure what you mean here...?

> 1) I realize that configuration of the rule file will be an issue, but
> no worse so than dealing w/ .procmailrc.

Yes and no.  I mentioned procmail because it's well known/loved/trusted,
and it would be good to be able to support it (in some way).  This would
give us the leverage to have flexible filtering even in 1.0 (when we don't
have native/internal filtering).

That being said, I just thought of a problem with my proposed approach --
see below.

> 2) Provided the interface to the "pattern select/route" mechanism is
> well defined, the regex stuff could easily be replaced downstream with
> something more powerful.

Agreed -- some kind of generalized mechanism would be good.

> 3) Using an X-Header to determine routing inside the maildb proper is
> going to end up being a hack on top of a hack -- you'll end up needing
> to eliminate it later when you decide to do the Right Thing

Possibly.  But it could be good to be able to support *both* procmail
*and* native filtering (if, perhaps, on the back end, they actually end up
doing the same thing -- then it wouldn't be nasty.  i.e., separate the
decision-making process from the acting-on-the-decision process).

But I did just think of a problem with the X-Maildb-Category approach:
what it someone sends you a message with:

	X-Maildb-Category: inbox

That is -- anyone can force a message to go into any of your categories
simply by adding header lines to messages that they send to you.  And
that's clearly not a Good Thing.  :-)

So back to what I said above -- perhaps we could do a "do no harm"
approach in a .procmailrc, where instead of adding a header line, you
actually run some maildb executable that adds the message to that
category (this may get a little complicated, but bear with me for this
thought experiment...).  So instead of:

-----
:0 fc
* ^TO_...@li...
| formail -A "X-Maildb-Category: maildb/devel"

:0 fc
* ^FROM_.*@squyres.com
| formail -A "X-Maildb-Category: received/squyres/family"
-----

Instead, you'd have:

-----
:0 fc
* ^TO_...@li...
| /usr/local/bin/maildb.insert --category mailb/devel

:0 fc
* ^FROM_.*@squyres.com
| /usr/local/bin/maildb.insert --category received/squyres/family
-----

...and so on.

The real trick/complication would be for a message that matches multiple
rules; that maildb.insert (or whatever) will have to recognize that it's
the same message and simply add another category to the message that's
already in the db.  Since we can't rely on the Message-Id, this is the
part that I don't really know how to do...  :-(

It seems that these procmail rules would need to put in some kind of
forward reference saying "there's an incoming message coming, make sure
that it gets added to category ABC" (but don't forget that
procmail/mail.local/etc. can be run asynchronously, so 2 different
messages with the same Message-ID can come in and be processed
"simultaneously.  So it still reduces to the same problem as above).

And perhaps procmail isn't the thing we want to support.  But any
rules-based agent will follow the same general principles.  So this is
probably still worth discussing...

-- 
{+} Jeff Squyres
{+} jsq...@os...
{+} Research Associate, Open Systems Lab, Indiana University
{+} http://www.osl.iu.edu/

On Wed, 9 Jul 2003, Darrell Kresge wrote:

> > That's one way to do it.  Another way that we used at A Former Company
> > of Mine (Collective Technologies, Austin TX) was to do a simple
> > encryption of the database username/password into a "keyring", and our
> > perl subroutine that handled the database connection extracted/used
> > that information.  It was more obfuscated than it was secure ... but
> > we figured every little bit helped.  :-)
>
> [snipped]
> Presumably, you're not going to be writing implementations for each and
> every potential database that someone might use.  Additionally,
> different DB vendors will use different authentication strategies.
> Assuming that the DB shim is developed externally (using Yet Another
> Well Defined Interface (soap, odbc, sql), it seems that for the purposes
> of release you'd want to keep the underlying mechanism as simple as
> possible; both functional and tutorial.  To that end, I would think that
> even an environment variable in a root owned start script would be
> sufficient.  Sure it's ugly, but it's easy to understand.

I think the central issue is that the mail server process has to be able
to access a secret somehow.  If you need one secret to get to another,
then that really doesn't solve the problem -- that the maild server
process (or proxy that continually gets launched via mail.local or
whatever) needs to be able to connect in an automated fashion.

And since we're not trying to protect from root -- we're only trying to
protect from other users -- a 0400 file seems like a nice, simple solution
(and easy to debug/maintain).

-- 
{+} Jeff Squyres
{+} jsq...@os...
{+} Research Associate, Open Systems Lab, Indiana University
{+} http://www.osl.iu.edu/

li...@av... wrote:

>
>> My question is: how do we authenticate to the database?
>> ...
>> Do we just put a 0400 file somewhere on the local filesystem that only
>> root and the mail.local user (probably "mail" or "daemon" or ...?) can
>> read that contains th DB username and password?  The only other way 
>> that I
>> can think of would be to compile the DB username/pw in the mail.local
>> executable, but that might make it vulnerable to "strings 
>> mail.local", or
>> something along those lines.  Is there a standard way to do this kind of
>> thing?  We're not trying to protect from root in this case -- we're only
>> trying to protect from other users (right?) -- so I'm thinking that a 
>> 0400
>> file might not be totally evil (one way to think of it: it's no less
>> secure than 0600 /var/spool/mail/* mbox files).
>
>
> That's one way to do it.  Another way that we used at A Former Company 
> of Mine (Collective Technologies, Austin TX) was to do a simple 
> encryption of the database username/password into a "keyring", and our 
> perl subroutine that handled the database connection extracted/used 
> that information.  It was more obfuscated than it was secure ... but 
> we figured every little bit helped.  :-)
> - Liza
>
> -----

I really like that idea -- and there's no reason that the encryption 
would need to be simple -- it could be PKI -- when you start the daemon, 
you specify a passphrase to get your private key which can decrypt the 
passwords on the publicly encrypted ring.

But...

Unlike filtering, is this really a maildb issue?

Presumably, you're not going to be writing implementations for each and 
every potential database that someone might use.  Additionally, 
different DB vendors will use different authentication strategies. 
 Assuming that the DB shim is developed externally (using Yet Another 
Well Defined Interface (soap, odbc, sql), it seems that for the purposes 
of release you'd want to keep the underlying mechanism as simple as 
possible; both functional and tutorial.  To that end, I would think that 
even an environment variable in a root owned start script would be 
sufficient.  Sure it's ugly, but it's easy to understand.

-D

Even if only temporary, I see using procmail as choosing a sledgehammer 
to hang a picture (not that I'm necessarily opposed to such things ;-) )

Since you've already a requirement to parse/extract header information, 
why not just implement regex -> folder filters directly?  

1) I realize that configuration of the rule file will be an issue, but 
no worse so than dealing w/ .procmailrc.  

2) Provided the interface to the "pattern select/route" mechanism is 
well defined, the regex stuff could easily be replaced downstream with 
something more powerful.

3) Using an X-Header to determine routing inside the maildb proper is 
going to end up being a hack on top of a hack -- you'll end up needing 
to eliminate it later when you decide to do the Right Thing

Just my $0.02 USD

-D

Jeff Squyres wrote:

>We've talked about maildb built-in filtering before.  Indeed, that's
>one of the main strengths of maildb -- that you can/should have
>millions of rules that will attach all kinds of categories to messages
>(e.g., as opposed to common thinking/usage today where most users file
>a message away in a *single* target folder; with maildb you should add
>*lots* of categories to each message -- this actually *increases* the
>possibility of you seeing important mails, as opposed to only filing a
>message away in a single [potentially obscure] folder).
>
>So we [eventually] need to support server-side filtering somehow.
>
>Up until now, we've only concentrated on the storage of messages -- we
>need to get this thing working before we tackle the complex issues of
>built-in server-side filtering.  I think that's been a good decision.
>
>But since this is a major feature/capability of maildb, it would be
>good to support it *somehow* -- even in our initial versions.
>
>The thought occurred to me today: what about procmail?
>
>Procmail is a slick server-side user filtering agent that is typically
>invoked directly by the MTA (e.g., via .forward).
>
>Obviously, procmail can write to the conventional mbox and mh formats,
>but it won't know how to write to the maildb data store.  But perhaps
>there's a quick-n-dirty way to make procmail work with maild: instead
>of having procmail write the actual output message to a mailbox file,
>have it simply add a header line telling maildb what to do when the
>message eventually gets written to the database.  Perhaps, something
>like:
>
>-----
>:0 fc
>* ^TO_...@li...
>| formail -A "X-Maildb-Category: maildb/devel"
>
>:0 fc
>* ^FROM_.*@squyres.com
>| formail -A "X-Maildb-Category: received/squyres/family"
>-----
>
>Then when the message finally gets written to the db, maildb will see
>any X-Maildb-Category line(s) and attach the appropriate category
>name(s) to the message in the database.  This is actually more
>efficient, because procmail won't write out the message N times --
>it'll only add N header lines and then write out the message *once* to
>the backing store.
>
>To make it work, there will need to be a final, all-encompassing
>procmail rule that actually writes the resulting message (including
>any added X-Maildb-Category header lines) into maildb by invoking some
>custom executable:
>
>-----
>:0 f
>| /usr/local/bin/maildb.insert
>-----
>
>...or something along those lines.
>
>Does this sound too hack-ish?  Any other thoughts/ideas?
>

> At least for 1.0...?

yes indeed.  :-)

On Wed, 9 Jul 2003 li...@av... wrote:

> > Does this sound too hack-ish?  Any other thoughts/ideas?
>
> Hmmm...yes, but I can't say that I have any other ideas, and this sounds
> pretty workable. :-)

At least for 1.0...?

;-)

-- 
{+} Jeff Squyres
{+} jsq...@os...
{+} Research Associate, Open Systems Lab, Indiana University
{+} http://www.osl.iu.edu/

> Does this sound too hack-ish?  Any other thoughts/ideas?

Hmmm...yes, but I can't say that I have any other ideas, and this sounds 
pretty workable. :-) 

 - Liza 

> My question is: how do we authenticate to the database?
> ... 
> 
> Do we just put a 0400 file somewhere on the local filesystem that only
> root and the mail.local user (probably "mail" or "daemon" or ...?) can
> read that contains th DB username and password?  The only other way that I
> can think of would be to compile the DB username/pw in the mail.local
> executable, but that might make it vulnerable to "strings mail.local", or
> something along those lines.  Is there a standard way to do this kind of
> thing?  We're not trying to protect from root in this case -- we're only
> trying to protect from other users (right?) -- so I'm thinking that a 0400
> file might not be totally evil (one way to think of it: it's no less
> secure than 0600 /var/spool/mail/* mbox files).

That's one way to do it.  Another way that we used at A Former Company of 
Mine (Collective Technologies, Austin TX) was to do a simple encryption of 
the database username/password into a "keyring", and our perl subroutine 
that handled the database connection extracted/used that information.  It 
was more obfuscated than it was secure ... but we figured every little bit 
helped.  :-) 

 - Liza 

> Can our current DB schema handle this? 
> 
> I *think* it can -- it seems like we're using message ID + the unique
> integer (msg_ids.m_id).  Does that make sense?

We're keying everything by our own internal identifier - msg_ids.m_id, which 
then gets referenced in the other tables (msg_attach, etc.).  The message ID 
itself is in msg_ids.m_msg_id, so presumably if we get a second copy of the 
same message (forwarded, whatever) we could drop it or keep it, depending on 
what you want to do. 

 - Liza 

We've talked about maildb built-in filtering before.  Indeed, that's
one of the main strengths of maildb -- that you can/should have
millions of rules that will attach all kinds of categories to messages
(e.g., as opposed to common thinking/usage today where most users file
a message away in a *single* target folder; with maildb you should add
*lots* of categories to each message -- this actually *increases* the
possibility of you seeing important mails, as opposed to only filing a
message away in a single [potentially obscure] folder).

So we [eventually] need to support server-side filtering somehow.

Up until now, we've only concentrated on the storage of messages -- we
need to get this thing working before we tackle the complex issues of
built-in server-side filtering.  I think that's been a good decision.

But since this is a major feature/capability of maildb, it would be
good to support it *somehow* -- even in our initial versions.

The thought occurred to me today: what about procmail?

Procmail is a slick server-side user filtering agent that is typically
invoked directly by the MTA (e.g., via .forward).

Obviously, procmail can write to the conventional mbox and mh formats,
but it won't know how to write to the maildb data store.  But perhaps
there's a quick-n-dirty way to make procmail work with maild: instead
of having procmail write the actual output message to a mailbox file,
have it simply add a header line telling maildb what to do when the
message eventually gets written to the database.  Perhaps, something
like:

-----
:0 fc
* ^TO_...@li...
| formail -A "X-Maildb-Category: maildb/devel"

:0 fc
* ^FROM_.*@squyres.com
| formail -A "X-Maildb-Category: received/squyres/family"
-----

Then when the message finally gets written to the db, maildb will see
any X-Maildb-Category line(s) and attach the appropriate category
name(s) to the message in the database.  This is actually more
efficient, because procmail won't write out the message N times --
it'll only add N header lines and then write out the message *once* to
the backing store.

To make it work, there will need to be a final, all-encompassing
procmail rule that actually writes the resulting message (including
any added X-Maildb-Category header lines) into maildb by invoking some
custom executable:

-----
:0 f
| /usr/local/bin/maildb.insert
-----

...or something along those lines.

Does this sound too hack-ish?  Any other thoughts/ideas?

-- 
{+} Jeff Squyres
{+} jsq...@os...
{+} Research Associate, Open Systems Lab, Indiana University
{+} http://www.osl.iu.edu/

For you DBAs out there... is there a common way to do this? (I think this
is an easy question, but I managed to confuse myself earlier today and
want to run in by you guys to ensure that I'm not crazy)

We've talked about user authentication before, and we decided to leave it
as the responsibility of the IMAP daemon.  This allows the possibility of
a bunch of different schemes, like passwd/shadow, pam, LDAP, etc.  i.e.:
it's not our problem.  I think this is the Right Thing.  I'm talking about
different authentication -- authentication to the database.

My question is: how do we authenticate to the database?

There's [at least] two different places where a process will need to be
executed on the server to insert a message into maildb: mail.local and a
server-side user filtering agent (e.g., procmail).  Let's look at
mail.local, although they both essentially come down to the same issue.

At some point, the MTA is going to invoke mail.local on the server to
actually deliver the message to the backing store (remember that UW IMAP
provides a mail.local replacement that will be able to write to the
maildb).  This mail.local process has to be able to connect to the
[MySQL|Postgres|whatever] database, authenticate, and then do its thing.

How do we do that?

Do we just put a 0400 file somewhere on the local filesystem that only
root and the mail.local user (probably "mail" or "daemon" or ...?) can
read that contains th DB username and password?  The only other way that I
can think of would be to compile the DB username/pw in the mail.local
executable, but that might make it vulnerable to "strings mail.local", or
something along those lines.  Is there a standard way to do this kind of
thing?  We're not trying to protect from root in this case -- we're only
trying to protect from other users (right?) -- so I'm thinking that a 0400
file might not be totally evil (one way to think of it: it's no less
secure than 0600 /var/spool/mail/* mbox files).

For the procmail issue, whatever process is launched (perhaps a variant of
mail.local) will likely be launched under the UID of the recipient user.
So will this executable need to be setuid to the mail user?  Or is that
asking for trouble?

Thoughts?

-- 
{+} Jeff Squyres
{+} jsq...@os...
{+} Research Associate, Open Systems Lab, Indiana University
{+} http://www.osl.iu.edu/

Here's a new issue that I thought of while I was driving home from
Bloomington today...

For my personal mail use, I recently switched over from client-side
filtering to server-side filtering (procmail).  In doing so, I learned
that the same message (i.e., a message with the same Message-Id) can
legitimately arrive at a single mailbox multiple times -- and possibly
even with different headers.

It's as simple as sending a message to two recipients: bo...@wo... and
bo...@ho....  Bob has his home address forwarded to work.  So Bob
actually gets two copies of the same message in his work mailbox, but
aside from some similarities (including an identical message ID, To, From,
Subject, Dates, etc.), the headers of the two messages may be very
different.  For example, the routes may be entirely different.  The
Subjects may be similar, but they may be different.

Consider an even worse case -- someone sends a virus to bo...@wo... and
som...@ex....  Bob's a member of somelist, so he gets two copies.
But the mailing list adds its own header lines and footer to the body.
So the message ID is the same, but for all intensive purposes, everything
else is different.

Can our current DB schema handle this?

I *think* it can -- it seems like we're using message ID + the unique
integer (msg_ids.m_id).  Does that make sense?

-- 
{+} Jeff Squyres
{+} jsq...@os...
{+} Research Associate, Open Systems Lab, Indiana University
{+} http://www.osl.iu.edu/

Oops.  I don't imagine m_parent_id can be -1 if I define the field as 
integer unsigned.  Make that integer.  :-) 

li...@av... writes: 

> Ok, so here are today's changes made on queeg.  
> 
> -- msg_ids adds some "header" cols, and m_parent_id (which would map to 
> another m_id, or be -1 if message is standalone).  
> 
> create table msg_ids (
> m_id         integer unsigned not null auto_increment,
> m_msg_id     varchar(255) not null,
> m_to         text,
> m_cc         text,
> m_bcc        text,
> m_subject    text,
> m_date       datetime,
> m_from       text,
> m_in_reply_to text,
> m_sender     text,
> m_parent_id  integer unsigned,
> m_vw_incl    text,
> m_vw_excl    text,
> primary key  (m_id),
> index        m_msg_id_idx (m_msg_id)
> ); 

Ok, so here are today's changes made on queeg. 

 -- msg_ids adds some "header" cols, and m_parent_id (which would map to 
another m_id, or be -1 if message is standalone). 

create table msg_ids (
 m_id         integer unsigned not null auto_increment,
 m_msg_id     varchar(255) not null,
 m_to         text,
 m_cc         text,
 m_bcc        text,
 m_subject    text,
 m_date       datetime,
 m_from       text,
 m_in_reply_to text,
 m_sender     text,
 m_parent_id  integer unsigned,
 m_vw_incl    text,
 m_vw_excl    text,
 primary key  (m_id),
 index        m_msg_id_idx (m_msg_id)
); 

 -- msg_owners loses date, from, subject field - an earlier attempt (that I 
forgot about completely) to put a few "common" fields somewhere to avoid a 
join.  I wish I remembered why I put those in this table and not in msg_ids. 
Oh well. 

create table msg_owners (
 mu_id         integer unsigned not null auto_increment,
 mu_m_id       integer unsigned not null references msg_ids(m_id),
 mu_u_id       integer unsigned not null references users(u_id),
 mu_ca_id      integer unsigned not null references cats(ca_id),
 --  mu_date     integer unsigned not null references msg_hdrs(mh_id),
 --  mu_from     integer unsigned not null references msg_hdrs(mh_id),
 --  mu_subject  integer unsigned not null references msg_hdrs(mh_id),
 mu_flags      integer,
 primary key   (mu_id),
 index         mu_m_id_idx (mu_m_id),
 index         mu_u_id_idx (mu_u_id),
 index         mu_ca_id_idx (mu_ca_id)
); 

 -- msg_attach loses ma_in_fs, ma_in_msg fields. 

These changes are reflected in cr_maildb.sql and also documented in the 
design.html document in the mysql part of the cvs tree. 

I need to review the indexing though.  Well, lots to be reviewed and 
fine-tuned as we go along (including text vs varchar as James notes, etc.). 

 - Liza 

li...@av... wrote:
> je...@sq... wrote:
>> Heck, let's err on the side of lots of options.  The idea here is that we
>> can search and sort in a million different ways:
>> [...]
> 
> 
> Errr...that's a lot.  Sure you don't want to keep just the "well-known" 
> ones in msg_ids and shove the rest off into msg_hdrs?  Or do you want to 
> get rid of msg_hdrs and put everything into msg_ids?  Actually I guess 
> we'd need a msg_hdrs in any case to catch anything that we don't account 
> for in msg_ids. Just wondering how far you want to go here.

I think the impact of indexing is going to be our guide here, I'm just 
not sure if there's a difference in performance. I don't know if it's 
faster to put n single-column indexes on one row, or one single-column 
index on n rows, when inserting message headers.

I think RFC 2822 is the latest one governing mail messages...when I get 
a chance I'll see if it calls out required headers. If it does, I say we 
stick to those in msg_ids for a first pass. If not, I say we just do the 
"common" ones and put the optional/variable ones (like X- headers) in 
msg_hdrs.

> 
>> Here's an issue, though: what happens if the value of any of these header
>> lines is longer than the length of the "text" field?
> 
> 
> Well, text is 64k, mediumtext is 16m, and largetext is 4g.  (I misspoke 
> earlier when I said text was 16m.)  So...I'd make them mediumtext, I 
> suspect, since I think 64k may be too small, but 16m has got to be 
> overkill. I mean, I can't even get 1m message bodies sent half the time, 
> much less anything with a header that big.

I'd say even text is overkill...but I don't know if there's a spec limit 
on header length. A big VARCHAR might be more efficient than text. If we 
can't find a reference we may need to do some analysis here.

JLS

On Thu, 22 May 2003 li...@av... wrote:

> Sure, we can cram it somewhere.  :-)  I could just make a table called
> 'msg_orig' that consists of the msg_id and a largetext field to hold the
> whole thing.  Ok?

Sounds good to me.

> > Heck, let's err on the side of lots of options.  The idea here is that we
> > can search and sort in a million different ways:
> >
> > - To
> > [snipped]
>
> Errr...that's a lot.  Sure you don't want to keep just the "well-known"
> ones in msg_ids and shove the rest off into msg_hdrs?  Or do you want to
> get rid of msg_hdrs and put everything into msg_ids?  Actually I guess
> we'd need a msg_hdrs in any case to catch anything that we don't account
> for in msg_ids.  Just wondering how far you want to go here.

True.

Ok, let's just do a few for now, and if performance really sucks, we can
add more later.  How about:

- To
- CC
- BCC (for outgoing messages)
- Subject
- Date
- From
- In-Reply-To
- Sender

> > Here's an issue, though: what happens if the value of any of these header
> > lines is longer than the length of the "text" field?
>
> Well, text is 64k, mediumtext is 16m, and largetext is 4g.  (I misspoke
> earlier when I said text was 16m.)  So...I'd make them mediumtext, I
> suspect, since I think 64k may be too small, but 16m has got to be
> overkill.  I mean, I can't even get 1m message bodies sent half the
> time, much less anything with a header that big.

I think text (64k) is fine for a single line in a header.  If you've got a
header line that's longer than 64k, you've got other issues!  ;-)

-- 
{+} Jeff Squyres
{+} jsq...@os...
{+} Research Associate, Open Systems Lab, Indiana University
{+} http://www.osl.iu.edu/

> For safety's sake (and probably only while we're developing/debugging),
> should we stash the entire (unmodified text) header in a table somewhere?

Sure, we can cram it somewhere.  :-)  I could just make a table called 
'msg_orig' that consists of the msg_id and a largetext field to hold the 
whole thing.  Ok? 

> Heck, let's err on the side of lots of options.  The idea here is that we
> can search and sort in a million different ways: 
> 
> - To
> - CC
> - BCC (for outgoing messages)
> - Subject
> - Date
> - In-Reply-To
> - Precedence
> - Reply-To
> - Sender
> - Message-ID (I think we have that already, right? Just mention it to be
>   complete...)
> - Return-Path
> - List-Id
> - X-Sender
> - X-Mailer
> - User-agent
> - Thread-topic
> - Thread-index
> - References
> - ...? 
> 

Errr...that's a lot.  Sure you don't want to keep just the "well-known" ones 
in msg_ids and shove the rest off into msg_hdrs?  Or do you want to get rid 
of msg_hdrs and put everything into msg_ids?  Actually I guess we'd need a 
msg_hdrs in any case to catch anything that we don't account for in msg_ids. 
Just wondering how far you want to go here. 

> Here's an issue, though: what happens if the value of any of these header
> lines is longer than the length of the "text" field?

Well, text is 64k, mediumtext is 16m, and largetext is 4g.  (I misspoke 
earlier when I said text was 16m.)  So...I'd make them mediumtext, I 
suspect, since I think 64k may be too small, but 16m has got to be overkill. 
I mean, I can't even get 1m message bodies sent half the time, much less 
anything with a header that big. 

 - Liza 

On Thu, 22 May 2003 li...@av... wrote:

> Ok, I agree as well about the messages being in the database in their
> entirety, and that the ma_in_fs field can go away.  Jeff, I also dimly
> remember that we decided to treat all message bodies as attachments.
> To indicate that a given message is not a standalone, yes, how about we
> add a "m_parent" field to msg_ids that would contain the m_id of the
> parent message, or would be -1 if the message is a standalone.

Excellent.  :-)

> I waffled a lot on this and went the more generic route.  Also I
> originally assumed that with a line in msg_headers per header, there
> would be a separate entry (row) for each recipient on the to, cc, bcc
> lines.  I have no problem putting the "required" headers into msg_ids --
> [snipped]

I agree here that for ease of searching, we probably want to put "well
known" header lines in specific fields.

> but I believe then we would be simply putting the entire comma-separated
> list as the value of m_to, m_cc, etc.  Is this ok with everyone?

I think that's ok.  Otherwise we'd have to make yet another table indexed
by message ID, right?

Let's try this approach and see how it works (i.e., that the field
contains the value of the "To:" line, etc.).

For safety's sake (and probably only while we're developing/debugging),
should we stash the entire (unmodified text) header in a table somewhere?
i.e., in case we decide to re-do the schema, we have all the original
header that we can re-build all the tables from?  I don't know if this is
a huge deal, and/or if it's helpful, but it might not be a bad idea --
could help with debugging (e.g., compare what ended up in the tables to
what the unmodified header is).  Just an idea.  :-)

> So...assuming I modify msg_ids ... what will we consider the required
> headers?  to, cc, subject, date, ... ?

Heck, let's err on the side of lots of options.  The idea here is that we
can search and sort in a million different ways:

- To
- CC
- BCC (for outgoing messages)
- Subject
- Date
- In-Reply-To
- Precedence
- Reply-To
- Sender
- Message-ID (I think we have that already, right? Just mention it to be
  complete...)
- Return-Path
- List-Id
- X-Sender
- X-Mailer
- User-agent
- Thread-topic
- Thread-index
- References
- ...?

(I know the X-* ones are not standard, but enough mailers use the ones
that I mentioned that it could be worthwhile -- didn't you always want to
be able to filter by who sends using Outlook Express? ;-)

Here's an issue, though: what happens if the value of any of these header
lines is longer than the length of the "text" field?

-- 
{+} Jeff Squyres
{+} jsq...@os...
{+} Research Associate, Open Systems Lab, Indiana University
{+} http://www.osl.iu.edu/

2002	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct (21)	Nov (9)	Dec (13)
2003	Jan (9)	Feb	Mar	Apr (6)	May (13)	Jun	Jul (13)	Aug	Sep	Oct	Nov	Dec
2004	Jan	Feb	Mar	Apr	May (2)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2007	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct (4)	Nov	Dec

maildb-devel Mailing List for maildb

maildb-devel — General development list for the MailDB project