Re: [Crm114-general] SQLite, anyone?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Paolo wrote:
> On Thu, Oct 25, 2007 at 11:07:59PM -0400, Eric S. Johansson wrote:
>>> not a problem if we make the reavercache RO for the imapd - imapd usually 
>>> can be configured to keep index+cache (whatever metainfo they use)
>>> elsewhere than the msg store.
>> this is a rather problematic approach.  you need to define who is going to keep 
> 
> as stated above, imapd takes care of metadata, to serve msg to IMAP 
> clients. mailreaver doesn't need them, won't use IMAP anyway. 
> When queried, imapd would refresh it's metadata, returning either OK, NO, BAD
> then it's up to client to take proper action.
> There's no need for RW access for imapd as long as its metadata are stored
> elsewhere.

I took a look at dovecot and supporting random meta data is on the "to do list". 
  At least, that's how I read what's documented.  As we all know documentation 
and actual project status are two different things.

>>>> "ls blah*" to ask for a generic name and then open that.
> ...
>> reliable to call a glob function than syscall 'ls.'
> 
> ok - provided it's avail ;)

yes, yea, yea.  glob was available on DOS (usually ;-)

>>> It's more likely, for spam filtering, than people have an IMAPd handy 
>>> rather than an SQL engine.
>> even though maildir is common, for me an sql engine (specifically sqlite) is 
>> faster and easier to set up and then trying to modify an imapd configuration and 
> 
> o-k. But that's you :)

and everybody knows I'm *special*

> OTOH one may do a bargain moving its mail store to eg. MySQL - I know of 
> very few apps that talk SQL to the mail store. In general, a wrapper 
> IMAP<->SQL would be needed. Hmmm... what a nice world, seems that dovecot(8)
> does have such (as plugin): http://www.dovecot.org/patches/mail-sql.tar.gz.
> Don't have time to play with it AnyTimeSoon, but should anyone on list know & 
> use it, pls let us know.

  interesting.  It might make sense to also look into LDA's that use my SQL.

> 
>> believe the repository should not be directly managed by any application but 
>> should go  through a library front-end which handles all of the message 
> 
> yep, saner, like for cssfile handling.
> Hm, that'd need testing to assess actual impact on performace by protocol 
> overhead. 
> 
> ...
>> retrieve message, cache ID by cache ID, message ID, real from, message from, 
> ...
>> address...  (set allows you to change meta values that are not derived from 
>> message meta-values)
> 
> ehm ... isn't that what IMAP(+extensions) is all about ;)
> eg. here's what I see cached in one of dovecot's index 
> 
>   Bcc
>   CC
>   Content-Disposition
>   Content-Language
>   Content-Transfer-Encoding
>   Content-Type
>   Date
>   FROM
>   Importance
>   In-Reply-To
>   Message-ID
>   Priority
>   References
>   Reply-To
>   Sender
>   SUBJECT
>   TO
>   X-Priority

and here is my metadata.  How would we expand the IMAP metadata to handle these 
little bits?  Admittedly, some of it would go away but it gives you an idea of 
what I'm using

X-Two-Penny-Blue: using format; delivery_headers
X-Two-Penny-Blue: attic_ID; bddc0d6c54d81e2c
X-Two-Penny-Blue: tpblue_ID; esj
X-Two-Penny-Blue: xforward; 66.35.250.225
X-Two-Penny-Blue: passed filter; fast white list
X-Crm114-score: 89.2369

> 
>> we need to ask a question, do we want to allow the same message to exist in the 
>> cache twice differing only by meta-information?
> 
> not with same msg-id. If I got your question right.

okay.  That's the direction I was heading in myself.  I can see certain 
circumstances where you might want to have two different sets of metadata as an 
art trail/historical record but those are really rare cases that I can easily 
convince myself to ignore.
> 
>> perfectly reasonable given the general nature of the search.  On my inbox which 
>> has about 4803 messages, a body search for a string never completes. 
> 
> search in *BODY* is a PITA anyway: no cache, MIME-decode on the fly (if 
> implemented server-side, else it's even worse: client needs to fetch msg and
> do on its own) :(

Ain't that the truth.  Anyway, I think this might be where the IMAP model breaks 
down in that the kind of indexing on metadata we would find most useful doesn't 
get done.
> 
>> messages stored as plain text in a file because I do find myself using grep 
>> occasionally and the plaintext form is nice and helpful.
> 
> would SQLite satisfy that requirement?

Other than the fact that it uses SQL?  Yeah, it satisfies the requirement of 
properly indexed data.  It doesn't satisfy the requirement of grep-able 
messages.  It may just be that perfect synchronization, index metadata and 
searchable messages using existing tools like grep just isn't going to happen. 
It may be necessary to provide a tool or two that will dump messages according 
to certain simple parameters

> 
>> face while talking about using an IMAP server for access, you should have no 
>> problems using a regular database for storage and retrieval.  The real 
>> discussion should be around metadata, its representation, and how to expand the 
>> metadata representation on a per (external) application basis.
> 
> well, as long as email msgs is the text stream to deal with, IMAP has an edge
> in that it's designed for that, it's a commands set and parsing functions to 
> properly deal with data structured as email. 
> For the caching+indexing, and a good number of other features, much depends on 
> actual server implementation. 

quite true and that's the only reason to consider using IMAP for everything. 
But I think we will find the performance is not adequate.  As it is, my times 
for scoring range from .3 seconds up to three seconds.  I know that if we use a 
file store abstraction, those times will increase but I don't think they will 
increase excessively.
> 
>> just a thought.
> 
> +a few others on the heap ;)

any good ones they are too.  here's one more (which will undoubtedly spawn lots 
of others).  I'm probably using a lot of shorthand because I've been thinking 
about this problem for a while.

In twopenny blue I use a file-based queue to feed messages to a stamper and a 
trainer (interface to CRM 114 training).  One of the mistakes I made in an early 
implementation was actively scanning the filesystem for new entries. 
Performance was bad and in a virtual machine, it was even worse.  One on the 
right things I did was how I named queue entries.  The format of a path to a 
queue entry was "/preamble/unique ID/state" where state was the file.  it was 
easy to create an entry and not worry about race conditions because creating a 
directory is an atomic operation and, as a rule, UNIX systems bark at you if you 
try to create a directory of the same name.  Making the state independent from 
the unique ID allows you to change the state without worrying about collision 
with any other entry.  This organization also allows you to store additional 
files for something like meta-information with the same unique ID.

I solved the performance problem by a queue management process.  On start, the 
management process reads the queue off disk (ordered by the unique ID and 
grouped by state).  All requests from the application are mediated by this 
process but all access to entries is handled by the client-side code.  For 
example, to insert an entry, you ask queue management process to create an 
entry.  It returns the path/filename and the application code puts the data into 
the entry.  The entry is activated when the application code changes the entry 
state to the active state value.  Pulling an entry off of the queue is similar. 
  The state is changed to the processing state, the filename is returned, 
client-side reads the data and when it's done, delete the entry.  This model 
lets you reclaim queue entries if processing should be interrupted.

The same model I used for file queues could be used for general file storage. 
The management process would create and hold all of the indices of metadata. 
The use of a management process instead of a traditional database should create 
greater flexibility with regards to data storage but reduce flexibility with 
regard to how indices are handled.  This may or may not be a problem.

One potential solution for the dynamic metadata problem and SQL is to use SQL 
light in a management process as described above but also store the database and 
memory.  Then we can dynamically define tables according to the metadata found 
in the message repository.  we regain query flexibility and take advantage of 
other people's expertise in making queries fast.

Communication to this mediator process would be through some form of remote 
procedure call.  Unfortunately, I haven't found a good one because, well just 
because.  Recently, I started using PYRO (Python remote objects) but the 
threading model is a bit odd and Python locks suck eggs.  Does another remote 
object model which allows you to make calls to a remote Python interpreter. 
Kind of scary.  Then there is soap which is a whole bunch of froth and not a lot 
of substance not to mention poor interoperability.  Last and apparently best is 
XML RPC or json-rpc.  function oriented instead of object-oriented but lots of 
implementations in different languages so interoperability is pretty much a given.

I suggest storing all metadata in the simplest format possible.  Something like 
name: value (yes, very much like header formats in rfc2822) or at worst, json.

anyway, that's a few more thoughts.

---eric
-- 
Speech-recognition in use.  It makes mistakes, I correct some.