From: Eric S. J. <es...@ha...> - 2007-10-27 19:32:11
|
Paolo wrote: > On Thu, Oct 25, 2007 at 11:07:59PM -0400, Eric S. Johansson wrote: >>> not a problem if we make the reavercache RO for the imapd - imapd usually >>> can be configured to keep index+cache (whatever metainfo they use) >>> elsewhere than the msg store. >> this is a rather problematic approach. you need to define who is going to keep > > as stated above, imapd takes care of metadata, to serve msg to IMAP > clients. mailreaver doesn't need them, won't use IMAP anyway. > When queried, imapd would refresh it's metadata, returning either OK, NO, BAD > then it's up to client to take proper action. > There's no need for RW access for imapd as long as its metadata are stored > elsewhere. I took a look at dovecot and supporting random meta data is on the "to do list". At least, that's how I read what's documented. As we all know documentation and actual project status are two different things. >>>> "ls blah*" to ask for a generic name and then open that. > ... >> reliable to call a glob function than syscall 'ls.' > > ok - provided it's avail ;) yes, yea, yea. glob was available on DOS (usually ;-) >>> It's more likely, for spam filtering, than people have an IMAPd handy >>> rather than an SQL engine. >> even though maildir is common, for me an sql engine (specifically sqlite) is >> faster and easier to set up and then trying to modify an imapd configuration and > > o-k. But that's you :) and everybody knows I'm *special* > OTOH one may do a bargain moving its mail store to eg. MySQL - I know of > very few apps that talk SQL to the mail store. In general, a wrapper > IMAP<->SQL would be needed. Hmmm... what a nice world, seems that dovecot(8) > does have such (as plugin): http://www.dovecot.org/patches/mail-sql.tar.gz. > Don't have time to play with it AnyTimeSoon, but should anyone on list know & > use it, pls let us know. interesting. It might make sense to also look into LDA's that use my SQL. > >> believe the repository should not be directly managed by any application but >> should go through a library front-end which handles all of the message > > yep, saner, like for cssfile handling. > Hm, that'd need testing to assess actual impact on performace by protocol > overhead. > > ... >> retrieve message, cache ID by cache ID, message ID, real from, message from, > ... >> address... (set allows you to change meta values that are not derived from >> message meta-values) > > ehm ... isn't that what IMAP(+extensions) is all about ;) > eg. here's what I see cached in one of dovecot's index > > Bcc > CC > Content-Disposition > Content-Language > Content-Transfer-Encoding > Content-Type > Date > FROM > Importance > In-Reply-To > Message-ID > Priority > References > Reply-To > Sender > SUBJECT > TO > X-Priority and here is my metadata. How would we expand the IMAP metadata to handle these little bits? Admittedly, some of it would go away but it gives you an idea of what I'm using X-Two-Penny-Blue: using format; delivery_headers X-Two-Penny-Blue: attic_ID; bddc0d6c54d81e2c X-Two-Penny-Blue: tpblue_ID; esj X-Two-Penny-Blue: xforward; 66.35.250.225 X-Two-Penny-Blue: passed filter; fast white list X-Crm114-score: 89.2369 > >> we need to ask a question, do we want to allow the same message to exist in the >> cache twice differing only by meta-information? > > not with same msg-id. If I got your question right. okay. That's the direction I was heading in myself. I can see certain circumstances where you might want to have two different sets of metadata as an art trail/historical record but those are really rare cases that I can easily convince myself to ignore. > >> perfectly reasonable given the general nature of the search. On my inbox which >> has about 4803 messages, a body search for a string never completes. > > search in *BODY* is a PITA anyway: no cache, MIME-decode on the fly (if > implemented server-side, else it's even worse: client needs to fetch msg and > do on its own) :( Ain't that the truth. Anyway, I think this might be where the IMAP model breaks down in that the kind of indexing on metadata we would find most useful doesn't get done. > >> messages stored as plain text in a file because I do find myself using grep >> occasionally and the plaintext form is nice and helpful. > > would SQLite satisfy that requirement? Other than the fact that it uses SQL? Yeah, it satisfies the requirement of properly indexed data. It doesn't satisfy the requirement of grep-able messages. It may just be that perfect synchronization, index metadata and searchable messages using existing tools like grep just isn't going to happen. It may be necessary to provide a tool or two that will dump messages according to certain simple parameters > >> face while talking about using an IMAP server for access, you should have no >> problems using a regular database for storage and retrieval. The real >> discussion should be around metadata, its representation, and how to expand the >> metadata representation on a per (external) application basis. > > well, as long as email msgs is the text stream to deal with, IMAP has an edge > in that it's designed for that, it's a commands set and parsing functions to > properly deal with data structured as email. > For the caching+indexing, and a good number of other features, much depends on > actual server implementation. quite true and that's the only reason to consider using IMAP for everything. But I think we will find the performance is not adequate. As it is, my times for scoring range from .3 seconds up to three seconds. I know that if we use a file store abstraction, those times will increase but I don't think they will increase excessively. > >> just a thought. > > +a few others on the heap ;) any good ones they are too. here's one more (which will undoubtedly spawn lots of others). I'm probably using a lot of shorthand because I've been thinking about this problem for a while. In twopenny blue I use a file-based queue to feed messages to a stamper and a trainer (interface to CRM 114 training). One of the mistakes I made in an early implementation was actively scanning the filesystem for new entries. Performance was bad and in a virtual machine, it was even worse. One on the right things I did was how I named queue entries. The format of a path to a queue entry was "/preamble/unique ID/state" where state was the file. it was easy to create an entry and not worry about race conditions because creating a directory is an atomic operation and, as a rule, UNIX systems bark at you if you try to create a directory of the same name. Making the state independent from the unique ID allows you to change the state without worrying about collision with any other entry. This organization also allows you to store additional files for something like meta-information with the same unique ID. I solved the performance problem by a queue management process. On start, the management process reads the queue off disk (ordered by the unique ID and grouped by state). All requests from the application are mediated by this process but all access to entries is handled by the client-side code. For example, to insert an entry, you ask queue management process to create an entry. It returns the path/filename and the application code puts the data into the entry. The entry is activated when the application code changes the entry state to the active state value. Pulling an entry off of the queue is similar. The state is changed to the processing state, the filename is returned, client-side reads the data and when it's done, delete the entry. This model lets you reclaim queue entries if processing should be interrupted. The same model I used for file queues could be used for general file storage. The management process would create and hold all of the indices of metadata. The use of a management process instead of a traditional database should create greater flexibility with regards to data storage but reduce flexibility with regard to how indices are handled. This may or may not be a problem. One potential solution for the dynamic metadata problem and SQL is to use SQL light in a management process as described above but also store the database and memory. Then we can dynamically define tables according to the metadata found in the message repository. we regain query flexibility and take advantage of other people's expertise in making queries fast. Communication to this mediator process would be through some form of remote procedure call. Unfortunately, I haven't found a good one because, well just because. Recently, I started using PYRO (Python remote objects) but the threading model is a bit odd and Python locks suck eggs. Does another remote object model which allows you to make calls to a remote Python interpreter. Kind of scary. Then there is soap which is a whole bunch of froth and not a lot of substance not to mention poor interoperability. Last and apparently best is XML RPC or json-rpc. function oriented instead of object-oriented but lots of implementations in different languages so interoperability is pretty much a given. I suggest storing all metadata in the simplest format possible. Something like name: value (yes, very much like header formats in rfc2822) or at worst, json. anyway, that's a few more thoughts. ---eric -- Speech-recognition in use. It makes mistakes, I correct some. |