[Assorted-commits] SF.net SVN: assorted: [725] mailing-list-filter
Brought to you by:
yangzhang
From: <yan...@us...> - 2008-05-08 08:29:51
|
Revision: 725 http://assorted.svn.sourceforge.net/assorted/?rev=725&view=rev Author: yangzhang Date: 2008-05-08 01:29:57 -0700 (Thu, 08 May 2008) Log Message: ----------- tagged 0.1 release Added Paths: ----------- mailing-list-filter/tags/ mailing-list-filter/tags/0.1/ mailing-list-filter/tags/0.1/README mailing-list-filter/tags/0.1/publish.bash mailing-list-filter/tags/0.1/setup.py mailing-list-filter/tags/0.1/src/mlf.py Removed Paths: ------------- mailing-list-filter/tags/0.1/src/filter.py Copied: mailing-list-filter/tags/0.1 (from rev 704, mailing-list-filter/trunk) Copied: mailing-list-filter/tags/0.1/README (from rev 720, mailing-list-filter/trunk/README) =================================================================== --- mailing-list-filter/tags/0.1/README (rev 0) +++ mailing-list-filter/tags/0.1/README 2008-05-08 08:29:57 UTC (rev 725) @@ -0,0 +1,63 @@ +Overview +-------- + +I have a Gmail account that I use for subscribing to and posting to mailing +lists. When dealing with high-volume mailing lists, I am typically only +interested in those threads that I participated in. This is a simple filter +for starring and marking unread any messages belonging to such threads. + +This is accomplished by looking at the set of messages that were either sent +from me or explicitly addressed to me. From this "root set" of messages, we +can use the `Message-ID`, `References`, and `In-Reply-To` headers to determine +threads, and thus the other messages that we care about. + +I have found this to be more accurate than my two original approaches. I used +to have Gmail filters that starred/marked unread any messages containing my +name anywhere in the message. This worked OK since my name is not too common, +but it produced some false positives (not that bad, just unstar messages) and +some false negatives (much harder to detect). + +A second approach is to tag all subjects with some signature string. This +usually is fine, but it doesn't work when you did not start the thread (and +thus determine the subject). You can try to change the subject line, but this +is (1) poor netiquette, (2) unreliable because your reply may not register in +other mail clients as being part of the same thread (and thus other +participants may miss your reply), and (3) unreliable because replies might not +directly referencing your post (either intentionally or unintentionally). It +also fails when others change the subject. Finally, this approach is +unsatisfactory because it pollutes subject lines, and it essentially replicates +exactly what Message-ID was intended for. + +This script is not intended to be a replacement for the Gmail filters. I still +keep those active so that I can get immediate first-pass filtering. I execute +this script on a daily basis to perform second-pass filtering/unfiltering to +catch those false negatives that may have been missed. + +Setup +----- + +Requirements: + +- [argparse](http://argparse.python-hosting.com/) +- [Python Commons](http://assorted.sf.net/python-commons/) 0.4 +- [path](http://www.jorendorff.com/articles/python/path/) + +Install the program using the standard `setup.py` program. + +Future Work Ideas +----------------- + +- Currently, we assume that the server specification points to a mailbox + containing all messages (both sent and received), and a message is determined + to have been sent by you by looking at the From: header field. This works + well with Gmail. An alternative strategy is to look through two folders, one + that's the Inbox and one that's the Sent mailbox, and treat all messages in + Sent as having been sent by you. This is presumably how most other IMAP + servers work. + +- Implement incremental maintenance of local cache. + +- Accept custom operations for filtered/unfiltered messages + (trashing/untrashing, labeling/unlabeling, etc.). + +- Refactor the message fetching/management part out into its own library. Copied: mailing-list-filter/tags/0.1/publish.bash (from rev 717, mailing-list-filter/trunk/publish.bash) =================================================================== --- mailing-list-filter/tags/0.1/publish.bash (rev 0) +++ mailing-list-filter/tags/0.1/publish.bash 2008-05-08 08:29:57 UTC (rev 725) @@ -0,0 +1,8 @@ +#!/usr/bin/env bash + +fullname='Mailing List Filter' +version=0.1 +license=psf +websrcs=( README ) +rels=( pypi: ) +. assorted.bash "$@" Copied: mailing-list-filter/tags/0.1/setup.py (from rev 721, mailing-list-filter/trunk/setup.py) =================================================================== --- mailing-list-filter/tags/0.1/setup.py (rev 0) +++ mailing-list-filter/tags/0.1/setup.py 2008-05-08 08:29:57 UTC (rev 725) @@ -0,0 +1,28 @@ +#!/usr/bin/env python + +from commons.setup import run_setup + +pkg_info_text = """ +Metadata-Version: 1.1 +Name: mailing-list-filter +Version: 0.1 +Author: Yang Zhang +Author-email: yaaang NOSPAM at REMOVECAPS gmail +Home-page: http://assorted.sourceforge.net/mailing-list-filter/ +Download-url: http://pypi.python.org/pypi/mailing-list-filter/ +Summary: Mailing List Filter +License: Python Software Foundation License +Description: Filter mailing list email for relevant threads only. +Keywords: mailing,list,email,filter,IMAP,Gmail +Platform: any +Provides: commons +Classifier: Development Status :: 4 - Beta +Classifier: Environment :: No Input/Output (Daemon) +Classifier: Intended Audience :: End Users/Desktop +Classifier: License :: OSI Approved :: Python Software Foundation License +Classifier: Operating System :: OS Independent +Classifier: Programming Language :: Python +Classifier: Topic :: Communications :: Email +""" + +run_setup(pkg_info_text, scripts = ['src/mlf.py']) Deleted: mailing-list-filter/tags/0.1/src/filter.py =================================================================== --- mailing-list-filter/trunk/src/filter.py 2008-05-07 16:06:28 UTC (rev 704) +++ mailing-list-filter/tags/0.1/src/filter.py 2008-05-08 08:29:57 UTC (rev 725) @@ -1,149 +0,0 @@ -#!/usr/bin/env python - -""" -Given an IMAP mailbox, mark all messages as read except for those threads in -which you were a participant, where thread grouping is performed via the -In-Reply-To and References headers. - -Currently, we assume that the server specification points to a mailbox -containing all messages (both sent and received), and a message is determined -to have been sent by you by looking at the From: header field. This should work -well with Gmail. An alternative strategy is to look through two folders, one -that's the Inbox and one that's the Sent mailbox, and treat all messages in -Sent as having been sent by you. -""" - -from __future__ import with_statement -from collections import defaultdict -from email import message_from_string -from getpass import getpass -from imaplib import IMAP4_SSL -from argparse import ArgumentParser -from path import path -from re import match -from functools import partial -from commons.decs import pickle_memoized -from commons.log import * -from commons.files import cleanse_filename, soft_makedirs -from commons.misc import default_if_none -from commons.networking import logout -from commons.seqs import concat, grouper -from commons.startup import run_main -from contextlib import closing - -info = partial(info, '') -debug = partial(debug, '') -error = partial(error, '') -die = partial(die, '') - -def getmail(imap): - info( 'finding max seqno' ) - ok, [seqnos] = imap.search(None, 'ALL') - maxseqno = int( seqnos.split()[-1] ) - del seqnos - - info( 'actually fetching the messages in chunks' ) - # The syntax/fields of the FETCH command is documented in RFC 2060. Also, - # this article contains a brief overview: - # http://www.devshed.com/c/a/Python/Python-Email-Libraries-part-2-IMAP/3/ - # BODY.PEEK prevents the message from automatically being flagged as \Seen. - query = '(FLAGS BODY.PEEK[HEADER.FIELDS (Message-ID References In-Reply-To From Subject)])' - step = 1000 - return list( concat( - imap.fetch('%d:%d' % (start, start + step - 1), query)[1] - for start in xrange(1, maxseqno + 1, step) ) ) - -def main(argv): - import logging - config_logging(level = logging.INFO, do_console = True) - - p = ArgumentParser(description = __doc__) - p.add_argument('--credfile', default = path( '~/.mlf.auth' ).expanduser(), - help = """File containing your login credentials, with the username on the - first line and the password on the second line. Ignored iff --prompt.""") - p.add_argument('--cachedir', default = path( '~/.mlf.cache' ).expanduser(), - help = "Directory to use for caching our data.") - p.add_argument('--prompt', action = 'store_true', - help = "Interactively prompt for the username and password.") - p.add_argument('sender', - help = "Your email address.") - p.add_argument('server', - help = "The server in the format: <host>[:<port>][/<mailbox>].") - - cfg = p.parse_args(argv[1:]) - - if cfg.prompt: - print "username:", - cfg.user = raw_input() - print "password:", - cfg.passwd = getpass() - else: - with file(cfg.credfile) as f: - [cfg.user, cfg.passwd] = map(lambda x: x.strip('\r\n'), f.readlines()) - - try: - m = match( r'(?P<host>[^:/]+)(:(?P<port>\d+))?(/(?P<mailbox>.+))?$', cfg.server ) - cfg.host = m.group('host') - cfg.port = int( default_if_none(m.group('port'), 993) ) - cfg.mailbox = default_if_none(m.group('mailbox'), 'INBOX') - except: - p.error('Need to specify the server in the correct format.') - - soft_makedirs(cfg.cachedir) - - with logout(IMAP4_SSL(cfg.host, cfg.port)) as imap: - imap.login(cfg.user, cfg.passwd) - with closing(imap) as imap: - # Select the main mailbox (INBOX). - imap.select(cfg.mailbox) - - # Fetch message IDs, references, and senders. - xs = pickle_memoized \ - (lambda imap: cfg.cachedir / cleanse_filename(cfg.sender)) \ - (getmail) \ - (imap) - - debug('fetched:', xs) - - info('determining the set of messages that were sent by you') - - sent = set() - for (envelope, data), paren in grouper(2, xs): - msg = message_from_string(data) - if cfg.sender in msg['From']: - sent.add( msg['Message-ID'] ) - - info( 'find the threads in which I am a participant' ) - - # Every second item is just a closing paren. - # Example data: - # [('13300 (BODY[HEADER.FIELDS (Message-ID References In-Reply-To)] {67}', - # 'Message-ID: <mai...@py...>\r\n\r\n'), - # ')', - # ('13301 (BODY[HEADER.FIELDS (Message-ID References In-Reply-To)] {59}', - # 'Message-Id: <200...@hv...>\r\n\r\n'), - # ')', - # ('13302 (BODY[HEADER.FIELDS (Message-ID References In-Reply-To)] {92}', - # 'Message-ID: <C43EAFC0.2E3AE%ni...@ya...>\r\nIn-Reply-To: <481...@gm...>\r\n\r\n')] - for (envelope, data), paren in grouper(2, xs): - m = match( r"(?P<seqno>\d+) \(FLAGS \((?P<flags>[^)]+)\)", envelope ) - seqno = m.group('seqno') - flags = m.group('flags') - if r'\Flagged' in flags: # flags != r'\Seen' and flags != r'\Seen NonJunk': - print 'FLAG' - print seqno, flags - print '\n'.join( map( str, msg.items() ) ) - print - msg = message_from_string(data) - id = msg['Message-ID'] - irt = default_if_none( msg.get_all('In-Reply-To'), [] ) - refs = default_if_none( msg.get_all('References'), [] ) - refs = set( ' '.join( irt + refs ).split() ) - if refs & sent: - print 'SENT' - print seqno, flags - print '\n'.join( map( str, msg.items() ) ) - print -# if refs & sent: - -run_main() Copied: mailing-list-filter/tags/0.1/src/mlf.py (from rev 722, mailing-list-filter/trunk/src/mlf.py) =================================================================== --- mailing-list-filter/tags/0.1/src/mlf.py (rev 0) +++ mailing-list-filter/tags/0.1/src/mlf.py 2008-05-08 08:29:57 UTC (rev 725) @@ -0,0 +1,236 @@ +#!/usr/bin/env python + +""" +Given a Gmail IMAP mailbox, star all messages in which you were a participant +(either a sender or an explicit recipient in To: or Cc:), where thread grouping +is performed via the In-Reply-To and References headers. +""" + +from __future__ import with_statement +from collections import defaultdict +from email import message_from_string +from getpass import getpass +from imaplib import IMAP4_SSL +from argparse import ArgumentParser +from path import path +from re import match +from functools import partial +from itertools import count +from commons.decs import pickle_memoized +from commons.files import cleanse_filename, soft_makedirs +from commons.log import * +from commons.misc import default_if_none, seq +from commons.networking import logout +from commons.seqs import concat, grouper +from commons.startup import run_main +from contextlib import closing +import logging +from commons import log + +info = partial(log.info, 'main') +debug = partial(log.debug, 'main') +warning = partial(log.warning, 'main') +error = partial(log.error, 'main') +die = partial(log.die, 'main') + +def thread_dfs(msg, tid, tid2msgs): + assert msg.tid is None + msg.tid = tid + tid2msgs[tid].append(msg) + for ref in msg.refs: + if ref.tid is None: + thread_dfs(ref, tid, tid2msgs) + else: + assert ref.tid == tid + +def getmail(imap): + info( 'finding max UID' ) + # We use UIDs rather than the default of sequence numbers because UIDs are + # guaranteed to be persistent across sessions. This means that we can, for + # instance, fetch messages in one session and operate on this locally cached + # data before marking messages in a separate session. + ok, [uids] = imap.uid('SEARCH', None, 'ALL') + maxuid = int( uids.split()[-1] ) + del uids + + info( 'actually fetching the messages in chunks up to max', maxuid ) + # The syntax/fields of the FETCH command is documented in RFC 2060. Also, + # this article contains a brief overview: + # http://www.devshed.com/c/a/Python/Python-Email-Libraries-part-2-IMAP/3/ + # BODY.PEEK prevents the message from automatically being flagged as \Seen. + query = '(FLAGS BODY.PEEK[HEADER.FIELDS ' \ + '(Message-ID References In-Reply-To From To Cc Subject)])' + step = 1000 + return list( concat( + seq( lambda: info('fetching', start, 'to', start + step - 1), + lambda: imap.uid('FETCH', '%d:%d' % (start, start + step - 1), + query)[1] ) + for start in xrange(1, maxuid + 1, step) ) ) + +def main(argv): + p = ArgumentParser(description = __doc__) + p.add_argument('--credfile', default = path( '~/.mlf.auth' ).expanduser(), + help = """File containing your login credentials, with the username on the + first line and the password on the second line. Ignored iff --prompt.""") + p.add_argument('--cachedir', default = path( '~/.mlf.cache' ).expanduser(), + help = "Directory to use for caching our data.") + p.add_argument('--prompt', action = 'store_true', + help = "Interactively prompt for the username and password.") + p.add_argument('--pretend', action = 'store_true', + help = """Do not actually carry out any updates to the server. Use in + conjunction with --debug to observe what would happen.""") + p.add_argument('--no-mark-unseen', action = 'store_true', + help = "Do not mark newly revelant threads as unread.") + p.add_argument('--no-mark-seen', action = 'store_true', + help = "Do not mark newly irrevelant threads as read.") + p.add_argument('--debug', action = 'append', + help = """Enable logging for messages of the given flags. Flags include: + refs (references to missing Message-IDs), dups (duplicate Message-IDs), + main (the main program logic), and star (which messages are being + starred), unstar (which messages are being unstarred).""") + p.add_argument('sender', + help = "Your email address.") + p.add_argument('server', + help = "The server in the format: <host>[:<port>][/<mailbox>].") + + cfg = p.parse_args(argv[1:]) + + config_logging(level = logging.ERROR, do_console = True, flags = cfg.debug) + + if cfg.prompt: + print "username:", + cfg.user = raw_input() + print "password:", + cfg.passwd = getpass() + else: + with file(cfg.credfile) as f: + [cfg.user, cfg.passwd] = map(lambda x: x.strip('\r\n'), f.readlines()) + + try: + m = match( r'(?P<host>[^:/]+)(:(?P<port>\d+))?(/(?P<mailbox>.+))?$', + cfg.server ) + cfg.host = m.group('host') + cfg.port = int( default_if_none(m.group('port'), 993) ) + cfg.mailbox = default_if_none(m.group('mailbox'), 'INBOX') + except: + p.error('Need to specify the server in the correct format.') + + soft_makedirs(cfg.cachedir) + + with logout(IMAP4_SSL(cfg.host, cfg.port)) as imap: + imap.login(cfg.user, cfg.passwd) + # Close is only valid in the authenticated state. + with closing(imap) as imap: + # Select the main mailbox (INBOX). + imap.select(cfg.mailbox) + + # Fetch message IDs, references, and senders. + xs = pickle_memoized \ + (lambda imap: cfg.cachedir / cleanse_filename(cfg.sender)) \ + (getmail) \ + (imap) + + log.debug('fetched', xs) + + info('building message-id map and determining the set of messages sent ' + 'by you or addressed to you (the "source set")') + + srcs = [] + mid2msg = {} + # Every second item is just a closing paren. + # Example data: + # [('13300 (BODY[HEADER.FIELDS (Message-ID References In-Reply-To)] {67}', + # 'Message-ID: <mai...@py...>\r\n\r\n'), + # ')', + # ('13301 (BODY[HEADER.FIELDS (Message-ID References In-Reply-To)] {59}', + # 'Message-Id: <200...@hv...>\r\n\r\n'), + # ')', + # ('13302 (BODY[HEADER.FIELDS (Message-ID References In-Reply-To)] {92}', + # 'Message-ID: <C43EAFC0.2E3AE%ni...@ya...>\r\nIn-Reply-To: <481...@gm...>\r\n\r\n')] + for (envelope, data), paren in grouper(2, xs): + # Parse the body. + msg = message_from_string(data) + + # Parse the envelope. + m = match( + r"(?P<seqno>\d+) \(UID (?P<uid>\d+) FLAGS \((?P<flags>[^)]+)\)", + envelope ) + msg.seqno = m.group('seqno') + msg.uid = m.group('uid') + msg.flags = m.group('flags').split() + + # Prepare a container for references to other msgs, and initialize the + # thread ID. + msg.refs = [] + msg.tid = None + + # Add these to the map. + if msg['Message-ID'] in mid2msg: + log.warning( 'dups', 'duplicate message IDs:', + msg['Message-ID'], msg['Subject'] ) + mid2msg[ msg['Message-ID'] ] = msg + + # Add to "srcs" set if sent by us or addressed to us. + if ( cfg.sender in default_if_none( msg['From'], '' ) or + cfg.sender in default_if_none( msg['To'], '' ) or + cfg.sender in default_if_none( msg['Cc'], '' ) ): + srcs.append( msg ) + + info( 'constructing undirected graph' ) + + for mid, msg in mid2msg.iteritems(): + # Extract any references. + irt = default_if_none( msg.get_all('In-Reply-To'), [] ) + refs = default_if_none( msg.get_all('References'), [] ) + refs = set( ' '.join( irt + refs ).replace('><', '> <').split() ) + + # Connect nodes in graph bidirectionally. Ignore references to MIDs + # that don't exist. + for ref in refs: + try: + refmsg = mid2msg[ref] + # We can use lists/append (not worry about duplicates) because the + # original sources should be acyclic. If a -> b, then there is no b -> + # a, so when crawling a we can add a <-> b without worrying that later + # we may re-add b -> a. + msg.refs.append(refmsg) + refmsg.refs.append(msg) + except: + log.warning( 'refs', ref ) + + info('finding connected components (grouping the messages into threads)') + + tids = count() + tid2msgs = defaultdict(list) + for mid, msg in mid2msg.iteritems(): + if msg.tid is None: + thread_dfs(msg, tids.next(), tid2msgs) + + info( 'starring the relevant threads, in which I am a participant' ) + + rel_tids = set() + for srcmsg in srcs: + if srcmsg.tid not in rel_tids: + rel_tids.add(srcmsg.tid) + for msg in tid2msgs[srcmsg.tid]: + if r'\Flagged' not in msg.flags: + log.info( 'star', '\n', msg ) + if not cfg.pretend: + imap.uid('STORE', msg.uid, '+FLAGS', r'\Flagged') + if not cfg.no_mark_unseen and r'\Seen' in msg.flags: + imap.uid('STORE', msg.uid, '-FLAGS', r'\Seen') + + info( 'unstarring irrelevant threads, in which I am not a participant' ) + + all_tids = set( tid2msgs.iterkeys() ) + irrel_tids = all_tids - rel_tids + for tid in irrel_tids: + for msg in tid2msgs[tid]: + if r'\Flagged' in msg.flags: + log.info( 'unstar', '\n', msg ) + if not cfg.pretend: + imap.uid('STORE', msg.uid, '-FLAGS', r'\Flagged') + if not cfg.no_mark_seen and r'\Seen' not in msg.flags: + imap.uid('STORE', msg.uid, '+FLAGS', r'\Seen') + +run_main() This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |