[Assorted-commits] SF.net SVN: assorted:[1586] mailing-list-filter/trunk/README

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Revision: 1586
          http://assorted.svn.sourceforge.net/assorted/?rev=1586&view=rev
Author:   yangzhang
Date:     2010-03-31 02:49:43 +0000 (Wed, 31 Mar 2010)

Log Message:
-----------
added story (orig blog post) to README

Modified Paths:
--------------
    mailing-list-filter/trunk/README

Modified: mailing-list-filter/trunk/README
===================================================================

--- mailing-list-filter/trunk/README	2010-03-23 20:47:28 UTC (rev 1585)
+++ mailing-list-filter/trunk/README	2010-03-31 02:49:43 UTC (rev 1586)
@@ -43,6 +43,51 @@
 
 Install the program using the standard `setup.py` program.
 
+Story
+-----
+
+First it was fine. Few msgs. Then OOM. Then wrote disk-based version using
+shelf. Ran for way too long. Custom shelf using sqlite. Naive bidirectional
+flagging = slow.
+
+Just make a pass, and whenever you come across a message you sent, then add it
+to a set `flagged`, and recursively scan everything it points to. This will
+miss things pointing to it, though.
+
+What if we make a second pass? That doesn't work, because messages may have
+been received out of order.
+
+What if we sort by timestamp? There are no reliable timestamp headers.
+
+What if we do a topological sort (using buffered repository trees
+http://docs.google.com/viewer?a=v&q=cache:ps9Mi3deobAJ:www.learnignorance.com/un/papers/a-curtis-topo-sorting-in-external-memory.pdf+external+topological+graph+sort&hl=en&gl=us&sig=AHIEtbR0xTrytGJ2iKBQoJHGCD8datGXYQ)?
+There may be missing references/gaps; not necessarily trees.
+
+The solution is to patch up these.
+
+Take advantage of the spatial locality during the scan.
+
+http://1060.org/blogxter/entry?publicid=4C423C78012A7BEF0C9C6BB33B8E406C&token=
+
+The Map-Reduce programming model doesn't lend itself well to certain graph
+processing tasks---for instance, computing shortest paths requires $n$ MR jobs,
+where $n$ is the longest path (though once enough paths have been resolved and
+the graph can be pruned to a small-enough size, one can crunch on a smaller
+problem).
+
+Google has a large scale graph processing system called [Pregel], of which
+there's not a lot of information. It implements the [Bulk Synchronous
+Parallel][BSP] model of computation, where each node does a bunch of work; this
+can in turn be implemented in terms of Map-Reduce (even though the Google
+project doesn't necessarily do so). [Apache Hama] is the open-source
+counterpart to Pregel and builds atop Hadoop.
+
+[Pregel]: http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html
+[BSP]: http://en.wikipedia.org/wiki/Bulk_Synchronous_Parallel
+[Apache Hama]: http://incubator.apache.org/hama/
+
+http://portal.acm.org/citation.cfm?id=1582716.1582723
+
 Todo
 ----
 


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.