[Assorted-commits] SF.net SVN: assorted:[1586] mailing-list-filter/trunk/README
Brought to you by:
yangzhang
From: <yan...@us...> - 2010-03-31 02:49:50
|
Revision: 1586 http://assorted.svn.sourceforge.net/assorted/?rev=1586&view=rev Author: yangzhang Date: 2010-03-31 02:49:43 +0000 (Wed, 31 Mar 2010) Log Message: ----------- added story (orig blog post) to README Modified Paths: -------------- mailing-list-filter/trunk/README Modified: mailing-list-filter/trunk/README =================================================================== --- mailing-list-filter/trunk/README 2010-03-23 20:47:28 UTC (rev 1585) +++ mailing-list-filter/trunk/README 2010-03-31 02:49:43 UTC (rev 1586) @@ -43,6 +43,51 @@ Install the program using the standard `setup.py` program. +Story +----- + +First it was fine. Few msgs. Then OOM. Then wrote disk-based version using +shelf. Ran for way too long. Custom shelf using sqlite. Naive bidirectional +flagging = slow. + +Just make a pass, and whenever you come across a message you sent, then add it +to a set `flagged`, and recursively scan everything it points to. This will +miss things pointing to it, though. + +What if we make a second pass? That doesn't work, because messages may have +been received out of order. + +What if we sort by timestamp? There are no reliable timestamp headers. + +What if we do a topological sort (using buffered repository trees +http://docs.google.com/viewer?a=v&q=cache:ps9Mi3deobAJ:www.learnignorance.com/un/papers/a-curtis-topo-sorting-in-external-memory.pdf+external+topological+graph+sort&hl=en&gl=us&sig=AHIEtbR0xTrytGJ2iKBQoJHGCD8datGXYQ)? +There may be missing references/gaps; not necessarily trees. + +The solution is to patch up these. + +Take advantage of the spatial locality during the scan. + +http://1060.org/blogxter/entry?publicid=4C423C78012A7BEF0C9C6BB33B8E406C&token= + +The Map-Reduce programming model doesn't lend itself well to certain graph +processing tasks---for instance, computing shortest paths requires $n$ MR jobs, +where $n$ is the longest path (though once enough paths have been resolved and +the graph can be pruned to a small-enough size, one can crunch on a smaller +problem). + +Google has a large scale graph processing system called [Pregel], of which +there's not a lot of information. It implements the [Bulk Synchronous +Parallel][BSP] model of computation, where each node does a bunch of work; this +can in turn be implemented in terms of Map-Reduce (even though the Google +project doesn't necessarily do so). [Apache Hama] is the open-source +counterpart to Pregel and builds atop Hadoop. + +[Pregel]: http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html +[BSP]: http://en.wikipedia.org/wiki/Bulk_Synchronous_Parallel +[Apache Hama]: http://incubator.apache.org/hama/ + +http://portal.acm.org/citation.cfm?id=1582716.1582723 + Todo ---- This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |