[Assorted-commits] SF.net SVN: assorted: [414] hash-join/trunk/README

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Revision: 414
          http://assorted.svn.sourceforge.net/assorted/?rev=414&view=rev
Author:   yangzhang
Date:     2008-02-14 12:33:35 -0800 (Thu, 14 Feb 2008)

Log Message:
-----------
added more to readme

Modified Paths:
--------------
    hash-join/trunk/README

Modified: hash-join/trunk/README
===================================================================

--- hash-join/trunk/README	2008-02-14 20:33:16 UTC (rev 413)
+++ hash-join/trunk/README	2008-02-14 20:33:35 UTC (rev 414)
@@ -7,34 +7,79 @@
 This is a simple implementation of parallel hash joins. I'm using this as a
 first step in studying the performance problems in multicore systems
 programming. This implementation is tailored for a particular dataset, the IMDB
-`movies.list` and `actresses.list` files, which may be found [here].
+`movies.list` and `actresses.list` files, which may be found at [here].
 
+To learn more about this algorithm, have a look at [Multiprocessor Hash-Based
+Join Algorithms]. In short:
+
+- The task is to join the set of (movie title, movie release year) with
+  (actress name, movie title) on equal movie titles. For instance, we may be
+  trying to answer the question "What is the average duration of an actress'
+  career?" (Movie titles and actress names are unique.)
+- Each of the $n$ nodes begins with an equal fraction of both datasets.
+- Stage 1 (_hash-partition_): each node hash-partitions its fraction into $n$
+  buckets, where bucket $i$ is destined for node $i$. This is done for each
+  dataset.
+- Stage 2 (_build_): each node receives the movie buckets destined for it from
+  all nodes, and builds a hash table mapping the join key (movie titles) to
+  tuples. This is done for the smaller of the two datasets, which in our case
+  is the movies dataset.
+- Stage 3 (_probe_): each node receives the actress buckets destined for it
+  from all nodes, and probes into the hash table using the movie title. If
+  there is a match, then emit the resulting joined tuple (movie title, movie
+  release year, actress name).
+
+Here are some [results].
+
 Requirements
 ------------
 
-- [C++ Commons] svn r370+
-- [libstdc++] v4.1
+- [C++ Commons]
+- [g++] 4.2
+- [libstdc++] 4.2
 
+Building
+--------
+
+C++ Commons is a utility library I maintain; you'll need to make it visible to
+hash-join:
+
+  $ svn --quiet co https://assorted.svn.sourceforge.net/svnroot/assorted/cpp-commons/trunk cpp-commons
+  $ svn --quiet co https://assorted.svn.sourceforge.net/svnroot/assorted/hash-join/trunk hash-join
+  $ ln -s "$PWD/cpp-commons/src/commons" hash-join/src/
+  $ cd hash-join/src/
+  $ make opt
+  $ ./hashjoin-opt 16 $MOVIEDATA/{movies,actresses}.dat
+
 Supporting Tools
 ----------------
 
 `DbPrep` filters the `.list` files to prepare them to be parsed by the hash
-join.
+join. This can additionally inflate the dataset size while maintaining roughly
+the same join-selectivity of the original dataset. I have also prepared [some
+datasets] available for your use.
 
 `LogProc` processes stdout concatenated from multiple runs of the program. This
 will produce the time and speedup plots illustrating the scalability of the
 system. This has actually been made into a generic tool and will be moved to
-its own project directory later.
+a utility project later.
 
 `Titles` extracts the titles from the output of `DbPrep` on `movies.list`.
 
+These tools all depend on the [Scala Commons].
+
 Related
 -------
 
 I used [HashDist] to experiment with the chaining of various hash functions on
-this dataset and observe the distribution.
+this dataset and to observe the resulting distributions.
 
+[C++ Commons]: http://assorted.sf.net/cpp-commons/
+[HashDist]: http://assorted.sf.net/
+[Multiprocessor Hash-Based Join Algorithms]: http://citeseer.ist.psu.edu/50143.html
+[Scala Commons]: http://assorted.sf.net/scala-commons/
+[g++]: http://gcc.gnu.org/
 [here]: http://us.imdb.com/interfaces#plain
 [libstdc++]: http://gcc.gnu.org/libstdc++/
-[C++ Commons]: http://assorted.sf.net/
-[HashDist]: http://assorted.sf.net/
+[results]: analysis.html
+[some datasets]: http://people.csail.mit.edu/yang/movie-data/


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.