Thread: [Assorted-commits] SF.net SVN: assorted: [336] hash-join/trunk/README

Brought to you by: yangzhang

assorted-commits

[Assorted-commits] SF.net SVN: assorted: [336] hash-join/trunk/README

From: <yan...@us...> - 2008-02-07 02:07:12

Revision: 336
          http://assorted.svn.sourceforge.net/assorted/?rev=336&view=rev
Author:   yangzhang
Date:     2008-02-06 18:07:16 -0800 (Wed, 06 Feb 2008)

Log Message:
-----------
added readme

Added Paths:
-----------
    hash-join/trunk/README

Added: hash-join/trunk/README
===================================================================
--- hash-join/trunk/README	                        (rev 0)
+++ hash-join/trunk/README	2008-02-07 02:07:16 UTC (rev 336)
@@ -0,0 +1,14 @@
+This is a simple implementation of parallel hash joins. I'm using this as a
+first step in studying the performance problems in multicore systems
+programming. This implementation is tailored for a particular dataset, the IMDB
+`movies.list` and `actresses.list` files, which may be found [here].
+
+The `tools/` directory contains `DbPrep.scala`, which is a filter for the
+`.list` files to prepare them to be more easily parsed by the hash join
+application.
+
+The `tools/` directory also contains `LogProc.scala`, which processes stdout
+concatenated from multiple runs of the program. This will produce the time and
+speedup plots illustrating the scalability of the system.
+
+[here]: http://us.imdb.com/interfaces#plain


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Assorted-commits] SF.net SVN: assorted: [379] hash-join/trunk/README

From: <yan...@us...> - 2008-02-11 23:19:43

Revision: 379
          http://assorted.svn.sourceforge.net/assorted/?rev=379&view=rev
Author:   yangzhang
Date:     2008-02-11 15:19:42 -0800 (Mon, 11 Feb 2008)

Log Message:
-----------
more informative readme

Modified Paths:
--------------
    hash-join/trunk/README

Modified: hash-join/trunk/README
===================================================================
--- hash-join/trunk/README	2008-02-11 23:11:38 UTC (rev 378)
+++ hash-join/trunk/README	2008-02-11 23:19:42 UTC (rev 379)
@@ -1,3 +1,9 @@
+% Parallel Hash Join
+% Yang Zhang
+
+Overview
+--------
+
 This is a simple implementation of parallel hash joins. I'm using this as a
 first step in studying the performance problems in multicore systems
 programming. This implementation is tailored for a particular dataset, the IMDB
@@ -3,11 +9,31 @@
 `movies.list` and `actresses.list` files, which may be found [here].
 
-The `tools/` directory contains `DbPrep.scala`, which is a filter for the
-`.list` files to prepare them to be more easily parsed by the hash join
-application.
+Requirements
+------------
 
-The `tools/` directory also contains `LogProc.scala`, which processes stdout
-concatenated from multiple runs of the program. This will produce the time and
-speedup plots illustrating the scalability of the system.
+- [C++ Commons] svn r370+
+- [libstdc++] v4.1
 
+Supporting Tools
+----------------
+
+`DbPrep` filters the `.list` files to prepare them to be parsed by the hash
+join.
+
+`LogProc` processes stdout concatenated from multiple runs of the program. This
+will produce the time and speedup plots illustrating the scalability of the
+system. This has actually been made into a generic tool and will be moved to
+its own project directory later.
+
+`Titles` extracts the titles from the output of `DbPrep` on `movies.list`.
+
+Related
+-------
+
+I used [HashDist] to experiment with the chaining of various hash functions on
+this dataset and observe the distribution.
+
 [here]: http://us.imdb.com/interfaces#plain
+[libstdc++]: http://gcc.gnu.org/libstdc++/
+[C++ Commons]: http://assorted.sf.net/
+[HashDist]: http://assorted.sf.net/


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Assorted-commits] SF.net SVN: assorted: [414] hash-join/trunk/README

From: <yan...@us...> - 2008-02-14 20:33:30

Revision: 414
          http://assorted.svn.sourceforge.net/assorted/?rev=414&view=rev
Author:   yangzhang
Date:     2008-02-14 12:33:35 -0800 (Thu, 14 Feb 2008)

Log Message:
-----------
added more to readme

Modified Paths:
--------------
    hash-join/trunk/README

Modified: hash-join/trunk/README
===================================================================
--- hash-join/trunk/README	2008-02-14 20:33:16 UTC (rev 413)
+++ hash-join/trunk/README	2008-02-14 20:33:35 UTC (rev 414)
@@ -7,34 +7,79 @@
 This is a simple implementation of parallel hash joins. I'm using this as a
 first step in studying the performance problems in multicore systems
 programming. This implementation is tailored for a particular dataset, the IMDB
-`movies.list` and `actresses.list` files, which may be found [here].
+`movies.list` and `actresses.list` files, which may be found at [here].
 
+To learn more about this algorithm, have a look at [Multiprocessor Hash-Based
+Join Algorithms]. In short:
+
+- The task is to join the set of (movie title, movie release year) with
+  (actress name, movie title) on equal movie titles. For instance, we may be
+  trying to answer the question "What is the average duration of an actress'
+  career?" (Movie titles and actress names are unique.)
+- Each of the $n$ nodes begins with an equal fraction of both datasets.
+- Stage 1 (_hash-partition_): each node hash-partitions its fraction into $n$
+  buckets, where bucket $i$ is destined for node $i$. This is done for each
+  dataset.
+- Stage 2 (_build_): each node receives the movie buckets destined for it from
+  all nodes, and builds a hash table mapping the join key (movie titles) to
+  tuples. This is done for the smaller of the two datasets, which in our case
+  is the movies dataset.
+- Stage 3 (_probe_): each node receives the actress buckets destined for it
+  from all nodes, and probes into the hash table using the movie title. If
+  there is a match, then emit the resulting joined tuple (movie title, movie
+  release year, actress name).
+
+Here are some [results].
+
 Requirements
 ------------
 
-- [C++ Commons] svn r370+
-- [libstdc++] v4.1
+- [C++ Commons]
+- [g++] 4.2
+- [libstdc++] 4.2
 
+Building
+--------
+
+C++ Commons is a utility library I maintain; you'll need to make it visible to
+hash-join:
+
+  $ svn --quiet co https://assorted.svn.sourceforge.net/svnroot/assorted/cpp-commons/trunk cpp-commons
+  $ svn --quiet co https://assorted.svn.sourceforge.net/svnroot/assorted/hash-join/trunk hash-join
+  $ ln -s "$PWD/cpp-commons/src/commons" hash-join/src/
+  $ cd hash-join/src/
+  $ make opt
+  $ ./hashjoin-opt 16 $MOVIEDATA/{movies,actresses}.dat
+
 Supporting Tools
 ----------------
 
 `DbPrep` filters the `.list` files to prepare them to be parsed by the hash
-join.
+join. This can additionally inflate the dataset size while maintaining roughly
+the same join-selectivity of the original dataset. I have also prepared [some
+datasets] available for your use.
 
 `LogProc` processes stdout concatenated from multiple runs of the program. This
 will produce the time and speedup plots illustrating the scalability of the
 system. This has actually been made into a generic tool and will be moved to
-its own project directory later.
+a utility project later.
 
 `Titles` extracts the titles from the output of `DbPrep` on `movies.list`.
 
+These tools all depend on the [Scala Commons].
+
 Related
 -------
 
 I used [HashDist] to experiment with the chaining of various hash functions on
-this dataset and observe the distribution.
+this dataset and to observe the resulting distributions.
 
+[C++ Commons]: http://assorted.sf.net/cpp-commons/
+[HashDist]: http://assorted.sf.net/
+[Multiprocessor Hash-Based Join Algorithms]: http://citeseer.ist.psu.edu/50143.html
+[Scala Commons]: http://assorted.sf.net/scala-commons/
+[g++]: http://gcc.gnu.org/
 [here]: http://us.imdb.com/interfaces#plain
 [libstdc++]: http://gcc.gnu.org/libstdc++/
-[C++ Commons]: http://assorted.sf.net/
-[HashDist]: http://assorted.sf.net/
+[results]: analysis.html
+[some datasets]: http://people.csail.mit.edu/yang/movie-data/


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Assorted-commits] SF.net SVN: assorted: [544] hash-join/trunk/README

From: <yan...@us...> - 2008-02-29 16:30:37

Revision: 544
          http://assorted.svn.sourceforge.net/assorted/?rev=544&view=rev
Author:   yangzhang
Date:     2008-02-29 08:30:30 -0800 (Fri, 29 Feb 2008)

Log Message:
-----------
tweaked readme

Modified Paths:
--------------
    hash-join/trunk/README

Modified: hash-join/trunk/README
===================================================================
--- hash-join/trunk/README	2008-02-29 16:30:20 UTC (rev 543)
+++ hash-join/trunk/README	2008-02-29 16:30:30 UTC (rev 544)
@@ -41,6 +41,10 @@
 - [g++] 4.2
 - [libstdc++] 4.2
 
+[libstdc++]: http://gcc.gnu.org/libstdc++/
+[C++ Commons]: http://assorted.sf.net/cpp-commons/
+[g++]: http://gcc.gnu.org/
+
 Building
 --------
 
@@ -51,8 +55,8 @@
   $ svn --quiet co https://assorted.svn.sourceforge.net/svnroot/assorted/hash-join/trunk hash-join
   $ ln -s "$PWD/cpp-commons/src/commons" hash-join/src/
   $ cd hash-join/src/
-  $ make opt
-  $ ./hashjoin-opt 16 $MOVIEDATA/{movies,actresses}.dat
+  $ make hashjoin-opt
+  $ out/hashjoin-opt 16 $MOVIEDATA/{movies,actresses}.dat
 
 Supporting Tools
 ----------------
@@ -77,12 +81,9 @@
 I used [HashDist] to experiment with the chaining of various hash functions on
 this dataset and to observe the resulting distributions.
 
-[C++ Commons]: http://assorted.sf.net/cpp-commons/
 [HashDist]: http://assorted.svn.sourceforge.net/viewvc/assorted/hash-dist/trunk/
 [Multiprocessor Hash-Based Join Algorithms]: http://citeseer.ist.psu.edu/50143.html
 [Scala Commons]: http://assorted.sf.net/scala-commons/
-[g++]: http://gcc.gnu.org/
 [here]: http://us.imdb.com/interfaces#plain
-[libstdc++]: http://gcc.gnu.org/libstdc++/
 [results]: analysis.html
 [some datasets]: http://people.csail.mit.edu/yang/movie-data/


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.