Thread: [Assorted-commits] SF.net SVN: assorted: [336] hash-join/trunk/README
Brought to you by:
yangzhang
From: <yan...@us...> - 2008-02-07 02:07:12
|
Revision: 336 http://assorted.svn.sourceforge.net/assorted/?rev=336&view=rev Author: yangzhang Date: 2008-02-06 18:07:16 -0800 (Wed, 06 Feb 2008) Log Message: ----------- added readme Added Paths: ----------- hash-join/trunk/README Added: hash-join/trunk/README =================================================================== --- hash-join/trunk/README (rev 0) +++ hash-join/trunk/README 2008-02-07 02:07:16 UTC (rev 336) @@ -0,0 +1,14 @@ +This is a simple implementation of parallel hash joins. I'm using this as a +first step in studying the performance problems in multicore systems +programming. This implementation is tailored for a particular dataset, the IMDB +`movies.list` and `actresses.list` files, which may be found [here]. + +The `tools/` directory contains `DbPrep.scala`, which is a filter for the +`.list` files to prepare them to be more easily parsed by the hash join +application. + +The `tools/` directory also contains `LogProc.scala`, which processes stdout +concatenated from multiple runs of the program. This will produce the time and +speedup plots illustrating the scalability of the system. + +[here]: http://us.imdb.com/interfaces#plain This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <yan...@us...> - 2008-02-11 23:19:43
|
Revision: 379 http://assorted.svn.sourceforge.net/assorted/?rev=379&view=rev Author: yangzhang Date: 2008-02-11 15:19:42 -0800 (Mon, 11 Feb 2008) Log Message: ----------- more informative readme Modified Paths: -------------- hash-join/trunk/README Modified: hash-join/trunk/README =================================================================== --- hash-join/trunk/README 2008-02-11 23:11:38 UTC (rev 378) +++ hash-join/trunk/README 2008-02-11 23:19:42 UTC (rev 379) @@ -1,3 +1,9 @@ +% Parallel Hash Join +% Yang Zhang + +Overview +-------- + This is a simple implementation of parallel hash joins. I'm using this as a first step in studying the performance problems in multicore systems programming. This implementation is tailored for a particular dataset, the IMDB @@ -3,11 +9,31 @@ `movies.list` and `actresses.list` files, which may be found [here]. -The `tools/` directory contains `DbPrep.scala`, which is a filter for the -`.list` files to prepare them to be more easily parsed by the hash join -application. +Requirements +------------ -The `tools/` directory also contains `LogProc.scala`, which processes stdout -concatenated from multiple runs of the program. This will produce the time and -speedup plots illustrating the scalability of the system. +- [C++ Commons] svn r370+ +- [libstdc++] v4.1 +Supporting Tools +---------------- + +`DbPrep` filters the `.list` files to prepare them to be parsed by the hash +join. + +`LogProc` processes stdout concatenated from multiple runs of the program. This +will produce the time and speedup plots illustrating the scalability of the +system. This has actually been made into a generic tool and will be moved to +its own project directory later. + +`Titles` extracts the titles from the output of `DbPrep` on `movies.list`. + +Related +------- + +I used [HashDist] to experiment with the chaining of various hash functions on +this dataset and observe the distribution. + [here]: http://us.imdb.com/interfaces#plain +[libstdc++]: http://gcc.gnu.org/libstdc++/ +[C++ Commons]: http://assorted.sf.net/ +[HashDist]: http://assorted.sf.net/ This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <yan...@us...> - 2008-02-14 20:33:30
|
Revision: 414 http://assorted.svn.sourceforge.net/assorted/?rev=414&view=rev Author: yangzhang Date: 2008-02-14 12:33:35 -0800 (Thu, 14 Feb 2008) Log Message: ----------- added more to readme Modified Paths: -------------- hash-join/trunk/README Modified: hash-join/trunk/README =================================================================== --- hash-join/trunk/README 2008-02-14 20:33:16 UTC (rev 413) +++ hash-join/trunk/README 2008-02-14 20:33:35 UTC (rev 414) @@ -7,34 +7,79 @@ This is a simple implementation of parallel hash joins. I'm using this as a first step in studying the performance problems in multicore systems programming. This implementation is tailored for a particular dataset, the IMDB -`movies.list` and `actresses.list` files, which may be found [here]. +`movies.list` and `actresses.list` files, which may be found at [here]. +To learn more about this algorithm, have a look at [Multiprocessor Hash-Based +Join Algorithms]. In short: + +- The task is to join the set of (movie title, movie release year) with + (actress name, movie title) on equal movie titles. For instance, we may be + trying to answer the question "What is the average duration of an actress' + career?" (Movie titles and actress names are unique.) +- Each of the $n$ nodes begins with an equal fraction of both datasets. +- Stage 1 (_hash-partition_): each node hash-partitions its fraction into $n$ + buckets, where bucket $i$ is destined for node $i$. This is done for each + dataset. +- Stage 2 (_build_): each node receives the movie buckets destined for it from + all nodes, and builds a hash table mapping the join key (movie titles) to + tuples. This is done for the smaller of the two datasets, which in our case + is the movies dataset. +- Stage 3 (_probe_): each node receives the actress buckets destined for it + from all nodes, and probes into the hash table using the movie title. If + there is a match, then emit the resulting joined tuple (movie title, movie + release year, actress name). + +Here are some [results]. + Requirements ------------ -- [C++ Commons] svn r370+ -- [libstdc++] v4.1 +- [C++ Commons] +- [g++] 4.2 +- [libstdc++] 4.2 +Building +-------- + +C++ Commons is a utility library I maintain; you'll need to make it visible to +hash-join: + + $ svn --quiet co https://assorted.svn.sourceforge.net/svnroot/assorted/cpp-commons/trunk cpp-commons + $ svn --quiet co https://assorted.svn.sourceforge.net/svnroot/assorted/hash-join/trunk hash-join + $ ln -s "$PWD/cpp-commons/src/commons" hash-join/src/ + $ cd hash-join/src/ + $ make opt + $ ./hashjoin-opt 16 $MOVIEDATA/{movies,actresses}.dat + Supporting Tools ---------------- `DbPrep` filters the `.list` files to prepare them to be parsed by the hash -join. +join. This can additionally inflate the dataset size while maintaining roughly +the same join-selectivity of the original dataset. I have also prepared [some +datasets] available for your use. `LogProc` processes stdout concatenated from multiple runs of the program. This will produce the time and speedup plots illustrating the scalability of the system. This has actually been made into a generic tool and will be moved to -its own project directory later. +a utility project later. `Titles` extracts the titles from the output of `DbPrep` on `movies.list`. +These tools all depend on the [Scala Commons]. + Related ------- I used [HashDist] to experiment with the chaining of various hash functions on -this dataset and observe the distribution. +this dataset and to observe the resulting distributions. +[C++ Commons]: http://assorted.sf.net/cpp-commons/ +[HashDist]: http://assorted.sf.net/ +[Multiprocessor Hash-Based Join Algorithms]: http://citeseer.ist.psu.edu/50143.html +[Scala Commons]: http://assorted.sf.net/scala-commons/ +[g++]: http://gcc.gnu.org/ [here]: http://us.imdb.com/interfaces#plain [libstdc++]: http://gcc.gnu.org/libstdc++/ -[C++ Commons]: http://assorted.sf.net/ -[HashDist]: http://assorted.sf.net/ +[results]: analysis.html +[some datasets]: http://people.csail.mit.edu/yang/movie-data/ This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <yan...@us...> - 2008-02-29 16:30:37
|
Revision: 544 http://assorted.svn.sourceforge.net/assorted/?rev=544&view=rev Author: yangzhang Date: 2008-02-29 08:30:30 -0800 (Fri, 29 Feb 2008) Log Message: ----------- tweaked readme Modified Paths: -------------- hash-join/trunk/README Modified: hash-join/trunk/README =================================================================== --- hash-join/trunk/README 2008-02-29 16:30:20 UTC (rev 543) +++ hash-join/trunk/README 2008-02-29 16:30:30 UTC (rev 544) @@ -41,6 +41,10 @@ - [g++] 4.2 - [libstdc++] 4.2 +[libstdc++]: http://gcc.gnu.org/libstdc++/ +[C++ Commons]: http://assorted.sf.net/cpp-commons/ +[g++]: http://gcc.gnu.org/ + Building -------- @@ -51,8 +55,8 @@ $ svn --quiet co https://assorted.svn.sourceforge.net/svnroot/assorted/hash-join/trunk hash-join $ ln -s "$PWD/cpp-commons/src/commons" hash-join/src/ $ cd hash-join/src/ - $ make opt - $ ./hashjoin-opt 16 $MOVIEDATA/{movies,actresses}.dat + $ make hashjoin-opt + $ out/hashjoin-opt 16 $MOVIEDATA/{movies,actresses}.dat Supporting Tools ---------------- @@ -77,12 +81,9 @@ I used [HashDist] to experiment with the chaining of various hash functions on this dataset and to observe the resulting distributions. -[C++ Commons]: http://assorted.sf.net/cpp-commons/ [HashDist]: http://assorted.svn.sourceforge.net/viewvc/assorted/hash-dist/trunk/ [Multiprocessor Hash-Based Join Algorithms]: http://citeseer.ist.psu.edu/50143.html [Scala Commons]: http://assorted.sf.net/scala-commons/ -[g++]: http://gcc.gnu.org/ [here]: http://us.imdb.com/interfaces#plain -[libstdc++]: http://gcc.gnu.org/libstdc++/ [results]: analysis.html [some datasets]: http://people.csail.mit.edu/yang/movie-data/ This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |