[Assorted-commits] SF.net SVN: assorted: [414] hash-join/trunk/README
Brought to you by:
yangzhang
From: <yan...@us...> - 2008-02-14 20:33:30
|
Revision: 414 http://assorted.svn.sourceforge.net/assorted/?rev=414&view=rev Author: yangzhang Date: 2008-02-14 12:33:35 -0800 (Thu, 14 Feb 2008) Log Message: ----------- added more to readme Modified Paths: -------------- hash-join/trunk/README Modified: hash-join/trunk/README =================================================================== --- hash-join/trunk/README 2008-02-14 20:33:16 UTC (rev 413) +++ hash-join/trunk/README 2008-02-14 20:33:35 UTC (rev 414) @@ -7,34 +7,79 @@ This is a simple implementation of parallel hash joins. I'm using this as a first step in studying the performance problems in multicore systems programming. This implementation is tailored for a particular dataset, the IMDB -`movies.list` and `actresses.list` files, which may be found [here]. +`movies.list` and `actresses.list` files, which may be found at [here]. +To learn more about this algorithm, have a look at [Multiprocessor Hash-Based +Join Algorithms]. In short: + +- The task is to join the set of (movie title, movie release year) with + (actress name, movie title) on equal movie titles. For instance, we may be + trying to answer the question "What is the average duration of an actress' + career?" (Movie titles and actress names are unique.) +- Each of the $n$ nodes begins with an equal fraction of both datasets. +- Stage 1 (_hash-partition_): each node hash-partitions its fraction into $n$ + buckets, where bucket $i$ is destined for node $i$. This is done for each + dataset. +- Stage 2 (_build_): each node receives the movie buckets destined for it from + all nodes, and builds a hash table mapping the join key (movie titles) to + tuples. This is done for the smaller of the two datasets, which in our case + is the movies dataset. +- Stage 3 (_probe_): each node receives the actress buckets destined for it + from all nodes, and probes into the hash table using the movie title. If + there is a match, then emit the resulting joined tuple (movie title, movie + release year, actress name). + +Here are some [results]. + Requirements ------------ -- [C++ Commons] svn r370+ -- [libstdc++] v4.1 +- [C++ Commons] +- [g++] 4.2 +- [libstdc++] 4.2 +Building +-------- + +C++ Commons is a utility library I maintain; you'll need to make it visible to +hash-join: + + $ svn --quiet co https://assorted.svn.sourceforge.net/svnroot/assorted/cpp-commons/trunk cpp-commons + $ svn --quiet co https://assorted.svn.sourceforge.net/svnroot/assorted/hash-join/trunk hash-join + $ ln -s "$PWD/cpp-commons/src/commons" hash-join/src/ + $ cd hash-join/src/ + $ make opt + $ ./hashjoin-opt 16 $MOVIEDATA/{movies,actresses}.dat + Supporting Tools ---------------- `DbPrep` filters the `.list` files to prepare them to be parsed by the hash -join. +join. This can additionally inflate the dataset size while maintaining roughly +the same join-selectivity of the original dataset. I have also prepared [some +datasets] available for your use. `LogProc` processes stdout concatenated from multiple runs of the program. This will produce the time and speedup plots illustrating the scalability of the system. This has actually been made into a generic tool and will be moved to -its own project directory later. +a utility project later. `Titles` extracts the titles from the output of `DbPrep` on `movies.list`. +These tools all depend on the [Scala Commons]. + Related ------- I used [HashDist] to experiment with the chaining of various hash functions on -this dataset and observe the distribution. +this dataset and to observe the resulting distributions. +[C++ Commons]: http://assorted.sf.net/cpp-commons/ +[HashDist]: http://assorted.sf.net/ +[Multiprocessor Hash-Based Join Algorithms]: http://citeseer.ist.psu.edu/50143.html +[Scala Commons]: http://assorted.sf.net/scala-commons/ +[g++]: http://gcc.gnu.org/ [here]: http://us.imdb.com/interfaces#plain [libstdc++]: http://gcc.gnu.org/libstdc++/ -[C++ Commons]: http://assorted.sf.net/ -[HashDist]: http://assorted.sf.net/ +[results]: analysis.html +[some datasets]: http://people.csail.mit.edu/yang/movie-data/ This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |