Thread: [Assorted-commits] SF.net SVN: assorted:[1081] ydb/trunk
Brought to you by:
yangzhang
From: <yan...@us...> - 2008-11-30 23:46:33
|
Revision: 1081 http://assorted.svn.sourceforge.net/assorted/?rev=1081&view=rev Author: yangzhang Date: 2008-11-30 23:46:31 +0000 (Sun, 30 Nov 2008) Log Message: ----------- - added simplest state-sending recovery - verifiably produce (dump) the same state on both machines - general clean up - filled out the README Modified Paths: -------------- ydb/trunk/src/Makefile ydb/trunk/src/main.lzz ydb/trunk/src/ydb.proto Added Paths: ----------- ydb/trunk/README ydb/trunk/publish.bash Removed Paths: ------------- ydb/trunk/src/ydb.thrift Added: ydb/trunk/README =================================================================== --- ydb/trunk/README (rev 0) +++ ydb/trunk/README 2008-11-30 23:46:31 UTC (rev 1081) @@ -0,0 +1,120 @@ +Overview +-------- + +YDB (Yang's Database) is a simple replicated memory store, developed for the +purpose of researching various approaches to recovery in such OLTP-optimized +databases as [VOLTDB] (formerly H-Store/Horizontica). + +[VOLTDB]: http://db.cs.yale.edu/hstore/ + +Currently, the only recovery implemented mechanism is to have one of the +replicas serialize the entire database state and send that to the joining node. + +If you start a system of $n$ replicas, then the leader will wait for $n-1$ of +them to join before it starts issuing transactions. (Think of $n-1$ as the +minimum number of replicas the system requires before it is willing to process +transactions.) Then when replica $n$ joins, it will need to catch up to the +current state of the system, and it will do so by contacting one of the other +replicas and requesting a complete dump of its DB state. + +The leader will report the current txn seqno to the joiner, and start streaming +txns beyond that seqno to the joiner, which the joiner will push onto its +backlog. It will also instruct the standing replicas to snapshot and send +their DB state at this txn seqno. As a result, the standing replicas will +pause once they get this message until they can send their state to the joiner. + +Setup +----- + +Requirements: + +- [boost] 1.35 +- [C++ Commons] svn r1074 +- [GCC] 4.3.2 +- [Lazy C++] 2.8.0 +- [Protocol Buffers] 2.0.0 +- [State Threads] 1.8 + +[boost]: http://www.boost.org/ +[C++ Commons]: http://assorted.sourceforge.net/cpp-commons/ +[GCC]: http://gcc.gnu.org/ +[Lazy C++]: http://www.lazycplusplus.com/ +[Protocol Buffers]: http://code.google.com/p/protobuf/ +[State Threads]: http://state-threads.sourceforge.net/ + +Usage +----- + +To start a leader to manage 2 replicas, run: + + ./ydb 2 + +This will listen on port 7654. Then to start a replica, run: + + ./ydb localhost 7654 7655 + +This means "connect to the leader at localhost:7654, and listen on port 7655." +The replicas have to listen for connections from other replicas. + +Currently handles only 2 replicas. + +Pseudo-code +----------- + +### Leader + + foreach event + if event == departure + remove replica + if event == join + add replica + send init msg to new replica + who else is in the system + which txn we're on + start sending txns to new replica + start handling responses from new replica + read responses up till the current seqno + +### Replica + + start listening for conns from new replicas + generate recovery msg from map + send recovery msg to new replica + send join msg to leader + recv init msg from leader + start recving txns from leader + if map is caught up + apply txn directly + else + push onto backlog + foreach replica + connect to replica + recv recovery msg from replica + apply the state + apply backlog + +Todo +---- + +- Add benchmarking information, e.g.: + - txns/second normally + - txns/second during recovery + - time to recover + - bytes used to recover + +- Run some benchmarks, esp. on multiple physical hosts. + +- Figure out why things are running so slowly with >2 replicas. + +- Add a variant of the recovery scheme so that the standing replicas can just + send any snapshot of their DB beyond a certain seqno. The joiner can simply + discard from its leader-populated backlog any txns before the seqno of the + actual state it receives. This way, no communication between the leader and + the standing replicas needs to take place, and the replicas don't need to + wait for the new guy to join before they can continue processing txns. + +- Add a recovery scheme to recover from multiple replicas simultaneously. + +- Add richer transactions/queries/operations. + +- Add disk-based recovery methods. Copied: ydb/trunk/publish.bash (from rev 1067, hash-join/trunk/publish.bash) =================================================================== --- ydb/trunk/publish.bash (rev 0) +++ ydb/trunk/publish.bash 2008-11-30 23:46:31 UTC (rev 1081) @@ -0,0 +1,9 @@ +#!/usr/bin/env bash + +fullname='YDB' +version=0.1 +license=gpl3 +websrcs=( README ) +rels=( src-tgz: ) +nodl=true +. assorted.bash "$@" Property changes on: ydb/trunk/publish.bash ___________________________________________________________________ Added: svn:executable + * Added: svn:mergeinfo + Modified: ydb/trunk/src/Makefile =================================================================== --- ydb/trunk/src/Makefile 2008-11-30 23:45:26 UTC (rev 1080) +++ ydb/trunk/src/Makefile 2008-11-30 23:46:31 UTC (rev 1081) @@ -1,4 +1,5 @@ TARGET := ydb +WTF := wtf LZZS := $(wildcard *.lzz) LZZHDRS := $(foreach lzz,$(LZZS),$(patsubst %.lzz,%.hh,$(lzz))) @@ -31,10 +32,10 @@ $(CXX) -o $@ $^ $(LDFLAGS) %.o: %.cc $(PBHDRS) - wtf $(CXX) $(CXXFLAGS) -c -o $@ $< + $(WTF) $(CXX) $(CXXFLAGS) -c -o $@ $< %.o: %.pb.cc %.pb.h - wtf $(CXX) $(PBCXXFLAGS) -c -o $@ $< + $(WTF) $(CXX) $(PBCXXFLAGS) -c -o $@ $< %.cc: %.lzz lzz -hx hh -sx cc -hl -sl -hd -sd $< Modified: ydb/trunk/src/main.lzz =================================================================== --- ydb/trunk/src/main.lzz 2008-11-30 23:45:26 UTC (rev 1080) +++ ydb/trunk/src/main.lzz 2008-11-30 23:46:31 UTC (rev 1081) @@ -6,13 +6,20 @@ #hdr #include <boost/bind.hpp> #include <boost/foreach.hpp> +#include <boost/lambda/lambda.hpp> +#include <boost/scoped_array.hpp> #include <commons/nullptr.h> #include <commons/st/st.h> +#include <csignal> #include <cstdio> #include <cstdlib> #include <iostream> +#include <fstream> #include <map> +#include <set> #include <sstream> +#include <sys/types.h> +#include <unistd.h> #include <vector> #include "ydb.pb.h" #define foreach BOOST_FOREACH @@ -21,11 +28,123 @@ using namespace std; #end -extern int chkpt = 1000; +typedef pair<int, int> pii; + +// Why does just timeout require the `extern`? extern const st_utime_t timeout = 1000000; -extern const bool verbose = false; +const int chkpt = 1000; +const bool verbose = true; +const uint16_t base_port = 7654; +st_intr_bool stop_hub, kill_hub; /** + * The list of threads. + */ +set<st_thread_t> threads; + +class thread_eraser +{ + public: + thread_eraser() { threads.insert(st_thread_self()); } + ~thread_eraser() { threads.erase(st_thread_self()); } +}; + +/** + * Delegate for running thread targets. + * \param[in] f The function to execute. + * \param[in] intr Whether to signal stop_hub on an exception. + */ +void +my_spawn_helper(const function0<void> f, bool intr) +{ + thread_eraser eraser; + try { + f(); + } catch (const exception &ex) { + cerr << "thread " << st_thread_self() << ": " << ex.what() << endl; + if (intr) stop_hub.set(); + } +} + +/** + * Spawn a thread using ST but wrap it in an exception handler that interrupts + * all other threads (hopefully causing them to unwind). + */ +st_thread_t +my_spawn(const function0<void> &f, bool intr = true) +{ + st_thread_t t = st_spawn(bind(my_spawn_helper, f, intr)); + threads.insert(t); + return t; +} + +/** + * Used by the leader to bookkeep information about replicas. + */ +class replica_info +{ + public: + replica_info(st_netfd_t fd, uint16_t port) : fd_(fd), port_(port) {} + st_netfd_t fd() const { return fd_; } + /** The port on which the replica is listening. */ + uint16_t port() const { return port_; } +#hdr +#define GETSA sockaddr_in sa; sockaddr(sa); return sa +#end + /** The port on which the replica connected to us. */ + uint16_t local_port() const { GETSA.sin_port; } + uint32_t host() const { GETSA.sin_addr.s_addr; } + sockaddr_in sockaddr() const { GETSA; } + void sockaddr(sockaddr_in &sa) const { + socklen_t salen = sizeof sa; + check0x(getpeername(st_netfd_fileno(fd_), + reinterpret_cast<struct sockaddr*>(&sa), + &salen)); + } + private: + st_netfd_t fd_; + uint16_t port_; +}; + +/** + * RAII to close all contained netfds. + */ +class st_closing_all +{ + public: + st_closing_all(const vector<replica_info>& rs) : rs_(rs) {} + ~st_closing_all() { + foreach (replica_info r, rs_) + check0x(st_netfd_close(r.fd())); + } + private: + const vector<replica_info> &rs_; +}; + +/** + * RAII for dumping the final state of the DB to a file on disk. + */ +class dump_state +{ + public: + dump_state(const map<int, int> &map, const int &seqno) + : map_(map), seqno_(seqno) {} + ~dump_state() { + stringstream fname; + fname << "/tmp/ydb" << getpid(); + cout << "dumping DB state (" << seqno_ << ") to " << fname.str() << endl; + ofstream of(fname.str().c_str()); + of << "seqno: " << seqno_ << endl; + foreach (const pii &p, map_) { + of << p.first << ": " << p.second << endl; + } + } + private: + const map<int, int> &map_; + const int &seqno_; +}; + +/** * Send a message to some destinations (sequentially). */ template<typename T> @@ -44,7 +163,7 @@ foreach (st_netfd_t dst, dsts) { checkeqnneg(st_write(dst, static_cast<void*>(&len), sizeof len, timeout), static_cast<int>(sizeof len)); - checkeqnneg(st_write(dst, buf, s.size(), timeout), + checkeqnneg(st_write(dst, buf, s.size(), ST_UTIME_NO_TIMEOUT), static_cast<int>(s.size())); } } @@ -52,7 +171,7 @@ /** * Send a message to a single recipient. */ -template<typename T> +template<typename T> void sendmsg(st_netfd_t dst, const T &msg) { @@ -65,19 +184,28 @@ */ template <typename T> void -readmsg(st_netfd_t src, T & msg) +readmsg(st_netfd_t src, T & msg, st_utime_t timeout = ST_UTIME_NO_TIMEOUT) { // Read the message length. uint32_t len; checkeqnneg(st_read_fully(src, static_cast<void*>(&len), sizeof len, - ST_UTIME_NO_TIMEOUT), + timeout), static_cast<int>(sizeof len)); len = ntohl(len); +#define GETMSG(buf) \ + checkeqnneg(st_read_fully(src, buf, len, timeout), (int) len); \ + check(msg.ParseFromArray(buf, len)); + // Parse the message body. - char buf[len]; - checkeqnneg(st_read_fully(src, buf, len, timeout), (int) len); - check(msg.ParseFromArray(buf, len)); + if (len < 4096) { + char buf[len]; + GETMSG(buf); + } else { + cout << "receiving large msg; heap-allocating " << len << " bytes" << endl; + scoped_array<char> buf(new char[len]); + GETMSG(buf.get()); + } } inline int @@ -90,239 +218,403 @@ * Keep issuing transactions to the replicas. */ void -issue_txns(const vector<st_netfd_t> &replicas) +issue_txns(st_channel<replica_info> &newreps, int &seqno) { Op_OpType types[] = {Op::read, Op::write, Op::del}; - size_t lastsize = replicas.size(); - cout << "replicas = " << &replicas << endl; - int i = 0; - while (true) { - if (replicas.size() != lastsize) { - cout << "size changed from " << lastsize << " to " << replicas.size() - << endl; - lastsize = replicas.size(); + vector<st_netfd_t> fds; + + while (!stop_hub) { + // Did we get a new member? + while (!newreps.empty()) { + if (seqno > 0) { + Txn txn; + bcastmsg(fds, txn); + } + fds.push_back(newreps.take().fd()); } + // Generate a random transaction. Txn txn; + txn.set_seqno(seqno++); int count = rand32(5) + 1; for (int o = 0; o < count; o++) { Op *op = txn.add_op(); - int rtype = rand32(3), rkey = rand32(), rvalue = rand32(); + int rtype = rand32(3), rkey = rand32(), rvalue = rand32(); op->set_type(types[rtype]); op->set_key(rkey); op->set_value(rvalue); } - bcastmsg(replicas, txn); - if (++i % chkpt == 0) { - if (verbose) cout << "issued txn " << i << endl; + + // Broadcast. + bcastmsg(fds, txn); + + // Checkpoint. + if (txn.seqno() % chkpt == 0) { + if (verbose) cout << "issued txn " << txn.seqno() << endl; st_sleep(0); } } } /** - * Keep swallowing replica responses. + * Process a transaction: update DB state (incl. seqno) and send response to + * leader. */ void -handle_responses(st_netfd_t replica) +process_txn(st_netfd_t leader, map<int, int> &map, const Txn &txn, int &seqno) { - int i = 0; - while (true) { - Response res; - readmsg(replica, res); - if (++i % chkpt == 0) { - if (verbose) - cout << "got response " << i << " from " << replica << " of size " - << res.result_size() << endl; - st_sleep(0); + checkeq(txn.seqno(), seqno + 1); + Response res; + res.set_seqno(txn.seqno()); + seqno = txn.seqno(); + for (int o = 0; o < txn.op_size(); o++) { + const Op &op = txn.op(o); + switch (op.type()) { + case Op::read: + res.add_result(map[op.key()]); + break; + case Op::write: + map[op.key()] = op.value(); + break; + case Op::del: + map.erase(op.key()); + break; } } + sendmsg(leader, res); } /** * Actually do the work of executing a transaction and sending back the reply. */ void -process_txns(st_netfd_t leader, map<int, int> &map) +process_txns(st_netfd_t leader, map<int, int> &map, int &seqno, + st_bool &send_state, st_bool &sent_state, + st_channel<Txn*> &backlog) { - int i = 0; while (true) { Txn txn; - readmsg(leader, txn); + //{ + //st_intr intr(stop_hub); + readmsg(leader, txn); + //} - Response res; - for (int o = 0; o < txn.op_size(); o++) { - const Op &op = txn.op(o); - switch (op.type()) { - case Op::read: - res.add_result(map[op.key()]); - break; - case Op::write: - map[op.key()] = op.value(); - break; - case Op::del: - map.erase(op.key()); - break; + if (txn.has_seqno()) { + if (txn.seqno() == seqno + 1) { + process_txn(leader, map, txn, seqno); + } else { + // Queue up for later processing once a snapshot has been received. + backlog.push(new Txn(txn)); } + } else { + // Wait for the snapshot to be generated. + send_state.set(); + cout << "waiting for state to be sent" << endl; + sent_state.waitset(); + sent_state.reset(); } - sendmsg(leader, res); - if (++i % chkpt == 0) { - if (verbose) cout << "processed txn " << i << endl; + if (txn.seqno() % chkpt == 0) { + if (verbose) cout << "processed txn " << txn.seqno() << endl; st_sleep(0); } } } /** + * Keep swallowing replica responses. + */ +void +handle_responses(st_netfd_t replica, const int &seqno) +{ + while (true) { + Response res; + { + st_intr intr(kill_hub); + readmsg(replica, res); + } + if (res.seqno() % chkpt == 0) { + if (verbose) + cout << "got response " << res.seqno() << " from " << replica << endl; + st_sleep(0); + } + if (stop_hub && res.seqno() == seqno - 1) { + cout << "seqno = " << seqno - 1 << endl; + break; + } + } +} + +/** * Help the recovering node. */ void -recover_joiner(st_netfd_t listener, const map<int, int> &map) +recover_joiner(st_netfd_t listener, const map<int, int> &map, const int &seqno, + st_bool &send_state, st_bool &sent_state) { - st_netfd_t joiner = checkpass(st_accept(listener, nullptr, nullptr, - ST_UTIME_NO_TIMEOUT)); - cout << "got the joiner! " << joiner << endl; + st_netfd_t joiner; + { + st_intr intr(stop_hub); + joiner = checkerr(st_accept(listener, nullptr, nullptr, + ST_UTIME_NO_TIMEOUT)); + } + st_closing closing(joiner); + cout << "got recoverer's connection" << endl; + + // Wait for the right time to generate the snapshot. + send_state.waitset(); + send_state.reset(); + + cout << "snapshotting state for recovery" << endl; Recovery recovery; - typedef pair<int, int> pii; - foreach (pii p, map) { + foreach (const pii &p, map) { Recovery_Pair *pair = recovery.add_pair(); pair->set_key(p.first); pair->set_value(p.second); } + recovery.set_seqno(seqno); + + // Notify process_txns that it may continue processing. + sent_state.set(); + + cout << "sending recovery" << endl; sendmsg(joiner, recovery); + cout << "sent" << endl; } -int -main(int argc, char **argv) +/** + * Run the leader. + */ +void +run_leader(int nreps) { - check0x(st_init()); - if (argc < 2) - die("leader: ydb <nreplicas>\n" - "replica: ydb <leaderhost> <leaderport> <listenport>\n" - "joiner: ydb <leaderhost> <leaderport>\n"); - bool is_leader = argc == 2; - bool is_joiner = argc == 3; - if (is_leader) { - cout << "starting as leader" << endl; + cout << "starting as leader" << endl; - // Wait until all replicas have joined. - st_netfd_t listener = st_tcp_listen(7654); - vector<st_netfd_t> replicas; - for (int i = 1; i < atoi(argv[1]); i++) { - replicas.push_back(checkpass( - st_accept(listener, nullptr, nullptr, ST_UTIME_NO_TIMEOUT))); + // Wait until all replicas have joined. + st_netfd_t listener = st_tcp_listen(base_port); + st_closing close_listener(listener); + // TODO rename these + int min_reps = nreps - 1; + vector<replica_info> replicas; + st_closing_all close_replicas(replicas); + for (int i = 0; i < min_reps; i++) { + st_netfd_t fd; + { + st_intr intr(stop_hub); + fd = checkerr(st_accept(listener, nullptr, nullptr, + ST_UTIME_NO_TIMEOUT)); } + Join join; + readmsg(fd, join); + replicas.push_back(replica_info(fd, static_cast<uint16_t>(join.port()))); + } - // Construct the initialization message. - Init init; - foreach (st_netfd_t r, replicas) { - // Get socket addresses. + // Construct the initialization message. + Init init; + init.set_txnseqno(0); + foreach (replica_info r, replicas) { + SockAddr *psa = init.add_node(); + psa->set_host(r.host()); + psa->set_port(r.port()); + } - sockaddr_in sa; - socklen_t salen = sizeof sa; - check0x(getpeername(st_netfd_fileno(r), reinterpret_cast<sockaddr*>(&sa), &salen)); + // Send init to each initial replica. + foreach (replica_info r, replicas) { + init.set_yourhost(r.host()); + sendmsg(r.fd(), init); + } - SockAddr *psa = init.add_node(); - psa->set_host(sa.sin_addr.s_addr); - psa->set_port(sa.sin_port); - } + // Start dispatching queries. + int seqno = 0; + st_channel<replica_info> newreps; + const function0<void> f = bind(issue_txns, ref(newreps), ref(seqno)); + st_thread_t swallower = my_spawn(bind(swallow, f)); + foreach (const replica_info &r, replicas) newreps.push(r); + st_joining join_swallower(swallower); - bcastmsg(replicas, init); + // Start handling responses. + st_thread_group handlers; + foreach (replica_info r, replicas) { + handlers.insert(my_spawn(bind(handle_responses, r.fd(), ref(seqno)))); + } - // Start dispatching queries. - const function0<void> f = bind(issue_txns, ref(replicas)); - st_thread_t t = st_spawn(bind(swallow, f)); + // Accept the recovering node, and tell it about the online replicas. + st_netfd_t joiner; + { + st_intr intr(stop_hub); + joiner = checkerr(st_accept(listener, nullptr, nullptr, + ST_UTIME_NO_TIMEOUT)); + } + Join join; + readmsg(joiner, join); + cout << "setting seqno to " << seqno << endl; + init.set_txnseqno(seqno); + sendmsg(joiner, init); - // Start handling responses. - vector<st_thread_t> handlers(replicas.size()); - foreach (st_netfd_t r, replicas) { - handlers.push_back(st_spawn(bind(handle_responses, r))); - } + // Start streaming txns to joiner. + cout << "start streaming txns to joiner" << endl; + replicas.push_back(replica_info(joiner, static_cast<uint16_t>(join.port()))); + newreps.push(replicas.back()); + handlers.insert(my_spawn(bind(handle_responses, joiner, ref(seqno)))); +} - // Accept the recovering node, and tell it about the online replicas. - st_netfd_t joiner = checkpass(st_accept(listener, nullptr, nullptr, - ST_UTIME_NO_TIMEOUT)); - sendmsg(joiner, init); +/** + * Run a replica. + */ +void +run_replica(char *leader_host, uint16_t leader_port, uint16_t listen_port) +{ + // Initialize database state. + map<int, int> map; + int seqno = -1; + dump_state ds(map, seqno); + st_bool send_state, sent_state; - // Bring the new guy "back" into action. - Ready ready; - readmsg(joiner, ready); - cout << "the prodigal son has returned" << endl; - cout << "replicas = " << &replicas << endl; - replicas.push_back(joiner); - handlers.push_back(st_spawn(bind(handle_responses, joiner))); + cout << "starting as replica" << endl; - // Wait on other threads. - check0x(st_thread_join(t, nullptr)); + // Listen for connections from other replicas. + st_netfd_t listener = + st_tcp_listen(listen_port); + st_thread_t rec = my_spawn(bind(recover_joiner, listener, ref(map), + ref(seqno), ref(send_state), + ref(sent_state))); - // Cleanly close all connections. - foreach (st_netfd_t r, replicas) { - check0x(st_netfd_close(r)); + // Connect to the leader and join the system. + st_netfd_t leader = st_tcp_connect(leader_host, leader_port, timeout); + Join join; + join.set_port(listen_port); + sendmsg(leader, join); + Init init; + readmsg(leader, init); + uint32_t listen_host = init.yourhost(); + + // Display the info. + cout << "got init msg with txn seqno " << init.txnseqno() + << " and hosts:" << endl; + vector<st_netfd_t> replicas; + for (uint16_t i = 0; i < init.node_size(); i++) { + const SockAddr &sa = init.node(i); + char buf[INET_ADDRSTRLEN]; + in_addr host = { sa.host() }; + bool is_self = sa.host() == listen_host && sa.port() == listen_port; + cout << "- " << checkerr(inet_ntop(AF_INET, &host, buf, + INET_ADDRSTRLEN)) + << ':' << sa.port() << (is_self ? " (self)" : "") << endl; + if (!is_self) { + replicas.push_back(st_tcp_connect(host, + static_cast<uint16_t>(sa.port()), + timeout)); } - check0x(st_netfd_close(listener)); - } else { - map<int, int> map; + } - // Connect to the leader. - char *host = argv[1]; - uint16_t port = static_cast<uint16_t>(atoi(argv[2])); + // Process txns. + st_channel<Txn*> backlog; + st_thread_t proc = my_spawn(bind(process_txns, leader, ref(map), ref(seqno), + ref(send_state), ref(sent_state), + ref(backlog))); - if (!is_joiner) { - // Listen for then talk to the joiner. - st_netfd_t listener = - st_tcp_listen(static_cast<uint16_t>(atoi(argv[3]))); - st_spawn(bind(recover_joiner, listener, ref(map))); + // If there's anything to recover. + if (init.txnseqno() > 0) { + cout << "waiting for recovery" << endl; + + // Read the recovery message. + Recovery recovery; + foreach (st_netfd_t r, replicas) { + readmsg(r, recovery); } + for (int i = 0; i < recovery.pair_size(); i++) { + const Recovery_Pair &p = recovery.pair(i); + map[p.key()] = p.value(); + } + assert(seqno == -1 && + static_cast<typeof(seqno)>(recovery.seqno()) > seqno); + seqno = recovery.seqno(); + cout << "recovered." << endl; - st_sleep(0); - cout << "here" << endl; - st_netfd_t leader = st_tcp_connect(host, port, timeout); - cout << "there" << endl; + while (!backlog.empty()) { + Txn *p = backlog.take(); + process_txn(leader, map, *p, seqno); + delete p; + } + cout << "caught up." << endl; + } - // Read the initialization message. - Init init; - readmsg(leader, init); + st_join(proc); + st_join(rec); + foreach (st_netfd_t r, replicas) { + check0x(st_netfd_close(r)); + } + check0x(st_netfd_close(leader)); +} - // Display the info. - cout << "hosts:" << endl; - vector<st_netfd_t> replicas; - for (uint16_t i = 0; i < init.node_size(); i++) { - const SockAddr &sa = init.node(i); - char buf[INET_ADDRSTRLEN]; - in_addr host; - host.s_addr = sa.host(); - cout << checkpass(inet_ntop(AF_INET, &host, buf, INET_ADDRSTRLEN)) - << ':' << sa.port() << endl; - if (is_joiner) - replicas.push_back(st_tcp_connect(host, static_cast<uint16_t>(7655+i), - timeout)); - } +int sig_pipe[2]; - if (is_joiner) { - // Read the recovery message. - Recovery recovery; - readmsg(replicas[0], recovery); - for (int i = 0; i < recovery.pair_size(); i++) { - const Recovery_Pair &p = recovery.pair(i); - map[p.key()] = p.value(); +/** + * Raw signal handler that triggers the (synchronous) handler. + */ +void handle_sig(int sig) { + int err = errno; + cerr << "got signal: " << sig << endl; + checkeqnneg(write(sig_pipe[1], &sig, sizeof sig), sizeof sig); + errno = err; +} + +/** + * Synchronous part of the signal handler; cleanly interrrupts any threads that + * have marked themselves as interruptible. + */ +void handle_sig_sync() { + stfd fd = checkerr(st_netfd_open(sig_pipe[0])); + while (true) { + int sig; + checkeqnneg(st_read(fd, &sig, sizeof sig, ST_UTIME_NO_TIMEOUT), + sizeof sig); + if (sig == SIGINT) { + if (!stop_hub) stop_hub.set(); + else kill_hub.set(); + } else if (sig == SIGTERM) { + foreach (st_thread_t t, threads) { + st_thread_interrupt(t); } - cout << "recovered." << endl; - - // Notify the leader. - Ready ready; - sendmsg(leader, ready); } + break; + } +} - // Process txns. - st_thread_t t = st_spawn(bind(process_txns, leader, ref(map))); - check0x(st_thread_join(t, nullptr)); +int +main(int argc, char **argv) +{ + try { + GOOGLE_PROTOBUF_VERIFY_VERSION; - foreach (st_netfd_t r, replicas) { - check0x(st_netfd_close(r)); + check0x(pipe(sig_pipe)); + struct sigaction sa; + sa.sa_handler = handle_sig; + check0x(sigemptyset(&sa.sa_mask)); + sa.sa_flags = 0; + check0x(sigaction(SIGINT, &sa, nullptr)); + + //check0x(st_set_eventsys(ST_EVENTSYS_ALT)); + check0x(st_init()); + thread_eraser eraser; + st_spawn(bind(handle_sig_sync)); + threads.insert(st_thread_self()); + if (argc != 2 && argc != 4) + die("leader: ydb <nreplicas>\n" + "replica: ydb <leaderhost> <leaderport> <listenport>\n"); + bool is_leader = argc == 2; + + if (is_leader) { + run_leader(atoi(argv[1])); + } else { + run_replica(argv[1], + static_cast<uint16_t>(atoi(argv[2])), + static_cast<uint16_t>(atoi(argv[3]))); } - check0x(st_netfd_close(leader)); + + return 0; + } catch (const exception &ex) { + cerr << "thread " << st_thread_self() << ": " << ex.what() << endl; + return 1; } - - return 0; } Modified: ydb/trunk/src/ydb.proto =================================================================== --- ydb/trunk/src/ydb.proto 2008-11-30 23:45:26 UTC (rev 1080) +++ ydb/trunk/src/ydb.proto 2008-11-30 23:46:31 UTC (rev 1081) @@ -1,10 +1,34 @@ +option optimize_for = SPEED; + +// A socket address (host:port). message SockAddr { - required int32 host = 1; - required int32 port = 2; + required uint32 host = 1; + required uint32 port = 2; } + +// Join request sent from nodes to leader. +message Join { + // The port on which the joining replica will listen for connections. + required uint32 port = 1; +} + +// Initialization message sent to a nodes when it joins. message Init { - repeated SockAddr node = 1; + // The current seqno that the server is on. + required uint32 txnseqno = 1; + // What the leader perceives to be the joining replica's IP address. + required uint32 yourhost = 2; + // The nodes that have joined (including the joining node); the ports here + // are the ports on which the nodes are listening. + repeated SockAddr node = 3; } + +// Sent to already-joined nodes to inform them of a newly joining node. +message Joined { + required SockAddr node = 1; +} + +// A single operation in a transaction. message Op { enum OpType { read = 0; @@ -15,19 +39,35 @@ required int32 key = 2; optional int32 value = 3; } + +// A transaction. Currently just a simple sequence of Ops. message Txn { - repeated Op op = 1; + optional uint32 seqno = 1; + repeated Op op = 2; } + +// Response to a transaction, containing a list of results. message Response { - repeated int32 result = 1; + // The txn that this is a response for. + required uint32 seqno = 1; + // The list of answers to read operations. + repeated int32 result = 2; } -message Ready { - optional int32 ready = 1; -} + +// Message from a running node to a joining node to bring it up to speed. message Recovery { message Pair { required int32 key = 1; required int32 value = 2; } - repeated Pair pair = 1; + // The seqno that this recovery message will bring us up through (the last + // txn seqno before the snapshot was generated). + required uint32 seqno = 1; + // The data map. + repeated Pair pair = 2; } + +// Message from a joining node to the leader to inform it that it is fully back +// into action. +message Ready { +} Deleted: ydb/trunk/src/ydb.thrift =================================================================== --- ydb/trunk/src/ydb.thrift 2008-11-30 23:45:26 UTC (rev 1080) +++ ydb/trunk/src/ydb.thrift 2008-11-30 23:46:31 UTC (rev 1081) @@ -1,9 +0,0 @@ -enum op_type { read, write, rm } -struct sock_addr { i32 host, i16 port } -struct init { list<sock_addr> node } -struct op { op_type type, i32 key, optional i32 value } -struct txn { list<op> op } -struct response { list<i32> results } -struct ready {} -struct pair { i32 key, i32 value } -struct recovery { list<pair> pairs } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <yan...@us...> - 2008-12-02 05:37:05
|
Revision: 1083 http://assorted.svn.sourceforge.net/assorted/?rev=1083&view=rev Author: yangzhang Date: 2008-12-02 05:37:00 +0000 (Tue, 02 Dec 2008) Log Message: ----------- - updated code and docs so that the system works with >2 replicas - only instruct a single node to help recover - only the recoverer connects to the helping node, and only that node - added another mode where the recoverer can process the backlog in concurrently with pushing live txns onto it - re-enabled killing of process_txns - bumped up the chkpt threshold - print strerror of errors - fixed signed/unsigned comparisons and other warnings (changed seqnos in pb msgs to be signed) - more clean-up - migrated to boost 1.37 (mainly had to avoid) Modified Paths: -------------- ydb/trunk/README ydb/trunk/src/Makefile ydb/trunk/src/main.lzz ydb/trunk/src/ydb.proto Modified: ydb/trunk/README =================================================================== --- ydb/trunk/README 2008-12-02 05:34:16 UTC (rev 1082) +++ ydb/trunk/README 2008-12-02 05:37:00 UTC (rev 1083) @@ -7,29 +7,29 @@ [VOLTDB]: http://db.cs.yale.edu/hstore/ -Currently, the only recovery implemented mechanism is to have one of the -replicas serialize the entire database state and send that to the joining node. +Currently, the only recovery implemented mechanism is to have the first-joining +replica serialize its entire database state and send that to the joining node. If you start a system of $n$ replicas, then the leader will wait for $n-1$ of them to join before it starts issuing transactions. (Think of $n-1$ as the minimum number of replicas the system requires before it is willing to process transactions.) Then when replica $n$ joins, it will need to catch up to the -current state of the system, and it will do so by contacting one of the other -replicas and requesting a complete dump of its DB state. +current state of the system, and it will do so by contacting that first replica +and receiving a complete dump of its DB state. The leader will report the current txn seqno to the joiner, and start streaming txns beyond that seqno to the joiner, which the joiner will push onto its -backlog. It will also instruct the standing replicas to snapshot and send -their DB state at this txn seqno. As a result, the standing replicas will -pause once they get this message until they can send their state to the joiner. +backlog. It will also instruct that first replica to snapshot its DB state at +this txn seqno and prepare to send it to the recovering node as soon as it +connects. Setup ----- Requirements: -- [boost] 1.35 -- [C++ Commons] svn r1074 +- [boost] 1.37 +- [C++ Commons] svn r1082 - [GCC] 4.3.2 - [Lazy C++] 2.8.0 - [Protocol Buffers] 2.0.0 @@ -45,19 +45,29 @@ Usage ----- -To start a leader to manage 2 replicas, run: +To start a leader to manage 3 replicas, run: - ./ydb 2 + ./ydb 3 -This will listen on port 7654. Then to start a replica, run: +This will listen on port 7654. Then to start the first two replicas, run: ./ydb localhost 7654 7655 + ./ydb localhost 7654 7656 This means "connect to the leader at localhost:7654, and listen on port 7655." -The replicas have to listen for connections from other replicas. +The replicas have to listen for connections from other replicas (namely the +recovering replica). -Currently handles only 2 replicas. +The recovering replica then joins: + ./ydb localhost 7654 7657 + +It will connect to the first replica (on port 7655) and receive a DB dump from it. + +To terminate the system, send a sigint (ctrl-c) to the leader, and a clean +shutdown should take place. The replicas dump their DB state to a tmp file, +which you can then verify to be identical. + Pseudo-code ----------- @@ -96,6 +106,10 @@ Todo ---- +- Expose program options. + +- Add test suite. + - Add benchmarking information, e.g.: - txns/second normally - txns/second during recovery Modified: ydb/trunk/src/Makefile =================================================================== --- ydb/trunk/src/Makefile 2008-12-02 05:34:16 UTC (rev 1082) +++ ydb/trunk/src/Makefile 2008-12-02 05:37:00 UTC (rev 1083) @@ -22,8 +22,9 @@ LDFLAGS := -lstx -lst -lresolv -lpthread -lprotobuf CXXFLAGS := -g3 -Wall -Werror -Wextra -Woverloaded-virtual -Wconversion \ -Wpointer-arith -Wcast-qual -Wcast-align -Wwrite-strings \ - -Winit-self -Wno-sign-compare -Wno-unused-parameter -Wc++0x-compat \ - -Wparentheses + -Winit-self -Wsign-promo -Wno-unused-parameter -Wc++0x-compat \ + -Wparentheses -Wmissing-format-attribute -Wfloat-equal \ + -Winline -Wsynth PBCXXFLAGS := -g3 -Wall -Werror all: $(TARGET) Modified: ydb/trunk/src/main.lzz =================================================================== --- ydb/trunk/src/main.lzz 2008-12-02 05:34:16 UTC (rev 1082) +++ ydb/trunk/src/main.lzz 2008-12-02 05:37:00 UTC (rev 1083) @@ -1,8 +1,3 @@ -// TODO -// - how does replication affect overhead? -// - implement other recovery schemes (disk-based) -// - verify correctness of the simple recovery scheme - #hdr #include <boost/bind.hpp> #include <boost/foreach.hpp> @@ -13,6 +8,7 @@ #include <csignal> #include <cstdio> #include <cstdlib> +#include <cstring> #include <iostream> #include <fstream> #include <map> @@ -32,7 +28,7 @@ // Why does just timeout require the `extern`? extern const st_utime_t timeout = 1000000; -const int chkpt = 1000; +const int chkpt = 10000; const bool verbose = true; const uint16_t base_port = 7654; st_intr_bool stop_hub, kill_hub; @@ -60,7 +56,7 @@ thread_eraser eraser; try { f(); - } catch (const exception &ex) { + } catch (const std::exception &ex) { cerr << "thread " << st_thread_self() << ": " << ex.what() << endl; if (intr) stop_hub.set(); } @@ -162,9 +158,9 @@ // Broadcast the length-prefixed message to replicas. foreach (st_netfd_t dst, dsts) { checkeqnneg(st_write(dst, static_cast<void*>(&len), sizeof len, timeout), - static_cast<int>(sizeof len)); + static_cast<ssize_t>(sizeof len)); checkeqnneg(st_write(dst, buf, s.size(), ST_UTIME_NO_TIMEOUT), - static_cast<int>(s.size())); + static_cast<ssize_t>(s.size())); } } @@ -190,7 +186,7 @@ uint32_t len; checkeqnneg(st_read_fully(src, static_cast<void*>(&len), sizeof len, timeout), - static_cast<int>(sizeof len)); + static_cast<ssize_t>(sizeof len)); len = ntohl(len); #define GETMSG(buf) \ @@ -225,11 +221,10 @@ while (!stop_hub) { // Did we get a new member? + if (!newreps.empty() && seqno > 0) { + sendmsg(fds[0], Txn()); + } while (!newreps.empty()) { - if (seqno > 0) { - Txn txn; - bcastmsg(fds, txn); - } fds.push_back(newreps.take().fd()); } @@ -294,10 +289,10 @@ { while (true) { Txn txn; - //{ - //st_intr intr(stop_hub); + { + st_intr intr(kill_hub); readmsg(leader, txn); - //} + } if (txn.has_seqno()) { if (txn.seqno() == seqno + 1) { @@ -306,18 +301,19 @@ // Queue up for later processing once a snapshot has been received. backlog.push(new Txn(txn)); } + + if (txn.seqno() % chkpt == 0) { + if (verbose) cout << "processed txn " << txn.seqno() << endl; + st_sleep(0); + } } else { // Wait for the snapshot to be generated. send_state.set(); cout << "waiting for state to be sent" << endl; sent_state.waitset(); sent_state.reset(); + cout << "state sent" << endl; } - - if (txn.seqno() % chkpt == 0) { - if (verbose) cout << "processed txn " << txn.seqno() << endl; - st_sleep(0); - } } } @@ -338,8 +334,8 @@ cout << "got response " << res.seqno() << " from " << replica << endl; st_sleep(0); } - if (stop_hub && res.seqno() == seqno - 1) { - cout << "seqno = " << seqno - 1 << endl; + if (stop_hub && res.seqno() + 1 == seqno) { + cout << "seqno = " << res.seqno() << endl; break; } } @@ -352,17 +348,11 @@ recover_joiner(st_netfd_t listener, const map<int, int> &map, const int &seqno, st_bool &send_state, st_bool &sent_state) { - st_netfd_t joiner; + // Wait for the right time to generate the snapshot. { st_intr intr(stop_hub); - joiner = checkerr(st_accept(listener, nullptr, nullptr, - ST_UTIME_NO_TIMEOUT)); + send_state.waitset(); } - st_closing closing(joiner); - cout << "got recoverer's connection" << endl; - - // Wait for the right time to generate the snapshot. - send_state.waitset(); send_state.reset(); cout << "snapshotting state for recovery" << endl; @@ -377,7 +367,16 @@ // Notify process_txns that it may continue processing. sent_state.set(); - cout << "sending recovery" << endl; + // Wait for the new joiner. + st_netfd_t joiner; + { + st_intr intr(stop_hub); + joiner = checkerr(st_accept(listener, nullptr, nullptr, + ST_UTIME_NO_TIMEOUT)); + } + st_closing closing(joiner); + + cout << "got joiner's connection, sending recovery" << endl; sendmsg(joiner, recovery); cout << "sent" << endl; } @@ -402,7 +401,7 @@ { st_intr intr(stop_hub); fd = checkerr(st_accept(listener, nullptr, nullptr, - ST_UTIME_NO_TIMEOUT)); + ST_UTIME_NO_TIMEOUT)); } Join join; readmsg(fd, join); @@ -500,10 +499,10 @@ cout << "- " << checkerr(inet_ntop(AF_INET, &host, buf, INET_ADDRSTRLEN)) << ':' << sa.port() << (is_self ? " (self)" : "") << endl; - if (!is_self) { + if (!is_self && init.txnseqno() > 0) { replicas.push_back(st_tcp_connect(host, static_cast<uint16_t>(sa.port()), - timeout)); + ST_UTIME_NO_TIMEOUT)); } } @@ -515,13 +514,11 @@ // If there's anything to recover. if (init.txnseqno() > 0) { - cout << "waiting for recovery" << endl; + cout << "waiting for recovery from " << replicas[0] << endl; // Read the recovery message. Recovery recovery; - foreach (st_netfd_t r, replicas) { - readmsg(r, recovery); - } + readmsg(replicas[0], recovery); for (int i = 0; i < recovery.pair_size(); i++) { const Recovery_Pair &p = recovery.pair(i); map[p.key()] = p.value(); @@ -534,6 +531,10 @@ while (!backlog.empty()) { Txn *p = backlog.take(); process_txn(leader, map, *p, seqno); + if (p->seqno() % chkpt == 0) { + cout << "processed txn " << p->seqno() << " off the backlog" << endl; + st_sleep(0); + } delete p; } cout << "caught up." << endl; @@ -554,8 +555,9 @@ */ void handle_sig(int sig) { int err = errno; - cerr << "got signal: " << sig << endl; - checkeqnneg(write(sig_pipe[1], &sig, sizeof sig), sizeof sig); + cerr << "got signal: " << strsignal(sig) << " (" << sig << ")" << endl; + checkeqnneg(write(sig_pipe[1], &sig, sizeof sig), + static_cast<ssize_t>(sizeof sig)); errno = err; } @@ -568,7 +570,7 @@ while (true) { int sig; checkeqnneg(st_read(fd, &sig, sizeof sig, ST_UTIME_NO_TIMEOUT), - sizeof sig); + static_cast<ssize_t>(sizeof sig)); if (sig == SIGINT) { if (!stop_hub) stop_hub.set(); else kill_hub.set(); @@ -613,7 +615,7 @@ } return 0; - } catch (const exception &ex) { + } catch (const std::exception &ex) { cerr << "thread " << st_thread_self() << ": " << ex.what() << endl; return 1; } Modified: ydb/trunk/src/ydb.proto =================================================================== --- ydb/trunk/src/ydb.proto 2008-12-02 05:34:16 UTC (rev 1082) +++ ydb/trunk/src/ydb.proto 2008-12-02 05:37:00 UTC (rev 1083) @@ -15,7 +15,7 @@ // Initialization message sent to a nodes when it joins. message Init { // The current seqno that the server is on. - required uint32 txnseqno = 1; + required int32 txnseqno = 1; // What the leader perceives to be the joining replica's IP address. required uint32 yourhost = 2; // The nodes that have joined (including the joining node); the ports here @@ -42,14 +42,14 @@ // A transaction. Currently just a simple sequence of Ops. message Txn { - optional uint32 seqno = 1; + optional int32 seqno = 1; repeated Op op = 2; } // Response to a transaction, containing a list of results. message Response { // The txn that this is a response for. - required uint32 seqno = 1; + required int32 seqno = 1; // The list of answers to read operations. repeated int32 result = 2; } @@ -62,7 +62,7 @@ } // The seqno that this recovery message will bring us up through (the last // txn seqno before the snapshot was generated). - required uint32 seqno = 1; + required int32 seqno = 1; // The data map. repeated Pair pair = 2; } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <yan...@us...> - 2008-12-02 21:05:16
|
Revision: 1084 http://assorted.svn.sourceforge.net/assorted/?rev=1084&view=rev Author: yangzhang Date: 2008-12-02 21:05:08 +0000 (Tue, 02 Dec 2008) Log Message: ----------- - added information on when the recovering joiner catches up (as measured by the leader) - made the optional recovery yields an option - more cleanup/comments, updated docs Modified Paths: -------------- ydb/trunk/README ydb/trunk/src/main.lzz ydb/trunk/src/ydb.proto Modified: ydb/trunk/README =================================================================== --- ydb/trunk/README 2008-12-02 05:37:00 UTC (rev 1083) +++ ydb/trunk/README 2008-12-02 21:05:08 UTC (rev 1084) @@ -62,12 +62,23 @@ ./ydb localhost 7654 7657 -It will connect to the first replica (on port 7655) and receive a DB dump from it. +It will connect to the first replica (on port 7655) and receive a DB dump from +it. To terminate the system, send a sigint (ctrl-c) to the leader, and a clean shutdown should take place. The replicas dump their DB state to a tmp file, which you can then verify to be identical. +Recovery Mechanisms +------------------- + +The following are currently implemented: + +- Network recovery + - From a single node + - Interleave the state recovery/catch up with the backlogging of live txns + - Recover/catch up in one swoop, then backlog the live txns + Pseudo-code ----------- @@ -110,8 +121,13 @@ - Add test suite. +- Add benchmarking hooks, e.g.: + - start the recovering joiner at a well-defined time (after a certain # txns + or after the DB reaches a certain size) + - Add benchmarking information, e.g.: - txns/second normally + - txns during recovery - txns/second during recovery - time to recover - bytes used to recover Modified: ydb/trunk/src/main.lzz =================================================================== --- ydb/trunk/src/main.lzz 2008-12-02 05:37:00 UTC (rev 1083) +++ ydb/trunk/src/main.lzz 2008-12-02 21:05:08 UTC (rev 1084) @@ -5,6 +5,7 @@ #include <boost/scoped_array.hpp> #include <commons/nullptr.h> #include <commons/st/st.h> +#include <commons/time.h> #include <csignal> #include <cstdio> #include <cstdlib> @@ -30,11 +31,15 @@ extern const st_utime_t timeout = 1000000; const int chkpt = 10000; const bool verbose = true; +const bool yield_during_recovery = false; +const bool yield_during_catch_up = false; +const bool use_epoll = false; const uint16_t base_port = 7654; st_intr_bool stop_hub, kill_hub; /** - * The list of threads. + * The list of all threads. Keep track of these so that we may cleanly shut + * down all threads. */ set<st_thread_t> threads; @@ -256,11 +261,13 @@ * leader. */ void -process_txn(st_netfd_t leader, map<int, int> &map, const Txn &txn, int &seqno) +process_txn(st_netfd_t leader, map<int, int> &map, const Txn &txn, int &seqno, + bool caught_up) { checkeq(txn.seqno(), seqno + 1); Response res; res.set_seqno(txn.seqno()); + res.set_caught_up(caught_up); seqno = txn.seqno(); for (int o = 0; o < txn.op_size(); o++) { const Op &op = txn.op(o); @@ -296,7 +303,7 @@ if (txn.has_seqno()) { if (txn.seqno() == seqno + 1) { - process_txn(leader, map, txn, seqno); + process_txn(leader, map, txn, seqno, true); } else { // Queue up for later processing once a snapshot has been received. backlog.push(new Txn(txn)); @@ -321,14 +328,19 @@ * Keep swallowing replica responses. */ void -handle_responses(st_netfd_t replica, const int &seqno) +handle_responses(st_netfd_t replica, const int &seqno, bool caught_up) { + long long start_time = current_time_millis(); while (true) { Response res; { st_intr intr(kill_hub); readmsg(replica, res); } + if (!caught_up && res.caught_up()) { + caught_up = true; + cout << "recovering node caught up; took " << current_time_millis() - start_time << "ms" << endl; + } if (res.seqno() % chkpt == 0) { if (verbose) cout << "got response " << res.seqno() << " from " << replica << endl; @@ -434,7 +446,7 @@ // Start handling responses. st_thread_group handlers; foreach (replica_info r, replicas) { - handlers.insert(my_spawn(bind(handle_responses, r.fd(), ref(seqno)))); + handlers.insert(my_spawn(bind(handle_responses, r.fd(), ref(seqno), true))); } // Accept the recovering node, and tell it about the online replicas. @@ -454,7 +466,7 @@ cout << "start streaming txns to joiner" << endl; replicas.push_back(replica_info(joiner, static_cast<uint16_t>(join.port()))); newreps.push(replicas.back()); - handlers.insert(my_spawn(bind(handle_responses, joiner, ref(seqno)))); + handlers.insert(my_spawn(bind(handle_responses, joiner, ref(seqno), false))); } /** @@ -522,6 +534,9 @@ for (int i = 0; i < recovery.pair_size(); i++) { const Recovery_Pair &p = recovery.pair(i); map[p.key()] = p.value(); + if (i % chkpt == 0) { + if (yield_during_recovery) st_sleep(0); + } } assert(seqno == -1 && static_cast<typeof(seqno)>(recovery.seqno()) > seqno); @@ -530,10 +545,10 @@ while (!backlog.empty()) { Txn *p = backlog.take(); - process_txn(leader, map, *p, seqno); + process_txn(leader, map, *p, seqno, false); if (p->seqno() % chkpt == 0) { cout << "processed txn " << p->seqno() << " off the backlog" << endl; - st_sleep(0); + if (yield_during_catch_up) st_sleep(0); } delete p; } @@ -583,12 +598,16 @@ } } +/** + * Initialization and command-line parsing. + */ int main(int argc, char **argv) { try { GOOGLE_PROTOBUF_VERIFY_VERSION; + // Initialize support for ST working with asynchronous signals. check0x(pipe(sig_pipe)); struct sigaction sa; sa.sa_handler = handle_sig; @@ -596,16 +615,22 @@ sa.sa_flags = 0; check0x(sigaction(SIGINT, &sa, nullptr)); - //check0x(st_set_eventsys(ST_EVENTSYS_ALT)); + // Initialize ST. + if (use_epoll) check0x(st_set_eventsys(ST_EVENTSYS_ALT)); check0x(st_init()); + st_spawn(bind(handle_sig_sync)); + + // Initialize thread manager for clean shutdown of all threads. thread_eraser eraser; - st_spawn(bind(handle_sig_sync)); threads.insert(st_thread_self()); + + // Parse command-line arguments. if (argc != 2 && argc != 4) die("leader: ydb <nreplicas>\n" "replica: ydb <leaderhost> <leaderport> <listenport>\n"); bool is_leader = argc == 2; + // Which role are we? if (is_leader) { run_leader(atoi(argv[1])); } else { @@ -616,6 +641,7 @@ return 0; } catch (const std::exception &ex) { + // Must catch all exceptions at the top to make the stack unwind. cerr << "thread " << st_thread_self() << ": " << ex.what() << endl; return 1; } Modified: ydb/trunk/src/ydb.proto =================================================================== --- ydb/trunk/src/ydb.proto 2008-12-02 05:37:00 UTC (rev 1083) +++ ydb/trunk/src/ydb.proto 2008-12-02 21:05:08 UTC (rev 1084) @@ -52,6 +52,8 @@ required int32 seqno = 1; // The list of answers to read operations. repeated int32 result = 2; + // Whether the replica has caught_up. + required bool caught_up = 3; } // Message from a running node to a joining node to bring it up to speed. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <yan...@us...> - 2008-12-04 10:24:43
|
Revision: 1093 http://assorted.svn.sourceforge.net/assorted/?rev=1093&view=rev Author: yangzhang Date: 2008-12-04 10:24:36 +0000 (Thu, 04 Dec 2008) Log Message: ----------- - added command-line options - added a "higher-level" readmsg() that relies on RVO - fixed and lifted random number generator - cleaned up includes, RAII - updated doc Modified Paths: -------------- ydb/trunk/README ydb/trunk/src/Makefile ydb/trunk/src/main.lzz Modified: ydb/trunk/README =================================================================== --- ydb/trunk/README 2008-12-04 10:24:21 UTC (rev 1092) +++ ydb/trunk/README 2008-12-04 10:24:36 UTC (rev 1093) @@ -10,12 +10,11 @@ Currently, the only recovery implemented mechanism is to have the first-joining replica serialize its entire database state and send that to the joining node. -If you start a system of $n$ replicas, then the leader will wait for $n-1$ of -them to join before it starts issuing transactions. (Think of $n-1$ as the -minimum number of replicas the system requires before it is willing to process -transactions.) Then when replica $n$ joins, it will need to catch up to the -current state of the system, and it will do so by contacting that first replica -and receiving a complete dump of its DB state. +If you start a system with a minimum of $n$ replicas, then the leader will wait +for that many to them to join before it starts issuing transactions. Then when +replica $n+1$ joins, it will need to catch up to the current state of the +system; it will do so by contacting the first-joining replica and receiving a +complete dump of its DB state. The leader will report the current txn seqno to the joiner, and start streaming txns beyond that seqno to the joiner, which the joiner will push onto its @@ -45,22 +44,23 @@ Usage ----- -To start a leader to manage 3 replicas, run: +To start a leader, run: - ./ydb 3 + ./ydb -l -This will listen on port 7654. Then to start the first two replicas, run: +Then to start the first two replicas, run: - ./ydb localhost 7654 7655 - ./ydb localhost 7654 7656 + ./ydb -p 7655 + ./ydb -p 7656 -This means "connect to the leader at localhost:7654, and listen on port 7655." -The replicas have to listen for connections from other replicas (namely the -recovering replica). +This means "connect to the leader at localhost:7654, and listen on port +7655/7656." The replicas have to listen for connections from other replicas +(namely the recovering replica). The leader waits for the minimum number +(default of 2) of replicas to join before beginning to issue transactions. The recovering replica then joins: - ./ydb localhost 7654 7657 + ./ydb -p 7657 It will connect to the first replica (on port 7655) and receive a DB dump from it. Modified: ydb/trunk/src/Makefile =================================================================== --- ydb/trunk/src/Makefile 2008-12-04 10:24:21 UTC (rev 1092) +++ ydb/trunk/src/Makefile 2008-12-04 10:24:36 UTC (rev 1093) @@ -19,8 +19,9 @@ SRCS := $(GENSRCS) OBJS := $(GENOBJS) -LDFLAGS := -lstx -lst -lresolv -lpthread -lprotobuf -CXXFLAGS := -g3 -Wall -Werror -Wextra -Woverloaded-virtual -Wconversion \ +LDFLAGS := -lstx -lst -lresolv -lpthread -lprotobuf \ + -lboost_program_options-gcc43-mt +CXXFLAGS := -g3 -Wall -Werror -Wextra -Woverloaded-virtual -Wconversion -Wno-conversion -Wno-ignored-qualifiers \ -Wpointer-arith -Wcast-qual -Wcast-align -Wwrite-strings \ -Winit-self -Wsign-promo -Wno-unused-parameter -Wc++0x-compat \ -Wparentheses -Wmissing-format-attribute -Wfloat-equal \ Modified: ydb/trunk/src/main.lzz =================================================================== --- ydb/trunk/src/main.lzz 2008-12-04 10:24:21 UTC (rev 1092) +++ ydb/trunk/src/main.lzz 2008-12-04 10:24:36 UTC (rev 1093) @@ -1,22 +1,23 @@ #hdr #include <boost/bind.hpp> #include <boost/foreach.hpp> -#include <boost/lambda/lambda.hpp> +#include <boost/program_options.hpp> #include <boost/scoped_array.hpp> #include <commons/nullptr.h> +#include <commons/rand.h> #include <commons/st/st.h> #include <commons/time.h> -#include <csignal> +#include <csignal> // sigaction etc. #include <cstdio> -#include <cstdlib> -#include <cstring> +#include <cstring> // strsignal #include <iostream> #include <fstream> #include <map> +#include <netinet/in.h> // in_addr etc. #include <set> -#include <sstream> -#include <sys/types.h> -#include <unistd.h> +#include <sys/socket.h> // getpeername +#include <sys/types.h> // ssize_t +#include <unistd.h> // pipe, write #include <vector> #include "ydb.pb.h" #define foreach BOOST_FOREACH @@ -26,15 +27,11 @@ #end typedef pair<int, int> pii; - -// Why does just timeout require the `extern`? -extern const st_utime_t timeout = 1000000; -const int chkpt = 10000; -const bool verbose = true; -const bool yield_during_recovery = false; -const bool yield_during_catch_up = false; -const bool use_epoll = false; -const uint16_t base_port = 7654; +st_utime_t timeout; +int chkpt; +bool verbose; +bool yield_during_build_up; +bool yield_during_catch_up; st_intr_bool stop_hub, kill_hub; /** @@ -110,11 +107,11 @@ /** * RAII to close all contained netfds. */ -class st_closing_all +class st_closing_all_infos { public: - st_closing_all(const vector<replica_info>& rs) : rs_(rs) {} - ~st_closing_all() { + st_closing_all_infos(const vector<replica_info>& rs) : rs_(rs) {} + ~st_closing_all_infos() { foreach (replica_info r, rs_) check0x(st_netfd_close(r.fd())); } @@ -123,6 +120,21 @@ }; /** + * RAII to close all contained netfds. + */ +class st_closing_all +{ + public: + st_closing_all(const vector<st_netfd_t>& rs) : rs_(rs) {} + ~st_closing_all() { + foreach (st_netfd_t r, rs_) + check0x(st_netfd_close(r)); + } + private: + const vector<st_netfd_t> &rs_; +}; + +/** * RAII for dumping the final state of the DB to a file on disk. */ class dump_state @@ -131,10 +143,9 @@ dump_state(const map<int, int> &map, const int &seqno) : map_(map), seqno_(seqno) {} ~dump_state() { - stringstream fname; - fname << "/tmp/ydb" << getpid(); - cout << "dumping DB state (" << seqno_ << ") to " << fname.str() << endl; - ofstream of(fname.str().c_str()); + string fname = string("/tmp/ydb") + lexical_cast<string>(getpid()); + cout << "dumping DB state (" << seqno_ << ") to " << fname << endl; + ofstream of(fname.c_str()); of << "seqno: " << seqno_ << endl; foreach (const pii &p, map_) { of << p.first << ": " << p.second << endl; @@ -209,10 +220,18 @@ } } -inline int -rand32(int max = RAND_MAX) +/** + * Same as the above readmsg(), but returns an internally constructed message. + * This is a "higher-level" readmsg() that relies on return-value optimization + * for avoiding unnecessary copies. + */ +template <typename T> +T +readmsg(st_netfd_t src, st_utime_t timeout = ST_UTIME_NO_TIMEOUT) { - return static_cast<int>( random() / ( RAND_MAX / max ) ); + T msg; + readmsg(src, msg, timeout); + return msg; } /** @@ -236,10 +255,10 @@ // Generate a random transaction. Txn txn; txn.set_seqno(seqno++); - int count = rand32(5) + 1; + int count = randint(5) + 1; for (int o = 0; o < count; o++) { Op *op = txn.add_op(); - int rtype = rand32(3), rkey = rand32(), rvalue = rand32(); + int rtype = randint(3), rkey = randint(), rvalue = randint(); op->set_type(types[rtype]); op->set_key(rkey); op->set_value(rvalue); @@ -339,7 +358,8 @@ } if (!caught_up && res.caught_up()) { caught_up = true; - cout << "recovering node caught up; took " << current_time_millis() - start_time << "ms" << endl; + cout << "recovering node caught up; took " + << current_time_millis() - start_time << "ms" << endl; } if (res.seqno() % chkpt == 0) { if (verbose) @@ -397,26 +417,25 @@ * Run the leader. */ void -run_leader(int nreps) +run_leader(int minreps, uint16_t leader_port) { cout << "starting as leader" << endl; + cout << "waiting for at least " << minreps << " replicas to join" << endl; // Wait until all replicas have joined. - st_netfd_t listener = st_tcp_listen(base_port); + st_netfd_t listener = st_tcp_listen(leader_port); st_closing close_listener(listener); // TODO rename these - int min_reps = nreps - 1; vector<replica_info> replicas; - st_closing_all close_replicas(replicas); - for (int i = 0; i < min_reps; i++) { + st_closing_all_infos close_replicas(replicas); + for (int i = 0; i < minreps; i++) { st_netfd_t fd; { st_intr intr(stop_hub); fd = checkerr(st_accept(listener, nullptr, nullptr, ST_UTIME_NO_TIMEOUT)); } - Join join; - readmsg(fd, join); + Join join = readmsg<Join>(fd); replicas.push_back(replica_info(fd, static_cast<uint16_t>(join.port()))); } @@ -456,8 +475,7 @@ joiner = checkerr(st_accept(listener, nullptr, nullptr, ST_UTIME_NO_TIMEOUT)); } - Join join; - readmsg(joiner, join); + Join join = readmsg<Join>(joiner); cout << "setting seqno to " << seqno << endl; init.set_txnseqno(seqno); sendmsg(joiner, init); @@ -473,7 +491,7 @@ * Run a replica. */ void -run_replica(char *leader_host, uint16_t leader_port, uint16_t listen_port) +run_replica(string leader_host, uint16_t leader_port, uint16_t listen_port) { // Initialize database state. map<int, int> map; @@ -481,28 +499,30 @@ dump_state ds(map, seqno); st_bool send_state, sent_state; - cout << "starting as replica" << endl; + cout << "starting as replica on port " << listen_port << endl; // Listen for connections from other replicas. st_netfd_t listener = st_tcp_listen(listen_port); - st_thread_t rec = my_spawn(bind(recover_joiner, listener, ref(map), - ref(seqno), ref(send_state), - ref(sent_state))); + st_joining join_rec(my_spawn(bind(recover_joiner, listener, ref(map), + ref(seqno), ref(send_state), + ref(sent_state)))); // Connect to the leader and join the system. - st_netfd_t leader = st_tcp_connect(leader_host, leader_port, timeout); + st_netfd_t leader = st_tcp_connect(leader_host.c_str(), leader_port, + timeout); + st_closing closing(leader); Join join; join.set_port(listen_port); sendmsg(leader, join); - Init init; - readmsg(leader, init); + Init init = readmsg<Init>(leader); uint32_t listen_host = init.yourhost(); // Display the info. cout << "got init msg with txn seqno " << init.txnseqno() << " and hosts:" << endl; vector<st_netfd_t> replicas; + st_closing_all close_replicas(replicas); for (uint16_t i = 0; i < init.node_size(); i++) { const SockAddr &sa = init.node(i); char buf[INET_ADDRSTRLEN]; @@ -520,9 +540,9 @@ // Process txns. st_channel<Txn*> backlog; - st_thread_t proc = my_spawn(bind(process_txns, leader, ref(map), ref(seqno), - ref(send_state), ref(sent_state), - ref(backlog))); + st_joining join_proc(my_spawn(bind(process_txns, leader, ref(map), + ref(seqno), ref(send_state), + ref(sent_state), ref(backlog)))); // If there's anything to recover. if (init.txnseqno() > 0) { @@ -530,12 +550,15 @@ // Read the recovery message. Recovery recovery; - readmsg(replicas[0], recovery); + { + st_intr intr(stop_hub); + readmsg(replicas[0], recovery); + } for (int i = 0; i < recovery.pair_size(); i++) { const Recovery_Pair &p = recovery.pair(i); map[p.key()] = p.value(); if (i % chkpt == 0) { - if (yield_during_recovery) st_sleep(0); + if (yield_during_build_up) st_sleep(0); } } assert(seqno == -1 && @@ -554,13 +577,6 @@ } cout << "caught up." << endl; } - - st_join(proc); - st_join(rec); - foreach (st_netfd_t r, replicas) { - check0x(st_netfd_close(r)); - } - check0x(st_netfd_close(leader)); } int sig_pipe[2]; @@ -604,9 +620,65 @@ int main(int argc, char **argv) { + namespace po = boost::program_options; try { GOOGLE_PROTOBUF_VERIFY_VERSION; + bool is_leader, use_epoll; + int minreps; + uint16_t leader_port, listen_port; + string leader_host; + + // Parse options. + po::options_description desc("Allowed options"); + desc.add_options() + ("help,h", "show this help message") + ("verbose,v", "enable periodic printing of txn processing progress") + ("epoll,e", po::bool_switch(&use_epoll), + "use epoll (select is used by default)") + ("yield-build-up", po::bool_switch(&yield_during_build_up), + "yield periodically during build-up phase of recovery") + ("yield-catch-up", po::bool_switch(&yield_during_catch_up), + "yield periodically during catch-up phase of recovery") + ("leader,l", po::bool_switch(&is_leader), + "run the leader (run replica by default)") + ("leader-host,H", + po::value<string>(&leader_host)->default_value(string("localhost")), + "hostname or address of the leader") + ("leader-port,P", + po::value<uint16_t>(&leader_port)->default_value(7654), + "port the leader listens on") + ("chkpt,c", po::value<int>(&chkpt)->default_value(10000), + "number of txns before yielding/verbose printing") + ("listen-port,p", po::value<uint16_t>(&listen_port), + "port to listen on (replicas only)") + ("timeout,t", po::value<st_utime_t>(&timeout)->default_value(1000000), + "timeout for IO operations (in microseconds)") + ("minreps,n", po::value<int>(&minreps)->default_value(2), + "minimum number of replicas the system is willing to process txns on"); + + po::variables_map vm; + try { + po::store(po::parse_command_line(argc, argv, desc), vm); + po::notify(vm); + + if (vm.count("help")) { + cout << desc << endl; + return 0; + } + if (!is_leader && !vm.count("listen-port")) { + class parse_exception : public std::exception { + virtual const char *what() const throw() { + return "running replica requires listen port to be specified"; + } + }; + throw parse_exception(); + } + } catch (std::exception &ex) { + cerr << ex.what() << endl << endl << desc << endl; + return 1; + } + // Initialize support for ST working with asynchronous signals. check0x(pipe(sig_pipe)); struct sigaction sa; @@ -624,19 +696,11 @@ thread_eraser eraser; threads.insert(st_thread_self()); - // Parse command-line arguments. - if (argc != 2 && argc != 4) - die("leader: ydb <nreplicas>\n" - "replica: ydb <leaderhost> <leaderport> <listenport>\n"); - bool is_leader = argc == 2; - // Which role are we? if (is_leader) { - run_leader(atoi(argv[1])); + run_leader(minreps, leader_port); } else { - run_replica(argv[1], - static_cast<uint16_t>(atoi(argv[2])), - static_cast<uint16_t>(atoi(argv[3]))); + run_replica(leader_host, leader_port, listen_port); } return 0; This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <yan...@us...> - 2008-12-11 19:16:42
|
Revision: 1098 http://assorted.svn.sourceforge.net/assorted/?rev=1098&view=rev Author: yangzhang Date: 2008-12-11 19:16:31 +0000 (Thu, 11 Dec 2008) Log Message: ----------- - started using clamp - added benchmark performance summary output - fixed unnecessary premature spawn and final join with recover_joiner - fixed issue where stop_hub wasn't being interrupted - added optional time limit as an option, causing issue_txns to stop after some time - added default value for listen port (since nodes are frequently run on different hosts) - added some comment documentation - my_spawn doesn't interrupt other threads by default on exceptions - reworked the communication/synchronization between process_txns and recover_joiner - made channels safer - added full test/benchmark setup & deployment scripts - added patch to get clamp to build Modified Paths: -------------- ydb/trunk/README ydb/trunk/src/Makefile Added Paths: ----------- ydb/trunk/src/main.lzz.clamp ydb/trunk/tools/ ydb/trunk/tools/clamp.patch ydb/trunk/tools/test.bash Removed Paths: ------------- ydb/trunk/src/main.lzz Modified: ydb/trunk/README =================================================================== --- ydb/trunk/README 2008-12-08 07:57:26 UTC (rev 1097) +++ ydb/trunk/README 2008-12-11 19:16:31 UTC (rev 1098) @@ -29,6 +29,7 @@ - [boost] 1.37 - [C++ Commons] svn r1082 +- [clamp] 153 - [GCC] 4.3.2 - [Lazy C++] 2.8.0 - [Protocol Buffers] 2.0.0 @@ -36,6 +37,7 @@ [boost]: http://www.boost.org/ [C++ Commons]: http://assorted.sourceforge.net/cpp-commons/ +[clamp]: http://home.clara.net/raoulgough/clamp/ [GCC]: http://gcc.gnu.org/ [Lazy C++]: http://www.lazycplusplus.com/ [Protocol Buffers]: http://code.google.com/p/protobuf/ @@ -67,8 +69,28 @@ To terminate the system, send a sigint (ctrl-c) to the leader, and a clean shutdown should take place. The replicas dump their DB state to a tmp file, -which you can then verify to be identical. +which you can then verify to be identical. You can also send a sigint to a +replica to stop just that node. If something goes awry, you can send a second +sigint to try to force all working threads to shut down (any node, including +replicas, respond to ctrl-c). +Full System Test +---------------- + + ./test.bash full + +will configure all the farm machines to (1) have my proper initial environment, +(2) have all the prerequisite software, and (3) build ydb. This may take a +long time (particularly the boost-building phase). + + range='10 13' wait=5 ./test.bash run + +will run a leader on farm10, replicas on farm11 and farm12, and a recovering +replica on farm13 after 5 seconds. Pipe several runs of this to some files +(`*.out`), and plot the results with + + ./test.bash plot *.out + Recovery Mechanisms ------------------- @@ -117,11 +139,7 @@ Todo ---- -- Expose program options. - -- Add test suite. - -- Add benchmarking hooks, e.g.: +- Add benchmarking/testing hooks, e.g.: - start the recovering joiner at a well-defined time (after a certain # txns or after the DB reaches a certain size) @@ -136,6 +154,9 @@ - Figure out why things are running so slowly with >2 replicas. +- Add a network recovery scheme that grabs state partitions in parallel from + all other replicas. + - Add a variant of the recovery scheme so that the standing replicas can just send any snapshot of their DB beyond a certain seqno. The joiner can simply discard from its leader-populated backlog any txns before the seqno of the Modified: ydb/trunk/src/Makefile =================================================================== --- ydb/trunk/src/Makefile 2008-12-08 07:57:26 UTC (rev 1097) +++ ydb/trunk/src/Makefile 2008-12-11 19:16:31 UTC (rev 1098) @@ -1,7 +1,7 @@ TARGET := ydb WTF := wtf -LZZS := $(wildcard *.lzz) +LZZS := $(patsubst %.clamp,%,$(wildcard *.lzz.clamp)) LZZHDRS := $(foreach lzz,$(LZZS),$(patsubst %.lzz,%.hh,$(lzz))) LZZSRCS := $(foreach lzz,$(LZZS),$(patsubst %.lzz,%.cc,$(lzz))) LZZOBJS := $(foreach lzz,$(LZZS),$(patsubst %.lzz,%.o,$(lzz))) @@ -51,12 +51,15 @@ %.pb.h: %.proto protoc --cpp_out=. $< +%.lzz: %.lzz.clamp + clamp < $< | sed "`echo -e '1i#src\n1a#end'`" > $@ + clean: - rm -f $(GENSRCS) $(GENHDRS) $(OBJS) $(TARGET) + rm -f $(GENSRCS) $(GENHDRS) $(OBJS) $(TARGET) main.lzz *.clamp_h doc: $(SRCS) $(HDRS) doxygen .PHONY: clean -.SECONDARY: $(SRCS) $(HDRS) $(OBJS) +.SECONDARY: $(SRCS) $(HDRS) $(OBJS) main.lzz Deleted: ydb/trunk/src/main.lzz =================================================================== --- ydb/trunk/src/main.lzz 2008-12-08 07:57:26 UTC (rev 1097) +++ ydb/trunk/src/main.lzz 2008-12-11 19:16:31 UTC (rev 1098) @@ -1,712 +0,0 @@ -#hdr -#include <boost/bind.hpp> -#include <boost/foreach.hpp> -#include <boost/program_options.hpp> -#include <boost/scoped_array.hpp> -#include <commons/nullptr.h> -#include <commons/rand.h> -#include <commons/st/st.h> -#include <commons/time.h> -#include <csignal> // sigaction etc. -#include <cstdio> -#include <cstring> // strsignal -#include <iostream> -#include <fstream> -#include <map> -#include <netinet/in.h> // in_addr etc. -#include <set> -#include <sys/socket.h> // getpeername -#include <sys/types.h> // ssize_t -#include <unistd.h> // pipe, write -#include <vector> -#include "ydb.pb.h" -#define foreach BOOST_FOREACH -using namespace boost; -using namespace commons; -using namespace std; -#end - -typedef pair<int, int> pii; -st_utime_t timeout; -int chkpt; -bool verbose; -bool yield_during_build_up; -bool yield_during_catch_up; -st_intr_bool stop_hub, kill_hub; - -/** - * The list of all threads. Keep track of these so that we may cleanly shut - * down all threads. - */ -set<st_thread_t> threads; - -class thread_eraser -{ - public: - thread_eraser() { threads.insert(st_thread_self()); } - ~thread_eraser() { threads.erase(st_thread_self()); } -}; - -/** - * Delegate for running thread targets. - * \param[in] f The function to execute. - * \param[in] intr Whether to signal stop_hub on an exception. - */ -void -my_spawn_helper(const function0<void> f, bool intr) -{ - thread_eraser eraser; - try { - f(); - } catch (const std::exception &ex) { - cerr << "thread " << st_thread_self() << ": " << ex.what() << endl; - if (intr) stop_hub.set(); - } -} - -/** - * Spawn a thread using ST but wrap it in an exception handler that interrupts - * all other threads (hopefully causing them to unwind). - */ -st_thread_t -my_spawn(const function0<void> &f, bool intr = true) -{ - st_thread_t t = st_spawn(bind(my_spawn_helper, f, intr)); - threads.insert(t); - return t; -} - -/** - * Used by the leader to bookkeep information about replicas. - */ -class replica_info -{ - public: - replica_info(st_netfd_t fd, uint16_t port) : fd_(fd), port_(port) {} - st_netfd_t fd() const { return fd_; } - /** The port on which the replica is listening. */ - uint16_t port() const { return port_; } -#hdr -#define GETSA sockaddr_in sa; sockaddr(sa); return sa -#end - /** The port on which the replica connected to us. */ - uint16_t local_port() const { GETSA.sin_port; } - uint32_t host() const { GETSA.sin_addr.s_addr; } - sockaddr_in sockaddr() const { GETSA; } - void sockaddr(sockaddr_in &sa) const { - socklen_t salen = sizeof sa; - check0x(getpeername(st_netfd_fileno(fd_), - reinterpret_cast<struct sockaddr*>(&sa), - &salen)); - } - private: - st_netfd_t fd_; - uint16_t port_; -}; - -/** - * RAII to close all contained netfds. - */ -class st_closing_all_infos -{ - public: - st_closing_all_infos(const vector<replica_info>& rs) : rs_(rs) {} - ~st_closing_all_infos() { - foreach (replica_info r, rs_) - check0x(st_netfd_close(r.fd())); - } - private: - const vector<replica_info> &rs_; -}; - -/** - * RAII to close all contained netfds. - */ -class st_closing_all -{ - public: - st_closing_all(const vector<st_netfd_t>& rs) : rs_(rs) {} - ~st_closing_all() { - foreach (st_netfd_t r, rs_) - check0x(st_netfd_close(r)); - } - private: - const vector<st_netfd_t> &rs_; -}; - -/** - * RAII for dumping the final state of the DB to a file on disk. - */ -class dump_state -{ - public: - dump_state(const map<int, int> &map, const int &seqno) - : map_(map), seqno_(seqno) {} - ~dump_state() { - string fname = string("/tmp/ydb") + lexical_cast<string>(getpid()); - cout << "dumping DB state (" << seqno_ << ") to " << fname << endl; - ofstream of(fname.c_str()); - of << "seqno: " << seqno_ << endl; - foreach (const pii &p, map_) { - of << p.first << ": " << p.second << endl; - } - } - private: - const map<int, int> &map_; - const int &seqno_; -}; - -/** - * Send a message to some destinations (sequentially). - */ -template<typename T> -void -bcastmsg(const vector<st_netfd_t> &dsts, const T & msg) -{ - // Serialize message to a buffer. - string s; - check(msg.SerializeToString(&s)); - const char *buf = s.c_str(); - - // Prefix the message with a four-byte length. - uint32_t len = htonl(static_cast<uint32_t>(s.size())); - - // Broadcast the length-prefixed message to replicas. - foreach (st_netfd_t dst, dsts) { - checkeqnneg(st_write(dst, static_cast<void*>(&len), sizeof len, timeout), - static_cast<ssize_t>(sizeof len)); - checkeqnneg(st_write(dst, buf, s.size(), ST_UTIME_NO_TIMEOUT), - static_cast<ssize_t>(s.size())); - } -} - -/** - * Send a message to a single recipient. - */ -template<typename T> -void -sendmsg(st_netfd_t dst, const T &msg) -{ - vector<st_netfd_t> dsts(1, dst); - bcastmsg(dsts, msg); -} - -/** - * Read a message. - */ -template <typename T> -void -readmsg(st_netfd_t src, T & msg, st_utime_t timeout = ST_UTIME_NO_TIMEOUT) -{ - // Read the message length. - uint32_t len; - checkeqnneg(st_read_fully(src, static_cast<void*>(&len), sizeof len, - timeout), - static_cast<ssize_t>(sizeof len)); - len = ntohl(len); - -#define GETMSG(buf) \ - checkeqnneg(st_read_fully(src, buf, len, timeout), (int) len); \ - check(msg.ParseFromArray(buf, len)); - - // Parse the message body. - if (len < 4096) { - char buf[len]; - GETMSG(buf); - } else { - cout << "receiving large msg; heap-allocating " << len << " bytes" << endl; - scoped_array<char> buf(new char[len]); - GETMSG(buf.get()); - } -} - -/** - * Same as the above readmsg(), but returns an internally constructed message. - * This is a "higher-level" readmsg() that relies on return-value optimization - * for avoiding unnecessary copies. - */ -template <typename T> -T -readmsg(st_netfd_t src, st_utime_t timeout = ST_UTIME_NO_TIMEOUT) -{ - T msg; - readmsg(src, msg, timeout); - return msg; -} - -/** - * Keep issuing transactions to the replicas. - */ -void -issue_txns(st_channel<replica_info> &newreps, int &seqno) -{ - Op_OpType types[] = {Op::read, Op::write, Op::del}; - vector<st_netfd_t> fds; - - while (!stop_hub) { - // Did we get a new member? - if (!newreps.empty() && seqno > 0) { - sendmsg(fds[0], Txn()); - } - while (!newreps.empty()) { - fds.push_back(newreps.take().fd()); - } - - // Generate a random transaction. - Txn txn; - txn.set_seqno(seqno++); - int count = randint(5) + 1; - for (int o = 0; o < count; o++) { - Op *op = txn.add_op(); - int rtype = randint(3), rkey = randint(), rvalue = randint(); - op->set_type(types[rtype]); - op->set_key(rkey); - op->set_value(rvalue); - } - - // Broadcast. - bcastmsg(fds, txn); - - // Checkpoint. - if (txn.seqno() % chkpt == 0) { - if (verbose) cout << "issued txn " << txn.seqno() << endl; - st_sleep(0); - } - } -} - -/** - * Process a transaction: update DB state (incl. seqno) and send response to - * leader. - */ -void -process_txn(st_netfd_t leader, map<int, int> &map, const Txn &txn, int &seqno, - bool caught_up) -{ - checkeq(txn.seqno(), seqno + 1); - Response res; - res.set_seqno(txn.seqno()); - res.set_caught_up(caught_up); - seqno = txn.seqno(); - for (int o = 0; o < txn.op_size(); o++) { - const Op &op = txn.op(o); - switch (op.type()) { - case Op::read: - res.add_result(map[op.key()]); - break; - case Op::write: - map[op.key()] = op.value(); - break; - case Op::del: - map.erase(op.key()); - break; - } - } - sendmsg(leader, res); -} - -/** - * Actually do the work of executing a transaction and sending back the reply. - */ -void -process_txns(st_netfd_t leader, map<int, int> &map, int &seqno, - st_bool &send_state, st_bool &sent_state, - st_channel<Txn*> &backlog) -{ - while (true) { - Txn txn; - { - st_intr intr(kill_hub); - readmsg(leader, txn); - } - - if (txn.has_seqno()) { - if (txn.seqno() == seqno + 1) { - process_txn(leader, map, txn, seqno, true); - } else { - // Queue up for later processing once a snapshot has been received. - backlog.push(new Txn(txn)); - } - - if (txn.seqno() % chkpt == 0) { - if (verbose) cout << "processed txn " << txn.seqno() << endl; - st_sleep(0); - } - } else { - // Wait for the snapshot to be generated. - send_state.set(); - cout << "waiting for state to be sent" << endl; - sent_state.waitset(); - sent_state.reset(); - cout << "state sent" << endl; - } - } -} - -/** - * Keep swallowing replica responses. - */ -void -handle_responses(st_netfd_t replica, const int &seqno, bool caught_up) -{ - long long start_time = current_time_millis(); - while (true) { - Response res; - { - st_intr intr(kill_hub); - readmsg(replica, res); - } - if (!caught_up && res.caught_up()) { - caught_up = true; - cout << "recovering node caught up; took " - << current_time_millis() - start_time << "ms" << endl; - } - if (res.seqno() % chkpt == 0) { - if (verbose) - cout << "got response " << res.seqno() << " from " << replica << endl; - st_sleep(0); - } - if (stop_hub && res.seqno() + 1 == seqno) { - cout << "seqno = " << res.seqno() << endl; - break; - } - } -} - -/** - * Help the recovering node. - */ -void -recover_joiner(st_netfd_t listener, const map<int, int> &map, const int &seqno, - st_bool &send_state, st_bool &sent_state) -{ - // Wait for the right time to generate the snapshot. - { - st_intr intr(stop_hub); - send_state.waitset(); - } - send_state.reset(); - - cout << "snapshotting state for recovery" << endl; - Recovery recovery; - foreach (const pii &p, map) { - Recovery_Pair *pair = recovery.add_pair(); - pair->set_key(p.first); - pair->set_value(p.second); - } - recovery.set_seqno(seqno); - - // Notify process_txns that it may continue processing. - sent_state.set(); - - // Wait for the new joiner. - st_netfd_t joiner; - { - st_intr intr(stop_hub); - joiner = checkerr(st_accept(listener, nullptr, nullptr, - ST_UTIME_NO_TIMEOUT)); - } - st_closing closing(joiner); - - cout << "got joiner's connection, sending recovery" << endl; - sendmsg(joiner, recovery); - cout << "sent" << endl; -} - -/** - * Run the leader. - */ -void -run_leader(int minreps, uint16_t leader_port) -{ - cout << "starting as leader" << endl; - cout << "waiting for at least " << minreps << " replicas to join" << endl; - - // Wait until all replicas have joined. - st_netfd_t listener = st_tcp_listen(leader_port); - st_closing close_listener(listener); - // TODO rename these - vector<replica_info> replicas; - st_closing_all_infos close_replicas(replicas); - for (int i = 0; i < minreps; i++) { - st_netfd_t fd; - { - st_intr intr(stop_hub); - fd = checkerr(st_accept(listener, nullptr, nullptr, - ST_UTIME_NO_TIMEOUT)); - } - Join join = readmsg<Join>(fd); - replicas.push_back(replica_info(fd, static_cast<uint16_t>(join.port()))); - } - - // Construct the initialization message. - Init init; - init.set_txnseqno(0); - foreach (replica_info r, replicas) { - SockAddr *psa = init.add_node(); - psa->set_host(r.host()); - psa->set_port(r.port()); - } - - // Send init to each initial replica. - foreach (replica_info r, replicas) { - init.set_yourhost(r.host()); - sendmsg(r.fd(), init); - } - - // Start dispatching queries. - int seqno = 0; - st_channel<replica_info> newreps; - const function0<void> f = bind(issue_txns, ref(newreps), ref(seqno)); - st_thread_t swallower = my_spawn(bind(swallow, f)); - foreach (const replica_info &r, replicas) newreps.push(r); - st_joining join_swallower(swallower); - - // Start handling responses. - st_thread_group handlers; - foreach (replica_info r, replicas) { - handlers.insert(my_spawn(bind(handle_responses, r.fd(), ref(seqno), true))); - } - - // Accept the recovering node, and tell it about the online replicas. - st_netfd_t joiner; - { - st_intr intr(stop_hub); - joiner = checkerr(st_accept(listener, nullptr, nullptr, - ST_UTIME_NO_TIMEOUT)); - } - Join join = readmsg<Join>(joiner); - cout << "setting seqno to " << seqno << endl; - init.set_txnseqno(seqno); - sendmsg(joiner, init); - - // Start streaming txns to joiner. - cout << "start streaming txns to joiner" << endl; - replicas.push_back(replica_info(joiner, static_cast<uint16_t>(join.port()))); - newreps.push(replicas.back()); - handlers.insert(my_spawn(bind(handle_responses, joiner, ref(seqno), false))); -} - -/** - * Run a replica. - */ -void -run_replica(string leader_host, uint16_t leader_port, uint16_t listen_port) -{ - // Initialize database state. - map<int, int> map; - int seqno = -1; - dump_state ds(map, seqno); - st_bool send_state, sent_state; - - cout << "starting as replica on port " << listen_port << endl; - - // Listen for connections from other replicas. - st_netfd_t listener = - st_tcp_listen(listen_port); - st_joining join_rec(my_spawn(bind(recover_joiner, listener, ref(map), - ref(seqno), ref(send_state), - ref(sent_state)))); - - // Connect to the leader and join the system. - st_netfd_t leader = st_tcp_connect(leader_host.c_str(), leader_port, - timeout); - st_closing closing(leader); - Join join; - join.set_port(listen_port); - sendmsg(leader, join); - Init init = readmsg<Init>(leader); - uint32_t listen_host = init.yourhost(); - - // Display the info. - cout << "got init msg with txn seqno " << init.txnseqno() - << " and hosts:" << endl; - vector<st_netfd_t> replicas; - st_closing_all close_replicas(replicas); - for (uint16_t i = 0; i < init.node_size(); i++) { - const SockAddr &sa = init.node(i); - char buf[INET_ADDRSTRLEN]; - in_addr host = { sa.host() }; - bool is_self = sa.host() == listen_host && sa.port() == listen_port; - cout << "- " << checkerr(inet_ntop(AF_INET, &host, buf, - INET_ADDRSTRLEN)) - << ':' << sa.port() << (is_self ? " (self)" : "") << endl; - if (!is_self && init.txnseqno() > 0) { - replicas.push_back(st_tcp_connect(host, - static_cast<uint16_t>(sa.port()), - ST_UTIME_NO_TIMEOUT)); - } - } - - // Process txns. - st_channel<Txn*> backlog; - st_joining join_proc(my_spawn(bind(process_txns, leader, ref(map), - ref(seqno), ref(send_state), - ref(sent_state), ref(backlog)))); - - // If there's anything to recover. - if (init.txnseqno() > 0) { - cout << "waiting for recovery from " << replicas[0] << endl; - - // Read the recovery message. - Recovery recovery; - { - st_intr intr(stop_hub); - readmsg(replicas[0], recovery); - } - for (int i = 0; i < recovery.pair_size(); i++) { - const Recovery_Pair &p = recovery.pair(i); - map[p.key()] = p.value(); - if (i % chkpt == 0) { - if (yield_during_build_up) st_sleep(0); - } - } - assert(seqno == -1 && - static_cast<typeof(seqno)>(recovery.seqno()) > seqno); - seqno = recovery.seqno(); - cout << "recovered." << endl; - - while (!backlog.empty()) { - Txn *p = backlog.take(); - process_txn(leader, map, *p, seqno, false); - if (p->seqno() % chkpt == 0) { - cout << "processed txn " << p->seqno() << " off the backlog" << endl; - if (yield_during_catch_up) st_sleep(0); - } - delete p; - } - cout << "caught up." << endl; - } -} - -int sig_pipe[2]; - -/** - * Raw signal handler that triggers the (synchronous) handler. - */ -void handle_sig(int sig) { - int err = errno; - cerr << "got signal: " << strsignal(sig) << " (" << sig << ")" << endl; - checkeqnneg(write(sig_pipe[1], &sig, sizeof sig), - static_cast<ssize_t>(sizeof sig)); - errno = err; -} - -/** - * Synchronous part of the signal handler; cleanly interrrupts any threads that - * have marked themselves as interruptible. - */ -void handle_sig_sync() { - stfd fd = checkerr(st_netfd_open(sig_pipe[0])); - while (true) { - int sig; - checkeqnneg(st_read(fd, &sig, sizeof sig, ST_UTIME_NO_TIMEOUT), - static_cast<ssize_t>(sizeof sig)); - if (sig == SIGINT) { - if (!stop_hub) stop_hub.set(); - else kill_hub.set(); - } else if (sig == SIGTERM) { - foreach (st_thread_t t, threads) { - st_thread_interrupt(t); - } - } - break; - } -} - -/** - * Initialization and command-line parsing. - */ -int -main(int argc, char **argv) -{ - namespace po = boost::program_options; - try { - GOOGLE_PROTOBUF_VERIFY_VERSION; - - bool is_leader, use_epoll; - int minreps; - uint16_t leader_port, listen_port; - string leader_host; - - // Parse options. - po::options_description desc("Allowed options"); - desc.add_options() - ("help,h", "show this help message") - ("verbose,v", "enable periodic printing of txn processing progress") - ("epoll,e", po::bool_switch(&use_epoll), - "use epoll (select is used by default)") - ("yield-build-up", po::bool_switch(&yield_during_build_up), - "yield periodically during build-up phase of recovery") - ("yield-catch-up", po::bool_switch(&yield_during_catch_up), - "yield periodically during catch-up phase of recovery") - ("leader,l", po::bool_switch(&is_leader), - "run the leader (run replica by default)") - ("leader-host,H", - po::value<string>(&leader_host)->default_value(string("localhost")), - "hostname or address of the leader") - ("leader-port,P", - po::value<uint16_t>(&leader_port)->default_value(7654), - "port the leader listens on") - ("chkpt,c", po::value<int>(&chkpt)->default_value(10000), - "number of txns before yielding/verbose printing") - ("listen-port,p", po::value<uint16_t>(&listen_port), - "port to listen on (replicas only)") - ("timeout,t", po::value<st_utime_t>(&timeout)->default_value(1000000), - "timeout for IO operations (in microseconds)") - ("minreps,n", po::value<int>(&minreps)->default_value(2), - "minimum number of replicas the system is willing to process txns on"); - - po::variables_map vm; - try { - po::store(po::parse_command_line(argc, argv, desc), vm); - po::notify(vm); - - if (vm.count("help")) { - cout << desc << endl; - return 0; - } - if (!is_leader && !vm.count("listen-port")) { - class parse_exception : public std::exception { - virtual const char *what() const throw() { - return "running replica requires listen port to be specified"; - } - }; - throw parse_exception(); - } - } catch (std::exception &ex) { - cerr << ex.what() << endl << endl << desc << endl; - return 1; - } - - // Initialize support for ST working with asynchronous signals. - check0x(pipe(sig_pipe)); - struct sigaction sa; - sa.sa_handler = handle_sig; - check0x(sigemptyset(&sa.sa_mask)); - sa.sa_flags = 0; - check0x(sigaction(SIGINT, &sa, nullptr)); - - // Initialize ST. - if (use_epoll) check0x(st_set_eventsys(ST_EVENTSYS_ALT)); - check0x(st_init()); - st_spawn(bind(handle_sig_sync)); - - // Initialize thread manager for clean shutdown of all threads. - thread_eraser eraser; - threads.insert(st_thread_self()); - - // Which role are we? - if (is_leader) { - run_leader(minreps, leader_port); - } else { - run_replica(leader_host, leader_port, listen_port); - } - - return 0; - } catch (const std::exception &ex) { - // Must catch all exceptions at the top to make the stack unwind. - cerr << "thread " << st_thread_self() << ": " << ex.what() << endl; - return 1; - } -} Copied: ydb/trunk/src/main.lzz.clamp (from rev 1093, ydb/trunk/src/main.lzz) =================================================================== --- ydb/trunk/src/main.lzz.clamp (rev 0) +++ ydb/trunk/src/main.lzz.clamp 2008-12-11 19:16:31 UTC (rev 1098) @@ -0,0 +1,787 @@ +#hdr +#include <boost/bind.hpp> +#include <boost/foreach.hpp> +#include <boost/program_options.hpp> +#include <boost/scoped_array.hpp> +#include <boost/shared_ptr.hpp> +#include <commons/nullptr.h> +#include <commons/rand.h> +#include <commons/st/st.h> +#include <commons/time.h> +#include <csignal> // sigaction etc. +#include <cstdio> +#include <cstring> // strsignal +#include <iostream> +#include <fstream> +#include <map> +#include <netinet/in.h> // in_addr etc. +#include <set> +#include <sys/socket.h> // getpeername +#include <sys/types.h> // ssize_t +#include <unistd.h> // pipe, write +#include <vector> +#include "ydb.pb.h" +#define foreach BOOST_FOREACH +using namespace boost; +using namespace commons; +using namespace std; +#end + +typedef pair<int, int> pii; +st_utime_t timeout; +int chkpt; +bool verbose, yield_during_build_up, yield_during_catch_up; +long long timelim; +st_intr_bool stop_hub, kill_hub; + +/** + * The list of all threads. Keep track of these so that we may cleanly shut + * down all threads. + */ +set<st_thread_t> threads; + +/** + * RAII for adding/removing the current thread from the global threads set. + */ +class thread_eraser +{ + public: + thread_eraser() { threads.insert(st_thread_self()); } + ~thread_eraser() { threads.erase(st_thread_self()); } +}; + +/** + * Delegate for running thread targets. + * \param[in] f The function to execute. + * \param[in] intr Whether to signal stop_hub on an exception. + */ +void +my_spawn_helper(const function0<void> f, bool intr) +{ + thread_eraser eraser; + try { + f(); + } catch (const std::exception &ex) { + cerr << "thread " << st_thread_self() << ": " << ex.what() + << (intr ? "; interrupting!" : "") << endl; + if (intr) stop_hub.set(); + } +} + +/** + * Spawn a thread using ST but wrap it in an exception handler that interrupts + * all other threads (hopefully causing them to unwind). + * \param[in] f The function to execute. + * \param[in] intr Whether to signal stop_hub on an exception. Not actually + * used anywhere. + */ +st_thread_t +my_spawn(const function0<void> &f, bool intr = false) +{ + st_thread_t t = st_spawn(bind(my_spawn_helper, f, intr)); + threads.insert(t); + return t; +} + +/** + * Used by the leader to bookkeep information about replicas. + */ +class replica_info +{ + public: + replica_info(st_netfd_t fd, uint16_t port) : fd_(fd), port_(port) {} + st_netfd_t fd() const { return fd_; } + /** The port on which the replica is listening. */ + uint16_t port() const { return port_; } +#hdr +#define GETSA sockaddr_in sa; sockaddr(sa); return sa +#end + /** The port on which the replica connected to us. */ + uint16_t local_port() const { GETSA.sin_port; } + uint32_t host() const { GETSA.sin_addr.s_addr; } + sockaddr_in sockaddr() const { GETSA; } + void sockaddr(sockaddr_in &sa) const { + socklen_t salen = sizeof sa; + check0x(getpeername(st_netfd_fileno(fd_), + reinterpret_cast<struct sockaddr*>(&sa), + &salen)); + } + private: + st_netfd_t fd_; + uint16_t port_; +}; + +/** + * RAII to close all contained netfds. + */ +class st_closing_all_infos +{ + public: + st_closing_all_infos(const vector<replica_info>& rs) : rs_(rs) {} + ~st_closing_all_infos() { + foreach (replica_info r, rs_) + check0x(st_netfd_close(r.fd())); + } + private: + const vector<replica_info> &rs_; +}; + +/** + * RAII to close all contained netfds. + */ +class st_closing_all +{ + public: + st_closing_all(const vector<st_netfd_t>& rs) : rs_(rs) {} + ~st_closing_all() { + foreach (st_netfd_t r, rs_) + check0x(st_netfd_close(r)); + } + private: + const vector<st_netfd_t> &rs_; +}; + +/** + * RAII for dumping the final state of the DB to a file on disk. + */ +class dump_state +{ + public: + dump_state(const map<int, int> &map, const int &seqno) + : map_(map), seqno_(seqno) {} + ~dump_state() { + string fname = string("/tmp/ydb") + lexical_cast<string>(getpid()); + cout << "dumping DB state (" << seqno_ << ") to " << fname << endl; + ofstream of(fname.c_str()); + of << "seqno: " << seqno_ << endl; + foreach (const pii &p, map_) { + of << p.first << ": " << p.second << endl; + } + } + private: + const map<int, int> &map_; + const int &seqno_; +}; + +/** + * Send a message to some destinations (sequentially). + */ +template<typename T> +void +bcastmsg(const vector<st_netfd_t> &dsts, const T & msg) +{ + // Serialize message to a buffer. + string s; + check(msg.SerializeToString(&s)); + const char *buf = s.c_str(); + + // Prefix the message with a four-byte length. + uint32_t len = htonl(static_cast<uint32_t>(s.size())); + + // Broadcast the length-prefixed message to replicas. + foreach (st_netfd_t dst, dsts) { + checkeqnneg(st_write(dst, static_cast<void*>(&len), sizeof len, timeout), + static_cast<ssize_t>(sizeof len)); + checkeqnneg(st_write(dst, buf, s.size(), ST_UTIME_NO_TIMEOUT), + static_cast<ssize_t>(s.size())); + } +} + +/** + * Send a message to a single recipient. + */ +template<typename T> +void +sendmsg(st_netfd_t dst, const T &msg) +{ + vector<st_netfd_t> dsts(1, dst); + bcastmsg(dsts, msg); +} + +/** + * Read a message. + */ +template <typename T> +void +readmsg(st_netfd_t src, T & msg, st_utime_t timeout = ST_UTIME_NO_TIMEOUT) +{ + // Read the message length. + uint32_t len; + checkeqnneg(st_read_fully(src, static_cast<void*>(&len), sizeof len, + timeout), + static_cast<ssize_t>(sizeof len)); + len = ntohl(len); + +#define GETMSG(buf) \ + checkeqnneg(st_read_fully(src, buf, len, timeout), (int) len); \ + check(msg.ParseFromArray(buf, len)); + + // Parse the message body. + if (len < 4096) { + char buf[len]; + GETMSG(buf); + } else { + cout << "receiving large msg; heap-allocating " << len << " bytes" << endl; + scoped_array<char> buf(new char[len]); + GETMSG(buf.get()); + } +} + +/** + * Same as the above readmsg(), but returns an internally constructed message. + * This is a "higher-level" readmsg() that relies on return-value optimization + * for avoiding unnecessary copies. + */ +template <typename T> +T +readmsg(st_netfd_t src, st_utime_t timeout = ST_UTIME_NO_TIMEOUT) +{ + T msg; + readmsg(src, msg, timeout); + return msg; +} + +/** + * Keep issuing transactions to the replicas. + */ +void +issue_txns(st_channel<replica_info> &newreps, int &seqno) +{ + Op_OpType types[] = {Op::read, Op::write, Op::del}; + vector<st_netfd_t> fds; + long long start_time = current_time_millis(); + + finally f(lambda () { + showtput("issued", current_time_millis(), __ref(start_time), __ref(seqno), + 0); + }); + + while (!stop_hub) { + // Did we get a new member? + if (!newreps.empty() && seqno > 0) { + sendmsg(fds[0], Txn()); + } + while (!newreps.empty()) { + fds.push_back(newreps.take().fd()); + } + + // Generate a random transaction. + Txn txn; + txn.set_seqno(seqno++); + int count = randint(5) + 1; + for (int o = 0; o < count; o++) { + Op *op = txn.add_op(); + int rtype = randint(3), rkey = randint(), rvalue = randint(); + op->set_type(types[rtype]); + op->set_key(rkey); + op->set_value(rvalue); + } + + // Broadcast. + bcastmsg(fds, txn); + + // Checkpoint. + if (txn.seqno() % chkpt == 0) { + if (verbose) + cout << "issued txn " << txn.seqno() << endl; + if (timelim > 0 && current_time_millis() - start_time > timelim) { + cout << "time's up; issued " << txn.seqno() << " txns in " << timelim + << " ms" << endl; + stop_hub.set(); + } + st_sleep(0); + } + } +} + +/** + * Process a transaction: update DB state (incl. seqno) and send response to + * leader. + */ +void +process_txn(st_netfd_t leader, map<int, int> &map, const Txn &txn, int &seqno, + bool caught_up) +{ + checkeq(txn.seqno(), seqno + 1); + Response res; + res.set_seqno(txn.seqno()); + res.set_caught_up(caught_up); + seqno = txn.seqno(); + for (int o = 0; o < txn.op_size(); o++) { + const Op &op = txn.op(o); + switch (op.type()) { + case Op::read: + res.add_result(map[op.key()]); + break; + case Op::write: + map[op.key()] = op.value(); + break; + case Op::del: + map.erase(op.key()); + break; + } + } + sendmsg(leader, res); +} + +void +showtput(const string &action, long long stop_time, long long start_time, + int stop_count, int start_count) +{ + long long time_diff = stop_time - start_time; + int count_diff = stop_count - start_count; + double rate = double(count_diff) * 1000 / time_diff; + cout << action << " " << count_diff << " txns in " << time_diff << " ms (" + << rate << "tps)" << endl; +} + +/** + * Actually do the work of executing a transaction and sending back the reply. + * + * \param[in] leader The connection to the leader. + * + * \param[in] map The data store. + * + * \param[in] seqno The sequence number last seen. This always starts at 0, + * but may be bumped up by the recovery procedure. + * + * \param[in] send_states Channel of snapshots of the database state to send to + * recovering nodes (sent to recover_joiner). + * + * \param[in] backlog The backlog of txns that need to be processed. + * + * \param[in] init_seqno The seqno that was sent in the Init message from the + * leader. Not entirely clear that this is necessary; could probably just go + * with seqno. + */ +void +process_txns(st_netfd_t leader, map<int, int> &map, int &seqno, + st_channel<shared_ptr<Recovery> > &send_states, + st_channel<shared_ptr<Txn> > &backlog, int init_seqno) +{ + bool caught_up = init_seqno == 0; + long long start_time = current_time_millis(), + time_caught_up = caught_up ? start_time : -1; + int seqno_caught_up = caught_up ? seqno : -1; + + finally f(lambda () { + long long now = current_time_millis(); + showtput("processed", now, __ref(start_time), __ref(seqno), + __ref(init_seqno)); + if (!__ref(caught_up)) { + cout << "live-processing: never entered this phase (never caught up)" << + endl; + } else { + showtput("live-processed", now, __ref(time_caught_up), __ref(seqno), + __ref(seqno_caught_up)); + } + __ref(send_states).push(shared_ptr<Recovery>()); + }); + + while (true) { + Txn txn; + { + st_intr intr(stop_hub); + readmsg(leader, txn); + } + + if (txn.has_seqno()) { + if (txn.seqno() == seqno + 1) { + if (!caught_up) { + time_caught_up = current_time_millis(); + seqno_caught_up = seqno; + showtput("backlogged", time_caught_up, start_time, seqno_caught_up, + init_seqno); + caught_up = true; + } + process_txn(leader, map, txn, seqno, true); + } else { + // Queue up for later processing once a snapshot has been received. + backlog.push(shared_ptr<Txn>(new Txn(txn))); + } + + if (txn.seqno() % chkpt == 0) { + if (verbose) + cout << "processed txn " << txn.seqno() + << "; db size = " << map.size() << endl; + st_sleep(0); + } + } else { + // Generate a snapshot. + shared_ptr<Recovery> recovery(new Recovery); + foreach (const pii &p, map) { + Recovery_Pair *pair = recovery->add_pair(); + pair->set_key(p.first); + pair->set_value(p.second); + } + recovery->set_seqno(seqno); + send_states.push(recovery); + } + } + +} + +/** + * Keep swallowing replica responses. + */ +void +handle_responses(st_netfd_t replica, const int &seqno, bool caught_up) +{ + long long start_time = current_time_millis(); + while (true) { + Response res; + { + st_intr intr(kill_hub); + readmsg(replica, res); + } + if (!caught_up && res.caught_up()) { + caught_up = true; + long long timediff = current_time_millis() - start_time; + cout << "recovering node caught up; took " + << timediff << "ms" << endl; + } + if (res.seqno() % chkpt == 0) { + if (verbose) + cout << "got response " << res.seqno() << " from " << replica << endl; + st_sleep(0); + } + if (stop_hub && res.seqno() + 1 == seqno) { + cout << "seqno = " << res.seqno() << endl; + break; + } + } +} + +/** + * Help the recovering node. + * + * \param[in] listener The connection on which we're listening for connections + * from recovering joiners. + * + * \param[in] map The database state. + * + * \param[in] seqno The sequence number. Always starts at 0. + * + * \param[in] send_states Channel of snapshots of the database state to receive + * from process_txns. + */ +void +recover_joiner(st_netfd_t listener, const map<int, int> &map, const int &seqno, + st_channel<shared_ptr<Recovery> > &send_states) +{ + st_netfd_t joiner; + shared_ptr<Recovery> recovery; + { + st_intr intr(stop_hub); + // Wait for the snapshot. + recovery = send_states.take(); + if (recovery == nullptr) { + return; + } + // Wait for the new joiner. + joiner = checkerr(st_accept(listener, nullptr, nullptr, + ST_UTIME_NO_TIMEOUT)); + } + + st_closing closing(joiner); + cout << "got joiner's connection, sending recovery" << endl; + sendmsg(joiner, *recovery); + cout << "sent" << endl; +} + +/** + * Run the leader. + */ +void +run_leader(int minreps, uint16_t leader_port) +{ + cout << "starting as leader" << endl; + + // Wait until all replicas have joined. + st_netfd_t listener = st_tcp_listen(leader_port); + st_closing close_listener(listener); + vector<replica_info> replicas; + st_closing_all_infos close_replicas(replicas); + cout << "waiting for at least " << minreps << " replicas to join" << endl; + for (int i = 0; i < minreps; i++) { + st_netfd_t fd; + { + st_intr intr(stop_hub); + fd = checkerr(st_accept(listener, nullptr, nullptr, + ST_UTIME_NO_TIMEOUT)); + } + Join join = readmsg<Join>(fd); + replicas.push_back(replica_info(fd, static_cast<uint16_t>(join.port()))); + } + cout << "got all " << minreps << " replicas" << endl; + + // Construct the initialization message. + Init init; + init.set_txnseqno(0); + foreach (replica_info r, replicas) { + SockAddr *psa = init.add_node(); + psa->set_host(r.host()); + psa->set_port(r.port()); + } + + // Send init to each initial replica. + foreach (replica_info r, replicas) { + init.set_yourhost(r.host()); + sendmsg(r.fd(), init); + } + + // Start dispatching queries. + int seqno = 0; + st_channel<replica_info> newreps; + const function0<void> f = bind(issue_txns, ref(newreps), ref(seqno)); + st_thread_t swallower = my_spawn(bind(swallow, f)); + foreach (const replica_info &r, replicas) newreps.push(r); + st_joining join_swallower(swallower); + + // Start handling responses. + st_thread_group handlers; + foreach (replica_info r, replicas) { + handlers.insert(my_spawn(bind(handle_responses, r.fd(), ref(seqno), true))); + } + + // Accept the recovering node, and tell it about the online replicas. + st_netfd_t joiner; + { + st_intr intr(stop_hub); + joiner = checkerr(st_accept(listener, nullptr, nullptr, + ST_UTIME_NO_TIMEOUT)); + } + Join join = readmsg<Join>(joiner); + cout << "setting seqno to " << seqno << endl; + init.set_txnseqno(seqno); + sendmsg(joiner, init); + + // Start streaming txns to joiner. + cout << "start streaming txns to joiner" << endl; + replicas.push_back(replica_info(joiner, static_cast<uint16_t>(join.port()))); + newreps.push(replicas.back()); + handlers.insert(my_spawn(bind(handle_responses, joiner, ref(seqno), false))); +} + +/** + * Run a replica. + */ +void +run_replica(string leader_host, uint16_t leader_port, uint16_t listen_port) +{ + // Initialize database state. + map<int, int> map; + int seqno = -1; + dump_state ds(map, seqno); + st_channel<shared_ptr<Recovery> > send_states; + + cout << "starting as replica on port " << listen_port << endl; + + // Listen for connections from other replicas. + st_netfd_t listener = st_tcp_listen(listen_port); + + // Connect to the leader and join the system. + st_netfd_t leader = st_tcp_connect(leader_host.c_str(), leader_port, + timeout); + st_closing closing(leader); + Join join; + join.set_port(listen_port); + sendmsg(leader, join); + Init init; + { + st_intr intr(stop_hub); + readmsg(leader, init); + } + uint32_t listen_host = init.yourhost(); + + // Display the info. + cout << "got init msg with txn seqno " << init.txnseqno() + << " and hosts:" << endl; + vector<st_netfd_t> replicas; + st_closing_all close_replicas(replicas); + for (uint16_t i = 0; i < init.node_size(); i++) { + const SockAddr &sa = init.node(i); + char buf[INET_ADDRSTRLEN]; + in_addr host = { sa.host() }; + bool is_self = sa.host() == listen_host && sa.port() == listen_port; + cout << "- " << checkerr(inet_ntop(AF_INET, &host, buf, + INET_ADDRSTRLEN)) + << ':' << sa.port() << (is_self ? " (self)" : "") << endl; + if (!is_self && init.txnseqno() > 0) { + replicas.push_back(st_tcp_connect(host, + static_cast<uint16_t>(sa.port()), + timeout)); + } + } + + // Process txns. + st_channel<shared_ptr<Txn> > backlog; + st_joining join_proc(my_spawn(bind(process_txns, leader, ref(map), + ref(seqno), ref(send_states), + ref(backlog), init.txnseqno()))); + st_joining join_rec(my_spawn(bind(recover_joiner, listener, ref(map), + ref(seqno), ref(send_states)))); + + // If there's anything to recover. + if (init.txnseqno() > 0) { + cout << "waiting for recovery from " << replicas[0] << endl; + + // Read the recovery message. + Recovery recovery; + { + st_intr intr(stop_hub); + readmsg(replicas[0], recovery); + } + for (int i = 0; i < recovery.pair_size(); i++) { + const Recovery_Pair &p = recovery.pair(i); + map[p.key()] = p.value(); + if (i % chkpt == 0) { + if (yield_during_build_up) st_sleep(0); + } + } + assert(seqno == -1 && + static_cast<typeof(seqno)>(recovery.seqno()) > seqno); + seqno = recovery.seqno(); + cout << "recovered." << endl; + + while (!backlog.empty()) { + shared_ptr<Txn> p = backlog.take(); + process_txn(leader, map, *p, seqno, false); + if (p->seqno() % chkpt == 0) { + if (verbose) + cout << "processed txn " << p->seqno() << " off the backlog" << endl; + if (yield_during_catch_up) + st_sleep(0); + } + } + cout << "caught up." << endl; + } + + stop_hub.insert(st_thread_self()); +} + +int sig_pipe[2]; + +/** + * Raw signal handler that triggers the (synchronous) handler. + */ +void handle_sig(int sig) { + int err = errno; + cerr << "got signal: " << strsignal(sig) << " (" << sig << ")" << endl; + checkeqnneg(write(sig_pipe[1], &sig, sizeof sig), + static_cast<ssize_t>(sizeof sig)); + errno = err; +} + +/** + * Synchronous part of the signal handler; cleanly interrrupts any threads that + * have marked themselves as interruptible. + */ +void handle_sig_sync() { + stfd fd = checkerr(st_netfd_open(sig_pipe[0])); + while (true) { + int sig; + checkeqnneg(st_read(fd, &sig, sizeof sig, ST_UTIME_NO_TIMEOUT), + static_cast<ssize_t>(sizeof sig)); + if (sig == SIGINT) { + if (!stop_hub) stop_hub.set(); + else kill_hub.set(); + } else if (sig == SIGTERM) { + foreach (st_thread_t t, threads) { + st_thread_interrupt(t); + } + } + break; + } +} + +/** + * Initialization and command-line parsing. + */ +int +main(int argc, char **argv) +{ + namespace po = boost::program_options; + try { + GOOGLE_PROTOBUF_VERIFY_VERSION; + + bool is_leader, use_epoll; + int minreps; + uint16_t leader_port, listen_port; + string leader_host; + + // Parse options. + po::options_description desc("Allowed options"); + desc.add_options() + ("help,h", "show this help message") + ("verbose,v", "enable periodic printing of txn processing progress") + ("epoll,e", po::bool_switch(&use_epoll), + "use epoll (select is used by default)") + ("yield-build-up", po::bool_switch(&yield_during_build_up), + "yield periodically during build-up phase of recovery") + ("yield-catch-up", po::bool_switch(&yield_during_catch_up), + "yield periodically during catch-up phase of recovery") + ("leader,l", po::bool_switch(&is_leader), + "run the leader (run replica by default)") + ("leader-host,H", + po::value<string>(&leader_host)->default_value(string("localhost")), + "hostname or address of the leader") + ("leader-port,P", + po::value<uint16_t>(&leader_port)->default_value(7654), + "port the leader listens on") + ("chkpt,c", po::value<int>(&chkpt)->default_value(10000), + "number of txns before yielding/verbose printing") + ("timelim,T", po::value<long long>(&timelim)->default_value(0), + "time limit in milliseconds, or 0 for none") + ("listen-port,p", po::value<uint16_t>(&listen_port)->default_value(7654), + "port to listen on (replicas only)") + ("timeout,t", po::value<st_utime_t>(&timeout)->default_value(1000000), + "timeout for IO operations (in microseconds)") + ("minreps,n", po::value<int>(&minreps)->default_value(2), + "minimum number of replicas the system is willing to process txns on"); + + po::variables_map vm; + try { + po::store(po::parse_command_line(argc, argv, desc), vm); + po::notify(vm); + + if (vm.count("help")) { + cout << desc << endl; + return 0; + } + } catch (std::exception &ex) { + cerr << ex.what() << endl << endl << desc << endl; + return 1; + } + + // Initialize support for ST working with asynchronous signals. + check0x(pipe(sig_pipe)); + struct sigaction sa; + sa.sa_handler = handle_sig; + check0x(sigemptyset(&sa.sa_mask)); + sa.sa_flags = 0; + check0x(sigaction(SIGINT, &sa, nullptr)); + + // Initialize ST. + if (use_epoll) check0x(st_set_eventsys(ST_EVENTSYS_ALT)); + check0x(st_init()); + st_spawn(bind(handle_sig_sync)); + + // Initialize thread manager for clean shutdown of all threads. + thread_eraser eraser; + threads.insert(st_thread_self()); + + // Which role are we? + if (is_leader) { + run_leader(minreps, leader_port); + } else { + run_replica(leader_host, leader_port, listen_port); + } + + return 0; + } catch (const std::exception &ex) { + // Must catch all exceptions at the top to make the stack unwind. + cerr << "thread " << st_thread_self() << ": " << ex.what() << endl; + return 1; + } +} Property changes on: ydb/trunk/src/main.lzz.clamp ___________________________________________________________________ Added: svn:mergeinfo + Added: ydb/trunk/tools/clamp.patch =================================================================== --- ydb/trunk/tools/clamp.patch (rev 0) +++ ydb/trunk/tools/clamp.patch 2008-12-11 19:16:31 UTC (rev 1098) @@ -0,0 +1,29 @@ +Only in clamp_053_new/: clamp +diff -u -r clamp_053/CodeGen.cc clamp_053_new/CodeGen.cc +--- clamp_053/CodeGen.cc 2003-09-30 18:44:04.000000000 -0400 ++++ clamp_053_new/CodeGen.cc 2008-12-11 01:25:30.000000000 -0500 +@@ -20,6 +20,7 @@ + + #include "CodeGen.hh" + ++#include <climits> + #include <sstream> + #include <cassert> + #include <iostream> +Binary files clamp_053/CodeGen.o and clamp_053_new/CodeGen.o differ +Only in clamp_053_new/: lambda_impl.clamp_h +diff -u -r clamp_053/Makefile clamp_053_new/Makefile +--- clamp_053/Makefile 2003-09-30 18:44:05.000000000 -0400 ++++ clamp_053_new/Makefile 2008-12-11 03:48:32.000000000 -0500 +@@ -27,9 +27,7 @@ + # pieces, depending on your set-up: CXX, CC, LEX and -I ...boost... + # + +-CXX = f:/mingw/bin/g++ +-CC = f:/mingw/bin/gcc +-LEX = f:/mingw/bin/flex ++LEX = flex + + CXXFLAGS = -g -Wall -Wno-unused -I d:/CVS/boost/boost + # Use -Wno-unused because lex.yy.c contains some unused labels and functions +Only in clamp_053_new/: test.cc Added: ydb/trunk/tools/test.bash =================================================================== --- ydb/trunk/tools/test.bash (rev 0) +++ ydb/trunk/tools/test.bash 2008-12-11 19:16:31 UTC (rev 1098) @@ -0,0 +1,274 @@ +#!/usr/bin/env bash + +set -o errexit -o nounset +if [[ "$1" != node-init-setup ]] +then . common.bash || exit 1 +fi + +script="$(basename "$0")" + +tagssh() { + ssh "$@" 2>&1 | sed "s/^/$1: /" +} + +check-remote() { + if [[ ${force:-asdf} != asdf && `hostname` == yang-xps410 ]] + then echo 'running a remote command on your pc!' 1>&2 && exit 1 + fi +} + +node-init-setup() { + check-remote + mkdir -p work + cd work + if [[ ! -d assorted ]] + then svn -q co https://assorted.svn.sourceforge.net/svnroot/assorted/ + fi + cd assorted/configs/trunk/ + ./bootstrap.bash local +} + +node-setup-lzz() { + check-remote + mkdir -p ~/.local/pkg/lzz/bin/ + mv /tmp/lzz.static ~/.local/pkg/lzz/bin/lzz + refresh-local +} + +node-setup-st() { + check-remote + mkdir -p ~/.local/pkg/st/{include,lib}/ + cd /tmp/ + tar xzf st-1.8.tar.gz + cd st-1.8 + CONFIG_GUESS_PATH=/tmp make -s default + make -C extensions -s + cp -f obj/st.h ~/.local/pkg/st/include/ + cp -f extensions/stx.h ~/.local/pkg/st/include/ + cp -f obj/{libst.{a,so*},libstx.a} ~/.local/pkg/st/lib/ + refresh-local +} + +node-setup-pb() { + check-remote + toast --quiet arm /tmp/protobuf-2.0.2.tar.bz2 +} + +node-setup-boost() { + check-remote + cd /tmp/ + tar xjf /tmp/boost_1_37_0.tar.bz2 + cd boost_1_37_0/ + ./configure --prefix=$HOME/.local/pkg/boost-1.37.0 + make -s install + ln -s ~/.local/pkg/boost-1.37.0/include/boost-1_37/boost/ ~/.local/pkg/boost-1.37.0/include/ + refresh-local +} + +node-setup-m4() { + check-remote + toast --quiet arm 'http://ftp.gnu.org/gnu/m4/m4-1.4.12.tar.bz2' +} + +node-setup-bison() { + check-remote + toast --quiet arm 'http://ftp.gnu.org/gnu/bison/bison-2.4.tar.bz2' +} + +node-setup-flex() { + check-remote + toast --quiet arm 'http://prdownloads.sourceforge.net/flex/flex-2.5.35.tar.bz2' +} + +node-setup-clamp() { + check-remote + cd /tmp/ + tar xzf clamp_053_src.tar.gz + cd clamp_053/ + chmod u+w * + patch -p1 < /tmp/clamp.patch + make -s clamp + mkdir -p ~/.local/pkg/clamp/bin/ + mv clamp ~/.local/pkg/clamp/bin/ + refresh-local +} + +node-setup-ydb-1() { + check-remote + if [[ ! -L ~/ydb ]] + then ln -s ~/work/assorted/ydb/trunk ~/ydb + fi + if [[ ! -L ~/ccom ]] + then ln -s ~/work/assorted/cpp-commons/trunk ~/ccom + fi +} + +node-setup-ydb-2() { + check-remote + cd ~/ccom/ + ./setup.bash -d -p ~/.local/pkg/cpp-commons + refresh-local + cd ~/ydb/src + make clean + make WTF= +} + +remote() { + local host="$1" + shift + scp -q "$(dirname "$0")/$script" "$host:" + tagssh "$host" "./$script" "$@" +} + +allhosts() { + if [[ ${host:-} ]] ; then + echo $host + elif [[ ${range:-} ]] ; then + seq $range | sed 's/^/farm/; s/$/.csail/' + else + cat << EOF +farm1.csail +farm2.csail +farm3.csail +farm4.csail +farm5.csail +farm6.csail +farm7.csail +farm8.csail +farm9.csail +farm10.csail +farm11.csail +farm12.csail +farm13.csail +farm14.csail +EOF + fi | xargs ${xargs--P9} -I^ "$@" +} + +allssh() { + allhosts ssh ^ "set -o errexit -o nounset; $@" +} + +allscp() { + allhosts scp -q "$@" +} + +allremote() { + allhosts "./$script" remote ^ "$@" +} + +init-setup() { + allremote node-init-setup +} + +get-deps() { + xargs -I_ -P9 wget -nv -P /tmp/ _ << EOF +http://www.lazycplusplus.com/lzz_2_8_0_linux.zip +http://downloads.sourceforge.net/state-threads/st-1.8.tar.gz +http://protobuf.googlecode.com/files/protobuf-2.0.2.tar.bz2 +http://downloads.sourceforge.net/boost/boost_1_37_0.tar.bz2 +http://home.clara.net/raoulgough/clamp/clamp_053_src.tar.gz +EOF + cd /tmp/ + unzip lzz_2_8_0_linux.zip lzz.static +} + +setup-deps() { + allscp \ + /usr/share/misc/config.guess \ + /tmp/lzz.static \ + /tmp/st-1.8.tar.gz \ + /tmp/protobuf-2.0.2.tar.bz2 \ + /tmp/boost_1_37_0.tar.bz2 \ + clamp.patch \ + ^:/tmp/ + + allremote node-setup-lzz + allremote node-setup-st + allremote node-setup-pb + allremote node-setup-boost + allremote node-setup-m4 + allremote node-setup-bison + allremote node-setup-clamp +} + +setup-ydb() { + allremote node-setup-ydb-1 + rm -r /tmp/{ydb,ccom}-src/ + svn export ~/ydb/src /tmp/ydb-src/ + svn export ~/ccom/src /tmp/ccom-src/ + allscp -r /tmp/ydb-src/* ^:ydb/src/ + allscp -r /tmp/ccom-src/* ^:ccom/src/ + allremote node-setup-ydb-2 +} + +full() { + init-setup + setup-deps + setup-ydb +} + +hostinfos() { + xargs= allssh " + echo + hostname + echo ===== + fgrep 'model name' /proc/cpuinfo + head -2 /proc/meminfo + " +} + +hosttops() { + xargs= allssh " + echo + hostname + echo ===== + top -b -n 1 | fgrep -A3 COMMAND + " +} + +run-helper() { + tagssh $1 "ydb/src/ydb -l" & + sleep .1 + tagssh $2 "ydb/src/ydb -H $1" & + tagssh $3 "ydb/src/ydb -H $1" & + sleep ${wait:-10} + tagssh $4 "ydb/src/ydb -H $1" & + read + kill %1 +} + +range2args() { + "$@" $(seq $range | sed 's/^/farm/; s/$/.csail/') +} + +run() { + range2args run-helper +} + +stop-helper() { + tagssh $1 'pkill ydb' +} + +stop() { + range2args stop-helper +} + +kill-helper() { + tagssh $1 'pkill ydb' + tagssh $2 'pkill ydb' + tagssh $3 'pkill ydb' + tagssh $4 'pkill ydb' +} + +kill() { + range2args kill-helper +} + +#plot() { +# for i in "$@" ; do +# sed "s/farm$i.csail//" < "$i" +# done +#} + +"$@" Property changes on: ydb/trunk/tools/test.bash ___________________________________________________________________ Added: svn:executable + * This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <yan...@us...> - 2008-12-22 22:57:43
|
Revision: 1111 http://assorted.svn.sourceforge.net/assorted/?rev=1111&view=rev Author: yangzhang Date: 2008-12-22 22:33:22 +0000 (Mon, 22 Dec 2008) Log Message: ----------- - replaced dump_state with a finally lambda - added accept_joiner_seqno - tweaked output - fixed run() in test.bash Modified Paths: -------------- ydb/trunk/src/main.lzz.clamp ydb/trunk/tools/test.bash Modified: ydb/trunk/src/main.lzz.clamp =================================================================== --- ydb/trunk/src/main.lzz.clamp 2008-12-19 23:46:26 UTC (rev 1110) +++ ydb/trunk/src/main.lzz.clamp 2008-12-22 22:33:22 UTC (rev 1111) @@ -29,7 +29,7 @@ typedef pair<int, int> pii; st_utime_t timeout; -int chkpt; +int chkpt, accept_joiner_seqno; bool verbose, yield_during_build_up, yield_during_catch_up; long long timelim; st_intr_bool stop_hub, kill_hub; @@ -143,28 +143,6 @@ }; /** - * RAII for dumping the final state of the DB to a file on disk. - */ -class dump_state -{ - public: - dump_state(const map<int, int> &map, const int &seqno) - : map_(map), seqno_(seqno) {} - ~dump_state() { - string fname = string("/tmp/ydb") + lexical_cast<string>(getpid()); - cout << "dumping DB state (" << seqno_ << ") to " << fname << endl; - ofstream of(fname.c_str()); - of << "seqno: " << seqno_ << endl; - foreach (const pii &p, map_) { - of << p.first << ": " << p.second << endl; - } - } - private: - const map<int, int> &map_; - const int &seqno_; -}; - -/** * Send a message to some destinations (sequentially). */ template<typename T> @@ -246,7 +224,8 @@ * Keep issuing transactions to the replicas. */ void -issue_txns(st_channel<replica_info> &newreps, int &seqno) +issue_txns(st_channel<replica_info> &newreps, int &seqno, + st_bool &accept_joiner) { Op_OpType types[] = {Op::read, Op::write, Op::del}; vector<st_netfd_t> fds; @@ -292,6 +271,10 @@ } st_sleep(0); } + + if (txn.seqno() == accept_joiner_seqno) { + accept_joiner.set(); + } } } @@ -423,10 +406,10 @@ } /** - * Keep swallowing replica responses. + * Swallow replica responses. */ void -handle_responses(st_netfd_t replica, const int &seqno, +handle_responses(st_netfd_t replica, const int &seqno, int rid, st_multichannel<long long> &recover_signals, bool caught_up) { st_channel<long long> &sub = recover_signals.subscribe(); @@ -437,6 +420,7 @@ recovery_end_seqno = -1; finally f(lambda () { long long end_time = current_time_millis(); + cout << __ref(rid) << ": "; showtput("after recovery, finished", end_time, __ref(recovery_end_time), __ref(seqno), __ref(recovery_end_seqno)); }); @@ -449,11 +433,13 @@ if (recovery_start_time == -1 && !sub.empty()) { recovery_start_time = sub.take(); recovery_start_seqno = seqno; + cout << rid << ": "; showtput("before recovery, finished", recovery_start_time, start_time, recovery_start_seqno, 0); } else if (recovery_end_time == -1 && !sub.empty()) { recovery_end_time = sub.take(); recovery_end_seqno = seqno; + cout << rid << ": "; showtput("during recovery, finished", recovery_end_time, recovery_start_time, recovery_end_seqno, recovery_start_seqno); } @@ -461,16 +447,20 @@ long long t = current_time_millis(), timediff = t - start_time; caught_up = true; recover_signals.push(t); + cout << rid << ": "; cout << "recovering node caught up; took " << timediff << "ms" << endl; } if (res.seqno() % chkpt == 0) { - if (verbose) + if (verbose) { + cout << rid << ": "; cout << "got response " << res.seqno() << " from " << replica << endl; + } st_sleep(0); } // This is OK since the seqno will never grow again if stop_hub is set. if (stop_hub && res.seqno() + 1 == seqno) { + cout << rid << ": "; cout << "stopping seqno = " << res.seqno() << endl; break; } @@ -557,17 +547,20 @@ } // Start dispatching queries. + st_bool accept_joiner; int seqno = 0; st_channel<replica_info> newreps; - const function0<void> f = bind(issue_txns, ref(newreps), ref(seqno)); + const function0<void> f = bind(issue_txns, ref(newreps), ref(seqno), + ref(accept_joiner)); st_thread_t swallower = my_spawn(bind(swallow, f), "issue_txns"); foreach (const replica_info &r, replicas) newreps.push(r); st_joining join_swallower(swallower); // Start handling responses. st_thread_group handlers; + int rid = 0; foreach (replica_info r, replicas) { - handlers.insert(my_spawn(bind(handle_responses, r.fd(), ref(seqno), + handlers.insert(my_spawn(bind(handle_responses, r.fd(), ref(seqno), rid++, ref(recover_signals), true), "handle_responses")); } @@ -578,6 +571,7 @@ st_intr intr(stop_hub); joiner = checkerr(st_accept(listener, nullptr, nullptr, ST_UTIME_NO_TIMEOUT)); + accept_joiner.waitset(); } Join join = readmsg<Join>(joiner); cout << "setting seqno to " << seqno << endl; @@ -589,7 +583,7 @@ cout << "start streaming txns to joiner" << endl; replicas.push_back(replica_info(joiner, static_cast<uint16_t>(join.port()))); newreps.push(replicas.back()); - handlers.insert(my_spawn(bind(handle_responses, joiner, ref(seqno), + handlers.insert(my_spawn(bind(handle_responses, joiner, ref(seqno), rid++, ref(recover_signals), false), "handle_responses")); } @@ -603,7 +597,16 @@ // Initialize database state. map<int, int> map; int seqno = -1; - dump_state ds(map, seqno); + finally f(lambda () { + string fname = string("/tmp/ydb") + lexical_cast<string>(getpid()); + cout << "dumping DB state (seqno = " << __ref(seqno) << ", size = " + << __ref(map).size() << ") to " << fname << endl; + ofstream of(fname.c_str()); + of << "seqno: " << __ref(seqno) << endl; + foreach (const pii &p, __ref(map)) { + of << p.first << ": " << p.second << endl; + } + }); st_channel<shared_ptr<Recovery> > send_states; cout << "starting as replica on port " << listen_port << endl; @@ -769,6 +772,9 @@ "yield periodically during catch-up phase of recovery") ("leader,l", po::bool_switch(&is_leader), "run the leader (run replica by default)") + ("accept-joiner-seqno,j", + po::value<int>(&accept_joiner_seqno)->default_value(0), + "accept recovering joiner (start recovery) after this seqno") ("leader-host,H", po::value<string>(&leader_host)->default_value(string("localhost")), "hostname or address of the leader") Modified: ydb/trunk/tools/test.bash =================================================================== --- ydb/trunk/tools/test.bash 2008-12-19 23:46:26 UTC (rev 1110) +++ ydb/trunk/tools/test.bash 2008-12-22 22:33:22 UTC (rev 1111) @@ -227,21 +227,24 @@ " } +range2args() { + "$@" $(seq $range | sed 's/^/farm/; s/$/.csail/') +} + run-helper() { tagssh $1 "ydb/src/ydb -l" & sleep .1 tagssh $2 "ydb/src/ydb -H $1" & tagssh $3 "ydb/src/ydb -H $1" & - sleep ${wait:-10} + sleep ${wait1:-10} tagssh $4 "ydb/src/ydb -H $1" & - read - kill %1 + if [[ ${wait2:-} ]] + then sleep $wait2 + else read + fi + tagssh $1 "pkill -sigint ydb" } -range2args() { - "$@" $(seq $range | sed 's/^/farm/; s/$/.csail/') -} - run() { range2args run-helper } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <yan...@us...> - 2009-01-14 18:26:15
|
Revision: 1130 http://assorted.svn.sourceforge.net/assorted/?rev=1130&view=rev Author: yangzhang Date: 2009-01-14 18:25:52 +0000 (Wed, 14 Jan 2009) Log Message: ----------- - added analysis.py for aggregating and plotting measurement results (collected from test.bash) - added --dump for finally dumping state - added --exit-on-recovery to make automated runs easier - added --issuing-interval for debugging - fixed bug with response_handler hanging forever if the sequence numbers happen to be caught up; very visible with --issuing-interval - added scaling, full-scaling, full-yield, full-block, much more to test.bash - a bunch of general refactoring of test.bash - documented measurement/analysis tools, tool reqs, more usage notes in general, and personal TODOs/notes - added timestamping to tagssh - fixed bug with process_txn throws a timeout exception trying to sendmsg - added --read-thresh for debugging time spent waiting to read from network socket - added SIGUSR1 pausing - refactored threadnames - improved thread switching callback debugging - refactored/improved thread exception printing - added debugging of large message sending - added sendmsg timeout warnings - added --general-txns so system defaults to insert/update txns - added --count-updates, --show-updates - fixed after-recovery statistics bookkeeping - fixed stop-responsiveness in response_handler - added exception printing in case something goes awry before the RAII thread joins in the main `run_*` functions - removed the breaking in handle_sig_sync Modified Paths: -------------- ydb/trunk/README ydb/trunk/src/main.lzz.clamp ydb/trunk/tools/test.bash Added Paths: ----------- ydb/trunk/tools/analysis.py Modified: ydb/trunk/README =================================================================== --- ydb/trunk/README 2009-01-12 20:04:16 UTC (rev 1129) +++ ydb/trunk/README 2009-01-14 18:25:52 UTC (rev 1130) @@ -1,7 +1,7 @@ Overview -------- -YDB (Yang's Database) is a simple replicated memory store, developed for the +ydb (Yang's Database) is a simple replicated memory store, developed for the purpose of researching various approaches to recovery in such OLTP-optimized databases as [VOLTDB] (formerly H-Store/Horizontica). @@ -25,7 +25,7 @@ Setup ----- -Requirements: +Requirements for the ydb system: - [boost] 1.37 - [C++ Commons] svn r1082 @@ -43,6 +43,16 @@ [Protocol Buffers]: http://code.google.com/p/protobuf/ [State Threads]: http://state-threads.sourceforge.net/ +Requirements for tools: + +- [Assorted Shell Tools] (bash-commons, mssh) +- [Pylab] 0.98.3 +- [Python] 2.5 + +[Assorted Shell Tools]: http://assorted.sf.net/shell-tools/ +[Pylab]: http://matplotlib.sf.net/ +[Python]: http://python.org/ + Usage ----- @@ -74,23 +84,72 @@ sigint to try to force all working threads to shut down (any node, including replicas, respond to ctrl-c). -Full System Test ----------------- +To pause/resume the issuing of transactions, send a sigusr1 to the leader. - ./test.bash full +Measurements +------------ +Included is a suite of scripts to run ydb on the PMG farm machines. It is from +this deployment that performance measurements are collected. + + ./test.bash full-setup + will configure all the farm machines to (1) have my proper initial environment, (2) have all the prerequisite software, and (3) build ydb. This may take a -long time (particularly the boost-building phase). +long time (particularly the boost-building phase). Subsequently, - range='10 13' wait=5 ./test.bash run + ./test.bash setup-ydb -will run a leader on farm10, replicas on farm11 and farm12, and a recovering -replica on farm13 after 5 seconds. Pipe several runs of this to some files -(`*.out`), and plot the results with +should be sufficient for pushing out the source from the current working copy +of the source repository and building on each machine. - ./test.bash plot *.out +Find out which of the farm machines is free by looking at the top 3 items from +`top` on each machine: + ./test.bash hosttops + +Most commands you pass to `test.bash` accept (and some require) a set of hosts +on which to run. Look at the comment documentation in test.bash to find out +more about each function (command). You must specify the hosts by setting +either the `hosts` environment variable to a string of space-separated +hostnames or an array of hostnames, or you may set `range` to a string of a +start and end number to select all hosts from `farm<START>` to `farm<END>`. +Examples (the following all run in parallel across the specified hosts): + + hosts='farm3 farm5 farm7' ./test.bash full-setup + hosts=(farm2 farm3 farm4) ./test.bash setup-ydb # Arrays are also accepted. + range='2 4' ./test.bash hosttops # Same as last line. + +### Recovery experiments + +To run a leader on `farm10`, initial replicas on `farm11` and `farm12`, and a +recovering replica on `farm13` after 5 seconds: + + range='10 13' ./test.bash run 1000 # Command requires exactly 4 nodes. + +To run this experiment TODO +trials: + + range='10 13' + +### Scaling experiments + +To run a leader on `farm10` with initial replicas on the rest: + + range='10 15' ./test.bash scaling + +To run for 1 through 3 initial replicas, repeating each configuration for 3 +trials: + + range='10 13' ./test.bash full-scaling + +Pipe several runs of this to the file `scaling-log`, and plot the results with + + ./analysis.py scaling + +Hence the name "scaling"---this was a test of the scalability of the base +system (no recovery involved). + Recovery Mechanisms ------------------- @@ -98,8 +157,9 @@ - Network recovery - From a single node - - Interleave the state recovery/catch up with the backlogging of live txns - - Recover/catch up in one swoop, then backlog the live txns + - **block**: Backlog live txns, then recover/catch up in one swoop + - **yield**: Interleave the state recovery/catch up with the backlogging of + live txns Pseudo-code ----------- @@ -133,12 +193,14 @@ foreach replica connect to replica recv recovery msg from replica - apply the state - apply backlog + apply the state (regularly yielding if desired) + apply backlog (regularly yielding if desired) Todo ---- +- Add a way to reliably obtain ST stack traces + - Add benchmarking/testing hooks, e.g.: - start the recovering joiner at a well-defined time (after a certain # txns or after the DB reaches a certain size) @@ -150,6 +212,12 @@ - time to recover - bytes used to recover +- Produce time series graphs of the txn throughput and mark the events on the + x-axis, which also conveys the duration of the various phases + - Overlay onto this the various recovery schemes + - Main benchmark: wait until the state grows to a certain size, then start + the recovery + - Run some benchmarks, esp. on multiple physical hosts. - Figure out why things are running so slowly with >2 replicas. @@ -169,3 +237,25 @@ - Add richer transactions/queries/operations. - Add disk-based recovery methods. + +Plan/Notes +---------- + +Measurements + +- DONE find out how often prng yields same number + - not very often +- DONE baseline scaling (tps with number of nodes) + - inversely proportional to number of nodes, so bottlenecked at leader +- DONE recovery time as a function of amount of data + - TODO break down into various phases using bar graph of segmented bars +- DONE use only insert (and update) txns +- TODO try profiling +- TODO detailed view of tps during recovery over time (should see various phases) +- TODO later: runtime overhead of logging/tps under normal operation (scaled + with # nodes?) + +Presentation + +- TODO differences from: harbor, harp, aries +- TODO understand 2pc, paxos, etc. Modified: ydb/trunk/src/main.lzz.clamp =================================================================== --- ydb/trunk/src/main.lzz.clamp 2009-01-12 20:04:16 UTC (rev 1129) +++ ydb/trunk/src/main.lzz.clamp 2009-01-14 18:25:52 UTC (rev 1130) @@ -28,12 +28,22 @@ #end typedef pair<int, int> pii; + +// Configuration. st_utime_t timeout; -int chkpt, accept_joiner_seqno; -bool verbose, yield_during_build_up, yield_during_catch_up; -long long timelim; +int chkpt, accept_joiner_seqno, issuing_interval; +size_t accept_joiner_size; +bool verbose, yield_during_build_up, yield_during_catch_up, dump, show_updates, + count_updates, stop_on_recovery, general_txns; +long long timelim, read_thresh; + +// Control. st_intr_bool stop_hub, kill_hub; +st_bool do_pause; +// Statistics. +int updates; + /** * The list of all threads. Keep track of these so that we may cleanly shut * down all threads. @@ -50,7 +60,53 @@ ~thread_eraser() { threads.erase(st_thread_self()); } }; +map<st_thread_t, string> threadnames; +st_thread_t last_thread; + /** + * Look up thread name, or just show thread ID. + */ +string +threadname(st_thread_t t = st_thread_self()) { + if (threadnames.find(t) != threadnames.end()) { + return threadnames[t]; + } else { + return lexical_cast<string>(t); + } +} + +/** + * Debug function for thread names. Remember what we're switching from. + */ +void +switch_out_cb() +{ + last_thread = st_thread_self(); +} + +/** + * Debug function for thread names. Show what we're switching from/to. + */ +void switch_in_cb() +{ + if (last_thread != st_thread_self()) { + cout << "switching"; + if (last_thread != 0) cout << " from " << threadname(last_thread); + cout << " to " << threadname() << endl; + } +} + +/** + * Print to cerr a thread exception. + */ +ostream& +cerr_thread_ex(const std::exception &ex) +{ + return cerr << "exception in thread " << threadname() + << ": " << ex.what(); +} + +/** * Delegate for running thread targets. * \param[in] f The function to execute. * \param[in] intr Whether to signal stop_hub on an exception. @@ -61,9 +117,8 @@ thread_eraser eraser; try { f(); - } catch (const std::exception &ex) { - cerr << "thread " << st_thread_self() << ": " << ex.what() - << (intr ? "; interrupting!" : "") << endl; + } catch (std::exception &ex) { + cerr_thread_ex(ex) << (intr ? "; interrupting!" : "") << endl; if (intr) stop_hub.set(); } } @@ -120,6 +175,7 @@ public: st_closing_all_infos(const vector<replica_info>& rs) : rs_(rs) {} ~st_closing_all_infos() { + cout << "closing all conns to replicas (replica_infos)" << endl; foreach (replica_info r, rs_) check0x(st_netfd_close(r.fd())); } @@ -154,15 +210,33 @@ check(msg.SerializeToString(&s)); const char *buf = s.c_str(); + if (s.size() > 1000000) + cout << "sending large message to " << dsts.size() << " dsts, size = " + << s.size() << " bytes" << endl; + // Prefix the message with a four-byte length. uint32_t len = htonl(static_cast<uint32_t>(s.size())); // Broadcast the length-prefixed message to replicas. + int dstno = 0; foreach (st_netfd_t dst, dsts) { - checkeqnneg(st_write(dst, static_cast<void*>(&len), sizeof len, timeout), - static_cast<ssize_t>(sizeof len)); - checkeqnneg(st_write(dst, buf, s.size(), ST_UTIME_NO_TIMEOUT), - static_cast<ssize_t>(s.size())); + size_t resid = sizeof len; +#define checksize(x,y) checkeqnneg(x, static_cast<ssize_t>(y)) + int res = st_write_resid(dst, static_cast<void*>(&len), &resid, timeout); + if (res == -1 && errno == ETIME) { + cerr << "got timeout! " << resid << " of " << sizeof len + << " remaining, for dst #" << dstno << endl; + checksize(st_write(dst, + reinterpret_cast<char*>(&len) + sizeof len - resid, + resid, + ST_UTIME_NO_TIMEOUT), + resid); + } else { + check0x(res); + } + checksize(st_write(dst, buf, s.size(), ST_UTIME_NO_TIMEOUT), + s.size()); + dstno++; } } @@ -237,10 +311,13 @@ }); while (!stop_hub) { - // Did we get a new member? + // Did we get a new member? If so, notify an arbitrary member (the first + // one) to prepare to send recovery information (by sending an + // empty/default Txn). if (!newreps.empty() && seqno > 0) { sendmsg(fds[0], Txn()); } + // Bring in any new members. while (!newreps.empty()) { fds.push_back(newreps.take().fd()); } @@ -251,12 +328,14 @@ int count = randint(5) + 1; for (int o = 0; o < count; o++) { Op *op = txn.add_op(); - int rtype = randint(3), rkey = randint(), rvalue = randint(); + int rtype = general_txns ? randint(3) : 1, rkey = randint(), rvalue = randint(); op->set_type(types[rtype]); op->set_key(rkey); op->set_value(rvalue); } + if (do_pause) do_pause.waitreset(); + // Broadcast. bcastmsg(fds, txn); @@ -266,11 +345,14 @@ cout << "issued txn " << txn.seqno() << endl; if (timelim > 0 && current_time_millis() - start_time > timelim) { cout << "time's up; issued " << txn.seqno() << " txns in " << timelim - << " ms" << endl; + << " ms" << endl; stop_hub.set(); } st_sleep(0); } + if (issuing_interval > 0) { + st_sleep(issuing_interval); + } if (txn.seqno() == accept_joiner_seqno) { accept_joiner.set(); @@ -282,7 +364,7 @@ * Process a transaction: update DB state (incl. seqno) and send response to * leader. */ -void + void process_txn(st_netfd_t leader, map<int, int> &map, const Txn &txn, int &seqno, bool caught_up) { @@ -293,22 +375,29 @@ seqno = txn.seqno(); for (int o = 0; o < txn.op_size(); o++) { const Op &op = txn.op(o); + const int key = op.key(); + if (show_updates || count_updates) { + if (map.find(key) != map.end()) { + if (show_updates) cout << "existing key: " << key << endl; + if (count_updates) updates++; + } + } switch (op.type()) { case Op::read: - res.add_result(map[op.key()]); + res.add_result(map[key]); break; case Op::write: - map[op.key()] = op.value(); + map[key] = op.value(); break; case Op::del: - map.erase(op.key()); + map.erase(key); break; } } sendmsg(leader, res); } -void + void showtput(const string &action, long long stop_time, long long start_time, int stop_count, int start_count) { @@ -316,7 +405,7 @@ int count_diff = stop_count - start_count; double rate = double(count_diff) * 1000 / time_diff; cout << action << " " << count_diff << " txns in " << time_diff << " ms (" - << rate << "tps)" << endl; + << rate << "tps)" << endl; } /** @@ -338,7 +427,7 @@ * leader. Not entirely clear that this is necessary; could probably just go * with seqno. */ -void + void process_txns(st_netfd_t leader, map<int, int> &map, int &seqno, st_channel<shared_ptr<Recovery> > &send_states, st_channel<shared_ptr<Txn> > &backlog, int init_seqno) @@ -349,27 +438,38 @@ int seqno_caught_up = caught_up ? seqno : -1; finally f(lambda () { - long long now = current_time_millis(); - showtput("processed", now, __ref(start_time), __ref(seqno), - __ref(init_seqno)); - if (!__ref(caught_up)) { - cout << "live-processing: never entered this phase (never caught up)" << - endl; - } else { - showtput("live-processed", now, __ref(time_caught_up), __ref(seqno), - __ref(seqno_caught_up)); - } - __ref(send_states).push(shared_ptr<Recovery>()); - }); + long long now = current_time_millis(); + showtput("processed", now, __ref(start_time), __ref(seqno), + __ref(init_seqno)); + if (!__ref(caught_up)) { + cout << "live-processing: never entered this phase (never caught up)" << + endl; + } else { + showtput("live-processed", now, __ref(time_caught_up), __ref(seqno), + __ref(seqno_caught_up)); + } + __ref(send_states).push(shared_ptr<Recovery>()); + }); while (true) { Txn txn; + long long before_read; + if (read_thresh > 0) { + before_read = current_time_millis(); + } { st_intr intr(stop_hub); readmsg(leader, txn); } - + if (read_thresh > 0) { + long long read_time = current_time_millis() - before_read; + if (read_time > read_thresh) { + cout << "current_time_millis() - before_read = " << read_time << " > " + << read_thresh << endl; + } + } if (txn.has_seqno()) { + const char *action; if (txn.seqno() == seqno + 1) { if (!caught_up) { time_caught_up = current_time_millis(); @@ -379,20 +479,26 @@ caught_up = true; } process_txn(leader, map, txn, seqno, true); + action = "processed"; } else { // Queue up for later processing once a snapshot has been received. backlog.push(shared_ptr<Txn>(new Txn(txn))); + action = "backlogged"; } if (txn.seqno() % chkpt == 0) { - if (verbose) - cout << "processed txn " << txn.seqno() - << "; db size = " << map.size() << endl; + if (verbose) { + cout << action << " txn " << txn.seqno() + << "; db size = " << map.size() + << "; seqno = " << seqno + << "; backlog.size = " << backlog.queue().size() << endl; + } st_sleep(0); } } else { // Generate a snapshot. shared_ptr<Recovery> recovery(new Recovery); + cout << "generating recovery of " << map.size() << " records" << endl; foreach (const pii &p, map) { Recovery_Pair *pair = recovery->add_pair(); pair->set_key(p.first); @@ -408,7 +514,7 @@ /** * Swallow replica responses. */ -void + void handle_responses(st_netfd_t replica, const int &seqno, int rid, st_multichannel<long long> &recover_signals, bool caught_up) { @@ -418,38 +524,87 @@ recovery_end_time = -1; int recovery_start_seqno = caught_up ? -1 : seqno, recovery_end_seqno = -1; + int last_seqno = -1; finally f(lambda () { long long end_time = current_time_millis(); - cout << __ref(rid) << ": "; - showtput("after recovery, finished", end_time, __ref(recovery_end_time), - __ref(seqno), __ref(recovery_end_seqno)); + if (__ref(recovery_end_time) > -1) { + cout << __ref(rid) << ": "; + showtput("after recovery, finished", end_time, __ref(recovery_end_time), + __ref(seqno), __ref(recovery_end_seqno)); + } }); while (true) { + finally f(lambda () { + // TODO: convert the whole thing to an object so that we can have "scoped + // globals". + long long &recovery_start_time = __ref(recovery_start_time); + int &recovery_start_seqno = __ref(recovery_start_seqno); + long long &recovery_end_time = __ref(recovery_end_time); + int &recovery_end_seqno = __ref(recovery_end_seqno); + long long &start_time = __ref(start_time); + const int &seqno = __ref(seqno); + int &rid = __ref(rid); + st_channel<long long> &sub = __ref(sub); + // The first timestamp that comes down the subscription pipeline is the + // recovery start time, issued by the main thread. The second one is the + // recovery end time, issued by the response handler associated with the + // joiner. + if (recovery_start_time == -1 && !sub.empty()) { + recovery_start_time = sub.take(); + recovery_start_seqno = seqno; + cout << rid << ": "; + showtput("before recovery, finished", recovery_start_time, start_time, + recovery_start_seqno, 0); + } else if (recovery_end_time == -1 && !sub.empty()) { + recovery_end_time = sub.take(); + recovery_end_seqno = seqno; + cout << rid << ": "; + showtput("during recovery, finished", recovery_end_time, + recovery_start_time, recovery_end_seqno, recovery_start_seqno); + } + }); Response res; - { + // Read the message, but correctly respond to interrupts so that we can + // cleanly exit (slightly tricky). + if (last_seqno + 1 == seqno) { + // Stop-interruptible in case we're already caught up. + try { + st_intr intr(stop_hub); + readmsg(replica, res); + } catch (...) { // TODO: only catch interruptions + // This check on seqnos is OK for termination since the seqno will + // never grow again if stop_hub is set. + if (last_seqno + 1 == seqno) { + cout << rid << ": "; + cout << "stopping seqno = " << res.seqno() << endl; + break; + } else { + continue; + } + } + } else { + // Only kill-interruptible because we want a clean termination (want + // to get all the acks back). st_intr intr(kill_hub); readmsg(replica, res); } - if (recovery_start_time == -1 && !sub.empty()) { - recovery_start_time = sub.take(); - recovery_start_seqno = seqno; - cout << rid << ": "; - showtput("before recovery, finished", recovery_start_time, start_time, - recovery_start_seqno, 0); - } else if (recovery_end_time == -1 && !sub.empty()) { - recovery_end_time = sub.take(); - recovery_end_seqno = seqno; - cout << rid << ": "; - showtput("during recovery, finished", recovery_end_time, - recovery_start_time, recovery_end_seqno, recovery_start_seqno); - } + // Determine if this response handler's host (the only joiner) has finished + // catching up. If it has, then broadcast a signal so that all response + // handlers will know about this event. if (!caught_up && res.caught_up()) { - long long t = current_time_millis(), timediff = t - start_time; + long long now = current_time_millis(), timediff = now - start_time; caught_up = true; - recover_signals.push(t); + recover_signals.push(now); cout << rid << ": "; cout << "recovering node caught up; took " << timediff << "ms" << endl; + // This will cause the program to exit eventually, but cleanly, such that + // the recovery time will be set first, before the eventual exit (which + // may not even happen in the current iteration). + if (stop_on_recovery) { + cout << "stopping on recovery" << endl; + stop_hub.set(); + } } if (res.seqno() % chkpt == 0) { if (verbose) { @@ -458,12 +613,7 @@ } st_sleep(0); } - // This is OK since the seqno will never grow again if stop_hub is set. - if (stop_hub && res.seqno() + 1 == seqno) { - cout << rid << ": "; - cout << "stopping seqno = " << res.seqno() << endl; - break; - } + last_seqno = res.seqno(); } } @@ -499,9 +649,10 @@ } st_closing closing(joiner); - cout << "got joiner's connection, sending recovery" << endl; + cout << "got joiner's connection, sending recovery of " + << recovery->pair_size() << " records" << endl; sendmsg(joiner, *recovery); - cout << "sent" << endl; + cout << "sent recovery" << endl; } /** @@ -556,36 +707,42 @@ foreach (const replica_info &r, replicas) newreps.push(r); st_joining join_swallower(swallower); - // Start handling responses. - st_thread_group handlers; - int rid = 0; - foreach (replica_info r, replicas) { - handlers.insert(my_spawn(bind(handle_responses, r.fd(), ref(seqno), rid++, - ref(recover_signals), true), - "handle_responses")); - } + try { + // Start handling responses. + st_thread_group handlers; + int rid = 0; + foreach (replica_info r, replicas) { + handlers.insert(my_spawn(bind(handle_responses, r.fd(), ref(seqno), rid++, + ref(recover_signals), true), + "handle_responses")); + } - // Accept the recovering node, and tell it about the online replicas. - st_netfd_t joiner; - { - st_intr intr(stop_hub); - joiner = checkerr(st_accept(listener, nullptr, nullptr, - ST_UTIME_NO_TIMEOUT)); - accept_joiner.waitset(); + // Accept the recovering node, and tell it about the online replicas. + st_netfd_t joiner; + { + st_intr intr(stop_hub); + joiner = checkerr(st_accept(listener, nullptr, nullptr, + ST_UTIME_NO_TIMEOUT)); + accept_joiner.waitset(); + } + Join join = readmsg<Join>(joiner); + cout << "setting seqno to " << seqno << endl; + init.set_txnseqno(seqno); + sendmsg(joiner, init); + recover_signals.push(current_time_millis()); + + // Start streaming txns to joiner. + cout << "start streaming txns to joiner" << endl; + replicas.push_back(replica_info(joiner, static_cast<uint16_t>(join.port()))); + newreps.push(replicas.back()); + handlers.insert(my_spawn(bind(handle_responses, joiner, ref(seqno), rid++, + ref(recover_signals), false), + "handle_responses_joiner")); + } catch (std::exception &ex) { + // TODO: maybe there's a cleaner way to do this final step before waiting with the join + cerr_thread_ex(ex) << endl; + throw; } - Join join = readmsg<Join>(joiner); - cout << "setting seqno to " << seqno << endl; - init.set_txnseqno(seqno); - sendmsg(joiner, init); - recover_signals.push(current_time_millis()); - - // Start streaming txns to joiner. - cout << "start streaming txns to joiner" << endl; - replicas.push_back(replica_info(joiner, static_cast<uint16_t>(join.port()))); - newreps.push(replicas.back()); - handlers.insert(my_spawn(bind(handle_responses, joiner, ref(seqno), rid++, - ref(recover_signals), false), - "handle_responses")); } /** @@ -598,13 +755,18 @@ map<int, int> map; int seqno = -1; finally f(lambda () { + cout << "REPLICA SUMMARY" << endl; + cout << "total updates = " << updates << endl; + cout << "final DB state: seqno = " << __ref(seqno) << ", size = " + << __ref(map).size() << endl; string fname = string("/tmp/ydb") + lexical_cast<string>(getpid()); - cout << "dumping DB state (seqno = " << __ref(seqno) << ", size = " - << __ref(map).size() << ") to " << fname << endl; - ofstream of(fname.c_str()); - of << "seqno: " << __ref(seqno) << endl; - foreach (const pii &p, __ref(map)) { - of << p.first << ": " << p.second << endl; + if (dump) { + cout << "dumping to " << fname << endl; + ofstream of(fname.c_str()); + of << "seqno: " << __ref(seqno) << endl; + foreach (const pii &p, __ref(map)) { + of << p.first << ": " << p.second << endl; + } } }); st_channel<shared_ptr<Recovery> > send_states; @@ -658,39 +820,46 @@ ref(seqno), ref(send_states)), "recover_joiner")); - // If there's anything to recover. - if (init.txnseqno() > 0) { - cout << "waiting for recovery from " << replicas[0] << endl; + try { + // If there's anything to recover. + if (init.txnseqno() > 0) { + cout << "waiting for recovery from " << replicas[0] << endl; - // Read the recovery message. - Recovery recovery; - { - st_intr intr(stop_hub); - readmsg(replicas[0], recovery); - } - for (int i = 0; i < recovery.pair_size(); i++) { - const Recovery_Pair &p = recovery.pair(i); - map[p.key()] = p.value(); - if (i % chkpt == 0) { - if (yield_during_build_up) st_sleep(0); + // Read the recovery message. + Recovery recovery; + { + st_intr intr(stop_hub); + readmsg(replicas[0], recovery); } - } - assert(seqno == -1 && - static_cast<typeof(seqno)>(recovery.seqno()) > seqno); - seqno = recovery.seqno(); - cout << "recovered." << endl; + for (int i = 0; i < recovery.pair_size(); i++) { + const Recovery_Pair &p = recovery.pair(i); + map[p.key()] = p.value(); + if (i % chkpt == 0) { + if (yield_during_build_up) st_sleep(0); + } + } + assert(seqno == -1 && + static_cast<typeof(seqno)>(recovery.seqno()) > seqno); + seqno = recovery.seqno(); + cout << "recovered " << recovery.pair_size() << " records." << endl; - while (!backlog.empty()) { - shared_ptr<Txn> p = backlog.take(); - process_txn(leader, map, *p, seqno, false); - if (p->seqno() % chkpt == 0) { - if (verbose) - cout << "processed txn " << p->seqno() << " off the backlog" << endl; - if (yield_during_catch_up) + while (!backlog.empty()) { + shared_ptr<Txn> p = backlog.take(); + process_txn(leader, map, *p, seqno, false); + if (p->seqno() % chkpt == 0) { + if (verbose) + cout << "processed txn " << p->seqno() << " off the backlog; " + << "backlog.size = " << backlog.queue().size() << endl; + // Explicitly yield. (Note that yielding does still effectively + // happen anyway because process_txn is a yield point.) st_sleep(0); + } } + cout << "caught up." << endl; } - cout << "caught up." << endl; + } catch (std::exception &ex) { + cerr_thread_ex(ex) << endl; + throw; } stop_hub.insert(st_thread_self()); @@ -726,22 +895,13 @@ foreach (st_thread_t t, threads) { st_thread_interrupt(t); } + } else if (sig == SIGUSR1) { + toggle(do_pause); } - break; + //break; } } -map<st_thread_t, string> threadnames; - -void cb() -{ - if (threadnames.find(st_thread_self()) != threadnames.end()) { - cout << "switched to: " << threadnames[st_thread_self()] << endl; - } else { - cout << "switched to: " << st_thread_self() << endl; - } -} - /** * Initialization and command-line parsing. */ @@ -763,18 +923,38 @@ ("help,h", "show this help message") ("debug-threads,d",po::bool_switch(&debug_threads), "enable context switch debug outputs") - ("verbose,v", "enable periodic printing of txn processing progress") + ("verbose,v", po::bool_switch(&verbose), + "enable periodic printing of txn processing progress") ("epoll,e", po::bool_switch(&use_epoll), "use epoll (select is used by default)") ("yield-build-up", po::bool_switch(&yield_during_build_up), "yield periodically during build-up phase of recovery") ("yield-catch-up", po::bool_switch(&yield_during_catch_up), "yield periodically during catch-up phase of recovery") + ("dump,D", po::bool_switch(&dump), + "replicas should finally dump their state to a tmp file for " + "inspection/diffing") + ("show-updates,U", po::bool_switch(&show_updates), + "log operations that touch (update/read/delete) an existing key") + ("count-updates,u",po::bool_switch(&count_updates), + "count operations that touch (update/read/delete) an existing key") + ("general-txns,g", po::bool_switch(&general_txns), + "issue read and delete transactions as well as the default of (only) insertion/update transactions (for leader only)") ("leader,l", po::bool_switch(&is_leader), "run the leader (run replica by default)") + ("exit-on-recovery,x", po::bool_switch(&stop_on_recovery), + "exit after the joiner fully recovers (for leader only)") + ("accept-joiner-size,s", + po::value<size_t>(&accept_joiner_size)->default_value(0), + "accept recovering joiner (start recovery) after DB grows to this size " + "(for leader only)") + ("issuing-interval,i", + po::value<int>(&issuing_interval)->default_value(0), + "seconds to sleep between issuing txns (for leader only)") ("accept-joiner-seqno,j", po::value<int>(&accept_joiner_seqno)->default_value(0), - "accept recovering joiner (start recovery) after this seqno") + "accept recovering joiner (start recovery) after this seqno (for leader " + "only)") ("leader-host,H", po::value<string>(&leader_host)->default_value(string("localhost")), "hostname or address of the leader") @@ -784,10 +964,13 @@ ("chkpt,c", po::value<int>(&chkpt)->default_value(10000), "number of txns before yielding/verbose printing") ("timelim,T", po::value<long long>(&timelim)->default_value(0), - "time limit in milliseconds, or 0 for none") + "general network IO time limit in milliseconds, or 0 for none") + ("read-thresh,r", po::value<long long>(&read_thresh)->default_value(0), + "if positive and any txn read exceeds this, then print a message " + "(for replicas only)") ("listen-port,p", po::value<uint16_t>(&listen_port)->default_value(7654), "port to listen on (replicas only)") - ("timeout,t", po::value<st_utime_t>(&timeout)->default_value(1000000), + ("timeout,t", po::value<st_utime_t>(&timeout)->default_value(200000), "timeout for IO operations (in microseconds)") ("minreps,n", po::value<int>(&minreps)->default_value(2), "minimum number of replicas the system is willing to process txns on"); @@ -813,13 +996,16 @@ check0x(sigemptyset(&sa.sa_mask)); sa.sa_flags = 0; check0x(sigaction(SIGINT, &sa, nullptr)); + check0x(sigaction(SIGTERM, &sa, nullptr)); + check0x(sigaction(SIGUSR1, &sa, nullptr)); // Initialize ST. if (use_epoll) check0x(st_set_eventsys(ST_EVENTSYS_ALT)); check0x(st_init()); st_spawn(bind(handle_sig_sync)); if (debug_threads) { - st_set_switch_in_cb(cb); + st_set_switch_out_cb(switch_out_cb); + st_set_switch_in_cb(switch_in_cb); } // Initialize thread manager for clean shutdown of all threads. @@ -835,9 +1021,9 @@ } return 0; - } catch (const std::exception &ex) { + } catch (std::exception &ex) { // Must catch all exceptions at the top to make the stack unwind. - cerr << "thread " << st_thread_self() << ": " << ex.what() << endl; + cerr_thread_ex(ex) << endl; return 1; } } Added: ydb/trunk/tools/analysis.py =================================================================== --- ydb/trunk/tools/analysis.py (rev 0) +++ ydb/trunk/tools/analysis.py 2009-01-14 18:25:52 UTC (rev 1130) @@ -0,0 +1,83 @@ +#!/usr/bin/env python + +from __future__ import with_statement +import re, sys, itertools, numpy +from pylab import * + +def check(path): + with file(path) as f: + if 'got timeout' in f.read(): + print 'warning: timeout occurred' + +def agg(src): + def gen(): + for seqno, pairs in itertools.groupby(src, lambda (a,b): a): + ts = numpy.array([t for seqno, t in pairs]) + yield seqno, ts.mean(), ts.std(), ts + return list(gen()) + +def scaling(path): + check(path) + def getpairs(): + with file(path) as f: + for line in f: + m = re.match( r'=== n=(?P<n>\d+) ', line ) + if m: + n = int(m.group('n')) + m = re.match( r'.*: issued .*[^.\d](?P<tps>[.\d]+)tps', line ) + if m: + tps = float(m.group('tps')) + yield (n, tps) + tups = agg(getpairs()) + + print 'num nodes, mean tps, stdev tps' + for n, mean, sd, raw in tups: print n, mean, sd, raw + + xs, ys, es, rs = zip(*tups) + errorbar(xs, ys, es) + title('Scaling of baseline throughput with number of nodes') + xlabel('Node count') + ylabel('Mean TPS (stdev error bars)') + xlim(.5, n+.5) + ylim(ymin = 0) + savefig('scaling.png') + +def run(blockpath, yieldpath): + for path, label in [(blockpath, 'blocking scheme'), (yieldpath, 'yielding scheme')]: + check(path) + def getpairs(): + with file(path) as f: + for line in f: + m = re.match( r'=== seqno=(?P<n>\d+) ', line ) + if m: + seqno = int(m.group('n')) + m = re.match( r'.*: recovering node caught up; took (?P<time>\d+)ms', line ) + if m: + t = float(m.group('time')) + yield (seqno, t) + tups = agg(getpairs()) + + print 'max seqno, mean time, stdev time [raw data]' + for seqno, mean, sd, raw in tups: print seqno, mean, sd, raw + + xs, ys, es, rs = zip(*tups) + errorbar(xs, ys, es, label = label) + + title('Recovery time over number of transactions') + xlabel('Transaction count (corresponds roughly to data size)') + ylabel('Mean TPS (stdev error bars)') + #xlim(.5, n+.5) + #ylim(ymin = 0) + savefig('run.png') + +def main(argv): + if len(argv) <= 1: + print >> sys.stderr, 'Must specify a command' + elif sys.argv[1] == 'scaling': + scaling(sys.argv[2] if len(sys.argv) > 2 else 'scaling-log') + elif sys.argv[1] == 'run': + run(*sys.argv[2:] if len(sys.argv) > 2 else ['block-log', 'yield-log']) + else: + print >> sys.stderr, 'Unknown command:', sys.argv[1] + +sys.exit(main(sys.argv)) Property changes on: ydb/trunk/tools/analysis.py ___________________________________________________________________ Added: svn:executable + * Modified: ydb/trunk/tools/test.bash =================================================================== --- ydb/trunk/tools/test.bash 2009-01-12 20:04:16 UTC (rev 1129) +++ ydb/trunk/tools/test.bash 2009-01-14 18:25:52 UTC (rev 1130) @@ -8,7 +8,13 @@ script="$(basename "$0")" tagssh() { - ssh "$@" 2>&1 | sed "s/^/$1: /" + ssh "$@" 2>&1 | python -u -c ' +import time, sys +while True: + line = sys.stdin.readline() + if line == "": break + print sys.argv[1], time.time(), ":\t", line, +' $1 } check-remote() { @@ -120,7 +126,7 @@ tagssh "$host" "./$script" "$@" } -allhosts() { +hosts() { if [[ ${host:-} ]] ; then echo $host elif [[ ${range:-} ]] ; then @@ -142,23 +148,27 @@ farm13.csail farm14.csail EOF - fi | xargs ${xargs--P9} -I^ "$@" + fi } -allssh() { - allhosts ssh ^ "set -o errexit -o nounset; $@" +parhosts() { + hosts | xargs ${xargs--P9} -I^ "$@" } -allscp() { - allhosts scp -q "$@" +parssh() { + parhosts ssh ^ "set -o errexit -o nounset; $@" } -allremote() { - allhosts "./$script" remote ^ "$@" +parscp() { + parhosts scp -q "$@" } +parremote() { + parhosts "./$script" remote ^ "$@" +} + init-setup() { - allremote node-init-setup + parremote node-init-setup } get-deps() { @@ -174,7 +184,7 @@ } setup-deps() { - allscp \ + parscp \ /usr/share/misc/config.guess \ /tmp/lzz.static \ /tmp/st-1.8.tar.gz \ @@ -183,33 +193,33 @@ clamp.patch \ ^:/tmp/ - allremote node-setup-lzz - allremote node-setup-st - allremote node-setup-pb - allremote node-setup-boost - allremote node-setup-m4 - allremote node-setup-bison - allremote node-setup-clamp + parremote node-setup-lzz + parremote node-setup-st + parremote node-setup-pb + parremote node-setup-boost + parremote node-setup-m4 + parremote node-setup-bison + parremote node-setup-clamp } setup-ydb() { - allremote node-setup-ydb-1 - rm -r /tmp/{ydb,ccom}-src/ + parremote node-setup-ydb-1 + rm -rf /tmp/{ydb,ccom}-src/ svn export ~/ydb/src /tmp/ydb-src/ svn export ~/ccom/src /tmp/ccom-src/ - allscp -r /tmp/ydb-src/* ^:ydb/src/ - allscp -r /tmp/ccom-src/* ^:ccom/src/ - allremote node-setup-ydb-2 + parscp -r /tmp/ydb-src/* ^:ydb/src/ + parscp -r /tmp/ccom-src/* ^:ccom/src/ + parremote node-setup-ydb-2 } -full() { +full-setup() { init-setup setup-deps setup-ydb } hostinfos() { - xargs= allssh " + xargs= parssh " echo hostname echo ===== @@ -219,7 +229,7 @@ } hosttops() { - xargs= allssh " + xargs= parssh " echo hostname echo ===== @@ -227,51 +237,134 @@ " } -range2args() { - "$@" $(seq $range | sed 's/^/farm/; s/$/.csail/') +hostargs() { + if [[ $range ]] + then "$@" $(seq $range | sed 's/^/farm/; s/$/.csail/') + else "$@" ${hosts[@]} + fi } -run-helper() { - tagssh $1 "ydb/src/ydb -l" & +scaling-helper() { + local leader=$1 + shift + tagssh $leader "ydb/src/ydb -l -n $#" & sleep .1 - tagssh $2 "ydb/src/ydb -H $1" & - tagssh $3 "ydb/src/ydb -H $1" & + for rep in "$@" + do tagssh $rep "ydb/src/ydb -n $# -H $leader" & + done sleep ${wait1:-10} - tagssh $4 "ydb/src/ydb -H $1" & - if [[ ${wait2:-} ]] - then sleep $wait2 - else read + tagssh $leader 'pkill -sigint ydb' + wait +} + +# This just tests how the system scales; no recovery involved. +scaling() { + hostargs scaling-helper +} + +# Repeat some experiment some number of trials and for some number of range +# configurations; e.g., "repeat scaling". +# TODO: fix this to work also with `hosts`; move into repeat-helper that's run +# via hostargs, and change the range= to hosts= +full-scaling() { + local base=$1 out=scaling-log-$(date +%Y-%m-%d-%H:%M:%S-%N) + shift + for n in {1..5} ; do # configurations + export range="$base $((base + n))" + stop + for i in {1..5} ; do # trials + echo === n=$n i=$i === + scaling + sleep 1 + stop + sleep .1 + echo + done + done >& $out + ln -sf $out scaling-log +} + +run-helper() { + local leader=$1 + shift + tagssh $leader "ydb/src/ydb -l -x --accept-joiner-seqno $seqno -n $(( $# - 1 )) ${extraargs:-}" & # -v --debug-threads + sleep .1 # pexpect 'waiting for at least' + # Run initial replicas. + while (( $# > 1 )) ; do + tagssh $1 "ydb/src/ydb -H $leader" & + shift + done + sleep .1 # pexpect 'got all \d+ replicas' leader + # Run joiner. + tagssh $1 "ydb/src/ydb -H $leader" & # -v --debug-threads -t 200000" & + if false ; then + if [[ ${wait2:-} ]] + then sleep $wait2 + else read + fi + tagssh $leader "pkill -sigint ydb" fi - tagssh $1 "pkill -sigint ydb" + wait } run() { - range2args run-helper + hostargs run-helper } +full-run() { + for seqno in 100000 300000 500000 700000 900000; do # configurations + stop + for i in {1..5} ; do # trials + echo === seqno=$seqno i=$i === + run + sleep 1 + stop + sleep .1 + echo + done + done +} + +full-block() { + local out=block-log-$(date +%Y-%m-%d-%H:%M:%S) + full-run >& $out + ln -sf $out block-log +} + +full-yield() { + local out=yield-log-$(date +%Y-%m-%d-%H:%M:%S) + extraargs='--yield-catch-up' full-run >& $out + ln -sf $out yield-log +} + +full() { + #full-block + full-yield + #full-scaling +} + stop-helper() { - tagssh $1 'pkill ydb' + tagssh $1 'pkill -sigint ydb' } stop() { - range2args stop-helper + hostargs stop-helper } kill-helper() { - tagssh $1 'pkill ydb' - tagssh $2 'pkill ydb' - tagssh $3 'pkill ydb' - tagssh $4 'pkill ydb' + for i in "$@" + do tagssh $i 'pkill ydb' + done } kill() { - range2args kill-helper + hostargs kill-helper } -#plot() { -# for i in "$@" ; do -# sed "s/farm$i.csail//" < "$i" -# done -#} +# Use mssh to log in with password as root to each machine. +mssh-root() { + : "${hosts:="$(hosts)"}" + mssh -l root "$@" +} "$@" This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <yan...@us...> - 2009-01-17 18:57:00
|
Revision: 1133 http://assorted.svn.sourceforge.net/assorted/?rev=1133&view=rev Author: yangzhang Date: 2009-01-17 18:56:53 +0000 (Sat, 17 Jan 2009) Log Message: ----------- - filled in more measurement documentation and rewrote parts of the overview - added -pg to the Makefile for profiling - added --profile-threads - added --write-thresh - improved detail of and tweaked some output messages - more clearly distinguished the different phases of recovery on the joiner - changed the joiner behavior to not send responses when catching up - added clean shutdown messages from leader to replicas - decreased the variance in the txns - added --max-ops, --min-ops - added some lineage to the analysis.py output - generalized analysis code to support plotting multi-segmented bar charts Modified Paths: -------------- ydb/trunk/README ydb/trunk/src/Makefile ydb/trunk/src/main.lzz.clamp ydb/trunk/tools/analysis.py ydb/trunk/tools/test.bash Modified: ydb/trunk/README =================================================================== --- ydb/trunk/README 2009-01-14 18:29:46 UTC (rev 1132) +++ ydb/trunk/README 2009-01-17 18:56:53 UTC (rev 1133) @@ -3,24 +3,23 @@ ydb (Yang's Database) is a simple replicated memory store, developed for the purpose of researching various approaches to recovery in such OLTP-optimized -databases as [VOLTDB] (formerly H-Store/Horizontica). +databases as [H-Store] (or VOLTDB). -[VOLTDB]: http://db.cs.yale.edu/hstore/ +[H-Store]: http://db.cs.yale.edu/hstore/ -Currently, the only recovery implemented mechanism is to have the first-joining +Currently, the only recovery implemented mechanism is to have an already-joined replica serialize its entire database state and send that to the joining node. -If you start a system with a minimum of $n$ replicas, then the leader will wait -for that many to them to join before it starts issuing transactions. Then when -replica $n+1$ joins, it will need to catch up to the current state of the -system; it will do so by contacting the first-joining replica and receiving a -complete dump of its DB state. +If you start a system with a minimum of $n$ replicas, then the leader waits for +that many to them to join before it starts issuing transactions. Once replica +$n+1$ joins, it needs to catch up to the current state of the system; it does +so by contacting the first-joining replica and receiving a complete dump of its +DB state. -The leader will report the current txn seqno to the joiner, and start streaming -txns beyond that seqno to the joiner, which the joiner will push onto its -backlog. It will also instruct that first replica to snapshot its DB state at -this txn seqno and prepare to send it to the recovering node as soon as it -connects. +The leader reports the current txn seqno to the joiner, and starts streaming +txns beyond that seqno to the joiner, which the joiner pushes onto its backlog. +It also instructs that first replica to snapshot its DB state at this txn seqno +and prepare to send it to the recovering node as soon as it connects. Setup ----- @@ -123,23 +122,28 @@ ### Recovery experiments To run a leader on `farm10`, initial replicas on `farm11` and `farm12`, and a -recovering replica on `farm13` after 5 seconds: +recovering replica on `farm13` after once the 100,000th txn has been issued: - range='10 13' ./test.bash run 1000 # Command requires exactly 4 nodes. + range='10 13' seqno=100000 ./test.bash run -To run this experiment TODO -trials: +The above experiment uses the `block` recovery scheme. To use the `yield` +recovery scheme: - range='10 13' + range='10 13' seqno=100000 extraargs=--yield-catch-up ./test.bash run +To run `block` and `yield`, respectively, for varying values of `seqno` and for +some number of trials: + + range='10 13' ./test.bash full-block + range='10 13' ./test.bash full-yield + ### Scaling experiments To run a leader on `farm10` with initial replicas on the rest: range='10 15' ./test.bash scaling -To run for 1 through 3 initial replicas, repeating each configuration for 3 -trials: +To run for varying numbers of replicas and for some number of trials: range='10 13' ./test.bash full-scaling @@ -199,12 +203,44 @@ Todo ---- +- DONE add benchmarking/testing hooks + - start the recovering joiner at a well-defined time (after a certain # + txns or after the DB reaches a certain size) + - stop the system once recovery finishes +- DONE find out how often prng yields same number + - not very often +- DONE baseline scaling (tps with number of nodes) + - inversely proportional to number of nodes, so bottlenecked at leader +- DONE recovery time as a function of amount of data +- DONE use only insert (and update) txns + - db size blows up much faster +- DONE try gprof profiling + - output quirky; waiting on list response to question +- DONE optimize acks from joiner + - much faster, and much less variance +- DONE use more careful txn counting/data size + - added lower/upper bounds on the rand # ops per txn (5), combined with + above restricting of ops to writes + - 5 is a lot; db grows large; experiments take much longer +- DONE break down into various phases using bar graph of segmented bars +- TODO serialize outputs from the various clients to a single merger to (1) + have ordering over the (timestamped) messages, and (2) avoid interleaved + lines +- TODO detailed view of tps during recovery over time (should see various + phases) +- TODO later: runtime overhead of logging/tps under normal operation (scaled + with # nodes?) +- TODO later: timestamped logging? + +Longer term + +- Testing + - unit/regression/mock + - performance tests + - valgrind + - Add a way to reliably obtain ST stack traces -- Add benchmarking/testing hooks, e.g.: - - start the recovering joiner at a well-defined time (after a certain # txns - or after the DB reaches a certain size) - - Add benchmarking information, e.g.: - txns/second normally - txns during recovery @@ -238,24 +274,8 @@ - Add disk-based recovery methods. -Plan/Notes ----------- +Presentation Notes +------------------ -Measurements - -- DONE find out how often prng yields same number - - not very often -- DONE baseline scaling (tps with number of nodes) - - inversely proportional to number of nodes, so bottlenecked at leader -- DONE recovery time as a function of amount of data - - TODO break down into various phases using bar graph of segmented bars -- DONE use only insert (and update) txns -- TODO try profiling -- TODO detailed view of tps during recovery over time (should see various phases) -- TODO later: runtime overhead of logging/tps under normal operation (scaled - with # nodes?) - -Presentation - - TODO differences from: harbor, harp, aries - TODO understand 2pc, paxos, etc. Modified: ydb/trunk/src/Makefile =================================================================== --- ydb/trunk/src/Makefile 2009-01-14 18:29:46 UTC (rev 1132) +++ ydb/trunk/src/Makefile 2009-01-17 18:56:53 UTC (rev 1133) @@ -19,14 +19,21 @@ SRCS := $(GENSRCS) OBJS := $(GENOBJS) +ifneq ($(GPROF),) + GPROF := -pg +endif +ifneq ($(GCOV),) + GCOV := -fprofile-arcs -ftest-coverage +endif LDFLAGS := -lstx -lst -lresolv -lpthread -lprotobuf \ - -lboost_program_options-gcc43-mt -CXXFLAGS := -g3 -Wall -Werror -Wextra -Woverloaded-virtual -Wconversion -Wno-conversion -Wno-ignored-qualifiers \ + -lboost_program_options-gcc43-mt $(GPROF) +CXXFLAGS := -g3 $(GPROF) -Wall -Werror -Wextra -Woverloaded-virtual -Wconversion \ + -Wno-conversion -Wno-ignored-qualifiers \ -Wpointer-arith -Wcast-qual -Wcast-align -Wwrite-strings \ -Winit-self -Wsign-promo -Wno-unused-parameter -Wc++0x-compat \ -Wparentheses -Wmissing-format-attribute -Wfloat-equal \ -Winline -Wsynth -PBCXXFLAGS := -g3 -Wall -Werror +PBCXXFLAGS := -g3 -Wall -Werror $(GPROF) all: $(TARGET) Modified: ydb/trunk/src/main.lzz.clamp =================================================================== --- ydb/trunk/src/main.lzz.clamp 2009-01-14 18:29:46 UTC (rev 1132) +++ ydb/trunk/src/main.lzz.clamp 2009-01-17 18:56:53 UTC (rev 1133) @@ -31,11 +31,12 @@ // Configuration. st_utime_t timeout; -int chkpt, accept_joiner_seqno, issuing_interval; +int chkpt, accept_joiner_seqno, issuing_interval, min_ops, max_ops; size_t accept_joiner_size; bool verbose, yield_during_build_up, yield_during_catch_up, dump, show_updates, - count_updates, stop_on_recovery, general_txns; -long long timelim, read_thresh; + count_updates, stop_on_recovery, general_txns, profile_threads, + debug_threads; +long long timelim, read_thresh, write_thresh; // Control. st_intr_bool stop_hub, kill_hub; @@ -60,10 +61,19 @@ ~thread_eraser() { threads.erase(st_thread_self()); } }; +/** + * For debug/error-printing purposes. + */ map<st_thread_t, string> threadnames; st_thread_t last_thread; /** + * For profiling. + */ +map<st_thread_t, long long> threadtimes; +long long thread_start_time; + +/** * Look up thread name, or just show thread ID. */ string @@ -81,7 +91,9 @@ void switch_out_cb() { - last_thread = st_thread_self(); + if (debug_threads) last_thread = st_thread_self(); + if (profile_threads) + threadtimes[st_thread_self()] += current_time_millis() - thread_start_time; } /** @@ -89,11 +101,13 @@ */ void switch_in_cb() { - if (last_thread != st_thread_self()) { + if (debug_threads && last_thread != st_thread_self()) { cout << "switching"; if (last_thread != 0) cout << " from " << threadname(last_thread); cout << " to " << threadname() << endl; } + if (profile_threads) + thread_start_time = current_time_millis(); } /** @@ -223,9 +237,11 @@ size_t resid = sizeof len; #define checksize(x,y) checkeqnneg(x, static_cast<ssize_t>(y)) int res = st_write_resid(dst, static_cast<void*>(&len), &resid, timeout); + long long before_write; + if (write_thresh > 0) { + before_write = current_time_millis(); + } if (res == -1 && errno == ETIME) { - cerr << "got timeout! " << resid << " of " << sizeof len - << " remaining, for dst #" << dstno << endl; checksize(st_write(dst, reinterpret_cast<char*>(&len) + sizeof len - resid, resid, @@ -234,6 +250,14 @@ } else { check0x(res); } + if (write_thresh > 0) { + long long write_time = current_time_millis() - before_write; + if (write_time > write_thresh) { + cout << "thread " << threadname() + << ": write to dst #" << dstno + << " took " << write_time << " ms" << endl; + } + } checksize(st_write(dst, buf, s.size(), ST_UTIME_NO_TIMEOUT), s.size()); dstno++; @@ -325,7 +349,7 @@ // Generate a random transaction. Txn txn; txn.set_seqno(seqno++); - int count = randint(5) + 1; + int count = randint(min_ops, max_ops + 1); for (int o = 0; o < count; o++) { Op *op = txn.add_op(); int rtype = general_txns ? randint(3) : 1, rkey = randint(), rvalue = randint(); @@ -358,6 +382,10 @@ accept_joiner.set(); } } + + Txn txn; + txn.set_seqno(-1); + bcastmsg(fds, txn); } /** @@ -394,7 +422,9 @@ break; } } - sendmsg(leader, res); + if (caught_up) { + sendmsg(leader, res); + } } void @@ -405,7 +435,7 @@ int count_diff = stop_count - start_count; double rate = double(count_diff) * 1000 / time_diff; cout << action << " " << count_diff << " txns in " << time_diff << " ms (" - << rate << "tps)" << endl; + << rate << " tps)" << endl; } /** @@ -427,7 +457,7 @@ * leader. Not entirely clear that this is necessary; could probably just go * with seqno. */ - void +void process_txns(st_netfd_t leader, map<int, int> &map, int &seqno, st_channel<shared_ptr<Recovery> > &send_states, st_channel<shared_ptr<Txn> > &backlog, int init_seqno) @@ -436,20 +466,24 @@ long long start_time = current_time_millis(), time_caught_up = caught_up ? start_time : -1; int seqno_caught_up = caught_up ? seqno : -1; + // Used by joiner only to tell where we actually started (init_seqno is just + // the seqno reported by the leader in the Init message, but it may have + // issued more since the Init message). + int first_seqno = -1; finally f(lambda () { - long long now = current_time_millis(); - showtput("processed", now, __ref(start_time), __ref(seqno), - __ref(init_seqno)); - if (!__ref(caught_up)) { - cout << "live-processing: never entered this phase (never caught up)" << - endl; - } else { - showtput("live-processed", now, __ref(time_caught_up), __ref(seqno), - __ref(seqno_caught_up)); - } - __ref(send_states).push(shared_ptr<Recovery>()); - }); + long long now = current_time_millis(); + showtput("processed", now, __ref(start_time), __ref(seqno), + __ref(init_seqno)); + if (!__ref(caught_up)) { + cout << "live-processing: never entered this phase (never caught up)" << + endl; + } else { + showtput("live-processed", now, __ref(time_caught_up), __ref(seqno), + __ref(seqno_caught_up)); + } + __ref(send_states).push(shared_ptr<Recovery>()); + }); while (true) { Txn txn; @@ -464,23 +498,28 @@ if (read_thresh > 0) { long long read_time = current_time_millis() - before_read; if (read_time > read_thresh) { - cout << "current_time_millis() - before_read = " << read_time << " > " - << read_thresh << endl; + cout << "thread " << threadname() + << ": read took " << read_time << " ms" << endl; } } if (txn.has_seqno()) { const char *action; - if (txn.seqno() == seqno + 1) { + if (txn.seqno() < 0) { + break; + } else if (txn.seqno() == seqno + 1) { if (!caught_up) { time_caught_up = current_time_millis(); seqno_caught_up = seqno; - showtput("backlogged", time_caught_up, start_time, seqno_caught_up, - init_seqno); + showtput("process_txns caught up; backlogged", + time_caught_up, start_time, seqno_caught_up, + first_seqno == -1 ? init_seqno : first_seqno); caught_up = true; } process_txn(leader, map, txn, seqno, true); action = "processed"; } else { + if (first_seqno == -1) + first_seqno = txn.seqno(); // Queue up for later processing once a snapshot has been received. backlog.push(shared_ptr<Txn>(new Txn(txn))); action = "backlogged"; @@ -576,7 +615,8 @@ // never grow again if stop_hub is set. if (last_seqno + 1 == seqno) { cout << rid << ": "; - cout << "stopping seqno = " << res.seqno() << endl; + cout << "clean stop; next expected seqno is " << seqno + << " (last seqno was " << last_seqno << ")" << endl; break; } else { continue; @@ -597,7 +637,7 @@ recover_signals.push(now); cout << rid << ": "; cout << "recovering node caught up; took " - << timediff << "ms" << endl; + << timediff << " ms" << endl; // This will cause the program to exit eventually, but cleanly, such that // the recovery time will be set first, before the eventual exit (which // may not even happen in the current iteration). @@ -756,12 +796,12 @@ int seqno = -1; finally f(lambda () { cout << "REPLICA SUMMARY" << endl; - cout << "total updates = " << updates << endl; - cout << "final DB state: seqno = " << __ref(seqno) << ", size = " + cout << "- total updates = " << updates << endl; + cout << "- final DB state: seqno = " << __ref(seqno) << ", size = " << __ref(map).size() << endl; string fname = string("/tmp/ydb") + lexical_cast<string>(getpid()); if (dump) { - cout << "dumping to " << fname << endl; + cout << "- dumping to " << fname << endl; ofstream of(fname.c_str()); of << "seqno: " << __ref(seqno) << endl; foreach (const pii &p, __ref(map)) { @@ -824,6 +864,7 @@ // If there's anything to recover. if (init.txnseqno() > 0) { cout << "waiting for recovery from " << replicas[0] << endl; + long long before_recv = current_time_millis(); // Read the recovery message. Recovery recovery; @@ -831,6 +872,9 @@ st_intr intr(stop_hub); readmsg(replicas[0], recovery); } + long long build_start = current_time_millis(); + cout << "got recovery message in " << build_start - before_recv + << " ms" << endl; for (int i = 0; i < recovery.pair_size(); i++) { const Recovery_Pair &p = recovery.pair(i); map[p.key()] = p.value(); @@ -840,8 +884,11 @@ } assert(seqno == -1 && static_cast<typeof(seqno)>(recovery.seqno()) > seqno); - seqno = recovery.seqno(); - cout << "recovered " << recovery.pair_size() << " records." << endl; + int mid_seqno = seqno = recovery.seqno(); + long long mid_time = current_time_millis(); + cout << "receive and build-up took " << mid_time - before_recv + << " ms; built up map of " << recovery.pair_size() << " records in " + << mid_time - build_start << " ms; now at seqno " << seqno << endl; while (!backlog.empty()) { shared_ptr<Txn> p = backlog.take(); @@ -849,13 +896,14 @@ if (p->seqno() % chkpt == 0) { if (verbose) cout << "processed txn " << p->seqno() << " off the backlog; " - << "backlog.size = " << backlog.queue().size() << endl; + << "backlog.size = " << backlog.queue().size() << endl; // Explicitly yield. (Note that yielding does still effectively // happen anyway because process_txn is a yield point.) st_sleep(0); } } - cout << "caught up." << endl; + showtput("replayer caught up; from backlog replayed", + current_time_millis(), mid_time, seqno, mid_seqno); } } catch (std::exception &ex) { cerr_thread_ex(ex) << endl; @@ -912,7 +960,7 @@ try { GOOGLE_PROTOBUF_VERIFY_VERSION; - bool is_leader, use_epoll, debug_threads; + bool is_leader, use_epoll; int minreps; uint16_t leader_port, listen_port; string leader_host; @@ -923,6 +971,8 @@ ("help,h", "show this help message") ("debug-threads,d",po::bool_switch(&debug_threads), "enable context switch debug outputs") + ("profile-threads,q",po::bool_switch(&profile_threads), + "enable profiling of threads") ("verbose,v", po::bool_switch(&verbose), "enable periodic printing of txn processing progress") ("epoll,e", po::bool_switch(&use_epoll), @@ -951,6 +1001,12 @@ ("issuing-interval,i", po::value<int>(&issuing_interval)->default_value(0), "seconds to sleep between issuing txns (for leader only)") + ("min-ops,o", + po::value<int>(&min_ops)->default_value(5), + "lower bound on randomly generated number of operations per txn (for leader only)") + ("max-ops,O", + po::value<int>(&max_ops)->default_value(5), + "upper bound on randomly generated number of operations per txn (for leader only)") ("accept-joiner-seqno,j", po::value<int>(&accept_joiner_seqno)->default_value(0), "accept recovering joiner (start recovery) after this seqno (for leader " @@ -965,9 +1021,10 @@ "number of txns before yielding/verbose printing") ("timelim,T", po::value<long long>(&timelim)->default_value(0), "general network IO time limit in milliseconds, or 0 for none") + ("write-thresh,w", po::value<long long>(&write_thresh)->default_value(200), + "if positive and any txn write exceeds this, then print a message (for replicas only)") ("read-thresh,r", po::value<long long>(&read_thresh)->default_value(0), - "if positive and any txn read exceeds this, then print a message " - "(for replicas only)") + "if positive and any txn read exceeds this, then print a message (for replicas only)") ("listen-port,p", po::value<uint16_t>(&listen_port)->default_value(7654), "port to listen on (replicas only)") ("timeout,t", po::value<st_utime_t>(&timeout)->default_value(200000), @@ -984,6 +1041,11 @@ cout << desc << endl; return 0; } + + // Validate arguments. + check(min_ops > 0); + check(max_ops > 0); + check(max_ops >= min_ops); } catch (std::exception &ex) { cerr << ex.what() << endl << endl << desc << endl; return 1; @@ -1002,8 +1064,8 @@ // Initialize ST. if (use_epoll) check0x(st_set_eventsys(ST_EVENTSYS_ALT)); check0x(st_init()); - st_spawn(bind(handle_sig_sync)); - if (debug_threads) { + my_spawn(bind(handle_sig_sync), "handle_sig_sync"); + if (debug_threads || profile_threads) { st_set_switch_out_cb(switch_out_cb); st_set_switch_in_cb(switch_in_cb); } @@ -1013,6 +1075,26 @@ threads.insert(st_thread_self()); threadnames[st_thread_self()] = "main"; + finally f(lambda() { + if (profile_threads) { + cout << "thread profiling results:" << endl; + long long total; + typedef pair<st_thread_t, long long> entry; + foreach (entry p, threadtimes) { + const string &name = threadname(p.first); + if (name != "main" && name != "handle_sig_sync") + total += p.second; + } + foreach (entry p, threadtimes) { + const string &name = threadname(p.first); + if (name != "main" && name != "handle_sig_sync") + cout << "- " << threadname(p.first) << ": " << p.second + << " (" << (static_cast<double>(p.second) / total) << "%)" + << endl; + } + } + }); + // Which role are we? if (is_leader) { run_leader(minreps, leader_port); Modified: ydb/trunk/tools/analysis.py =================================================================== --- ydb/trunk/tools/analysis.py 2009-01-14 18:29:46 UTC (rev 1132) +++ ydb/trunk/tools/analysis.py 2009-01-17 18:56:53 UTC (rev 1133) @@ -1,9 +1,12 @@ #!/usr/bin/env python from __future__ import with_statement -import re, sys, itertools, numpy +import re, sys, itertools +from os.path import basename, realpath from pylab import * +def getname(path): return basename(realpath(path)) + def check(path): with file(path) as f: if 'got timeout' in f.read(): @@ -11,12 +14,21 @@ def agg(src): def gen(): - for seqno, pairs in itertools.groupby(src, lambda (a,b): a): - ts = numpy.array([t for seqno, t in pairs]) - yield seqno, ts.mean(), ts.std(), ts - return list(gen()) + for index, tups in itertools.groupby(src, lambda x: x[0]): + yield list(tups) + a = array(list(gen())) + indexes = a[:,0,0] + means = a.mean(1) + stds = a.std(1) + tup = (indexes,) + for i in range(1, len(a[0,0])): + tup += (means[:,i], stds[:,i]) + stacked = hstack(map(lambda x: x.reshape((len(indexes),1)), tup)) + return tup + (stacked, a) def scaling(path): + print '=== scaling ===' + print 'file:', getname(path) check(path) def getpairs(): with file(path) as f: @@ -24,62 +36,77 @@ m = re.match( r'=== n=(?P<n>\d+) ', line ) if m: n = int(m.group('n')) - m = re.match( r'.*: issued .*[^.\d](?P<tps>[.\d]+)tps', line ) + m = re.match( r'.*: issued .*[^.\d](?P<tps>[.\d]+) ?tps', line ) if m: tps = float(m.group('tps')) yield (n, tps) tups = agg(getpairs()) + ns, tpsmeans, tpssds, stacked, a = agg(getpairs()) + print 'n, tps mean, tps sd' + print stacked + print - print 'num nodes, mean tps, stdev tps' - for n, mean, sd, raw in tups: print n, mean, sd, raw - - xs, ys, es, rs = zip(*tups) - errorbar(xs, ys, es) + errorbar(ns, tpsmeans, tpssds) title('Scaling of baseline throughput with number of nodes') xlabel('Node count') ylabel('Mean TPS (stdev error bars)') - xlim(.5, n+.5) + xlim(ns.min() - .5, ns.max() + .5) ylim(ymin = 0) savefig('scaling.png') def run(blockpath, yieldpath): - for path, label in [(blockpath, 'blocking scheme'), (yieldpath, 'yielding scheme')]: + for path, label in [#(blockpath, 'blocking scheme'), + (yieldpath, 'yielding scheme')]: + print '===', label, '===' + print 'file:', getname(path) check(path) def getpairs(): with file(path) as f: + seqno = recv = buildup = catchup = total = None for line in f: - m = re.match( r'=== seqno=(?P<n>\d+) ', line ) - if m: - seqno = int(m.group('n')) - m = re.match( r'.*: recovering node caught up; took (?P<time>\d+)ms', line ) - if m: - t = float(m.group('time')) - yield (seqno, t) - tups = agg(getpairs()) + m = re.match( r'=== seqno=(?P<seqno>\d+) ', line ) + if m: seqno = int(m.group('seqno')) + m = re.search( r'got recovery message in (?P<time>\d+) ms', line ) + if m: recv = float(m.group('time')) + m = re.search( r'built up .* (?P<time>\d+) ms', line ) + if m: buildup = float(m.group('time')) + m = re.search( r'replayer caught up; from backlog replayed \d+ txns in (?P<time>\d+) ms', line ) + if m: catchup = float(m.group('time')) + m = re.match( r'.*: recovering node caught up; took (?P<time>\d+) ?ms', line ) + if m: total = float(m.group('time')) + tup = (seqno, recv, buildup, catchup, total) + if all(tup): + yield tup + seqno = recv = buildup = catchup = total = None + seqnos, recvmeans, recvsds, buildmeans, buildsds, catchmeans, catchsds, totalmeans, totalsds, stacked, a = agg(getpairs()) - print '===', label, '===' - print 'max seqno, mean time, stdev time [raw data]' - for seqno, mean, sd, raw in tups: print seqno, mean, sd, raw + print 'max seqno, recv mean, recv sd, build mean, build sd, catch mean, catch sd, total mean, total sd' + print stacked print - xs, ys, es, rs = zip(*tups) - errorbar(xs, ys, es, label = label) + width = 5e4 + a = bar(seqnos, recvmeans, yerr = recvsds, width = width, color = 'r', + label = 'State receive') + b = bar(seqnos, buildmeans, yerr = buildsds, width = width, color = 'g', + label = 'Build-up time', bottom = recvmeans) + c = bar(seqnos, catchmeans, yerr = catchsds, width = width, color = 'b', + label = 'Catch-up', bottom = recvmeans + buildmeans) title('Recovery time over number of transactions') xlabel('Transaction count (corresponds roughly to data size)') - ylabel('Mean recovery time in ms (stdev error bars)') - #xlim(.5, n+.5) - #ylim(ymin = 0) + ylabel('Mean time in ms (SD error bars)') + legend(loc = 'upper left') savefig('run.png') def main(argv): if len(argv) <= 1: print >> sys.stderr, 'Must specify a command' - elif sys.argv[1] == 'scaling': - scaling(sys.argv[2] if len(sys.argv) > 2 else 'scaling-log') - elif sys.argv[1] == 'run': - run(*sys.argv[2:] if len(sys.argv) > 2 else ['block-log', 'yield-log']) + elif argv[1] == 'scaling': + scaling(argv[2] if len(argv) > 2 else 'scaling-log') + elif argv[1] == 'run': + run(*argv[2:] if len(argv) > 2 else ['block-log', 'yield-log']) else: - print >> sys.stderr, 'Unknown command:', sys.argv[1] + print >> sys.stderr, 'Unknown command:', argv[1] -sys.exit(main(sys.argv)) +if __name__ == '__main__': + sys.exit(main(sys.argv)) Modified: ydb/trunk/tools/test.bash =================================================================== --- ydb/trunk/tools/test.bash 2009-01-14 18:29:46 UTC (rev 1132) +++ ydb/trunk/tools/test.bash 2009-01-17 18:56:53 UTC (rev 1133) @@ -269,6 +269,7 @@ full-scaling() { local base=$1 out=scaling-log-$(date +%Y-%m-%d-%H:%M:%S-%N) shift + ln -sf $out scaling-log for n in {1..5} ; do # configurations export range="$base $((base + n))" stop @@ -281,13 +282,12 @@ echo done done >& $out - ln -sf $out scaling-log } run-helper() { local leader=$1 shift - tagssh $leader "ydb/src/ydb -l -x --accept-joiner-seqno $seqno -n $(( $# - 1 )) ${extraargs:-}" & # -v --debug-threads + tagssh $leader "ydb/src/ydb -l -x --accept-joiner-seqno $seqno -n $(( $# - 1 )) -o 1 -O 1 ${extraargs:-}" & # -v --debug-threads sleep .1 # pexpect 'waiting for at least' # Run initial replicas. while (( $# > 1 )) ; do @@ -312,7 +312,7 @@ } full-run() { - for seqno in 100000 300000 500000 700000 900000; do # configurations + for seqno in 500000 400000 300000 200000 100000 ; do # 200000 300000 400000 500000 ; do # 700000 900000; do # configurations stop for i in {1..5} ; do # trials echo === seqno=$seqno i=$i === @@ -327,20 +327,20 @@ full-block() { local out=block-log-$(date +%Y-%m-%d-%H:%M:%S) + ln -sf $out block-log full-run >& $out - ln -sf $out block-log } full-yield() { local out=yield-log-$(date +%Y-%m-%d-%H:%M:%S) + ln -sf $out yield-log extraargs='--yield-catch-up' full-run >& $out - ln -sf $out yield-log } full() { - #full-block + full-block full-yield - #full-scaling + full-scaling } stop-helper() { This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <yan...@us...> - 2009-01-23 09:34:27
|
Revision: 1136 http://assorted.svn.sourceforge.net/assorted/?rev=1136&view=rev Author: yangzhang Date: 2009-01-23 09:34:16 +0000 (Fri, 23 Jan 2009) Log Message: ----------- - added multi-host recovery - added start of googletest tests - added --debug-memory, memory monitor - added place-holder for disk IO threads - fixed some bugs in test.bash - added gtest setup to test.bash - improved colors on plots - updated reqs, todo Modified Paths: -------------- ydb/trunk/README ydb/trunk/src/Makefile ydb/trunk/src/main.lzz.clamp ydb/trunk/src/ydb.proto ydb/trunk/tools/analysis.py ydb/trunk/tools/test.bash Modified: ydb/trunk/README =================================================================== --- ydb/trunk/README 2009-01-19 04:52:00 UTC (rev 1135) +++ ydb/trunk/README 2009-01-23 09:34:16 UTC (rev 1136) @@ -30,6 +30,7 @@ - [C++ Commons] svn r1082 - [clamp] 153 - [GCC] 4.3.2 +- [googletest] 1.2.1 - [Lazy C++] 2.8.0 - [Protocol Buffers] 2.0.0 - [State Threads] 1.8 @@ -38,6 +39,7 @@ [C++ Commons]: http://assorted.sourceforge.net/cpp-commons/ [clamp]: http://home.clara.net/raoulgough/clamp/ [GCC]: http://gcc.gnu.org/ +[googletest]: http://code.google.com/p/googletest/ [Lazy C++]: http://www.lazycplusplus.com/ [Protocol Buffers]: http://code.google.com/p/protobuf/ [State Threads]: http://state-threads.sourceforge.net/ @@ -203,6 +205,8 @@ Todo ---- +Period: -1/20 + - DONE add benchmarking/testing hooks - start the recovering joiner at a well-defined time (after a certain # txns or after the DB reaches a certain size) @@ -223,9 +227,31 @@ above restricting of ops to writes - 5 is a lot; db grows large; experiments take much longer - DONE break down into various phases using bar graph of segmented bars + +Period: 1/20-1/27 + +- DONE implement multihost +- TODO add simple, proper timestamped logging +- TODO see how much multihost recovery affects perf +- TODO look again at how much yielding affects perf +- TODO monitor memory usage +- TODO switch to btree +- TODO break down the red bar some more +- TODO see how much time difference there is +- TODO red bar: why are/aren't we saturating bandwidth? +- TODO understand the rest of the perf (eg stl map) +- TODO try scaling up +- TODO implement checkpointing disk-based scheme +- TODO implement log-based recovery; show that it sucks +- TODO implement group (batch) commit for log-based recovery +- TODO talk + - motivation: log-based sucks, look into alternatives - TODO serialize outputs from the various clients to a single merger to (1) have ordering over the (timestamped) messages, and (2) avoid interleaved lines + +Period: 1/27- + - TODO detailed view of tps during recovery over time (should see various phases) - TODO later: runtime overhead of logging/tps under normal operation (scaled Modified: ydb/trunk/src/Makefile =================================================================== --- ydb/trunk/src/Makefile 2009-01-19 04:52:00 UTC (rev 1135) +++ ydb/trunk/src/Makefile 2009-01-23 09:34:16 UTC (rev 1136) @@ -25,10 +25,10 @@ ifneq ($(GCOV),) GCOV := -fprofile-arcs -ftest-coverage endif -LDFLAGS := -lstx -lst -lresolv -lpthread -lprotobuf \ - -lboost_program_options-gcc43-mt $(GPROF) -CXXFLAGS := -g3 $(GPROF) -Wall -Werror -Wextra -Woverloaded-virtual -Wconversion \ - -Wno-conversion -Wno-ignored-qualifiers \ +LDFLAGS := -pthread -lstx -lst -lresolv -lprotobuf -lgtest \ + -lboost_program_options-gcc43-mt -lboost_thread-gcc43-mt $(GPROF) +CXXFLAGS := -g3 -pthread $(GPROF) -Wall -Werror -Wextra -Woverloaded-virtual \ + -Wconversion -Wno-conversion -Wno-ignored-qualifiers \ -Wpointer-arith -Wcast-qual -Wcast-align -Wwrite-strings \ -Winit-self -Wsign-promo -Wno-unused-parameter -Wc++0x-compat \ -Wparentheses -Wmissing-format-attribute -Wfloat-equal \ Modified: ydb/trunk/src/main.lzz.clamp =================================================================== --- ydb/trunk/src/main.lzz.clamp 2009-01-19 04:52:00 UTC (rev 1135) +++ ydb/trunk/src/main.lzz.clamp 2009-01-23 09:34:16 UTC (rev 1136) @@ -2,8 +2,10 @@ #include <boost/bind.hpp> #include <boost/foreach.hpp> #include <boost/program_options.hpp> +#include <boost/range/iterator_range.hpp> #include <boost/scoped_array.hpp> #include <boost/shared_ptr.hpp> +#include <boost/thread.hpp> #include <commons/nullptr.h> #include <commons/rand.h> #include <commons/st/st.h> @@ -13,6 +15,8 @@ #include <cstring> // strsignal #include <iostream> #include <fstream> +#include <gtest/gtest.h> +#include <malloc.h> #include <map> #include <netinet/in.h> // in_addr etc. #include <set> @@ -25,9 +29,11 @@ using namespace boost; using namespace commons; using namespace std; +using namespace testing; #end typedef pair<int, int> pii; +typedef map<int, int> mii; // Configuration. st_utime_t timeout; @@ -35,7 +41,7 @@ size_t accept_joiner_size; bool verbose, yield_during_build_up, yield_during_catch_up, dump, show_updates, count_updates, stop_on_recovery, general_txns, profile_threads, - debug_threads; + debug_threads, multirecover, disk, debug_memory; long long timelim, read_thresh, write_thresh; // Control. @@ -339,7 +345,11 @@ // one) to prepare to send recovery information (by sending an // empty/default Txn). if (!newreps.empty() && seqno > 0) { - sendmsg(fds[0], Txn()); + if (multirecover) { + bcastmsg(fds, Txn()); + } else { + sendmsg(fds[0], Txn()); + } } // Bring in any new members. while (!newreps.empty()) { @@ -392,7 +402,7 @@ * Process a transaction: update DB state (incl. seqno) and send response to * leader. */ - void +void process_txn(st_netfd_t leader, map<int, int> &map, const Txn &txn, int &seqno, bool caught_up) { @@ -427,18 +437,44 @@ } } - void +void showtput(const string &action, long long stop_time, long long start_time, int stop_count, int start_count) { long long time_diff = stop_time - start_time; int count_diff = stop_count - start_count; double rate = double(count_diff) * 1000 / time_diff; - cout << action << " " << count_diff << " txns in " << time_diff << " ms (" - << rate << " tps)" << endl; + cout << action << " " << count_diff << " txns [" + << start_count << ".." << stop_count + << "] in " << time_diff << " ms [" + << start_time << ".." << stop_time + << "] (" + << rate << " tps)" << endl; } /** + * Return range * part / nparts, but with proper casting. Assumes that part < + * nparts. + */ +inline int +interp(int range, int part, int nparts) { + return static_cast<int>(static_cast<long long>(range) * part / nparts); +} + +#src +TEST(interp_test, basics) { + EXPECT_EQ(0, interp(3, 0, 3)); + EXPECT_EQ(1, interp(3, 1, 3)); + EXPECT_EQ(2, interp(3, 2, 3)); + EXPECT_EQ(3, interp(3, 3, 3)); + + EXPECT_EQ(0, interp(RAND_MAX, 0, 2)); + EXPECT_EQ(RAND_MAX / 2, interp(RAND_MAX, 1, 2)); + EXPECT_EQ(RAND_MAX, interp(RAND_MAX, 2, 2)); +} +#end + +/** * Actually do the work of executing a transaction and sending back the reply. * * \param[in] leader The connection to the leader. @@ -454,13 +490,18 @@ * \param[in] backlog The backlog of txns that need to be processed. * * \param[in] init_seqno The seqno that was sent in the Init message from the - * leader. Not entirely clear that this is necessary; could probably just go - * with seqno. + * leader. The first expected seqno. + * + * \param[in] mypos This host's position in the Init message list. Used for + * calculating the sub-range of the map for which this node is responsible. + * + * \param[in] nnodes The total number nodes in the Init message list. */ void process_txns(st_netfd_t leader, map<int, int> &map, int &seqno, st_channel<shared_ptr<Recovery> > &send_states, - st_channel<shared_ptr<Txn> > &backlog, int init_seqno) + st_channel<shared_ptr<Txn> > &backlog, int init_seqno, + int mypos, int nnodes) { bool caught_up = init_seqno == 0; long long start_time = current_time_millis(), @@ -503,6 +544,7 @@ } } if (txn.has_seqno()) { + // Regular transaction. const char *action; if (txn.seqno() < 0) { break; @@ -512,7 +554,7 @@ seqno_caught_up = seqno; showtput("process_txns caught up; backlogged", time_caught_up, start_time, seqno_caught_up, - first_seqno == -1 ? init_seqno : first_seqno); + first_seqno == -1 ? init_seqno - 1 : first_seqno); caught_up = true; } process_txn(leader, map, txn, seqno, true); @@ -535,10 +577,17 @@ st_sleep(0); } } else { - // Generate a snapshot. + // Empty (default) Txn means "generate a snapshot." shared_ptr<Recovery> recovery(new Recovery); - cout << "generating recovery of " << map.size() << " records" << endl; - foreach (const pii &p, map) { + mii::const_iterator begin = + map.lower_bound(multirecover ? interp(RAND_MAX, mypos, nnodes) : 0); + mii::const_iterator end = multirecover && mypos < nnodes - 1 ? + map.lower_bound(interp(RAND_MAX, mypos + 1, nnodes)) : map.end(); + cout << "generating recovery over " << begin->first << ".." + << (end == map.end() ? "end" : lexical_cast<string>(end->first)) + << " (node " << mypos << " of " << nnodes << ")" + << endl; + foreach (const pii &p, make_iterator_range(begin, end)) { Recovery_Pair *pair = recovery->add_pair(); pair->set_key(p.first); pair->set_value(p.second); @@ -553,7 +602,7 @@ /** * Swallow replica responses. */ - void +void handle_responses(st_netfd_t replica, const int &seqno, int rid, st_multichannel<long long> &recover_signals, bool caught_up) { @@ -695,6 +744,15 @@ cout << "sent recovery" << endl; } +void +threadfunc() +{ + while (true) { + sleep(3); + cout << "AAAAAAAAAAAAAAAAAAAAAA" << endl; + } +} + /** * Run the leader. */ @@ -725,6 +783,7 @@ // Construct the initialization message. Init init; init.set_txnseqno(0); + init.set_multirecover(true); foreach (replica_info r, replicas) { SockAddr *psa = init.add_node(); psa->set_host(r.host()); @@ -791,6 +850,13 @@ void run_replica(string leader_host, uint16_t leader_port, uint16_t listen_port) { + if (disk) { + // Disk IO threads. + for (int i = 0; i < 5; i++) { + thread somethread(threadfunc); + } + } + // Initialize database state. map<int, int> map; int seqno = -1; @@ -829,13 +895,15 @@ readmsg(leader, init); } uint32_t listen_host = init.yourhost(); + multirecover = init.multirecover(); // Display the info. cout << "got init msg with txn seqno " << init.txnseqno() << " and hosts:" << endl; vector<st_netfd_t> replicas; st_closing_all close_replicas(replicas); - for (uint16_t i = 0; i < init.node_size(); i++) { + int mypos = -1; + for (int i = 0; i < init.node_size(); i++) { const SockAddr &sa = init.node(i); char buf[INET_ADDRSTRLEN]; in_addr host = { sa.host() }; @@ -843,6 +911,7 @@ cout << "- " << checkerr(inet_ntop(AF_INET, &host, buf, INET_ADDRSTRLEN)) << ':' << sa.port() << (is_self ? " (self)" : "") << endl; + if (is_self) mypos = i; if (!is_self && init.txnseqno() > 0) { replicas.push_back(st_tcp_connect(host, static_cast<uint16_t>(sa.port()), @@ -854,7 +923,8 @@ st_channel<shared_ptr<Txn> > backlog; st_joining join_proc(my_spawn(bind(process_txns, leader, ref(map), ref(seqno), ref(send_states), - ref(backlog), init.txnseqno()), + ref(backlog), init.txnseqno(), + mypos, init.node_size()), "process_txns")); st_joining join_rec(my_spawn(bind(recover_joiner, listener, ref(map), ref(seqno), ref(send_states)), @@ -863,33 +933,46 @@ try { // If there's anything to recover. if (init.txnseqno() > 0) { - cout << "waiting for recovery from " << replicas[0] << endl; + cout << "waiting for recovery message" << (multirecover ? "s" : "") + << endl; long long before_recv = current_time_millis(); - // Read the recovery message. - Recovery recovery; - { - st_intr intr(stop_hub); - readmsg(replicas[0], recovery); + vector<st_thread_t> recovery_builders; + assert(seqno == -1); + for (int i = 0; i < (multirecover ? init.node_size() : 1); i++) { + recovery_builders.push_back(my_spawn(lambda() { + // Read the recovery message. + Recovery recovery; + { + st_intr intr(stop_hub); + readmsg(__ref(replicas)[__ctx(i)], recovery); + } + long long build_start = current_time_millis(); + cout << "got recovery message in " + << build_start - __ref(before_recv) << " ms" << endl; + for (int i = 0; i < recovery.pair_size(); i++) { + const Recovery_Pair &p = recovery.pair(i); + __ref(map)[p.key()] = p.value(); + if (i % chkpt == 0) { + if (yield_during_build_up) st_sleep(0); + } + } + check(recovery.seqno() >= 0); + int seqno = __ref(seqno) = recovery.seqno(); + long long build_end = current_time_millis(); + cout << "receive and build-up took " + << build_end - __ref(before_recv) + << " ms; built up map of " << recovery.pair_size() + << " records in " << build_end - build_start + << " ms; now at seqno " << seqno << endl; + }, "recovery_builder" + lexical_cast<string>(i))); } - long long build_start = current_time_millis(); - cout << "got recovery message in " << build_start - before_recv - << " ms" << endl; - for (int i = 0; i < recovery.pair_size(); i++) { - const Recovery_Pair &p = recovery.pair(i); - map[p.key()] = p.value(); - if (i % chkpt == 0) { - if (yield_during_build_up) st_sleep(0); - } + foreach (st_thread_t t, recovery_builders) { + st_join(t); } - assert(seqno == -1 && - static_cast<typeof(seqno)>(recovery.seqno()) > seqno); - int mid_seqno = seqno = recovery.seqno(); long long mid_time = current_time_millis(); - cout << "receive and build-up took " << mid_time - before_recv - << " ms; built up map of " << recovery.pair_size() << " records in " - << mid_time - build_start << " ms; now at seqno " << seqno << endl; + int mid_seqno = seqno; while (!backlog.empty()) { shared_ptr<Txn> p = backlog.take(); process_txn(leader, map, *p, seqno, false); @@ -951,6 +1034,21 @@ } /** + * Memory monitor. + */ +void +memmon() +{ + while (!stop_hub) { + { + st_intr intr(stop_hub); + st_sleep(1); + } + malloc_stats(); + } +} + +/** * Initialization and command-line parsing. */ int @@ -969,6 +1067,8 @@ po::options_description desc("Allowed options"); desc.add_options() ("help,h", "show this help message") + ("debug-memory,M", po::bool_switch(&debug_memory), + "enable memory monitoring/debug outputs") ("debug-threads,d",po::bool_switch(&debug_threads), "enable context switch debug outputs") ("profile-threads,q",po::bool_switch(&profile_threads), @@ -978,9 +1078,13 @@ ("epoll,e", po::bool_switch(&use_epoll), "use epoll (select is used by default)") ("yield-build-up", po::bool_switch(&yield_during_build_up), - "yield periodically during build-up phase of recovery") + "yield periodically during build-up phase of recovery (for recoverer only)") ("yield-catch-up", po::bool_switch(&yield_during_catch_up), - "yield periodically during catch-up phase of recovery") + "yield periodically during catch-up phase of recovery (for recoverer only)") + ("multirecover,m", po::bool_switch(&multirecover), + "recover from multiple hosts, instead of just one (specified via leader only)") + ("disk,k", po::bool_switch(&disk), + "use disk-based recovery") ("dump,D", po::bool_switch(&dump), "replicas should finally dump their state to a tmp file for " "inspection/diffing") @@ -1029,6 +1133,7 @@ "port to listen on (replicas only)") ("timeout,t", po::value<st_utime_t>(&timeout)->default_value(200000), "timeout for IO operations (in microseconds)") + ("test", "execute unit tests instead of running the normal system") ("minreps,n", po::value<int>(&minreps)->default_value(2), "minimum number of replicas the system is willing to process txns on"); @@ -1051,6 +1156,12 @@ return 1; } + // Run unit-tests. + if (vm.count("test")) { + InitGoogleTest(&argc, argv); + return RUN_ALL_TESTS(); + } + // Initialize support for ST working with asynchronous signals. check0x(pipe(sig_pipe)); struct sigaction sa; @@ -1075,6 +1186,12 @@ threads.insert(st_thread_self()); threadnames[st_thread_self()] = "main"; + // Print memory debugging information. + if (debug_memory) { + my_spawn(memmon, "memmon"); + } + + // At the end, print thread profiling information. finally f(lambda() { if (profile_threads) { cout << "thread profiling results:" << endl; Modified: ydb/trunk/src/ydb.proto =================================================================== --- ydb/trunk/src/ydb.proto 2009-01-19 04:52:00 UTC (rev 1135) +++ ydb/trunk/src/ydb.proto 2009-01-23 09:34:16 UTC (rev 1136) @@ -14,13 +14,15 @@ // Initialization message sent to a nodes when it joins. message Init { - // The current seqno that the server is on. + // The next seqno that the server is going to send. required int32 txnseqno = 1; // What the leader perceives to be the joining replica's IP address. required uint32 yourhost = 2; + // Which recovery scheme we're using. + required bool multirecover = 3; // The nodes that have joined (including the joining node); the ports here // are the ports on which the nodes are listening. - repeated SockAddr node = 3; + repeated SockAddr node = 4; } // Sent to already-joined nodes to inform them of a newly joining node. Modified: ydb/trunk/tools/analysis.py =================================================================== --- ydb/trunk/tools/analysis.py 2009-01-19 04:52:00 UTC (rev 1135) +++ ydb/trunk/tools/analysis.py 2009-01-23 09:34:16 UTC (rev 1136) @@ -85,12 +85,21 @@ print width = 5e4 - a = bar(seqnos, recvmeans, yerr = recvsds, width = width, color = 'r', - label = 'State receive') - b = bar(seqnos, buildmeans, yerr = buildsds, width = width, color = 'g', - label = 'Build-up time', bottom = recvmeans) - c = bar(seqnos, catchmeans, yerr = catchsds, width = width, color = 'b', - label = 'Catch-up', bottom = recvmeans + buildmeans) + # From "zen and tea" on kuler.adobe.com + hue = lambda i: tuple(map(lambda x: float(x)/255, + [( 16, 34, 43), + (149,171, 99), + (189,214,132), + (226,240,214), + (246,255,224)][i+1])) + ehue = lambda i: hue(-1) # tuple(map(lambda x: min(1, x + .3), hue(i))) + a = bar(seqnos, recvmeans, yerr = recvsds, width = width, color = hue(0), + ecolor = ehue(0), label = 'State receive') + b = bar(seqnos, buildmeans, yerr = buildsds, width = width, color = hue(1), + ecolor = ehue(1), label = 'Build-up time', bottom = recvmeans) + c = bar(seqnos, catchmeans, yerr = catchsds, width = width, color = hue(2), + ecolor = ehue(2), label = 'Catch-up', + bottom = recvmeans + buildmeans) title('Recovery time over number of transactions') xlabel('Transaction count (corresponds roughly to data size)') Modified: ydb/trunk/tools/test.bash =================================================================== --- ydb/trunk/tools/test.bash 2009-01-19 04:52:00 UTC (rev 1135) +++ ydb/trunk/tools/test.bash 2009-01-23 09:34:16 UTC (rev 1136) @@ -99,6 +99,12 @@ refresh-local } +node-setup-gtest() { + check-remote + cd /tmp/ + toast --quiet arm googletest +} + node-setup-ydb-1() { check-remote if [[ ! -L ~/ydb ]] @@ -200,6 +206,7 @@ parremote node-setup-m4 parremote node-setup-bison parremote node-setup-clamp + parremote node-setup-gtest } setup-ydb() { @@ -287,6 +294,7 @@ run-helper() { local leader=$1 shift + : ${seqno:=100000} tagssh $leader "ydb/src/ydb -l -x --accept-joiner-seqno $seqno -n $(( $# - 1 )) -o 1 -O 1 ${extraargs:-}" & # -v --debug-threads sleep .1 # pexpect 'waiting for at least' # Run initial replicas. @@ -296,7 +304,7 @@ done sleep .1 # pexpect 'got all \d+ replicas' leader # Run joiner. - tagssh $1 "ydb/src/ydb -H $leader" & # -v --debug-threads -t 200000" & + tagssh $1 "ydb/src/ydb -H $leader ${extraargs:-}" & # -v --debug-threads -t 200000" & if false ; then if [[ ${wait2:-} ]] then sleep $wait2 This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <yan...@us...> - 2009-01-23 22:44:14
|
Revision: 1137 http://assorted.svn.sourceforge.net/assorted/?rev=1137&view=rev Author: yangzhang Date: 2009-01-23 21:44:55 +0000 (Fri, 23 Jan 2009) Log Message: ----------- - added further breakdown of the xfer time - upgraded readmsg - mean -> median in analysis plot Modified Paths: -------------- ydb/trunk/src/Makefile ydb/trunk/src/main.lzz.clamp ydb/trunk/tools/analysis.py ydb/trunk/tools/test.bash Modified: ydb/trunk/src/Makefile =================================================================== --- ydb/trunk/src/Makefile 2009-01-23 09:34:16 UTC (rev 1136) +++ ydb/trunk/src/Makefile 2009-01-23 21:44:55 UTC (rev 1137) @@ -27,6 +27,7 @@ endif LDFLAGS := -pthread -lstx -lst -lresolv -lprotobuf -lgtest \ -lboost_program_options-gcc43-mt -lboost_thread-gcc43-mt $(GPROF) +# The -Wno- warnings are for boost. CXXFLAGS := -g3 -pthread $(GPROF) -Wall -Werror -Wextra -Woverloaded-virtual \ -Wconversion -Wno-conversion -Wno-ignored-qualifiers \ -Wpointer-arith -Wcast-qual -Wcast-align -Wwrite-strings \ Modified: ydb/trunk/src/main.lzz.clamp =================================================================== --- ydb/trunk/src/main.lzz.clamp 2009-01-23 09:34:16 UTC (rev 1136) +++ ydb/trunk/src/main.lzz.clamp 2009-01-23 21:44:55 UTC (rev 1137) @@ -282,17 +282,31 @@ } /** - * Read a message. + * Read a message. This is done in two steps: first by reading the length + * prefix, then by reading the actual body. + * + * \param[in] src The socket from which to read. + * + * \param[in] msg The protobuf to read into. + * + * \param[in] timed Whether to make a note of the time at which the first piece of the + * message (the length) was received. Such measurement only makes sense for + * large messages which take a long time to receive. + * + * \param[in] timeout on each of the two read operations (first one is on + * length, second one is on the rest). */ template <typename T> -void -readmsg(st_netfd_t src, T & msg, st_utime_t timeout = ST_UTIME_NO_TIMEOUT) +long long +readmsg(st_netfd_t src, T & msg, bool timed = false, st_utime_t timeout = + ST_UTIME_NO_TIMEOUT) { // Read the message length. uint32_t len; checkeqnneg(st_read_fully(src, static_cast<void*>(&len), sizeof len, timeout), static_cast<ssize_t>(sizeof len)); + long long start_receive = timed ? current_time_millis() : -1; len = ntohl(len); #define GETMSG(buf) \ @@ -308,6 +322,8 @@ scoped_array<char> buf(new char[len]); GETMSG(buf.get()); } + + return start_receive; } /** @@ -320,7 +336,7 @@ readmsg(st_netfd_t src, st_utime_t timeout = ST_UTIME_NO_TIMEOUT) { T msg; - readmsg(src, msg, timeout); + readmsg(src, msg, false, timeout); return msg; } @@ -943,13 +959,15 @@ recovery_builders.push_back(my_spawn(lambda() { // Read the recovery message. Recovery recovery; + long long receive_start = -1; { st_intr intr(stop_hub); - readmsg(__ref(replicas)[__ctx(i)], recovery); + receive_start = readmsg(__ref(replicas)[__ctx(i)], recovery, true); } long long build_start = current_time_millis(); cout << "got recovery message in " - << build_start - __ref(before_recv) << " ms" << endl; + << build_start - __ref(before_recv) << " ms (xfer took " + << build_start - receive_start << " ms)" << endl; for (int i = 0; i < recovery.pair_size(); i++) { const Recovery_Pair &p = recovery.pair(i); __ref(map)[p.key()] = p.value(); Modified: ydb/trunk/tools/analysis.py =================================================================== --- ydb/trunk/tools/analysis.py 2009-01-23 09:34:16 UTC (rev 1136) +++ ydb/trunk/tools/analysis.py 2009-01-23 21:44:55 UTC (rev 1137) @@ -18,7 +18,7 @@ yield list(tups) a = array(list(gen())) indexes = a[:,0,0] - means = a.mean(1) + means = median(a,1) #a.mean(1) stds = a.std(1) tup = (indexes,) for i in range(1, len(a[0,0])): @@ -62,25 +62,27 @@ check(path) def getpairs(): with file(path) as f: - seqno = recv = buildup = catchup = total = None + seqno = dump = recv = buildup = catchup = total = None for line in f: m = re.match( r'=== seqno=(?P<seqno>\d+) ', line ) if m: seqno = int(m.group('seqno')) - m = re.search( r'got recovery message in (?P<time>\d+) ms', line ) - if m: recv = float(m.group('time')) + m = re.search( r'got recovery message in (?P<dump>\d+) ms \(xfer took (?P<recv>\d+) ms\)', line ) + if m: dump, recv = float(m.group('dump')), float(m.group('recv')) m = re.search( r'built up .* (?P<time>\d+) ms', line ) if m: buildup = float(m.group('time')) - m = re.search( r'replayer caught up; from backlog replayed \d+ txns in (?P<time>\d+) ms', line ) + m = re.search( r'replayer caught up; from backlog replayed \d+ txns .* in (?P<time>\d+) ms', line ) if m: catchup = float(m.group('time')) m = re.match( r'.*: recovering node caught up; took (?P<time>\d+) ?ms', line ) if m: total = float(m.group('time')) - tup = (seqno, recv, buildup, catchup, total) + tup = (seqno, dump, recv, buildup, catchup, total) if all(tup): yield tup - seqno = recv = buildup = catchup = total = None - seqnos, recvmeans, recvsds, buildmeans, buildsds, catchmeans, catchsds, totalmeans, totalsds, stacked, a = agg(getpairs()) + seqno = dump = recv = buildup = catchup = total = None + seqnos, dumpmeans, dumpsds, recvmeans, recvsds, buildmeans, buildsds, \ + catchmeans, catchsds, totalmeans, totalsds, stacked, a = \ + agg(getpairs()) - print 'max seqno, recv mean, recv sd, build mean, build sd, catch mean, catch sd, total mean, total sd' + print 'max seqno, dump mean, dump sd, recv mean, recv sd, build mean, build sd, catch mean, catch sd, total mean, total sd' print stacked print @@ -93,13 +95,16 @@ (226,240,214), (246,255,224)][i+1])) ehue = lambda i: hue(-1) # tuple(map(lambda x: min(1, x + .3), hue(i))) - a = bar(seqnos, recvmeans, yerr = recvsds, width = width, color = hue(0), - ecolor = ehue(0), label = 'State receive') - b = bar(seqnos, buildmeans, yerr = buildsds, width = width, color = hue(1), - ecolor = ehue(1), label = 'Build-up time', bottom = recvmeans) - c = bar(seqnos, catchmeans, yerr = catchsds, width = width, color = hue(2), - ecolor = ehue(2), label = 'Catch-up', - bottom = recvmeans + buildmeans) + bar(seqnos, dumpmeans, yerr = dumpsds, width = width, color = hue(0), + ecolor = ehue(0), label = 'State serialization') + bar(seqnos, recvmeans, yerr = recvsds, width = width, color = hue(0), + ecolor = ehue(0), label = 'State receive', bottom = dumpmeans) + bar(seqnos, buildmeans, yerr = buildsds, width = width, color = hue(1), + ecolor = ehue(1), label = 'Build-up', + bottom = dumpmeans + recvmeans) + bar(seqnos, catchmeans, yerr = catchsds, width = width, color = hue(2), + ecolor = ehue(2), label = 'Catch-up', + bottom = dumpmeans + recvmeans + buildmeans) title('Recovery time over number of transactions') xlabel('Transaction count (corresponds roughly to data size)') Modified: ydb/trunk/tools/test.bash =================================================================== --- ydb/trunk/tools/test.bash 2009-01-23 09:34:16 UTC (rev 1136) +++ ydb/trunk/tools/test.bash 2009-01-23 21:44:55 UTC (rev 1137) @@ -369,6 +369,10 @@ hostargs kill-helper } +times() { + parssh date +%s.%N +} + # Use mssh to log in with password as root to each machine. mssh-root() { : "${hosts:="$(hosts)"}" This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <yan...@us...> - 2009-01-27 23:25:29
|
Revision: 1147 http://assorted.svn.sourceforge.net/assorted/?rev=1147&view=rev Author: yangzhang Date: 2009-01-27 23:25:16 +0000 (Tue, 27 Jan 2009) Log Message: ----------- - added --exit-on-seqno - refactored analysis.py, settled in the sloppy regex extraction approach - brought back scaling analysis - improved the colors and shapes - pretty print raw data tables - cleaned up range/hosts configuration in test.bash (incl. propagation to subscripts) - added len-plotting to analysis - fixed erratic behavior by lowering chkpt to 1K from 10K - added recovery-generation timing - fixed always-multirecover bug - fixed set_yourhost omission for the joiner - added somewhat-off analysis for multirecover Modified Paths: -------------- ydb/trunk/README ydb/trunk/src/main.lzz.clamp ydb/trunk/tools/analysis.py ydb/trunk/tools/test.bash Modified: ydb/trunk/README =================================================================== --- ydb/trunk/README 2009-01-26 05:35:42 UTC (rev 1146) +++ ydb/trunk/README 2009-01-27 23:25:16 UTC (rev 1147) @@ -231,32 +231,45 @@ Period: 1/20-1/27 - DONE implement multihost -- TODO add simple, proper timestamped logging -- TODO see how much multihost recovery affects perf -- TODO look again at how much yielding affects perf -- TODO monitor memory usage -- TODO switch to btree -- TODO break down the red bar some more -- TODO see how much time difference there is -- TODO red bar: why are/aren't we saturating bandwidth? -- TODO understand the rest of the perf (eg stl map) -- TODO try scaling up + - not much, it only decreases the xfer time (which orig was thought to be the bottleneck) +- DONE see how much multihost recovery affects perf + - quite a bit! +- DONE look again at how much yielding affects perf + - not much +- DONE break down the red bar some more + - most of the time is spent in the dumping +- DONE understand the rest of the perf (eg stl map) + - DONE why the big jump in 400,000 ops? why all the unexpected ups & downs? + - due to the 10,000-txn quantum; lowering this to 1,000 made everything much + saner + - DONE how does the recovery state xfer time compare to what's expected? + - msgs smaller than expected, eg 300,000 txns * 2*4 bytes per txn = 2.4MB, + but msgs are ~2MB (compression, some random overwrites) + - xfer takes much longer than the theoretical time; 2MB on 1GbE = 16 ms, but + takes more around 50 ms +- DONE start building infrastructure for disk IO + +Period: 1/27- + +- TODO fix up analysis of multihost recovery - TODO implement checkpointing disk-based scheme - TODO implement log-based recovery; show that it sucks - TODO implement group (batch) commit for log-based recovery -- TODO talk - - motivation: log-based sucks, look into alternatives +- TODO try scaling up - TODO serialize outputs from the various clients to a single merger to (1) have ordering over the (timestamped) messages, and (2) avoid interleaved lines - -Period: 1/27- - +- TODO add simple, proper timestamped logging +- TODO see how much clock difference there is among the hosts +- TODO monitor memory usage +- TODO try improving map perf; switch to btree; try bulk loading - TODO detailed view of tps during recovery over time (should see various phases) - TODO later: runtime overhead of logging/tps under normal operation (scaled with # nodes?) - TODO later: timestamped logging? +- TODO talk + - motivation: log-based sucks, look into alternatives Longer term Modified: ydb/trunk/src/main.lzz.clamp =================================================================== --- ydb/trunk/src/main.lzz.clamp 2009-01-26 05:35:42 UTC (rev 1146) +++ ydb/trunk/src/main.lzz.clamp 2009-01-27 23:25:16 UTC (rev 1147) @@ -37,7 +37,8 @@ // Configuration. st_utime_t timeout; -int chkpt, accept_joiner_seqno, issuing_interval, min_ops, max_ops; +int chkpt, accept_joiner_seqno, issuing_interval, min_ops, max_ops, + stop_on_seqno; size_t accept_joiner_size; bool verbose, yield_during_build_up, yield_during_catch_up, dump, show_updates, count_updates, stop_on_recovery, general_txns, profile_threads, @@ -283,34 +284,47 @@ /** * Read a message. This is done in two steps: first by reading the length - * prefix, then by reading the actual body. + * prefix, then by reading the actual body. This function also provides a way + * to measure how much time is spent actually reading the message from the + * network. Such measurement only makes sense for large messages which take a + * long time to receive. * * \param[in] src The socket from which to read. * * \param[in] msg The protobuf to read into. * - * \param[in] timed Whether to make a note of the time at which the first piece of the - * message (the length) was received. Such measurement only makes sense for - * large messages which take a long time to receive. + * \param[out] start_time If not null, record the time at which we start to + * receive the message (after the length is received). * + * \param[out] stop_time If not null, record the time at which we finish + * receiving the message (before we deserialize the protobuf). + * + * \param[out] len If not null, record the size of the serialized message + * in bytes. + * * \param[in] timeout on each of the two read operations (first one is on * length, second one is on the rest). + * + * \return The length of the serialized message. */ template <typename T> -long long -readmsg(st_netfd_t src, T & msg, bool timed = false, st_utime_t timeout = - ST_UTIME_NO_TIMEOUT) +size_t +readmsg(st_netfd_t src, T & msg, long long *start_time = nullptr, long long + *stop_time = nullptr, st_utime_t timeout = ST_UTIME_NO_TIMEOUT) { // Read the message length. uint32_t len; checkeqnneg(st_read_fully(src, static_cast<void*>(&len), sizeof len, timeout), static_cast<ssize_t>(sizeof len)); - long long start_receive = timed ? current_time_millis() : -1; + if (start_time != nullptr) + *start_time = current_time_millis(); len = ntohl(len); #define GETMSG(buf) \ checkeqnneg(st_read_fully(src, buf, len, timeout), (int) len); \ + if (stop_time != nullptr) \ + *stop_time = current_time_millis(); \ check(msg.ParseFromArray(buf, len)); // Parse the message body. @@ -323,7 +337,7 @@ GETMSG(buf.get()); } - return start_receive; + return len; } /** @@ -336,7 +350,7 @@ readmsg(st_netfd_t src, st_utime_t timeout = ST_UTIME_NO_TIMEOUT) { T msg; - readmsg(src, msg, false, timeout); + readmsg(src, msg, nullptr, nullptr, timeout); return msg; } @@ -407,6 +421,11 @@ if (txn.seqno() == accept_joiner_seqno) { accept_joiner.set(); } + + if (txn.seqno() == stop_on_seqno) { + cout << "stopping on issue of seqno " << txn.seqno() << endl; + stop_hub.set(); + } } Txn txn; @@ -600,14 +619,18 @@ mii::const_iterator end = multirecover && mypos < nnodes - 1 ? map.lower_bound(interp(RAND_MAX, mypos + 1, nnodes)) : map.end(); cout << "generating recovery over " << begin->first << ".." - << (end == map.end() ? "end" : lexical_cast<string>(end->first)) - << " (node " << mypos << " of " << nnodes << ")" - << endl; + << (end == map.end() ? "end" : lexical_cast<string>(end->first)); + if (multirecover) + cout << " (node " << mypos << " of " << nnodes << ")"; + cout << endl; + long long start_snap = current_time_millis(); foreach (const pii &p, make_iterator_range(begin, end)) { Recovery_Pair *pair = recovery->add_pair(); pair->set_key(p.first); pair->set_value(p.second); } + cout << "generating recovery took " + << current_time_millis() - start_snap << " ms" << endl; recovery->set_seqno(seqno); send_states.push(recovery); } @@ -799,7 +822,7 @@ // Construct the initialization message. Init init; init.set_txnseqno(0); - init.set_multirecover(true); + init.set_multirecover(multirecover); foreach (replica_info r, replicas) { SockAddr *psa = init.add_node(); psa->set_host(r.host()); @@ -841,14 +864,15 @@ accept_joiner.waitset(); } Join join = readmsg<Join>(joiner); + replicas.push_back(replica_info(joiner, static_cast<uint16_t>(join.port()))); cout << "setting seqno to " << seqno << endl; init.set_txnseqno(seqno); + init.set_yourhost(replicas.back().host()); sendmsg(joiner, init); recover_signals.push(current_time_millis()); // Start streaming txns to joiner. cout << "start streaming txns to joiner" << endl; - replicas.push_back(replica_info(joiner, static_cast<uint16_t>(join.port()))); newreps.push(replicas.back()); handlers.insert(my_spawn(bind(handle_responses, joiner, ref(seqno), rid++, ref(recover_signals), false), @@ -959,15 +983,18 @@ recovery_builders.push_back(my_spawn(lambda() { // Read the recovery message. Recovery recovery; - long long receive_start = -1; + long long receive_start = 0, receive_end = 0; + size_t len = 0; { st_intr intr(stop_hub); - receive_start = readmsg(__ref(replicas)[__ctx(i)], recovery, true); + len = readmsg(__ref(replicas)[__ctx(i)], recovery, &receive_start, + &receive_end); } long long build_start = current_time_millis(); - cout << "got recovery message in " - << build_start - __ref(before_recv) << " ms (xfer took " - << build_start - receive_start << " ms)" << endl; + cout << "got recovery message of " << len << " bytes in " + << build_start - __ref(before_recv) << " ms: xfer took " + << receive_end - receive_start << " ms, deserialization took " + << build_start - receive_end << " ms" << endl; for (int i = 0; i < recovery.pair_size(); i++) { const Recovery_Pair &p = recovery.pair(i); __ref(map)[p.key()] = p.value(); @@ -1116,6 +1143,8 @@ "run the leader (run replica by default)") ("exit-on-recovery,x", po::bool_switch(&stop_on_recovery), "exit after the joiner fully recovers (for leader only)") + ("exit-on-seqno,X", po::value<int>(&stop_on_seqno)->default_value(-1), + "exit after txn seqno is issued (for leader only)") ("accept-joiner-size,s", po::value<size_t>(&accept_joiner_size)->default_value(0), "accept recovering joiner (start recovery) after DB grows to this size " @@ -1139,7 +1168,7 @@ ("leader-port,P", po::value<uint16_t>(&leader_port)->default_value(7654), "port the leader listens on") - ("chkpt,c", po::value<int>(&chkpt)->default_value(10000), + ("chkpt,c", po::value<int>(&chkpt)->default_value(1000), "number of txns before yielding/verbose printing") ("timelim,T", po::value<long long>(&timelim)->default_value(0), "general network IO time limit in milliseconds, or 0 for none") Modified: ydb/trunk/tools/analysis.py =================================================================== --- ydb/trunk/tools/analysis.py 2009-01-26 05:35:42 UTC (rev 1146) +++ ydb/trunk/tools/analysis.py 2009-01-27 23:25:16 UTC (rev 1147) @@ -1,10 +1,12 @@ #!/usr/bin/env python from __future__ import with_statement -import re, sys, itertools +import re, sys, itertools, colorsys from os.path import basename, realpath from pylab import * +class struct(object): pass + def getname(path): return basename(realpath(path)) def check(path): @@ -12,106 +14,137 @@ if 'got timeout' in f.read(): print 'warning: timeout occurred' -def agg(src): +def show_table(pairs): + def fmt(x): + s = str(x) + if s.endswith('.0'): return s[:-2] + p = s.index('.') + return s if p < 0 else s[:p+4] + cols = [ [heading] + map(fmt, col) for (heading, col) in pairs ] + widths = [ max(map(len, col)) for col in cols ] + return '\n'.join( + '|'.join( ('%%%ds' % width) % val for width, val in zip(widths, row) ) + for row in zip(*cols) ) + +def show_table1(dicts): + keys = dicts[0].keys() + return show_table([(k, [d[k] for d in dicts]) for k in keys]) + +def logextract(path, indexkey, pats): + check(path) + # Capture values from log using regex pats. + def getcaps(): + with file(path) as f: + caps = {} # captures: name -> int/float + sats = [ False for pat in pats ] + for line in f: +# if line == '\n': print '===', caps.keys(), ''.join('1' if s else '0' for s in sats) + for i, pat in enumerate(pats): + m = re.search(pat, line) + if m: + for k in m.groupdict(): + if k in caps: + caps[k + '0'] = caps[k] + caps.update((k, float(v)) for k,v in m.groupdict().iteritems()) + sats[i] = True + break + if all(sats): + sats = [ False for pat in pats ] +# print '!!!' + yield caps.copy() # [ caps[k] for k in keys ] + caps.clear() + # Aggregate the captured values. + caps = list(getcaps()) +# print show_table1(caps) + keys = [indexkey] + filter(lambda x: x != indexkey, caps[0].keys()) def gen(): - for index, tups in itertools.groupby(src, lambda x: x[0]): - yield list(tups) - a = array(list(gen())) + for index, ds in itertools.groupby(caps, lambda d: d[indexkey]): + ds = list(ds) + print [d['len'] for d in ds] + yield [ [d[k] for k in keys] for d in ds ] + a = array(list(gen())) # raw results indexes = a[:,0,0] - means = median(a,1) #a.mean(1) - stds = a.std(1) - tup = (indexes,) - for i in range(1, len(a[0,0])): - tup += (means[:,i], stds[:,i]) - stacked = hstack(map(lambda x: x.reshape((len(indexes),1)), tup)) - return tup + (stacked, a) + means = median(a,1) # or a.mean(1) + sds = a.std(1) + # Build result dict. + stacks = [ (indexkey, indexes) ] # no need to agg the index + for i,k in list(enumerate(keys))[1:]: # everything but index + stacks.append((k + ' mean', means[:,i])) + stacks.append((k + ' sd', sds[:,i])) + res = dict(stacks) + res['stacked'] = hstack(map(lambda (_,x): x.reshape((len(indexes), 1)), stacks)) + res['raw'] = a + print show_table(stacks) + print + return res def scaling(path): print '=== scaling ===' print 'file:', getname(path) - check(path) - def getpairs(): - with file(path) as f: - for line in f: - m = re.match( r'=== n=(?P<n>\d+) ', line ) - if m: - n = int(m.group('n')) - m = re.match( r'.*: issued .*[^.\d](?P<tps>[.\d]+) ?tps', line ) - if m: - tps = float(m.group('tps')) - yield (n, tps) - tups = agg(getpairs()) - ns, tpsmeans, tpssds, stacked, a = agg(getpairs()) - print 'n, tps mean, tps sd' - print stacked - print + res = logextract(path, 'n', [ + r'=== n=(?P<n>\d+) ', + r'issued .*\((?P<tps>[.\d]+) tps\)' ]) - errorbar(ns, tpsmeans, tpssds) + errorbar(res['n'], res['tps mean'], res['tps sd']) title('Scaling of baseline throughput with number of nodes') xlabel('Node count') ylabel('Mean TPS (stdev error bars)') - xlim(ns.min() - .5, ns.max() + .5) + xlim(res['n'].min() - .5, res['n'].max() + .5) ylim(ymin = 0) savefig('scaling.png') def run(blockpath, yieldpath): - for path, label in [#(blockpath, 'blocking scheme'), - (yieldpath, 'yielding scheme')]: - print '===', label, '===' + for path, titlestr, name in [#(blockpath, 'blocking scheme', 'block'), + (yieldpath, 'yielding scheme', 'yield')]: + print '===', titlestr, '===' print 'file:', getname(path) - check(path) - def getpairs(): - with file(path) as f: - seqno = dump = recv = buildup = catchup = total = None - for line in f: - m = re.match( r'=== seqno=(?P<seqno>\d+) ', line ) - if m: seqno = int(m.group('seqno')) - m = re.search( r'got recovery message in (?P<dump>\d+) ms \(xfer took (?P<recv>\d+) ms\)', line ) - if m: dump, recv = float(m.group('dump')), float(m.group('recv')) - m = re.search( r'built up .* (?P<time>\d+) ms', line ) - if m: buildup = float(m.group('time')) - m = re.search( r'replayer caught up; from backlog replayed \d+ txns .* in (?P<time>\d+) ms', line ) - if m: catchup = float(m.group('time')) - m = re.match( r'.*: recovering node caught up; took (?P<time>\d+) ?ms', line ) - if m: total = float(m.group('time')) - tup = (seqno, dump, recv, buildup, catchup, total) - if all(tup): - yield tup - seqno = dump = recv = buildup = catchup = total = None - seqnos, dumpmeans, dumpsds, recvmeans, recvsds, buildmeans, buildsds, \ - catchmeans, catchsds, totalmeans, totalsds, stacked, a = \ - agg(getpairs()) + res = logextract(path, 'seqno', + [ r'=== seqno=(?P<seqno>\d+) ', + r'got recovery message of (?P<len>\d+) bytes in (?P<dump>\d+) ms: xfer took (?P<recv>\d+) ms, deserialization took (?P<deser>\d+)', + r'built up .* (?P<buildup>\d+) ms', + r'generating recovery took (?P<gen>\d+) ms', + r'replayer caught up; from backlog replayed \d+ txns .* in (?P<catchup>\d+) ms', + r'.*: recovering node caught up; took (?P<total>\d+) ?ms' ] ) - print 'max seqno, dump mean, dump sd, recv mean, recv sd, build mean, build sd, catch mean, catch sd, total mean, total sd' - print stacked - print - + # Colors and positioning width = 5e4 - # From "zen and tea" on kuler.adobe.com - hue = lambda i: tuple(map(lambda x: float(x)/255, - [( 16, 34, 43), - (149,171, 99), - (189,214,132), - (226,240,214), - (246,255,224)][i+1])) - ehue = lambda i: hue(-1) # tuple(map(lambda x: min(1, x + .3), hue(i))) - bar(seqnos, dumpmeans, yerr = dumpsds, width = width, color = hue(0), - ecolor = ehue(0), label = 'State serialization') - bar(seqnos, recvmeans, yerr = recvsds, width = width, color = hue(0), - ecolor = ehue(0), label = 'State receive', bottom = dumpmeans) - bar(seqnos, buildmeans, yerr = buildsds, width = width, color = hue(1), - ecolor = ehue(1), label = 'Build-up', - bottom = dumpmeans + recvmeans) - bar(seqnos, catchmeans, yerr = catchsds, width = width, color = hue(2), - ecolor = ehue(2), label = 'Catch-up', - bottom = dumpmeans + recvmeans + buildmeans) + step = 1.0 / 5 + hues = ( colorsys.hls_to_rgb(step * i, .7, .5) for i in itertools.count() ) + ehues = ( colorsys.hls_to_rgb(step * i, .3, .5) for i in itertools.count() ) + widths = ( 2 * width - 2 * width / 5 * i for i in itertools.count() ) + offsets = ( width - 2 * width / 5 * i for i in itertools.count() ) + self = struct() + self.bottom = 0 - title('Recovery time over number of transactions') - xlabel('Transaction count (corresponds roughly to data size)') - ylabel('Mean time in ms (SD error bars)') - legend(loc = 'upper left') - savefig('run.png') + clf() + def mybar(yskey, eskey, label): + bar(res['seqno'] - offsets.next(), res[yskey], yerr = res[eskey], width = + widths.next(), color = hues.next(), edgecolor = (1,1,1), ecolor = + ehues.next(), label = label, bottom = self.bottom) + self.bottom += res[yskey] + mybar('dump mean', 'dump sd', 'State dump') + mybar('recv mean', 'recv sd', 'State receive') + mybar('deser mean', 'deser sd', 'State deserialization') + mybar('buildup mean', 'buildup sd', 'Build-up') + mybar('catchup mean', 'catchup sd', 'Catch-up') + + title('Recovery time of ' + titlestr + ' over data size') + xlabel('Transaction count (corresponds roughly to data size)') + ylabel('Mean time in ms (SD error bars)') + legend(loc = 'upper left') + + ax2 = twinx() + col = colorsys.hls_to_rgb(.6, .4, .4) + ax2.errorbar(res['seqno'], res['len mean'] / 1024, res['len sd'] / 1024, marker = 'o', + color = col) + ax2.set_ylabel('Size of serialized state (KB)', color = col) + ax2.set_ylim(ymin = 0) + for tl in ax2.get_yticklabels(): tl.set_color(col) + + xlim(xmin = min(res['seqno']) - width, xmax = max(res['seqno']) + width) + savefig(name + '.png') + def main(argv): if len(argv) <= 1: print >> sys.stderr, 'Must specify a command' Modified: ydb/trunk/tools/test.bash =================================================================== --- ydb/trunk/tools/test.bash 2009-01-26 05:35:42 UTC (rev 1146) +++ ydb/trunk/tools/test.bash 2009-01-27 23:25:16 UTC (rev 1147) @@ -9,17 +9,22 @@ tagssh() { ssh "$@" 2>&1 | python -u -c ' -import time, sys +import time, sys, socket +# def fmt(*xs): return " ".join(map(str, xs)) + "\n" +# s = socket.socket() +# s.connect(("localhost", 9876)) +# f = s.makefile() +f = sys.stdout while True: line = sys.stdin.readline() if line == "": break - print sys.argv[1], time.time(), ":\t", line, + print >> f, sys.argv[1], time.time(), ":\t", line, ' $1 } check-remote() { - if [[ ${force:-asdf} != asdf && `hostname` == yang-xps410 ]] - then echo 'running a remote command on your pc!' 1>&2 && exit 1 + if [[ ! ${remote:-} ]] + then 'running a remote command on your pc!' fi } @@ -129,36 +134,11 @@ local host="$1" shift scp -q "$(dirname "$0")/$script" "$host:" - tagssh "$host" "./$script" "$@" + tagssh "$host" "remote=1 ./$script" "$@" } -hosts() { - if [[ ${host:-} ]] ; then - echo $host - elif [[ ${range:-} ]] ; then - seq $range | sed 's/^/farm/; s/$/.csail/' - else - cat << EOF -farm1.csail -farm2.csail -farm3.csail -farm4.csail -farm5.csail -farm6.csail -farm7.csail -farm8.csail -farm9.csail -farm10.csail -farm11.csail -farm12.csail -farm13.csail -farm14.csail -EOF - fi -} - parhosts() { - hosts | xargs ${xargs--P9} -I^ "$@" + echo -n $hosts | xargs ${xargs--P9} -d' ' -I^ "$@" } parssh() { @@ -170,6 +150,7 @@ } parremote() { + export hosts range parhosts "./$script" remote ^ "$@" } @@ -235,7 +216,7 @@ " } -hosttops() { +tops() { xargs= parssh " echo hostname @@ -245,22 +226,17 @@ } hostargs() { - if [[ $range ]] - then "$@" $(seq $range | sed 's/^/farm/; s/$/.csail/') - else "$@" ${hosts[@]} - fi + "$@" $hosts } scaling-helper() { local leader=$1 shift - tagssh $leader "ydb/src/ydb -l -n $#" & + tagssh $leader "ydb/src/ydb -l -n $# -X 100000" & sleep .1 for rep in "$@" do tagssh $rep "ydb/src/ydb -n $# -H $leader" & done - sleep ${wait1:-10} - tagssh $leader 'pkill -sigint ydb' wait } @@ -274,21 +250,23 @@ # TODO: fix this to work also with `hosts`; move into repeat-helper that's run # via hostargs, and change the range= to hosts= full-scaling() { - local base=$1 out=scaling-log-$(date +%Y-%m-%d-%H:%M:%S-%N) - shift + local out=scaling-log-$(date +%Y-%m-%d-%H:%M:%S-%N) + local orighosts="$hosts" maxn=$(( $(echo $hosts | wc -w) - 1 )) ln -sf $out scaling-log - for n in {1..5} ; do # configurations - export range="$base $((base + n))" + for n in `seq $maxn -1 1` ; do # configurations stop for i in {1..5} ; do # trials echo === n=$n i=$i === + echo === n=$n i=$i === > `tty` scaling sleep 1 stop sleep .1 echo done + hosts="${hosts% *}" done >& $out + hosts="$orighosts" } run-helper() { @@ -324,6 +302,7 @@ stop for i in {1..5} ; do # trials echo === seqno=$seqno i=$i === + echo === seqno=$seqno i=$i === > `tty` run sleep 1 stop @@ -342,15 +321,9 @@ full-yield() { local out=yield-log-$(date +%Y-%m-%d-%H:%M:%S) ln -sf $out yield-log - extraargs='--yield-catch-up' full-run >& $out + extraargs="--yield-catch-up ${extraargs:-}" full-run >& $out } -full() { - full-block - full-yield - full-scaling -} - stop-helper() { tagssh $1 'pkill -sigint ydb' } @@ -375,8 +348,32 @@ # Use mssh to log in with password as root to each machine. mssh-root() { - : "${hosts:="$(hosts)"}" mssh -l root "$@" } -"$@" +# Set up hosts. +confighosts() { + if [[ ! ${remote:-} ]] ; then + if [[ ! "${hosts:-}" && ! "${range:-}" ]] + then range='1 14'; echo "warning: running with farms 1..14" 1>&2 + fi + if [[ "${range:-}" ]] + then hosts="$( seq $range | sed 's/^/farm/' )" + fi + hosts="$( echo -n $hosts )" + fi +} + +# Set up logger. +configlogger() { + if [[ ! ${remote:-} ]] ; then + ( + flock -n /tmp/ydbtest.socket + ) > /tmp/y + fi +} + +confighosts +#configlogger + +eval "$@" This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <yan...@us...> - 2009-02-03 00:00:31
|
Revision: 1156 http://assorted.svn.sourceforge.net/assorted/?rev=1156&view=rev Author: yangzhang Date: 2009-02-03 00:00:26 +0000 (Tue, 03 Feb 2009) Log Message: ----------- - added simple "WAL" and leader-only mode - changed experiments from yield vs block to single vs multi host Modified Paths: -------------- ydb/trunk/README ydb/trunk/src/Makefile ydb/trunk/src/main.lzz.clamp ydb/trunk/tools/analysis.py ydb/trunk/tools/test.bash Modified: ydb/trunk/README =================================================================== --- ydb/trunk/README 2009-02-02 22:42:05 UTC (rev 1155) +++ ydb/trunk/README 2009-02-03 00:00:26 UTC (rev 1156) @@ -249,8 +249,13 @@ takes more around 50 ms - DONE start building infrastructure for disk IO -Period: 1/27- +Period: 1/27-2/3 +- DONE simple wal + +Period: 2/3- + +- DONE better wal - TODO fix up analysis of multihost recovery - TODO implement checkpointing disk-based scheme - TODO implement log-based recovery; show that it sucks Modified: ydb/trunk/src/Makefile =================================================================== --- ydb/trunk/src/Makefile 2009-02-02 22:42:05 UTC (rev 1155) +++ ydb/trunk/src/Makefile 2009-02-03 00:00:26 UTC (rev 1156) @@ -26,7 +26,8 @@ GCOV := -fprofile-arcs -ftest-coverage endif LDFLAGS := -pthread -lstx -lst -lresolv -lprotobuf -lgtest \ - -lboost_program_options-gcc43-mt -lboost_thread-gcc43-mt $(GPROF) + -lboost_program_options-gcc43-mt -lboost_thread-gcc43-mt \ + -lboost_serialization-gcc43-mt $(GPROF) # The -Wno- warnings are for boost. CXXFLAGS := -g3 -pthread $(GPROF) -Wall -Werror -Wextra -Woverloaded-virtual \ -Wconversion -Wno-conversion -Wno-ignored-qualifiers \ Modified: ydb/trunk/src/main.lzz.clamp =================================================================== --- ydb/trunk/src/main.lzz.clamp 2009-02-02 22:42:05 UTC (rev 1155) +++ ydb/trunk/src/main.lzz.clamp 2009-02-03 00:00:26 UTC (rev 1156) @@ -1,4 +1,6 @@ #hdr +#include <boost/archive/binary_iarchive.hpp> +#include <boost/archive/binary_oarchive.hpp> #include <boost/bind.hpp> #include <boost/foreach.hpp> #include <boost/program_options.hpp> @@ -14,7 +16,7 @@ #include <cstdio> #include <cstring> // strsignal #include <iostream> -#include <fstream> +#include <fstream> // ofstream #include <gtest/gtest.h> #include <malloc.h> #include <map> @@ -27,6 +29,7 @@ #include "ydb.pb.h" #define foreach BOOST_FOREACH using namespace boost; +using namespace boost::archive; using namespace commons; using namespace std; using namespace testing; @@ -42,7 +45,7 @@ size_t accept_joiner_size; bool verbose, yield_during_build_up, yield_during_catch_up, dump, show_updates, count_updates, stop_on_recovery, general_txns, profile_threads, - debug_threads, multirecover, disk, debug_memory; + debug_threads, multirecover, disk, debug_memory, use_wal; long long timelim, read_thresh, write_thresh; // Control. @@ -355,6 +358,35 @@ } /** + * ARIES write-ahead log. No undo logging necessary (no steal). + */ +class wal +{ +public: + wal() : of("wal"), out(of) {} + void del(int key) { + int op = op_del; // TODO: is this really necessary? + out & op & key; + } + void write(int key, int val) { + int op = op_write; + out & op & key & val; + } + void commit() { + int op = op_commit; + out & op; + } +private: + enum { op_del, op_write, op_commit }; + ofstream of; + binary_oarchive out; +}; + +// Globals +map<int, int> g_map; +wal *g_wal; + +/** * Keep issuing transactions to the replicas. */ void @@ -388,7 +420,7 @@ // Generate a random transaction. Txn txn; - txn.set_seqno(seqno++); + txn.set_seqno(seqno); int count = randint(min_ops, max_ops + 1); for (int o = 0; o < count; o++) { Op *op = txn.add_op(); @@ -400,8 +432,13 @@ if (do_pause) do_pause.waitreset(); - // Broadcast. - bcastmsg(fds, txn); + // Process, or broadcast and increment seqno. + if (fds.empty()) { + int dummy_seqno = seqno - 1; + process_txn(nullptr, g_map, txn, dummy_seqno, true); + } else { + bcastmsg(fds, txn); + } // Checkpoint. if (txn.seqno() % chkpt == 0) { @@ -426,6 +463,8 @@ cout << "stopping on issue of seqno " << txn.seqno() << endl; stop_hub.set(); } + + ++seqno; } Txn txn; @@ -441,6 +480,7 @@ process_txn(st_netfd_t leader, map<int, int> &map, const Txn &txn, int &seqno, bool caught_up) { + wal &wal = *g_wal; checkeq(txn.seqno(), seqno + 1); Response res; res.set_seqno(txn.seqno()); @@ -449,27 +489,33 @@ for (int o = 0; o < txn.op_size(); o++) { const Op &op = txn.op(o); const int key = op.key(); + ::map<int, int>::iterator it = map.find(key); if (show_updates || count_updates) { - if (map.find(key) != map.end()) { + if (it != map.end()) { if (show_updates) cout << "existing key: " << key << endl; if (count_updates) updates++; } } switch (op.type()) { case Op::read: - res.add_result(map[key]); + if (it == map.end()) res.add_result(0); + else res.add_result(it->second); break; case Op::write: - map[key] = op.value(); + if (use_wal) wal.write(key, op.value()); + if (it == map.end()) map[key] = op.value(); + else it->second = op.value(); break; case Op::del: - map.erase(key); + if (it != map.end()) { + if (use_wal) wal.del(key); + map.erase(it); + } break; } } - if (caught_up) { - sendmsg(leader, res); - } + if (use_wal) wal.commit(); + if (caught_up && leader != nullptr) sendmsg(leader, res); } void @@ -531,6 +577,8 @@ * calculating the sub-range of the map for which this node is responsible. * * \param[in] nnodes The total number nodes in the Init message list. + * + * \param[in] wal The WAL. */ void process_txns(st_netfd_t leader, map<int, int> &map, int &seqno, @@ -801,6 +849,9 @@ cout << "starting as leader" << endl; st_multichannel<long long> recover_signals; + scoped_ptr<wal> pwal(new wal); + g_wal = pwal.get(); + // Wait until all replicas have joined. st_netfd_t listener = st_tcp_listen(leader_port); st_closing close_listener(listener); @@ -1139,6 +1190,8 @@ "count operations that touch (update/read/delete) an existing key") ("general-txns,g", po::bool_switch(&general_txns), "issue read and delete transactions as well as the default of (only) insertion/update transactions (for leader only)") + ("wal", po::bool_switch(&use_wal), + "enable ARIES write-ahead logging") ("leader,l", po::bool_switch(&is_leader), "run the leader (run replica by default)") ("exit-on-recovery,x", po::bool_switch(&stop_on_recovery), Modified: ydb/trunk/tools/analysis.py =================================================================== --- ydb/trunk/tools/analysis.py 2009-02-02 22:42:05 UTC (rev 1155) +++ ydb/trunk/tools/analysis.py 2009-02-03 00:00:26 UTC (rev 1156) @@ -93,9 +93,9 @@ ylim(ymin = 0) savefig('scaling.png') -def run(blockpath, yieldpath): - for path, titlestr, name in [#(blockpath, 'blocking scheme', 'block'), - (yieldpath, 'yielding scheme', 'yield')]: +def run(singlepath, multipath): + for path, titlestr, name in [(singlepath, 'single recoverer', 'single'), + (multipath, 'multi recoverer', 'multi')]: print '===', titlestr, '===' print 'file:', getname(path) res = logextract(path, 'seqno', @@ -151,7 +151,7 @@ elif argv[1] == 'scaling': scaling(argv[2] if len(argv) > 2 else 'scaling-log') elif argv[1] == 'run': - run(*argv[2:] if len(argv) > 2 else ['block-log', 'yield-log']) + run(*argv[2:] if len(argv) > 2 else ['single-log', 'multi-log']) else: print >> sys.stderr, 'Unknown command:', argv[1] Modified: ydb/trunk/tools/test.bash =================================================================== --- ydb/trunk/tools/test.bash 2009-02-02 22:42:05 UTC (rev 1155) +++ ydb/trunk/tools/test.bash 2009-02-03 00:00:26 UTC (rev 1156) @@ -282,7 +282,7 @@ done sleep .1 # pexpect 'got all \d+ replicas' leader # Run joiner. - tagssh $1 "ydb/src/ydb -H $leader ${extraargs:-}" & # -v --debug-threads -t 200000" & + tagssh $1 "ydb/src/ydb -H $leader --yield-catch-up ${extraargs:-}" & # -v --debug-threads -t 200000" & if false ; then if [[ ${wait2:-} ]] then sleep $wait2 @@ -297,8 +297,9 @@ hostargs run-helper } -full-run() { - for seqno in 500000 400000 300000 200000 100000 ; do # 200000 300000 400000 500000 ; do # 700000 900000; do # configurations +# Recovery experient. +exp() { + for seqno in 500000 400000 300000 200000 100000 ; do # configurations stop for i in {1..5} ; do # trials echo === seqno=$seqno i=$i === @@ -312,16 +313,18 @@ done } -full-block() { - local out=block-log-$(date +%Y-%m-%d-%H:%M:%S) - ln -sf $out block-log - full-run >& $out +# Single-host recovery experiment. +exp-single() { + local out=single-log-$(date +%Y-%m-%d-%H:%M:%S) + ln -sf $out single-log + exp >& $out } -full-yield() { - local out=yield-log-$(date +%Y-%m-%d-%H:%M:%S) - ln -sf $out yield-log - extraargs="--yield-catch-up ${extraargs:-}" full-run >& $out +# Multi-host recovery experiment. +exp-multi() { + local out=multi-log-$(date +%Y-%m-%d-%H:%M:%S) + ln -sf $out multi-log + extraargs="-m ${extraargs:-}" exp >& $out } stop-helper() { This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <yan...@us...> - 2009-02-03 22:29:59
|
Revision: 1162 http://assorted.svn.sourceforge.net/assorted/?rev=1162&view=rev Author: yangzhang Date: 2009-02-03 22:29:55 +0000 (Tue, 03 Feb 2009) Log Message: ----------- - Print out the raw data tables. - Added default value lookups to deal with unreliably funneled output. - Fixed the parsing loop to understand the significance of === markers. - Graphs are named after the real filenames of the logs they're generated from. - Added WAL benchmark. - Updated the scaling analysis to include the WAL results. - Added a mtcp benchmark. - Renamed run to rec. - Added --yield-build-up to alleviate the large distortion in recv times (though this greatly inflates the build-up times). - Updated README/TODOs. Modified Paths: -------------- ydb/trunk/README ydb/trunk/tools/analysis.py ydb/trunk/tools/test.bash Modified: ydb/trunk/README =================================================================== --- ydb/trunk/README 2009-02-03 22:24:59 UTC (rev 1161) +++ ydb/trunk/README 2009-02-03 22:29:55 UTC (rev 1162) @@ -251,12 +251,20 @@ Period: 1/27-2/3 +- DONE associative containers benchmark +- DONE parallel tcp benchmark - DONE simple wal +- issues: + - multi vs single + - WAL performs well + - what to do? limit parallelism? how? Period: 2/3- -- DONE better wal +- TODO better wal +- TODO better understand multihost recovery - TODO fix up analysis of multihost recovery +- TODO data structures benchmark - TODO implement checkpointing disk-based scheme - TODO implement log-based recovery; show that it sucks - TODO implement group (batch) commit for log-based recovery Modified: ydb/trunk/tools/analysis.py =================================================================== --- ydb/trunk/tools/analysis.py 2009-02-03 22:24:59 UTC (rev 1161) +++ ydb/trunk/tools/analysis.py 2009-02-03 22:29:55 UTC (rev 1162) @@ -2,6 +2,7 @@ from __future__ import with_statement import re, sys, itertools, colorsys +from path import path from os.path import basename, realpath from pylab import * @@ -28,9 +29,12 @@ def show_table1(dicts): keys = dicts[0].keys() - return show_table([(k, [d[k] for d in dicts]) for k in keys]) + # TODO: Remove the default arg once we have reliably funneled output. + return show_table([(k, [d.get(k, dicts[0][k]) for d in dicts]) + for k in keys]) -def logextract(path, indexkey, pats): +def logextract(path, indexkey, pats, xform = None): + if xform is None: xform = lambda x: x check(path) # Capture values from log using regex pats. def getcaps(): @@ -38,6 +42,7 @@ caps = {} # captures: name -> int/float sats = [ False for pat in pats ] for line in f: + if line.startswith('=== '): print line,; caps = {}; sats = [False for pat in pats] # if line == '\n': print '===', caps.keys(), ''.join('1' if s else '0' for s in sats) for i, pat in enumerate(pats): m = re.search(pat, line) @@ -51,17 +56,18 @@ if all(sats): sats = [ False for pat in pats ] # print '!!!' - yield caps.copy() # [ caps[k] for k in keys ] - caps.clear() + yield xform(caps) + caps = {} # Aggregate the captured values. caps = list(getcaps()) -# print show_table1(caps) + print show_table1(caps) + caps = sorted(caps, key = lambda d: d[indexkey]) keys = [indexkey] + filter(lambda x: x != indexkey, caps[0].keys()) def gen(): for index, ds in itertools.groupby(caps, lambda d: d[indexkey]): ds = list(ds) - print [d['len'] for d in ds] - yield [ [d[k] for k in keys] for d in ds ] + # TODO: Remove the default arg once we have reliably funneled output. + yield [ [d.get(k, ds[0][k]) for k in keys] for d in ds ] a = array(list(gen())) # raw results indexes = a[:,0,0] means = median(a,1) # or a.mean(1) @@ -78,33 +84,46 @@ print return res -def scaling(path): +def scaling(scalingpath, ariespath): print '=== scaling ===' - print 'file:', getname(path) - res = logextract(path, 'n', [ - r'=== n=(?P<n>\d+) ', + print 'file:', getname(scalingpath) + res = logextract(scalingpath, 'n', [ + r'=== n=(?P<n>-?\d+) ', r'issued .*\((?P<tps>[.\d]+) tps\)' ]) - errorbar(res['n'], res['tps mean'], res['tps sd']) + print 'file:', getname(ariespath) + res2 = logextract(ariespath, 'n', [ + r'=== n=(?P<n>-?\d+) ', + r'issued .*\((?P<tps>[.\d]+) tps\)' ]) + + errorbar(hstack([res2['n'], res['n']]), + hstack([res2['tps mean'], res['tps mean']]), + hstack([res2['tps sd'], res['tps sd']])) title('Scaling of baseline throughput with number of nodes') xlabel('Node count') ylabel('Mean TPS (stdev error bars)') - xlim(res['n'].min() - .5, res['n'].max() + .5) + xlim(hstack([res2['n'], res['n']]).min() - .5, + hstack([res2['n'], res['n']]).max() + .5) ylim(ymin = 0) savefig('scaling.png') def run(singlepath, multipath): - for path, titlestr, name in [(singlepath, 'single recoverer', 'single'), - (multipath, 'multi recoverer', 'multi')]: + singlepath, multipath = map(path, [singlepath, multipath]) + for datpath, titlestr, name in [(singlepath, 'single recoverer', 'single'), + (multipath, 'multi recoverer', 'multi')]: + def xform(d): + d['realdump'] = d['dump'] - d['recv'] - d['deser'] + return d print '===', titlestr, '===' - print 'file:', getname(path) - res = logextract(path, 'seqno', + print 'file:', getname(datpath) + res = logextract(datpath, 'seqno', [ r'=== seqno=(?P<seqno>\d+) ', r'got recovery message of (?P<len>\d+) bytes in (?P<dump>\d+) ms: xfer took (?P<recv>\d+) ms, deserialization took (?P<deser>\d+)', r'built up .* (?P<buildup>\d+) ms', r'generating recovery took (?P<gen>\d+) ms', r'replayer caught up; from backlog replayed \d+ txns .* in (?P<catchup>\d+) ms', - r'.*: recovering node caught up; took (?P<total>\d+) ?ms' ] ) + r'.*: recovering node caught up; took (?P<total>\d+) ?ms' ], + xform ) # Colors and positioning width = 5e4 @@ -123,7 +142,7 @@ ehues.next(), label = label, bottom = self.bottom) self.bottom += res[yskey] - mybar('dump mean', 'dump sd', 'State dump') + mybar('realdump mean', 'realdump sd', 'State dump etc.') mybar('recv mean', 'recv sd', 'State receive') mybar('deser mean', 'deser sd', 'State deserialization') mybar('buildup mean', 'buildup sd', 'Build-up') @@ -141,17 +160,33 @@ ax2.set_ylabel('Size of serialized state (KB)', color = col) ax2.set_ylim(ymin = 0) for tl in ax2.get_yticklabels(): tl.set_color(col) - xlim(xmin = min(res['seqno']) - width, xmax = max(res['seqno']) + width) - savefig(name + '.png') + pngpath = datpath.realpath() + '.png' + savefig(pngpath) + symlink = path(name + '.png') + if symlink.isfile(): symlink.remove() + pngpath.symlink(symlink) + +def mtcp(datpath): + res = logextract(datpath, 'n', + [ r'=== n=(?P<n>\d+)', + r'real\s+0m(?P<t>[0-9\.]+)s' ]) + errorbar(res['n'], res['t mean'], res['t sd']) + title('Time to send a large message (6888896 bytes)') + xlabel('Number of parallel senders') + ylabel('Time (ms)') + savefig('mtcp.png') + def main(argv): if len(argv) <= 1: print >> sys.stderr, 'Must specify a command' elif argv[1] == 'scaling': - scaling(argv[2] if len(argv) > 2 else 'scaling-log') + scaling(*argv[2:] if len(argv) > 2 else ['scaling-log', 'aries-log']) elif argv[1] == 'run': run(*argv[2:] if len(argv) > 2 else ['single-log', 'multi-log']) + elif argv[1] == 'mtcp': + mtcp('mtcp-log') else: print >> sys.stderr, 'Unknown command:', argv[1] Modified: ydb/trunk/tools/test.bash =================================================================== --- ydb/trunk/tools/test.bash 2009-02-03 22:24:59 UTC (rev 1161) +++ ydb/trunk/tools/test.bash 2009-02-03 22:29:55 UTC (rev 1162) @@ -232,10 +232,10 @@ scaling-helper() { local leader=$1 shift - tagssh $leader "ydb/src/ydb -l -n $# -X 100000" & + tagssh $leader "ydb/src/ydb -l -n $# -X 100000 ${extraargs:-}" & sleep .1 for rep in "$@" - do tagssh $rep "ydb/src/ydb -n $# -H $leader" & + do tagssh $rep "ydb/src/ydb -n $# -H $leader ${extraargs:-}" & done wait } @@ -249,13 +249,13 @@ # configurations; e.g., "repeat scaling". # TODO: fix this to work also with `hosts`; move into repeat-helper that's run # via hostargs, and change the range= to hosts= -full-scaling() { +exp-scaling() { local out=scaling-log-$(date +%Y-%m-%d-%H:%M:%S-%N) local orighosts="$hosts" maxn=$(( $(echo $hosts | wc -w) - 1 )) ln -sf $out scaling-log - for n in `seq $maxn -1 1` ; do # configurations + for n in `seq $maxn -1 0` ; do # configurations stop - for i in {1..5} ; do # trials + for i in {1..3} ; do # trials echo === n=$n i=$i === echo === n=$n i=$i === > `tty` scaling @@ -269,7 +269,7 @@ hosts="$orighosts" } -run-helper() { +rec-helper() { local leader=$1 shift : ${seqno:=100000} @@ -282,7 +282,7 @@ done sleep .1 # pexpect 'got all \d+ replicas' leader # Run joiner. - tagssh $1 "ydb/src/ydb -H $leader --yield-catch-up ${extraargs:-}" & # -v --debug-threads -t 200000" & + tagssh $1 "ydb/src/ydb -H $leader --yield-build-up --yield-catch-up ${extraargs:-}" & # -v --debug-threads -t 200000" & if false ; then if [[ ${wait2:-} ]] then sleep $wait2 @@ -293,18 +293,18 @@ wait } -run() { - hostargs run-helper +rec() { + hostargs rec-helper } # Recovery experient. -exp() { +exp-rec() { for seqno in 500000 400000 300000 200000 100000 ; do # configurations stop - for i in {1..5} ; do # trials + for i in {1..3} ; do # trials echo === seqno=$seqno i=$i === echo === seqno=$seqno i=$i === > `tty` - run + rec sleep 1 stop sleep .1 @@ -314,19 +314,68 @@ } # Single-host recovery experiment. -exp-single() { +exp-rec-single() { local out=single-log-$(date +%Y-%m-%d-%H:%M:%S) ln -sf $out single-log - exp >& $out + exp-rec >& $out } # Multi-host recovery experiment. -exp-multi() { +exp-rec-multi() { local out=multi-log-$(date +%Y-%m-%d-%H:%M:%S) ln -sf $out multi-log - extraargs="-m ${extraargs:-}" exp >& $out + extraargs="-m ${extraargs:-}" exp-rec >& $out } +# WAL. +aries() { + extraargs='--wal' scaling ${hosts:-} +} + +exp-aries() { + local out=aries-log-$(date +%Y-%m-%d-%H:%M:%S) + ln -sf $out aries-log + for i in {1..3} ; do + echo === n=-1 i=$i === + echo === n=-1 i=$i === > `tty` + aries + echo + done >& $out +} + +mtcp-helper() { + local leader=$1 n=$(( $# - 1 )) + tagssh $leader 'pkill nc' + shift + while (( $# > 0 )) ; do + tagssh $1 "sleep .5 ; time seq $((1000000/n)) | nc $leader 9876" & + shift + done + tagssh $leader "nc -l 9876 > /dev/null" + wait +} + +mtcp() { + hostargs mtcp-helper +} + +exp-mtcp() { + local out=mtcp-log-$(date +%Y-%m-%d-%H:%M:%S-%N) + local orighosts="$hosts" maxn=$(( $(echo $hosts | wc -w) - 1 )) + ln -sf $out mtcp-log + for n in `seq $maxn -1 1` ; do # configurations + for i in {1..3} ; do # trials + echo === n=$n i=$i === + echo === n=$n i=$i === > `tty` + mtcp + sleep 1 + echo + done + hosts="${hosts% *}" + done >& $out + hosts="$orighosts" +} + stop-helper() { tagssh $1 'pkill -sigint ydb' } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <yan...@us...> - 2009-02-06 00:36:33
|
Revision: 1167 http://assorted.svn.sourceforge.net/assorted/?rev=1167&view=rev Author: yangzhang Date: 2009-02-06 00:36:31 +0000 (Fri, 06 Feb 2009) Log Message: ----------- - Added epperf benchmark because I couldn't trust ST - Made stperf a bit more flexible - Made epperf a better system - Changed the `run` analysis to produce a bar chart comparing multi and single side by side - Added aggregation filter specifiers to the regex group names - Rearranged test.bash to clean it up - Added notes and more TODOs to the README Modified Paths: -------------- ydb/trunk/README ydb/trunk/tools/analysis.py ydb/trunk/tools/stperf.cc ydb/trunk/tools/test.bash Added Paths: ----------- ydb/trunk/tools/epperf.cc ydb/trunk/tools/epperf.mk Modified: ydb/trunk/README =================================================================== --- ydb/trunk/README 2009-02-06 00:32:30 UTC (rev 1166) +++ ydb/trunk/README 2009-02-06 00:36:31 UTC (rev 1167) @@ -255,12 +255,40 @@ - DONE parallel tcp benchmark - DONE simple wal - issues: + - associative containers: hash is strong contender, array is unbeatable, but + most are close enough + - interested to see how the cache-friendly btree programming competition + pans out + - serialization benchmark + - protobufs very cheap for large messages, and always terse + - may want to look into batching up msgs for faster ser/deser - multi vs single + - some microbenchmarks demonstrate that in fact there is no speedup from + parallel transfers; network is already saturated + - used: socat (mtcp), epperf, stperf + - show scaling results + - 7MB / 300ms = 2.3MB/ms = 23MB/s = 184Mb/s + - slower than expected + - share network with incoming & outgoing leader comm + - still need to understand why the build-up is larger for multi + - scalability graphs: bottleneck is...? + - 4 + 4 + 8 * 5 = 48 B per txn + - one: 25,000txn/s * 48B/txn * 8b/B / 1e6b/Mb = 9Mb/s + - two: 9Mb/s * 2 - WAL performs well - - what to do? limit parallelism? how? + - close to no replication in scalability graphs + - what to do? limit parallelism? how? + - include actual separate clients? -Period: 2/3- +Period: 2/5- +- TODO commit!!! +- TODO serialization bench (multiple layers, control batch sizes) +- TODO network throughput bench +- TODO associative container bench +- TODO combine the analyses of the above three; integrate with actual message + formats, etc. +- TODO batching, serialization, disk speed - TODO better wal - TODO better understand multihost recovery - TODO fix up analysis of multihost recovery Modified: ydb/trunk/tools/analysis.py =================================================================== --- ydb/trunk/tools/analysis.py 2009-02-06 00:32:30 UTC (rev 1166) +++ ydb/trunk/tools/analysis.py 2009-02-06 00:36:31 UTC (rev 1167) @@ -5,11 +5,15 @@ from path import path from os.path import basename, realpath from pylab import * +from matplotlib.font_manager import FontProperties class struct(object): pass def getname(path): return basename(realpath(path)) +def mean(xs): return array(xs).mean() +def last(xs): return xs[-1] + def check(path): with file(path) as f: if 'got timeout' in f.read(): @@ -36,6 +40,15 @@ def logextract(path, indexkey, pats, xform = None): if xform is None: xform = lambda x: x check(path) + # Prepare the regex patterns. + filts = {} + def repl(m): + name, filt = m.group(1), m.group(2) + # Duplicate check. + assert name not in filts, 'Capture %r exists more than once.' % name + filts[name] = filt + return '(?P<%s>' % name + pats = [ re.sub(r'\(\?P<(\w+)\|(\w+)>', repl, pat) for pat in pats ] # Capture values from log using regex pats. def getcaps(): with file(path) as f: @@ -47,15 +60,23 @@ for i, pat in enumerate(pats): m = re.search(pat, line) if m: - for k in m.groupdict(): + gd = dict( (k, float(v)) for (k,v) in m.groupdict().iteritems() ) + for k, v in gd.iteritems(): if k in caps: - caps[k + '0'] = caps[k] - caps.update((k, float(v)) for k,v in m.groupdict().iteritems()) + if k in filts: + if type(caps[k]) != list: caps[k] = [caps[k]] + caps[k].append(v) + else: + caps[k + '0'] = caps[k] + else: + caps[k] = v sats[i] = True break if all(sats): sats = [ False for pat in pats ] -# print '!!!' + caps = dict( (k, eval(filts.get(k, 'lambda x: x'))(v)) + for k,v in caps.iteritems() ) + assert all( type(v) != list for v in caps.itervalues() ) yield xform(caps) caps = {} # Aggregate the captured values. @@ -109,6 +130,7 @@ def run(singlepath, multipath): singlepath, multipath = map(path, [singlepath, multipath]) + ress = [] for datpath, titlestr, name in [(singlepath, 'single recoverer', 'single'), (multipath, 'multi recoverer', 'multi')]: def xform(d): @@ -118,55 +140,68 @@ print 'file:', getname(datpath) res = logextract(datpath, 'seqno', [ r'=== seqno=(?P<seqno>\d+) ', - r'got recovery message of (?P<len>\d+) bytes in (?P<dump>\d+) ms: xfer took (?P<recv>\d+) ms, deserialization took (?P<deser>\d+)', - r'built up .* (?P<buildup>\d+) ms', + r'got recovery message of (?P<len|mean>\d+) bytes in (?P<dump|mean>\d+) ms: xfer took (?P<recv|mean>\d+) ms, deserialization took (?P<deser|sum>\d+)', + r'built up .* (?P<buildup|mean>\d+) ms', r'generating recovery took (?P<gen>\d+) ms', r'replayer caught up; from backlog replayed \d+ txns .* in (?P<catchup>\d+) ms', r'.*: recovering node caught up; took (?P<total>\d+) ?ms' ], xform ) + ress.append((datpath, titlestr, name, res)) + seqnos = ress[0][-1]['seqno'] + interval = float(seqnos[1] - seqnos[0]) + xmin, xmax = seqnos.min() - interval / 2, seqnos.max() + interval / 2 + gap = interval / 10 # (xmax - xmin) / len(seqnos) + width = (interval - gap) / len(ress) + step = 1. / len(seqnos) # For color. + + for pos, (datpath, titlestr, name, res) in enumerate(ress): # Colors and positioning - width = 5e4 - step = 1.0 / 5 - hues = ( colorsys.hls_to_rgb(step * i, .7, .5) for i in itertools.count() ) + hues = ( colorsys.hls_to_rgb(step * i, .7 - pos * .2, .5) for i in itertools.count() ) ehues = ( colorsys.hls_to_rgb(step * i, .3, .5) for i in itertools.count() ) - widths = ( 2 * width - 2 * width / 5 * i for i in itertools.count() ) - offsets = ( width - 2 * width / 5 * i for i in itertools.count() ) + widths = ( width for i in itertools.count() ) + offsets = ( pos * width for i in itertools.count() ) self = struct() self.bottom = 0 - clf() def mybar(yskey, eskey, label): - bar(res['seqno'] - offsets.next(), res[yskey], yerr = res[eskey], width = - widths.next(), color = hues.next(), edgecolor = (1,1,1), ecolor = - ehues.next(), label = label, bottom = self.bottom) + bar(res['seqno'] - (interval - gap) / 2 + offsets.next(), res[yskey], yerr + = res[eskey], width = widths.next(), color = hues.next(), edgecolor = + (1,1,1), ecolor = ehues.next(), label = name + ' ' + label, bottom = + self.bottom) self.bottom += res[yskey] - mybar('realdump mean', 'realdump sd', 'State dump etc.') - mybar('recv mean', 'recv sd', 'State receive') - mybar('deser mean', 'deser sd', 'State deserialization') - mybar('buildup mean', 'buildup sd', 'Build-up') - mybar('catchup mean', 'catchup sd', 'Catch-up') + mybar('realdump mean', 'realdump sd', 'state dump etc.') + mybar('recv mean', 'recv sd', 'state receive') + mybar('deser mean', 'deser sd', 'state deserialization') + mybar('buildup mean', 'buildup sd', 'build-up') + mybar('catchup mean', 'catchup sd', 'catch-up') - title('Recovery time of ' + titlestr + ' over data size') - xlabel('Transaction count (corresponds roughly to data size)') - ylabel('Mean time in ms (SD error bars)') - legend(loc = 'upper left') + title('Recovery time of ' + titlestr + ' over data size') + xlabel('Transaction count (corresponds roughly to data size)') + ylabel('Mean time in ms (SD error bars)') + legend(loc = 'upper left', prop = FontProperties(size = 'small')) - ax2 = twinx() - col = colorsys.hls_to_rgb(.6, .4, .4) - ax2.errorbar(res['seqno'], res['len mean'] / 1024, res['len sd'] / 1024, marker = 'o', - color = col) - ax2.set_ylabel('Size of serialized state (KB)', color = col) - ax2.set_ylim(ymin = 0) - for tl in ax2.get_yticklabels(): tl.set_color(col) - xlim(xmin = min(res['seqno']) - width, xmax = max(res['seqno']) + width) + ax2 = twinx() + colors = ( colorsys.hls_to_rgb(.6, .6 - i * .2, .4) for i, _ in enumerate(ress) ) + for _, _, name, res in ress: + ax2.errorbar(res['seqno'], res['len mean'] / 1024, res['len sd'] / 1024, + marker = 'o', color = colors.next(), + label = name + ' msg size') + ax2.set_ylabel('Size of serialized state (KB)') + ax2.set_ylim(ymin = 0) +# for tl in ax2.get_yticklabels(): tl.set_color(col) + xlim(xmin = xmin, xmax = xmax) + legend(loc = 'upper center', prop = FontProperties(size = 'small')) + if False: pngpath = datpath.realpath() + '.png' savefig(pngpath) symlink = path(name + '.png') if symlink.isfile(): symlink.remove() pngpath.symlink(symlink) + else: + savefig('run.png') def mtcp(datpath): def xform(d): @@ -182,6 +217,7 @@ title('Scaling of sending large message using socat (6888896 bytes)') xlabel('Number of parallel senders') ylabel('Speedup') + ylim(ymin = 0) savefig('mtcp.png') def stperf(datpath): @@ -198,6 +234,7 @@ title('Scaling of sending large message using ST (6888896 bytes)') xlabel('Number of parallel senders') ylabel('Speedup') + ylim(ymin = 0) savefig('stperf.png') def main(argv): Added: ydb/trunk/tools/epperf.cc =================================================================== --- ydb/trunk/tools/epperf.cc (rev 0) +++ ydb/trunk/tools/epperf.cc 2009-02-06 00:36:31 UTC (rev 1167) @@ -0,0 +1,188 @@ +#include <fcntl.h> +#include <stdio.h> +#include <sys/epoll.h> +#include <unistd.h> + +#include <iostream> +#include <cstdlib> + +#include <commons/check.h> +#include <commons/closing.h> +#include <commons/pool.h> +#include <commons/sockets.h> +#include <boost/scoped_array.hpp> + +using namespace boost; +using namespace commons; +using namespace std; + +enum { size = 100000000 }; +int n, expected; + +class echoer { + public: + echoer() : buf(new char[4096]) {} + + /** + * \return true iff we are not done with the reading/would've blocked + * (EAGAIN), false iff we've gotten the full 40-byte packet or have hit + * EOF/an error. + */ + bool consume() { + while (true) { + int bytes = ::read(fd_, buf.get(), 4096); + if (bytes == -1) { + // We're going to block. + if (errno == EAGAIN) { + return true; + } else { + perror("read"); + return false; + } + } + if (bytes == 0) { + return false; + } + ss_ << string(buf.get(), bytes); + //if (ss_.tellp() >= expected) + //return false; + } + } + + /** + * Read the contents of the buffer as a string. + */ + string read() { return ss_.str(); } + + /** + * The socket file descriptor we're currently associated with. + */ + int & fd() { return fd_; } + int fd() const { return fd_; } + + private: + stringstream ss_; + int fd_; + scoped_array<char> buf; // (new char[4096]); +}; + +int +main(int argc, char* argv[]) { + if (argc < 2) { return 1; } + n = atoi(argv[1]); + expected = size / n; + + // Create a non-blocking server socket. + int server = tcp_listen(9876, true); + + // Make sure the fd is finally closed. + closingfd closer(server); + + // Create our epoll file descriptor. max_events is the maximum number of + // events to process at a time (max number of events that we want a call to + // epoll_wait() to "return"), while max_echoers is the max number of + // connections to make. + const int max_events = 16, max_echoers = 100; + + // This file descriptor isn't actually bound to any socket; it's a special fd + // that is really just used for manipulating the epoll (e.g., registering + // more sockets/connections with it). TODO: Figure out the rationale behind + // why this thing is an fd. + int epoll_fd = checknneg(epoll_create(max_events)); + + // Add our server fd to the epoll event loop. The event specifies: + // + // - what fd is + // - what operations we're interested in (connections, hangups, errors) + // (TODO: what are hangups?) + // - arbitrary data to be associated with this fd, in the form of a pointer + // (ptr) or number (u32/u64); this is more useful for connection fd's, of + // which there are multiple, and so it helps to have a direct pointer to + // (say) that connection's handler. + // + // The add operation actually makes a copy of the given epoll_event, so + // that's why we can reuse this `event` later. + struct epoll_event event; + event.events = EPOLLIN | EPOLLERR | EPOLLHUP | EPOLLET; + event.data.fd = server; + checknneg(epoll_ctl(epoll_fd, EPOLL_CTL_ADD, server, &event)); + + // Set up a bunch of echo server instances. + pool<echoer> echoers(max_echoers); + + // Execute the epoll event loop. + int ncons = 0; + while (ncons < n) { + struct epoll_event events[max_events]; + int num_fds = epoll_wait(epoll_fd, events, max_events, -1); + + for (int i = 0; i < num_fds; i++) { + // Case 1: Error condition. + if (events[i].events & (EPOLLHUP | EPOLLERR)) { + fputs("epoll: EPOLLERR", stderr); + // epoll will remove the fd from its set automatically when the fd is + // closed. + close(events[i].data.fd); + } else { + check(events[i].events & EPOLLIN); + + // Case 2: Our server is receiving a connection. + if (events[i].data.fd == server) { + struct sockaddr remote_addr; + socklen_t addr_size = sizeof remote_addr; + int connection = accept(server, &remote_addr, &addr_size); + if (connection == -1) { + if (errno != EAGAIN && errno != EWOULDBLOCK) { + perror("accept"); + } + continue; + } + + // Make the connection non-blocking. + checknneg(fcntl(connection, F_SETFL, + O_NONBLOCK | fcntl(connection, F_GETFL, 0))); + + // Add the connection to our epoll loop. Note we are reusing our + // epoll_event. Now we're actually using the ptr field to point to a + // free handler. event.data is a union of {ptr, fd, ...}, so we can + // only use one of these. event.data is entirely for the user; epoll + // doesn't actually look at this. Note that we're passing the fd + // (connection) separately into epoll_ctl(). + echoer *e = echoers.take(); + cout << "got a connection! " << + echoers.size() << " echoers remaining" << endl; + event.data.ptr = e; + e->fd() = connection; + checknneg(epoll_ctl(epoll_fd, EPOLL_CTL_ADD, connection, + &event)); + } + + // Case 3: One of our connections has read data. + else { + echoer &e = *((echoer*) events[i].data.ptr); + // If we have read the minimum amount (or encountered a dead-end + // situation), then echo the data back. + if (!e.consume()) { + cout << "done!" << endl; +// // Write back! +// string s = e.read(); +// check((size_t) checknneg(write(e.fd(), s.c_str(), s.size())) == s.size()); +// +// // epoll will remove the fd from its set automatically when the fd is +// // closed. + close(e.fd()); +// +// // Release the echoer. +// echoers.drop(&e); +// +// cout << "responded with '" << e.read() << "'; " << +// echoers.size() << " echoers remaining" << endl; + ++ncons; + } + } + } + } + } + + return 0; +} Added: ydb/trunk/tools/epperf.mk =================================================================== --- ydb/trunk/tools/epperf.mk (rev 0) +++ ydb/trunk/tools/epperf.mk 2009-02-06 00:36:31 UTC (rev 1167) @@ -0,0 +1,9 @@ +CXXFLAGS += -O3 -Wall +CXX := $(WTF) $(CXX) + +all: epperf + +clean: + rm -f epperf + +.PHONY: clean all Modified: ydb/trunk/tools/stperf.cc =================================================================== --- ydb/trunk/tools/stperf.cc 2009-02-06 00:32:30 UTC (rev 1166) +++ ydb/trunk/tools/stperf.cc 2009-02-06 00:36:31 UTC (rev 1167) @@ -12,8 +12,9 @@ using namespace commons; using namespace std; -enum { port = 9876, size = 100000000 }; // 100mb +enum { size = 100000000 }; // 100mb char *rbuf, *sbuf, *host; +short port; bool do_r, do_s; int n, my_i; @@ -28,6 +29,7 @@ if (do_r) { vector<st_thread_t> ts; st_netfd_t l = st_tcp_listen(port); + // XXX: bug: the ordering here (the value of i) should be specified by the connector. for (int i = 0; i < n; i++) { st_netfd_t c = st_accept(l, 0, 0, -1); ts.push_back(st_spawn(boost::bind(rr, i, c))); @@ -47,16 +49,18 @@ int main(int argc, char **argv) { host = strdup("localhost"); n = 1; + port = 9876; int opt; - while ((opt = getopt(argc, argv, "i:n:rs:")) != -1) { + while ((opt = getopt(argc, argv, "i:n:p:rs:")) != -1) { switch (opt) { case 'i': my_i = atoi(optarg); break; case 'n': n = atoi(optarg); break; + case 'p': port = atoi(optarg); break; case 'r': do_r = true; break; case 's': do_s = true; host = strdup(optarg); break; } } - cout << "n=" << n << " i=" << my_i + cout << "n=" << n << " i=" << my_i << " port=" << port << " start=" << bstart(my_i) << " end=" << bend(my_i) << endl; if (!(do_r || do_s)) do_r = do_s = true; rbuf = new char[size]; Modified: ydb/trunk/tools/test.bash =================================================================== --- ydb/trunk/tools/test.bash 2009-02-06 00:32:30 UTC (rev 1166) +++ ydb/trunk/tools/test.bash 2009-02-06 00:36:31 UTC (rev 1167) @@ -7,6 +7,41 @@ script="$(basename "$0")" +# +# Configuration +# + +# Set up hosts. +confighosts() { + if [[ ! ${remote:-} ]] ; then + if [[ ! "${hosts:-}" && ! "${range:-}" ]] + then range='1 14'; echo "warning: running with farms 1..14" 1>&2 + fi + if [[ "${range:-}" ]] + then hosts="$( seq $range | sed 's/^/farm/' )" + fi + hosts="$( echo -n $hosts )" + fi +} + +# Set up logger. +configlogger() { + if [[ ! ${remote:-} ]] ; then + ( + flock -n /tmp/ydbtest.socket + ) > /tmp/y + fi +} + +# +# Utilities +# + +# Use mssh to log in with password as root to each machine. +mssh-root() { + mssh -l root "$@" +} + tagssh() { ssh "$@" 2>&1 | python -u -c ' import time, sys, socket @@ -28,6 +63,38 @@ fi } +remote() { + local host="$1" + shift + scp -q "$(dirname "$0")/$script" "$host:" + tagssh "$host" "remote=1 ./$script" "$@" +} + +parhosts() { + echo -n $hosts | xargs ${xargs--P9} -d' ' -I^ "$@" +} + +parssh() { + parhosts ssh ^ "set -o errexit -o nounset; $@" +} + +parscp() { + parhosts scp -q "$@" +} + +parremote() { + export hosts range + parhosts "./$script" remote ^ "$@" +} + +hostargs() { + "$@" $hosts +} + +# +# Setup +# + node-init-setup() { check-remote mkdir -p work @@ -130,30 +197,6 @@ make WTF= } -remote() { - local host="$1" - shift - scp -q "$(dirname "$0")/$script" "$host:" - tagssh "$host" "remote=1 ./$script" "$@" -} - -parhosts() { - echo -n $hosts | xargs ${xargs--P9} -d' ' -I^ "$@" -} - -parssh() { - parhosts ssh ^ "set -o errexit -o nounset; $@" -} - -parscp() { - parhosts scp -q "$@" -} - -parremote() { - export hosts range - parhosts "./$script" remote ^ "$@" -} - init-setup() { parremote node-init-setup } @@ -205,12 +248,21 @@ parssh make -C /tmp/ -f stperf.mk } +setup-epperf() { + parscp epperf.{cc,mk} ^:/tmp/ + parssh make -C /tmp/ -f epperf.mk +} + full-setup() { init-setup setup-deps setup-ydb } +# +# Status +# + hostinfos() { xargs= parssh " echo @@ -230,64 +282,28 @@ " } -hostargs() { - "$@" $hosts +times() { + parssh date +%s.%N } -scaling-helper() { - local leader=$1 - shift - tagssh $leader "ydb/src/ydb -l -n $# -X 100000 ${extraargs:-}" & - sleep .1 - for rep in "$@" - do tagssh $rep "ydb/src/ydb -n $# -H $leader ${extraargs:-}" & - done - wait -} +# +# Experiments involving ydb recovery (varying amount of data). +# -# This just tests how the system scales; no recovery involved. -scaling() { - hostargs scaling-helper -} - -# Repeat some experiment some number of trials and for some number of range -# configurations; e.g., "repeat scaling". -# TODO: fix this to work also with `hosts`; move into repeat-helper that's run -# via hostargs, and change the range= to hosts= -exp-scaling() { - local out=scaling-log-$(date +%Y-%m-%d-%H:%M:%S-%N) - local orighosts="$hosts" maxn=$(( $(echo $hosts | wc -w) - 1 )) - ln -sf $out scaling-log - for n in `seq $maxn -1 0` ; do # configurations - stop - for i in {1..3} ; do # trials - echo === n=$n i=$i === - echo === n=$n i=$i === > `tty` - scaling - sleep 1 - stop - sleep .1 - echo - done - hosts="${hosts% *}" - done >& $out - hosts="$orighosts" -} - rec-helper() { local leader=$1 shift : ${seqno:=100000} - tagssh $leader "ydb/src/ydb -l -x --accept-joiner-seqno $seqno -n $(( $# - 1 )) -o 1 -O 1 ${extraargs:-}" & # -v --debug-threads - sleep .1 # pexpect 'waiting for at least' + tagssh $leader "ydb/src/ydb -l -x --accept-joiner-seqno $seqno -n $(( $# - 1 )) -o 1 -O 1 ${extraargs:-}" & + sleep .1 # Run initial replicas. while (( $# > 1 )) ; do tagssh $1 "ydb/src/ydb -H $leader" & shift done - sleep .1 # pexpect 'got all \d+ replicas' leader + sleep .1 # Run joiner. - tagssh $1 "ydb/src/ydb -H $leader --yield-build-up --yield-catch-up ${extraargs:-}" & # -v --debug-threads -t 200000" & + tagssh $1 "ydb/src/ydb -H $leader --yield-build-up --yield-catch-up ${extraargs:-}" & if false ; then if [[ ${wait2:-} ]] then sleep $wait2 @@ -297,11 +313,8 @@ fi wait } +rec() { hostargs rec-helper ; } -rec() { - hostargs rec-helper -} - # Recovery experient. exp-rec() { for seqno in 500000 400000 300000 200000 100000 ; do # configurations @@ -332,62 +345,105 @@ extraargs="-m ${extraargs:-}" exp-rec >& $out } -# WAL. -aries() { - extraargs='--wal' scaling ${hosts:-} +stop-helper() { + tagssh $1 'pkill -sigint ydb' } -exp-aries() { - local out=aries-log-$(date +%Y-%m-%d-%H:%M:%S) - ln -sf $out aries-log - for i in {1..3} ; do - echo === n=-1 i=$i === - echo === n=-1 i=$i === > `tty` - aries - echo - done >& $out +stop() { + hostargs stop-helper } -mtcp-helper() { - local leader=$1 n=$(( $# - 1 )) - tagssh $leader 'pkill socat' - shift - for i in `seq $n` ; do - tagssh $1 " - sleep .2 - ( time seq $((1000000/n)) | socat - TCP4:$leader:$((9876+i)) ) 2>&1 | - fgrep real" & - shift +kill-helper() { + for i in "$@" + do tagssh $i 'pkill ydb' done - tagssh $leader " - for i in \`seq $n\` ; do - socat TCP4-LISTEN:\$((9876+i)),reuseaddr - > /dev/null & - done - wait" - wait } -mtcp() { - hostargs mtcp-helper +kill() { + hostargs kill-helper } -exp-mtcp() { - local out=mtcp-log-$(date +%Y-%m-%d-%H:%M:%S-%N) +# +# Experiments varying number of nodes (no recovery). +# + +# Repeat some experiment some number of trials and for varying numbers of hosts. +exp-var() { + local name=$1 cmd=$1 stop=${2:-} + local out=$name-log-$(date +%Y-%m-%d-%H:%M:%S-%N) local orighosts="$hosts" maxn=$(( $(echo $hosts | wc -w) - 1 )) - ln -sf $out mtcp-log - for n in `seq $maxn -1 1` ; do # configurations + ln -sf $out $name-log + for n in `seq $maxn -1 ${minn:-1}` ; do # configurations + $stop for i in {1..3} ; do # trials echo === n=$n i=$i === - echo === n=$n i=$i === > `tty` - mtcp + $cmd sleep 1 + if [[ $stop ]] + then $stop; sleep .1 + fi echo done hosts="${hosts% *}" - done >& $out + done 2>&1 | tee $out hosts="$orighosts" } +# ydb scalability test. +scaling-helper() { + local leader=$1 + shift + tagssh $leader "ydb/src/ydb -l -n $# -X 100000 ${extraargs:-}" & + sleep .1 + for rep in "$@" + do tagssh $rep "ydb/src/ydb -n $# -H $leader ${extraargs:-}" & + done + wait +} +scaling() { hostargs scaling-helper ; } +exp-scaling() { minn=0 exp-var scaling stop ; } + +# socat app +mtcp-helper() { + local leader=$1 n=$(( $# - 1 )) + tagssh $leader 'pkill socat || true' + shift + tagssh $leader " + for i in \`seq $n\` ; do + socat TCP4-LISTEN:\$((9876+i)),reuseaddr - > /dev/null & + done + wait" & + sleep .1 + { + time { + for i in `seq $n` ; do + tagssh $1 " + if false; then + time python -c ' +import socket, sys +sys.stdout.flush() +host, port, i, n = [sys.argv[1]] + map(int, sys.argv[2:]) +s = socket.socket() +s.connect((host, port)) +# For baseline benchmarking (how long it takes to generate the msg) +# sys.exit(1 == ord(chr(1)*(100000000)[-1])) +s.send(chr(1)*(100000000/n))' $leader $((9876+i)) $((i-1)) $n + elif true ; then + time dd bs=10000 if=/dev/zero count=$((100000000/10000/n)) | + socat - TCP4:$leader:$((9876+i)) + elif false ; then + time /tmp/stperf -s $leader -p $((9876+i)) -i $((i-1)) -n $n + fi" & + shift + done + wait + } + } 2>&1 | fgrep real +} +mtcp() { hostargs mtcp-helper ; } +exp-mtcp() { exp-var mtcp ; } + +# ST app stperf-helper() { local leader=$1 n=$(( $# - 1 )) shift @@ -402,78 +458,49 @@ done wait } +stperf() { hostargs stperf-helper ; } +exp-stperf() { exp-var stperf ; } -stperf() { - hostargs stperf-helper -} - -exp-stperf() { - local out=stperf-log-$(date +%Y-%m-%d-%H:%M:%S-%N) - local orighosts="$hosts" maxn=$(( $(echo $hosts | wc -w) - 1 )) - ln -sf $out stperf-log - for n in `seq $maxn` ; do # configurations - for i in {1..3} ; do # trials - echo === n=$n i=$i === - echo === n=$n i=$i === > `tty` - stperf - sleep 1 - echo - done - hosts="${hosts% *}" - done >& $out - hosts="$orighosts" -} - -stop-helper() { - tagssh $1 'pkill -sigint ydb' -} - -stop() { - hostargs stop-helper -} - -kill-helper() { - for i in "$@" - do tagssh $i 'pkill ydb' +# epoll app +epperf-helper() { + local leader=$1 n=$(( $# - 1 )) + shift + parssh "pkill epperf || true" + tagssh $leader "/tmp/epperf $n > /dev/null" & + sleep .1 + for i in `seq $n` ; do + tagssh $1 " + ( time /tmp/stperf -s $leader -i $((i-1)) -n $n ) 2>&1 | + fgrep real" & + shift done + wait } +epperf() { hostargs epperf-helper ; } +exp-epperf() { exp-var epperf ; } -kill() { - hostargs kill-helper -} +# +# WAL experiments. +# -times() { - parssh date +%s.%N +aries() { + extraargs='--wal' scaling ${hosts:-} } -# Use mssh to log in with password as root to each machine. -mssh-root() { - mssh -l root "$@" +exp-aries() { + local out=aries-log-$(date +%Y-%m-%d-%H:%M:%S) + ln -sf $out aries-log + for i in {1..3} ; do + echo === n=-1 i=$i === + echo === n=-1 i=$i === > `tty` + aries + echo + done >& $out } -# Set up hosts. -confighosts() { - if [[ ! ${remote:-} ]] ; then - if [[ ! "${hosts:-}" && ! "${range:-}" ]] - then range='1 14'; echo "warning: running with farms 1..14" 1>&2 - fi - if [[ "${range:-}" ]] - then hosts="$( seq $range | sed 's/^/farm/' )" - fi - hosts="$( echo -n $hosts )" - fi -} +# +# Main +# -# Set up logger. -configlogger() { - if [[ ! ${remote:-} ]] ; then - ( - flock -n /tmp/ydbtest.socket - ) > /tmp/y - fi -} - confighosts -#configlogger - eval "$@" This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <yan...@us...> - 2009-02-20 03:22:10
|
Revision: 1204 http://assorted.svn.sourceforge.net/assorted/?rev=1204&view=rev Author: yangzhang Date: 2009-02-20 01:41:23 +0000 (Fri, 20 Feb 2009) Log Message: ----------- - upgraded pb-switching from static to dynamic (using template hackery) - added --use-pb - added batches of types, etc. - broke recovery; need to fix this/speed it up - updated ser; this was playground for moving to dynamic switching - fixed writer::show() - fixed start_op/txn and add_op/txn so that the start_ operations would initialize the counts to 0 - fixed the clamp preprocessing so as to separate the fwd decls from the defs - updated README with new TODOs Modified Paths: -------------- ydb/trunk/README ydb/trunk/src/Makefile ydb/trunk/src/main.lzz.clamp ydb/trunk/src/ser.cc ydb/trunk/src/ser.h Modified: ydb/trunk/README =================================================================== --- ydb/trunk/README 2009-02-19 23:02:34 UTC (rev 1203) +++ ydb/trunk/README 2009-02-20 01:41:23 UTC (rev 1204) @@ -391,9 +391,15 @@ Period 2/17- -- TODO dynamic switch between pb and zero-copy +- DONE removed class outstream +- TODO refactor st_reader, etc. to be generic opportunistic buffered readers +- TODO see how streambuf read/write is actually implemented (whether it's too + slow) +- TODO try making a streambuf for st_write, then try it in conj with + struct-less pb +- DONE dynamic switch between pb and zero-copy - TODO async (threaded) wal -- TODO 0-node 0-copy (need to use threads) +- TODO 0-node 0-copy (don't need to use threads, just process each batch immed) - TODO google dense hash map - TODO show aries-write Modified: ydb/trunk/src/Makefile =================================================================== --- ydb/trunk/src/Makefile 2009-02-19 23:02:34 UTC (rev 1203) +++ ydb/trunk/src/Makefile 2009-02-20 01:41:23 UTC (rev 1204) @@ -62,12 +62,10 @@ %.o: %.pb.cc %.pb.h $(CXX) -c $(PBCXXFLAGS) $(OUTPUT_OPTION) $< -%.cc: %.lzz +%.cc %.hh: %.lzz lzz -hx hh -sx cc -hl -sl -hd -sd $< + python -c 'pars = file("lambda_impl.clamp_h").read().split("\n\n"); hh = file("main.hh").read(); print >> file("main.cc", "a"), pars[-1]; print >> file("main.hh", "w"), "\n\n".join(pars[:-1] + [hh])' -%.hh: %.lzz - lzz -hx hh -sx cc -hl -sl -hd -sd $< - %.pb.cc: %.proto protoc --cpp_out=. $< @@ -75,7 +73,7 @@ protoc --cpp_out=. $< %.lzz: %.lzz.clamp - clamp < $< | sed "`echo -e '1i#src\n1a#end'`" > $@ + clamp < $< | sed '1d' > $@ main.o: ser.h Modified: ydb/trunk/src/main.lzz.clamp =================================================================== --- ydb/trunk/src/main.lzz.clamp 2009-02-19 23:02:34 UTC (rev 1203) +++ ydb/trunk/src/main.lzz.clamp 2009-02-20 01:41:23 UTC (rev 1204) @@ -28,14 +28,12 @@ #include <unistd.h> // pipe, write #include <vector> #include "ydb.pb.h" -//#define USE_PB #include "ser.h" #define function boost::function #define foreach BOOST_FOREACH #define shared_ptr boost::shared_ptr #define ref boost::ref -#define REUSE_SER using namespace boost; using namespace boost::archive; @@ -55,11 +53,8 @@ using ydb::pb::Init; using ydb::pb::Join; using ydb::pb::SockAddr; -#ifdef USE_PB using namespace ydb::pb; -#else using namespace ydb::msg; -#endif #define GETMSG(buf) \ checkeqnneg(st_read_fully(src, buf, len, timeout), (int) len); \ @@ -82,7 +77,7 @@ size_t accept_joiner_size, buf_size; bool verbose, yield_during_build_up, yield_during_catch_up, dump, show_updates, count_updates, stop_on_recovery, general_txns, profile_threads, - debug_threads, multirecover, disk, debug_memory, use_wal, + debug_threads, multirecover, disk, debug_memory, use_wal, use_pb, suppress_txn_msgs, use_bcast_async, fake_bcast, force_ser, fake_exec; long long timelim, read_thresh, write_thresh; @@ -534,10 +529,14 @@ /** * Keep issuing transactions to the replicas. */ +template<typename Types> void issue_txns(st_channel<replica_info> &newreps, int &seqno, st_bool &accept_joiner) { + typedef typename Types::TxnBatch TxnBatch; + typedef typename Types::Txn Txn; + typedef typename Types::Op Op; Op_OpType types[] = {Op::read, Op::write, Op::del}; vector<st_netfd_t> fds; @@ -566,7 +565,8 @@ checkeqnneg(st_write(dst, buf, len, ST_UTIME_NO_TIMEOUT), ssize_t(len)); }, buf_size); stream s(r,w); - TxnBatch batch NPBONLY((s)); + scoped_ptr<TxnBatch> pbatch(new_TxnBatch<TxnBatch>(s)); + TxnBatch batch = *pbatch; for (int t = 0; t < batch_size; ++t) batch.add_txn(); while (!stop_hub) { @@ -589,12 +589,12 @@ } // Generate some random transactions. - NPBONLY(batch.start_txn()); + start_txn(batch); for (int t = 0; t < batch_size; ++t) { Txn &txn = *batch.add_txn(); txn.set_seqno(seqno); int count = randint(min_ops, max_ops + 1); - NPBONLY(txn.start_op()); + start_op(txn); for (int o = 0; o < count; ++o) { Op *op = txn.add_op(); int rtype = general_txns ? randint(3) : 1, @@ -604,12 +604,12 @@ op->set_key(rkey); op->set_value(rvalue); } - NPBONLY(txn.fin_op()); + fin_op(txn); // Process immediately if not bcasting. if (fds.empty()) { --seqno; - process_txn(g_map, txn, seqno, nullptr); + process_txn<Types>(g_map, txn, seqno, nullptr); } // Checkpoint. @@ -643,8 +643,8 @@ ++seqno; } - NPBONLY(batch.fin_txn()); - NPBONLY(if (batch.txn_size() == 0) w.reset()); + fin_txn(batch); + if (batch.txn_size() == 0) w.reset(); // Broadcast. #ifdef USE_PB @@ -669,13 +669,13 @@ // This means "The End." w.mark(); batch.Clear(); - NPBONLY(batch.start_txn()); + start_txn(batch); Txn &txn = *batch.add_txn(); txn.set_seqno(-1); - NPBONLY(txn.start_op()); - NPBONLY(txn.fin_op()); - NPBONLY(batch.fin_txn()); - PBONLY(bcastmsg(fds, batch)); + start_op(txn); + fin_op(txn); + fin_txn(batch); + if (Types::is_pb()) bcastmsg(fds, batch); w.mark(); w.flush(); } @@ -684,9 +684,12 @@ * Process a transaction: update DB state (incl. seqno) and send response to * leader. */ +template<typename Types> void -process_txn(mii &map, const Txn &txn, int &seqno, Response *res) +process_txn(mii &map, const typename Types::Txn &txn, int &seqno, Response *res) { + typedef typename Types::Txn Txn; + typedef typename Types::Op Op; //wal &wal = *g_wal; checkeq(txn.seqno(), seqno + 1); seqno = txn.seqno(); @@ -770,6 +773,17 @@ } #end +template<typename Txn> shared_ptr<ydb::pb::Txn> to_pb_Txn(Txn txn); +template<> shared_ptr<ydb::pb::Txn> to_pb_Txn(ydb::pb::Txn txn) { + return shared_ptr<ydb::pb::Txn>(new ydb::pb::Txn(txn)); +} +template<> shared_ptr<ydb::pb::Txn> to_pb_Txn(ydb::msg::Txn txn) { + shared_ptr<ydb::pb::Txn> ptxn(new ydb::pb::Txn()); + ptxn->set_seqno(txn.seqno()); + // XXX FIXME + return ptxn; +} + /** * Actually do the work of executing a transaction and sending back the reply. * @@ -795,12 +809,17 @@ * * \param[in] wal The WAL. */ +template<typename Types> void process_txns(st_netfd_t leader, mii &map, int &seqno, st_channel<shared_ptr<Recovery> > &send_states, - st_channel<shared_ptr<Txn> > &backlog, int init_seqno, + st_channel<shared_ptr<ydb::pb::Txn> > &backlog, int init_seqno, int mypos, int nnodes) { + typedef typename Types::TxnBatch TxnBatch; + typedef typename Types::Txn Txn; + typedef typename Types::Op Op; + bool caught_up = init_seqno == 0; long long start_time = current_time_millis(), time_caught_up = caught_up ? start_time : -1; @@ -829,7 +848,8 @@ stream s(reader, w); try { - TxnBatch batch NPBONLY((s)); + scoped_ptr<TxnBatch> pbatch(new_TxnBatch<TxnBatch>(s)); + TxnBatch batch = *pbatch; ResponseBatch resbatch; while (true) { long long before_read = -1; @@ -838,8 +858,8 @@ } { st_intr intr(stop_hub); - PBONLY(readmsg(reader, batch)); - NPBONLY(batch.Clear()); + if (Types::is_pb()) readmsg(reader, batch); + else batch.Clear(); } if (read_thresh > 0) { long long read_time = current_time_millis() - before_read; @@ -866,13 +886,14 @@ caught_up = true; } Response *res = resbatch.add_res(); - process_txn(map, txn, seqno, res); + process_txn<Types>(map, txn, seqno, res); action = "processed"; } else { if (first_seqno == -1) first_seqno = txn.seqno(); // Queue up for later processing once a snapshot has been received. - backlog.push(shared_ptr<Txn>(new Txn(txn))); + // XXX + backlog.push(to_pb_Txn(txn)); action = "backlogged"; } @@ -1158,8 +1179,9 @@ st_bool accept_joiner; int seqno = 0; st_channel<replica_info> newreps; - const function0<void> f = bind(issue_txns, ref(newreps), ref(seqno), - ref(accept_joiner)); + const function<void()> f = use_pb ? + bind(issue_txns<pb_types>, ref(newreps), ref(seqno), ref(accept_joiner)) : + bind(issue_txns<rb_types>, ref(newreps), ref(seqno), ref(accept_joiner)); st_thread_t swallower = my_spawn(bind(swallow, f), "issue_txns"); foreach (const replica_info &r, replicas) newreps.push(r); st_joining join_swallower(swallower); @@ -1302,12 +1324,15 @@ } // Process txns. - st_channel<shared_ptr<Txn> > backlog; - st_joining join_proc(my_spawn(bind(process_txns, leader, ref(map), - ref(seqno), ref(send_states), - ref(backlog), init.txnseqno(), - mypos, init.node_size()), - "process_txns")); + st_channel<shared_ptr<ydb::pb::Txn> > backlog; + const function<void()> process_fn = use_pb ? + bind(process_txns<pb_types>, leader, ref(map), ref(seqno), + ref(send_states), ref(backlog), init.txnseqno(), mypos, + init.node_size()) : + bind(process_txns<rb_types>, leader, ref(map), ref(seqno), + ref(send_states), ref(backlog), init.txnseqno(), mypos, + init.node_size()); + st_joining join_proc(my_spawn(process_fn, "process_txns")); st_joining join_rec(my_spawn(bind(recover_joiner, listener, ref(map), ref(seqno), ref(send_states)), "recover_joiner")); @@ -1361,8 +1386,9 @@ int mid_seqno = seqno; while (!backlog.empty()) { + using ydb::pb::Txn; shared_ptr<Txn> p = backlog.take(); - process_txn(map, *p, seqno, nullptr); + process_txn<pb_types>(map, *p, seqno, nullptr); if (p->seqno() % chkpt == 0) { if (verbose) cout << "processed txn " << p->seqno() << " off the backlog; " @@ -1488,6 +1514,8 @@ "count operations that touch (update/read/delete) an existing key") ("general-txns,g", po::bool_switch(&general_txns), "issue read and delete transactions as well as the default of (only) insertion/update transactions (for leader only)") + ("use-pb", po::bool_switch(&use_pb), + "use protocol buffers instead of raw buffers") ("wal", po::bool_switch(&use_wal), "enable ARIES write-ahead logging") ("force-ser", po::bool_switch(&force_ser), Modified: ydb/trunk/src/ser.cc =================================================================== --- ydb/trunk/src/ser.cc 2009-02-19 23:02:34 UTC (rev 1203) +++ ydb/trunk/src/ser.cc 2009-02-20 01:41:23 UTC (rev 1204) @@ -1,17 +1,11 @@ #include "ser.h" #include <commons/st/st.h> +#include <boost/scoped_ptr.hpp> -//#define USE_PB -using ydb::msg::reader; -using ydb::msg::writer; -using ydb::msg::stream; using namespace commons; using namespace std; -#ifdef USE_PB +using namespace ydb::msg; using namespace ydb::pb; -#else -using namespace ydb::msg; -#endif const int nreps = 2; @@ -22,13 +16,18 @@ public: outstream(const vector<st_netfd_t> &dsts) : dsts(dsts) {} void operator()(const void *buf, size_t len) { + cout << "writing " << len << endl; foreach (st_netfd_t dst, dsts) checkeqnneg(st_write(dst, buf, len, ST_UTIME_NO_TIMEOUT), ssize_t(len)); } }; +template<typename types> void producer(st_netfd_t dst) { + typedef typename types::TxnBatch TxnBatch; + typedef typename types::Txn Txn; + typedef typename types::Op Op; vector<st_netfd_t> dsts(1, dst); outstream os(dsts); writer w(os, 90); @@ -36,48 +35,56 @@ stream s(r,w); string str; const bool show = true; - TxnBatch batch NPBONLY((s)); + scoped_ptr<TxnBatch> p(new_TxnBatch<TxnBatch>(s)); + TxnBatch &batch = *p; for (int i = 0; i < nreps; ++i) { w.mark(); batch.Clear(); - NPBONLY(batch.start_txn()); + start_txn(batch); for (int t = 0; t < 2; ++t) { Txn &txn = *batch.add_txn(); txn.set_seqno(t + 5); - NPBONLY(txn.start_op()); + start_op(txn); for (int o = 0; o < 2; ++o) { Op &op = *txn.add_op(); op.set_type (Op::del); op.set_key (3 * (o+1)); op.set_value(4 * (o+1)); } - NPBONLY(txn.fin_op()); + fin_op(txn); } - NPBONLY(batch.fin_txn()); + fin_txn(batch); if (show) cout << w.pos() << '/' << w.size() << endl; - PBONLY(check(batch.SerializeToString(&str))); + ser(batch, str); } batch.Clear(); - NPBONLY(batch.start_txn()); - NPBONLY(batch.fin_txn()); + start_txn(batch); + fin_txn(batch); w.mark(); + w.show(); w.flush(); } +template<typename types> void consumer(st_netfd_t src) { + typedef typename types::TxnBatch TxnBatch; + typedef typename types::Txn Txn; + typedef typename types::Op Op; vector<st_netfd_t> v; outstream os(v); writer w(os, 90); reader r(src); stream s(r,w); + string str; // XXX const bool show = true; - TxnBatch batch NPBONLY((s)); - for (int i = 0; i < nreps; ++i) { + scoped_ptr<TxnBatch> p(new_TxnBatch<TxnBatch>(s)); + TxnBatch &batch = *p; + while (true) { batch.Clear(); - PBONLY(check(batch.ParseFromString(str))); + parse(batch, str); if (show) cout << "ntxn " << batch.txn_size() << endl; - //if (batch.txn_size() == 0) break; + if (batch.txn_size() == 0) break; for (int t = 0; t < batch.txn_size(); ++t) { const Txn &txn = batch.txn(t); if (show) cout << "txn seqno " << txn.seqno() << " " << txn.seqno() << endl; @@ -98,15 +105,22 @@ int main(int argc, char **argv) { st_init(); - bool is_leader = argc == 1; + bool use_pb = argc > 1 && string("-p") == argv[1]; + bool is_leader = argc == (use_pb ? 2 : 1); if (is_leader) { st_netfd_t listener = st_tcp_listen(7654); st_netfd_t dst = checkerr(st_accept(listener, nullptr, nullptr, ST_UTIME_NO_TIMEOUT)); - producer(dst); + if (use_pb) + producer<pb_types>(dst); + else + producer<rb_types>(dst); } else { st_netfd_t src = st_tcp_connect(argv[1], 7654, ST_UTIME_NO_TIMEOUT); - consumer(src); + if (use_pb) + consumer<pb_types>(src); + else + consumer<rb_types>(src); } return 0; } Modified: ydb/trunk/src/ser.h =================================================================== --- ydb/trunk/src/ser.h 2009-02-19 23:02:34 UTC (rev 1203) +++ ydb/trunk/src/ser.h 2009-02-20 01:41:23 UTC (rev 1204) @@ -7,16 +7,6 @@ #include <iostream> #include "ydb.pb.h" -#ifdef USE_PB -#define PBSWITCH(a,b) a -#define PBONLY(x) x -#define NPBONLY(x) -#else -#define PBSWITCH(a,b) b -#define PBONLY(x) -#define NPBONLY(x) x -#endif - #define BEGIN_NAMESPACE(ns) namespace ns { #define END_NAMESPACE } @@ -75,7 +65,7 @@ void show() { cout << (void*) p_; for (size_t i = 0; i < a_.size(); ++i) - cout << " " << hex << setfill('0') << setw(2) << int(mark_[i]); + cout << " " << hex << setfill('0') << setw(2) << (int)(unsigned char)(a_.get()[i]); cout << endl; cout << (void*) p_; for (size_t i = 0; i < a_.size(); ++i) @@ -136,8 +126,8 @@ void Clear() { w_.reserve(0*50); nop_ = unset; seqno_ = unset; off_ = w_.pos(); } void set_seqno(int x) { w_.write(x); } int seqno() const { return seqno_ == unset ? seqno_ = r_.read<int>() : seqno_; } - void start_op() { w_.skip<typeof(nop_)>(); } - Op *add_op() { if (nop_ == unset) nop_ = 0; ++nop_; return &op_; } + void start_op() { if (nop_ == unset) nop_ = 0; w_.skip<typeof(nop_)>(); } + Op *add_op() { ++nop_; return &op_; } void fin_op() { w_.write(nop_, off_ + sizeof(int)); } int op_size() const { if (nop_ == unset) nop_ = r_.read<typeof(nop_)>(); return nop_; } const Op &op(int o) const { return op_; } @@ -155,8 +145,8 @@ public: TxnBatch(stream &s) : s_(s), r_(s.get_reader()), w_(s.get_writer()), off_(w_.pos()), txn_(s), ntxn_(unset) {} void Clear() { w_.reserve(0*100); txn_.Clear(); ntxn_ = unset; off_ = w_.pos(); } - void start_txn() { w_.skip<typeof(ntxn_)>(); } - Txn *add_txn() { if (ntxn_ == unset) ntxn_ = 0; ++ntxn_; txn_.Clear(); return &txn_; } + void start_txn() { if (ntxn_ == unset) ntxn_ = 0; w_.skip<typeof(ntxn_)>(); } + Txn *add_txn() { ++ntxn_; txn_.Clear(); return &txn_; } void fin_txn() { w_.write(ntxn_, off_); } int txn_size() const { if (ntxn_ == unset) @@ -167,8 +157,52 @@ bool AppendToString(string *s) const { throw std::exception(); } bool SerializeToString(string *s) const { throw std::exception(); } bool SerializeToOstream(ostream *s) const { throw std::exception(); } + bool ParseFromArray(void *p, size_t len) { throw std::exception(); } }; +template<typename T> void parse(T &batch, const string &str); +template<> void parse(ydb::pb::TxnBatch &batch, const string &str) { check(batch.ParseFromString(str)); } +template<> void parse(ydb::msg::TxnBatch &batch, const string &str) {} + +template<typename T> void ser(T &batch, string &str); +template<> void ser(ydb::pb::TxnBatch &batch, string &str) { check(batch.SerializeToString(&str)); } +template<> void ser(ydb::msg::TxnBatch &batch, string &str) {} + +template<typename T> void start_txn(T &batch); +template<> void start_txn(ydb::pb::TxnBatch &batch) {} +template<> void start_txn(ydb::msg::TxnBatch &batch) { batch.start_txn(); } + +template<typename T> void fin_txn(T &batch); +template<> void fin_txn(ydb::pb::TxnBatch &batch) {} +template<> void fin_txn(ydb::msg::TxnBatch &batch) { batch.fin_txn(); } + +template<typename T> void start_op(T &txn); +template<> void start_op(ydb::pb::Txn &txn) {} +template<> void start_op(ydb::msg::Txn &txn) { txn.start_op(); } + +template<typename T> void fin_op(T &txn); +template<> void fin_op(ydb::pb::Txn &txn) {} +template<> void fin_op(ydb::msg::Txn &txn) { txn.fin_op(); } + +template<typename T> T *new_TxnBatch(stream &s); +template<> ydb::pb::TxnBatch *new_TxnBatch(stream &s) { return new ydb::pb::TxnBatch(); } +template<> ydb::msg::TxnBatch *new_TxnBatch(stream &s) { return new ydb::msg::TxnBatch(s); } + +struct pb_types { + typedef ydb::pb::TxnBatch TxnBatch; + typedef ydb::pb::Txn Txn; + typedef ydb::pb::Op Op; + static bool is_pb() { return true; } +}; + +// rb = raw buffer +struct rb_types { + typedef ydb::msg::TxnBatch TxnBatch; + typedef ydb::msg::Txn Txn; + typedef ydb::msg::Op Op; + static bool is_pb() { return false; } +}; + END_NAMESPACE END_NAMESPACE This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <yan...@us...> - 2009-02-20 06:22:10
|
Revision: 1206 http://assorted.svn.sourceforge.net/assorted/?rev=1206&view=rev Author: yangzhang Date: 2009-02-20 06:22:02 +0000 (Fri, 20 Feb 2009) Log Message: ----------- - removed the parse/ser calls - renamed ydb.o -> ydb.pb.o - got pb path working in ser - added notes to reuse serialization buffers for pb path in ydb - added some more notes/todos Modified Paths: -------------- ydb/trunk/README ydb/trunk/src/Makefile ydb/trunk/src/main.lzz.clamp ydb/trunk/src/ser.cc ydb/trunk/src/ser.h Modified: ydb/trunk/README =================================================================== --- ydb/trunk/README 2009-02-20 05:52:31 UTC (rev 1205) +++ ydb/trunk/README 2009-02-20 06:22:02 UTC (rev 1206) @@ -398,6 +398,8 @@ - TODO try making a streambuf for st_write, then try it in conj with struct-less pb - DONE dynamic switch between pb and zero-copy +- TODO fix pb recovery +- TODO implement new recovery (add buffer swapping, add buffers to a list) - TODO async (threaded) wal - TODO 0-node 0-copy (don't need to use threads, just process each batch immed) - TODO google dense hash map Modified: ydb/trunk/src/Makefile =================================================================== --- ydb/trunk/src/Makefile 2009-02-20 05:52:31 UTC (rev 1205) +++ ydb/trunk/src/Makefile 2009-02-20 06:22:02 UTC (rev 1206) @@ -9,7 +9,7 @@ PBS := $(wildcard *.proto) PBHDRS := $(foreach pb,$(PBS),$(patsubst %.proto,%.pb.h,$(pb))) PBSRCS := $(foreach pb,$(PBS),$(patsubst %.proto,%.pb.cc,$(pb))) -PBOBJS := $(foreach pb,$(PBS),$(patsubst %.proto,%.o,$(pb))) +PBOBJS := $(foreach pb,$(PBS),$(patsubst %.proto,%.pb.o,$(pb))) GENHDRS := $(LZZHDRS) $(PBHDRS) GENSRCS := $(LZZSRCS) $(PBSRCS) @@ -56,12 +56,12 @@ $(TARGET): $(OBJS) $(LINK.o) $^ $(LOADLIBES) $(LDLIBS) -o $@ +%.pb.o: %.pb.cc %.pb.h + $(CXX) -c $(PBCXXFLAGS) $(OUTPUT_OPTION) $< + %.o: %.cc $(PBHDRS) $(COMPILE.cc) $(OUTPUT_OPTION) $< -%.o: %.pb.cc %.pb.h - $(CXX) -c $(PBCXXFLAGS) $(OUTPUT_OPTION) $< - %.cc %.hh: %.lzz lzz -hx hh -sx cc -hl -sl -hd -sd $< python -c 'pars = file("lambda_impl.clamp_h").read().split("\n\n"); hh = file("main.hh").read(); print >> file("main.cc", "a"), pars[-1]; print >> file("main.hh", "w"), "\n\n".join(pars[:-1] + [hh])' @@ -98,7 +98,7 @@ ### -serperf: serperf.o ydb.o +serperf: serperf.o ydb.pb.o $(LINK.o) $^ $(LOADLIBES) $(LDLIBS) $(OUTPUT_OPTION) # serperf.cc ydb.pb.h @@ -106,5 +106,5 @@ p2: p2.cc $(LINK.cc) $^ $(LOADLIBES) $(LDLIBS) $(OUTPUT_OPTION) -ser: ser.cc ser.h ydb.o +ser: ser.cc ser.h ydb.pb.o $(LINK.cc) $^ $(LOADLIBES) $(LDLIBS) $(OUTPUT_OPTION) Modified: ydb/trunk/src/main.lzz.clamp =================================================================== --- ydb/trunk/src/main.lzz.clamp 2009-02-20 05:52:31 UTC (rev 1205) +++ ydb/trunk/src/main.lzz.clamp 2009-02-20 06:22:02 UTC (rev 1206) @@ -377,7 +377,7 @@ */ template<typename T> void -bcastmsg_sync(const vector<st_netfd_t> &dsts, const T &msg) +bcastmsg_sync(const vector<st_netfd_t> &dsts, const T &msg /*, ser_t &s */) { ser_t s; ser(s, msg); @@ -394,7 +394,7 @@ */ template<typename T> void -bcastmsg(const vector<st_netfd_t> &dsts, const T &msg) +bcastmsg(const vector<st_netfd_t> &dsts, const T &msg /* XXX optimize this , ser_t &s */) { if (use_bcast_async) bcastmsg_async(dsts, msg); else bcastmsg_sync(dsts, msg); @@ -407,6 +407,7 @@ void sendmsg(st_netfd_t dst, const T &msg) { + // XXX optimize this vector<st_netfd_t> dsts(1, dst); bcastmsg(dsts, msg); } Modified: ydb/trunk/src/ser.cc =================================================================== --- ydb/trunk/src/ser.cc 2009-02-20 05:52:31 UTC (rev 1205) +++ ydb/trunk/src/ser.cc 2009-02-20 06:22:02 UTC (rev 1206) @@ -22,6 +22,17 @@ } }; +template<typename TxnBatch> +void push(TxnBatch &batch, string &str, outstream &os) { + str.clear(); + uint32_t len = 0; + str.append(sizeof len, '\0'); + check(batch.AppendToString(&str)); + len = str.size() - sizeof len; + copy((char*) &len, (char*) &len + sizeof len, str.begin()); + os(str.data(), str.size()); +} + template<typename types> void producer(st_netfd_t dst) { @@ -55,7 +66,7 @@ } fin_txn(batch); if (show) cout << w.pos() << '/' << w.size() << endl; - ser(batch, str); + if (types::is_pb()) push(batch, str, os); } batch.Clear(); start_txn(batch); @@ -63,6 +74,7 @@ w.mark(); w.show(); w.flush(); + if (types::is_pb()) push(batch, str, os); } template<typename types> @@ -81,8 +93,13 @@ scoped_ptr<TxnBatch> p(new_TxnBatch<TxnBatch>(s)); TxnBatch &batch = *p; while (true) { - batch.Clear(); - parse(batch, str); + if (types::is_pb()) { + uint32_t len = r.read<uint32_t>(); + managed_array<char> a = r.read(len); + check(batch.ParseFromArray(a.get(), len)); + } else { + batch.Clear(); + } if (show) cout << "ntxn " << batch.txn_size() << endl; if (batch.txn_size() == 0) break; for (int t = 0; t < batch.txn_size(); ++t) { @@ -107,6 +124,7 @@ st_init(); bool use_pb = argc > 1 && string("-p") == argv[1]; bool is_leader = argc == (use_pb ? 2 : 1); + cout << "use_pb " << use_pb << " is_leader " << is_leader << endl; if (is_leader) { st_netfd_t listener = st_tcp_listen(7654); st_netfd_t dst = checkerr(st_accept(listener, nullptr, nullptr, @@ -116,7 +134,7 @@ else producer<rb_types>(dst); } else { - st_netfd_t src = st_tcp_connect(argv[1], 7654, ST_UTIME_NO_TIMEOUT); + st_netfd_t src = st_tcp_connect(argv[use_pb ? 2 : 1], 7654, ST_UTIME_NO_TIMEOUT); if (use_pb) consumer<pb_types>(src); else Modified: ydb/trunk/src/ser.h =================================================================== --- ydb/trunk/src/ser.h 2009-02-20 05:52:31 UTC (rev 1205) +++ ydb/trunk/src/ser.h 2009-02-20 06:22:02 UTC (rev 1206) @@ -160,14 +160,6 @@ bool ParseFromArray(void *p, size_t len) { throw std::exception(); } }; -template<typename T> void parse(T &batch, const string &str); -template<> void parse(ydb::pb::TxnBatch &batch, const string &str) { check(batch.ParseFromString(str)); } -template<> void parse(ydb::msg::TxnBatch &batch, const string &str) {} - -template<typename T> void ser(T &batch, string &str); -template<> void ser(ydb::pb::TxnBatch &batch, string &str) { check(batch.SerializeToString(&str)); } -template<> void ser(ydb::msg::TxnBatch &batch, string &str) {} - template<typename T> void start_txn(T &batch); template<> void start_txn(ydb::pb::TxnBatch &batch) {} template<> void start_txn(ydb::msg::TxnBatch &batch) { batch.start_txn(); } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <yan...@us...> - 2009-02-20 19:44:26
|
Revision: 1211 http://assorted.svn.sourceforge.net/assorted/?rev=1211&view=rev Author: yangzhang Date: 2009-02-20 19:44:17 +0000 (Fri, 20 Feb 2009) Log Message: ----------- - falling back to pb for 0-node cases - removed swallower - using google dense_hash_map - optimized/simplified the function pointers - fixed the accidental total exclusion of local (0-node) processing - using operation_not_supported exceptions Modified Paths: -------------- ydb/trunk/README ydb/trunk/src/main.lzz.clamp ydb/trunk/src/ser.h Modified: ydb/trunk/README =================================================================== --- ydb/trunk/README 2009-02-20 19:37:33 UTC (rev 1210) +++ ydb/trunk/README 2009-02-20 19:44:17 UTC (rev 1211) @@ -388,10 +388,17 @@ - 3: 122K/122K/97K - DONE commit - DONE add zero-copy structs/(de-)serialization + - -1: 245K (same as before; actually using pb's) + - 0: 320K (same as before; actually using pb's) + - 1: 300K + - 2: 300K + - 3: 300K Period 2/17- - DONE removed class outstream +- TODO get raw-buffer working in wal, 0-node +- TODO add raw-buffer versions of the response message classes as well - TODO refactor st_reader, etc. to be generic opportunistic buffered readers - TODO see how streambuf read/write is actually implemented (whether it's too slow) @@ -402,8 +409,15 @@ - TODO implement new recovery (add buffer swapping, add buffers to a list) - TODO async (threaded) wal - TODO 0-node 0-copy (don't need to use threads, just process each batch immed) -- TODO google dense hash map +- DONE google dense hash map + - big improvement, again not in the direction we'd like + - 0: 550K + - 1: 490K + - 2: 485K + - 3: 475K +- TODO reuse the serialization buffer in the pb path of ydb + - TODO show aries-write - TODO checkpointing + replaying log from replicas (not from disk) - TODO scale-up on multicore Modified: ydb/trunk/src/main.lzz.clamp =================================================================== --- ydb/trunk/src/main.lzz.clamp 2009-02-20 19:37:33 UTC (rev 1210) +++ ydb/trunk/src/main.lzz.clamp 2009-02-20 19:44:17 UTC (rev 1211) @@ -63,13 +63,16 @@ check(msg.ParseFromArray(buf, len)); #end -#define map_t unordered_map +//#define map_t unordered_map //#define map_t map -//#define map_t dense_hash_map +#define map_t dense_hash_map typedef pair<int, int> pii; typedef map_t<int, int> mii; typedef string ser_t; +template<typename T> void init_map(T &map) {} +template<> void init_map(dense_hash_map<int, int> &map) { map.set_empty_key(-1); map.set_deleted_key(-2); } + // Configuration. st_utime_t timeout; int chkpt, accept_joiner_seqno, issuing_interval, min_ops, max_ops, @@ -550,23 +553,17 @@ }); reader r(nullptr); - //function<void(const void*, size_t)> fn = use_wal ? - // lambda(const void *buf, size_t len) { g_wal->logbuf(buf, len); } : - // lambda(const void *buf, size_t len) { - // }; - //if (use_wal) fn = lambda(const void *buf, size_t len) {}; - //else fn = lambda(const void *buf, size_t len) { g_wal->logbuf(buf, len); }; - // TODO why doesn't this work? - // else fn = boost::bind(&wal::logbuf, g_wal); + function<void(const void*, size_t)> fn; + if (use_wal) + fn = boost::bind(&wal::logbuf, g_wal, _1, _2); + else + fn = lambda(const void *buf, size_t len) { + foreach (st_netfd_t dst, __ref(fds)) + checkeqnneg(st_write(dst, buf, len, ST_UTIME_NO_TIMEOUT), + static_cast<ssize_t>(len)); + }; - writer w(lambda(const void *buf, size_t len) { - if (__ref(use_wal)) - g_wal->logbuf(buf, len); - else - foreach (st_netfd_t dst, __ref(fds)) - checkeqnneg(st_write(dst, buf, len, ST_UTIME_NO_TIMEOUT), - static_cast<ssize_t>(len)); - }, buf_size); + writer w(fn, buf_size); stream s(r,w); scoped_ptr<TxnBatch> pbatch(new_TxnBatch<TxnBatch>(s)); TxnBatch batch = *pbatch; @@ -651,7 +648,7 @@ if (batch.txn_size() == 0) w.reset(); // Broadcast. - if (!fds.empty() && !suppress_txn_msgs) { + if (Types::is_pb() && !fds.empty() && !suppress_txn_msgs) { bcastmsg(fds, batch); } else if (use_wal) { g_wal->log(batch); @@ -846,7 +843,9 @@ st_reader reader(leader); vector<st_netfd_t> leader_v(1, leader); - writer w(lambda(const void*, size_t) { throw std::exception(); }, buf_size); + writer w(lambda(const void*, size_t) { + throw operation_not_supported("process_txns should not be writing"); + }, buf_size); stream s(reader, w); try { @@ -1184,9 +1183,9 @@ const function<void()> f = use_pb ? bind(issue_txns<pb_types>, ref(newreps), ref(seqno), ref(accept_joiner)) : bind(issue_txns<rb_types>, ref(newreps), ref(seqno), ref(accept_joiner)); - st_thread_t swallower = my_spawn(bind(swallow, f), "issue_txns"); + st_thread_t issue_txns_thread = my_spawn(f, "issue_txns"); foreach (const replica_info &r, replicas) newreps.push(r); - st_joining join_swallower(swallower); + st_joining join_issue_txns(issue_txns_thread); finally fin(lambda () { cout << "LEADER SUMMARY" << endl; @@ -1264,7 +1263,7 @@ } // Initialize database state. - mii map; + mii &map = g_map; int seqno = -1; finally f(lambda () { cout << "REPLICA SUMMARY" << endl; @@ -1586,7 +1585,7 @@ check(max_ops > 0); check(max_ops >= min_ops); - if (minreps == 0) use_pb = true; // XXX + if (minreps == 0 && !use_wal) use_pb = true; // XXX } catch (std::exception &ex) { cerr << ex.what() << endl << endl << desc << endl; return 1; @@ -1663,6 +1662,9 @@ } }); + // Initialize the map. + init_map(g_map); + // Which role are we? if (is_leader) { run_leader(minreps, leader_port); @@ -1682,5 +1684,4 @@ * Compile-time options: * * - map, unordered_map, dense_hash_map - * - SERIALIZATION METHOD */ Modified: ydb/trunk/src/ser.h =================================================================== --- ydb/trunk/src/ser.h 2009-02-20 19:37:33 UTC (rev 1210) +++ ydb/trunk/src/ser.h 2009-02-20 19:44:17 UTC (rev 1211) @@ -2,6 +2,7 @@ #define YDB_MSG_H #include <commons/array.h> +#include <commons/exceptions.h> #include <commons/st/st.h> #include <iomanip> #include <iostream> @@ -36,7 +37,7 @@ assert(size_t(p - mark_ + n) <= a_.size()); flush(); size_t diff = mark_ - a_.get(); - memmove(a_.get(), mark_, diff); + memmove(a_.get(), mark_, p_ - mark_); unsent_ = mark_ = a_.get(); p_ -= diff; p -= diff; @@ -154,12 +155,12 @@ return ntxn_; } const Txn &txn(int t) const { txn_.Clear(); return txn_; } - bool AppendToString(string *s) const { throw std::exception(); } - bool SerializeToString(string *s) const { throw std::exception(); } - bool SerializeToOstream(ostream *s) const { throw std::exception(); } - bool ParseFromArray(void *p, size_t len) { throw std::exception(); } - size_t GetCachedSize() const { throw std::exception(); } - size_t ByteSize() const { throw std::exception(); } + bool AppendToString(string *s) const { throw_operation_not_supported(); } + bool SerializeToString(string *s) const { throw_operation_not_supported(); } + bool SerializeToOstream(ostream *s) const { throw_operation_not_supported(); } + bool ParseFromArray(void *p, size_t len) { throw_operation_not_supported(); } + size_t GetCachedSize() const { throw_operation_not_supported(); } + size_t ByteSize() const { throw_operation_not_supported(); } }; template<typename T> void start_txn(T &batch); This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <yan...@us...> - 2009-02-23 04:49:57
|
Revision: 1226 http://assorted.svn.sourceforge.net/assorted/?rev=1226&view=rev Author: yangzhang Date: 2009-02-23 04:49:55 +0000 (Mon, 23 Feb 2009) Log Message: ----------- - removed USE_PB/PB - simplified setup-ydb - further tweaked warnings - added TODO Modified Paths: -------------- ydb/trunk/README ydb/trunk/src/Makefile ydb/trunk/tools/test.bash Modified: ydb/trunk/README =================================================================== --- ydb/trunk/README 2009-02-23 04:37:33 UTC (rev 1225) +++ ydb/trunk/README 2009-02-23 04:49:55 UTC (rev 1226) @@ -410,6 +410,8 @@ - TODO async (threaded) wal - TODO 0-node 0-copy (don't need to use threads, just process each batch immed) +- TODO see how p2 compares with ydb + - DONE google dense hash map - big improvement, again not in the direction we'd like - 0: 550K Modified: ydb/trunk/src/Makefile =================================================================== --- ydb/trunk/src/Makefile 2009-02-23 04:37:33 UTC (rev 1225) +++ ydb/trunk/src/Makefile 2009-02-23 04:49:55 UTC (rev 1226) @@ -33,9 +33,6 @@ else OPT := -g3 endif -ifneq ($(PB),) - PB := -DUSE_PB -endif # CXX := $(WTF) ag++ -k --Xcompiler # $(CXX) CXX := $(WTF) $(CXX) LDFLAGS := -pthread $(GPROF) @@ -55,7 +52,6 @@ -Winit-self \ -Wswitch-enum \ -Wunused \ - -Wstrict-overflow \ -Wfloat-equal \ -Wundef \ -Wunsafe-loop-optimizations \ @@ -71,22 +67,23 @@ -Wmissing-format-attribute \ -Wpacked \ -Wredundant-decls \ - -Winline \ -Winvalid-pch \ -Wlong-long \ -Wvolatile-register-var \ -std=gnu++0x \ - $(PB) \ $(CXXFLAGS) - # -Wmissing-noreturn \ - # -Weffc++ \ - # -pedantic \ - # -Wshadow \ - # -Wswitch-default \ - # -Wpadded \ - # -Wunreachable-code \ - # -Wstack-protector \ + # \ + -Wmissing-noreturn \ + -Weffc++ \ + -pedantic \ + -Wshadow \ + -Wswitch-default \ + -Wpadded \ + -Wunreachable-code \ + -Wstack-protector \ + -Wstrict-overflow \ + -Winline \ PBCXXFLAGS := $(OPT) -Wall -Werror $(GPROF) Modified: ydb/trunk/tools/test.bash =================================================================== --- ydb/trunk/tools/test.bash 2009-02-23 04:37:33 UTC (rev 1225) +++ ydb/trunk/tools/test.bash 2009-02-23 04:49:55 UTC (rev 1226) @@ -181,26 +181,15 @@ toast --quiet arm 'http://google-sparsehash.googlecode.com/files/sparsehash-1.4.tar.gz' } -node-setup-ydb-1() { +node-setup-ydb() { check-remote - if [[ ! -L ~/ydb ]] - then ln -s ~/work/assorted/ydb/trunk ~/ydb - fi - if [[ ! -L ~/ccom ]] - then ln -s ~/work/assorted/cpp-commons/trunk ~/ccom - fi -} - -node-setup-ydb-2() { - check-remote cd ~/ccom/ ./setup.bash -d -p ~/.local/pkg/cpp-commons refresh-local cd ~/ydb/src make clean - # PB=1 PPROF=1 OPT=1 make WTF= PPROF=1 OPT=1 make WTF= - # PPROF=1 OPT=1 make WTF= p2 + PPROF=1 OPT=1 make WTF= p2 } init-setup() { @@ -240,13 +229,13 @@ } setup-ydb() { - parremote node-setup-ydb-1 - rm -rf /tmp/{ydb,ccom}-src/ - svn export ~/ydb/src /tmp/ydb-src/ - svn export ~/ccom/src /tmp/ccom-src/ - parscp -r /tmp/ydb-src/* ^:ydb/src/ - parscp -r /tmp/ccom-src/* ^:ccom/src/ - parremote node-setup-ydb-2 + parssh mkdir -p ydb/ ccom/ + rm -rf /tmp/{ydb,ccom}-export/ + svn export ~/work/assorted/ydb/trunk/ /tmp/ydb-export/ + svn export ~/work/assorted/cpp-commons/trunk/ /tmp/ccom-export/ + parscp -r /tmp/ydb-export/* ^:ydb/ + parscp -r /tmp/ccom-export/* ^:ccom/ + parremote node-setup-ydb } setup-stperf() { This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <yan...@us...> - 2009-02-23 23:55:33
|
Revision: 1228 http://assorted.svn.sourceforge.net/assorted/?rev=1228&view=rev Author: yangzhang Date: 2009-02-23 23:55:26 +0000 (Mon, 23 Feb 2009) Log Message: ----------- - fixed fake-exec bug with raw-buf (skipping op_size * Op_Size) - use dense_hash_map in p2 - added --fake-exec, --thresh to p2 (with thresh < 0 ==> no thresh) - added random keys/values to p2 to really build up a map - reintroduced -Wold-style-cast - added more notes/todos - added extraargs to p2() in test.bash Modified Paths: -------------- ydb/trunk/README ydb/trunk/src/Makefile ydb/trunk/src/main.lzz.clamp ydb/trunk/src/p2.cc ydb/trunk/src/ser.h ydb/trunk/tools/test.bash Modified: ydb/trunk/README =================================================================== --- ydb/trunk/README 2009-02-23 23:54:23 UTC (rev 1227) +++ ydb/trunk/README 2009-02-23 23:55:26 UTC (rev 1228) @@ -397,6 +397,36 @@ Period 2/17- - DONE removed class outstream +- DONE dynamic switch between pb and zero-copy + +- DONE google dense hash map + - big improvement, again not in the direction we'd like + - 0: 550K + - 1: 490K + - 2: 485K + - 3: 475K +- DONE try again fake-exec + - WHOA! major gains + - 0: 1.9M + - 1: 1.5M + - 2: 1M + - 3: 657K + +- DONE see how p2 compares with ydb + - as before, 2.6M +- DONE try adding dense hash map to p2 + - some benefit, 2.9M +- DONE try adding randint to p2 + - huge negative impact! down to 505K + - almost slow as ydb??? sign that i should stop trying to opt ydb :) +- DONE see whether the rand inefficiency is coming from rand or from the random + map manip + - definitely the random map manip + - randspeed in rand-dist shows we get 8.8M rand/s for commons::posix_rand or + 9.9M rand/s for random() +- DONE see how fast p2 runs with fake exec + - back up, 2.75M + - TODO get raw-buffer working in wal, 0-node - TODO add raw-buffer versions of the response message classes as well - TODO refactor st_reader, etc. to be generic opportunistic buffered readers @@ -404,20 +434,11 @@ slow) - TODO try making a streambuf for st_write, then try it in conj with struct-less pb -- DONE dynamic switch between pb and zero-copy - TODO fix pb recovery - TODO implement new recovery (add buffer swapping, add buffers to a list) - TODO async (threaded) wal - TODO 0-node 0-copy (don't need to use threads, just process each batch immed) -- TODO see how p2 compares with ydb - -- DONE google dense hash map - - big improvement, again not in the direction we'd like - - 0: 550K - - 1: 490K - - 2: 485K - - 3: 475K - TODO reuse the serialization buffer in the pb path of ydb - TODO show aries-write Modified: ydb/trunk/src/Makefile =================================================================== --- ydb/trunk/src/Makefile 2009-02-23 23:54:23 UTC (rev 1227) +++ ydb/trunk/src/Makefile 2009-02-23 23:55:26 UTC (rev 1228) @@ -45,7 +45,7 @@ -Werror \ -Wextra \ -Wstrict-null-sentinel \ - -Wno-old-style-cast \ + -Wold-style-cast \ -Woverloaded-virtual \ -Wsign-promo \ -Wformat=2 \ Modified: ydb/trunk/src/main.lzz.clamp =================================================================== --- ydb/trunk/src/main.lzz.clamp 2009-02-23 23:54:23 UTC (rev 1227) +++ ydb/trunk/src/main.lzz.clamp 2009-02-23 23:55:26 UTC (rev 1228) @@ -889,6 +889,9 @@ } Response *res = resbatch.add_res(); process_txn<Types>(map, txn, seqno, res); + if (!Types::is_pb()) { + reader.skip(txn.op_size() * Op_Size); + } action = "processed"; } else { if (first_seqno == -1) Modified: ydb/trunk/src/p2.cc =================================================================== --- ydb/trunk/src/p2.cc 2009-02-23 23:54:23 UTC (rev 1227) +++ ydb/trunk/src/p2.cc 2009-02-23 23:55:26 UTC (rev 1228) @@ -7,6 +7,7 @@ #include <commons/sockets.h> #include <commons/time.h> #include <exception> +#include <google/dense_hash_map> #include <iostream> #include <set> #include <string> @@ -14,6 +15,7 @@ #include <tr1/unordered_map> #include <vector> using namespace commons; +using namespace google; using namespace std; using namespace tr1; #define foreach BOOST_FOREACH @@ -24,15 +26,16 @@ ++c; \ t += current_time_millis() - start_time; -int bufsize = 1e8, chkpt = 1e4, batch_size = 1e4, thresh = 1e6; -bool verbose = true; +int bufsize = 1e8, chkpt = 1e4, batch_size, thresh; +bool fake_exec, verbose; long long start = 0, seltime = 0, readtime = 0, writetime = 0; int selcnt = 0, readcnt = 0, writecnt = 0; typedef managed_array<char> arr; arr mkarr(char *p = nullptr) { return arr(p, false); } -typedef unordered_map<int, int> map_t; +//typedef unordered_map<int, int> map_t; +typedef dense_hash_map<int, int> map_t; fd_set rfds, wfds, efds; @@ -187,8 +190,8 @@ buf_ = a; if (buf_ == nullptr) return; for (uint32_t i = 0; i < npairs; ++i) { - writeint(1); - writeint(2); + writeint(randint()); + writeint(randint()); } w_.write(); } @@ -207,7 +210,9 @@ long long start_; public: - replica(int fd) : fd_(fd), r_(fd), counter_(0), readcount_(0), start_(current_time_millis()) {} + replica(int fd) : fd_(fd), r_(fd), counter_(0), readcount_(0), start_(current_time_millis()) { + map_.set_empty_key(-1); + } int fd() { return fd_; } uint32_t readint() { @@ -228,11 +233,11 @@ for (uint32_t i = 0; i < npairs; ++i) { uint32_t k = readint(); uint32_t v = readint(); - map_[k] = v; + if (!fake_exec) map_[k] = v; ++counter_; if (counter_ % chkpt == 0) { //if (verbose) cout << current_time_millis() << ": count " << counter_ << endl; - if (counter_ > thresh) { + if (counter_ > thresh || thresh < 0) { long long end = current_time_millis(); double rate = counter_ / double(end - start_) * 1000; cout << "rate " << rate << " pairs/s " << rate / 5 << " tps; readcount " << readcount_ << endl; @@ -261,10 +266,12 @@ ("help,h", "show this help message") ("leader,l", po::bool_switch(&is_leader), "leader") ("verbose,v",po::bool_switch(&verbose), "verbose") + ("fake-exec",po::bool_switch(&fake_exec), "fake-exec") ("host,H", po::value<string>(&host)->default_value(string("localhost")), "hostname or address of the leader") - ("batch,b", po::value<int>(&batch_size)->default_value(1e4), "batch size"); + ("batch,b", po::value<int>(&batch_size)->default_value(1e4), "batch size") + ("thresh,X", po::value<int>(&thresh)->default_value(1e7), "thresh"); po::variables_map vm; try { po::store(po::parse_command_line(argc, argv, desc), vm); @@ -287,8 +294,8 @@ FD_ZERO(&efds); int srv = is_leader ? tcp_listen(7654, true) : -1; - int cli = is_leader ? -1 : tcp_connect(host.c_str(), 7654); - if (cli >= 0) checknnegerr(fcntl(cli, F_SETFL, O_NONBLOCK | fcntl(cli, F_GETFL, 0))); + int cli = is_leader ? + -1 : set_non_blocking(tcp_connect(host.c_str(), 7654)); int nfds = max(srv, cli); if (srv >= 0) FD_SET(srv, &rfds); if (cli >= 0) FD_SET(cli, &rfds); @@ -305,10 +312,7 @@ if (srv >= 0 && FD_ISSET(srv, &rfds)) { if (start == 0) { start = current_time_millis(); seltime = 0; } cout << "accept" << endl; - int r = checknnegerr(accept(srv, nullptr, nullptr)); - cout << fcntl(r, F_GETFL, 0) << ' '; - checknnegerr(fcntl(r, F_SETFL, O_NONBLOCK | fcntl(r, F_GETFL, 0))); - cout << fcntl(r, F_GETFL, 0) << endl; + int r = set_non_blocking(checknnegerr(accept(srv, nullptr, nullptr))); rs.push_back(new replica_channel(r)); nfds = max(nfds, r); FD_SET(r, &wfds); Modified: ydb/trunk/src/ser.h =================================================================== --- ydb/trunk/src/ser.h 2009-02-23 23:54:23 UTC (rev 1227) +++ ydb/trunk/src/ser.h 2009-02-23 23:55:26 UTC (rev 1228) @@ -113,6 +113,8 @@ int value() const { return r_.read<int>(); } }; +const size_t Op_Size = sizeof(char) + sizeof(int) + sizeof(int); + class Txn { private: Modified: ydb/trunk/tools/test.bash =================================================================== --- ydb/trunk/tools/test.bash 2009-02-23 23:54:23 UTC (rev 1227) +++ ydb/trunk/tools/test.bash 2009-02-23 23:55:26 UTC (rev 1228) @@ -512,15 +512,15 @@ p2-helper() { local leader="$1" shift - tagssh "$leader" "ydb/src/p2 -l | tail" & + tagssh "$leader" "ydb/src/p2 -l ${extraargs:-}" & sleep .1 { while (( $# > 0 )) ; do - tagssh "$1" "ydb/src/p2 -H $leader | tail" & + tagssh "$1" "ydb/src/p2 -H $leader ${extraargs:-}" & shift done time wait - } 2>&1 | fgrep real + } 2>&1 } p2() { This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <yan...@us...> - 2009-02-24 08:40:23
|
Revision: 1229 http://assorted.svn.sourceforge.net/assorted/?rev=1229&view=rev Author: yangzhang Date: 2009-02-24 08:40:14 +0000 (Tue, 24 Feb 2009) Log Message: ----------- - added raw-buf counterparts to the response msg types, with empty-only Responses - removed macros GETMSG, GETSA; cleanup - fixed always-fake-execing bug - fixed response sending bugs: sendmsg, marking, start_res/fin_res - refactored some more macros in ser.h - added --use-pb-res to distinguish between serialization methods of txns and responses - test.bash: just build ydb - added notes/todos Modified Paths: -------------- ydb/trunk/README ydb/trunk/src/main.lzz.clamp ydb/trunk/src/ser.cc ydb/trunk/src/ser.h ydb/trunk/tools/test.bash Modified: ydb/trunk/README =================================================================== --- ydb/trunk/README 2009-02-23 23:55:26 UTC (rev 1228) +++ ydb/trunk/README 2009-02-24 08:40:14 UTC (rev 1229) @@ -427,8 +427,17 @@ - DONE see how fast p2 runs with fake exec - back up, 2.75M +- DONE add raw-buffer versions of the response message classes as well + - slight increase in speed + - 1: 518K + - 2: 505K + - 3: 485K +- DONE recap + - n: rb-rb rb-pb pb-rb pb-pb + - 1: 518K 467K 359K 333K + - 2: 505K 470K 350K 333K + - 3: 485K 465K 335K 333K - TODO get raw-buffer working in wal, 0-node -- TODO add raw-buffer versions of the response message classes as well - TODO refactor st_reader, etc. to be generic opportunistic buffered readers - TODO see how streambuf read/write is actually implemented (whether it's too slow) Modified: ydb/trunk/src/main.lzz.clamp =================================================================== --- ydb/trunk/src/main.lzz.clamp 2009-02-23 23:55:26 UTC (rev 1228) +++ ydb/trunk/src/main.lzz.clamp 2009-02-24 08:40:14 UTC (rev 1229) @@ -42,25 +42,9 @@ using namespace std; using namespace std::tr1; using namespace testing; - -using ydb::msg::reader; -using ydb::msg::writer; -using ydb::msg::stream; -using ydb::pb::ResponseBatch; -using ydb::pb::Response; -using ydb::pb::Recovery; -using ydb::pb::Recovery_Pair; -using ydb::pb::Init; -using ydb::pb::Join; -using ydb::pb::SockAddr; +using namespace ydb; using namespace ydb::pb; using namespace ydb::msg; - -#define GETMSG(buf) \ -checkeqnneg(st_read_fully(src, buf, len, timeout), int(len)); \ -if (stop_time != nullptr) \ - *stop_time = current_time_millis(); \ -check(msg.ParseFromArray(buf, len)); #end //#define map_t unordered_map @@ -71,7 +55,10 @@ typedef string ser_t; template<typename T> void init_map(T &map) {} -template<> void init_map(dense_hash_map<int, int> &map) { map.set_empty_key(-1); map.set_deleted_key(-2); } +template<> void init_map(dense_hash_map<int, int> &map) { + map.set_empty_key(-1); + map.set_deleted_key(-2); +} // Configuration. st_utime_t timeout; @@ -80,7 +67,7 @@ size_t accept_joiner_size, buf_size; bool verbose, yield_during_build_up, yield_during_catch_up, dump, show_updates, count_updates, stop_on_recovery, general_txns, profile_threads, - debug_threads, multirecover, disk, debug_memory, use_wal, use_pb, + debug_threads, multirecover, disk, debug_memory, use_wal, use_pb, use_pb_res, suppress_txn_msgs, use_bcast_async, fake_bcast, force_ser, fake_exec; long long timelim, read_thresh, write_thresh; @@ -220,13 +207,10 @@ st_netfd_t fd() const { return fd_; } /** The port on which the replica is listening. */ uint16_t port() const { return port_; } -#hdr -#define GETSA sockaddr_in sa; sockaddr(sa); return sa -#end /** The port on which the replica connected to us. */ - uint16_t local_port() const { GETSA.sin_port; } - uint32_t host() const { GETSA.sin_addr.s_addr; } - sockaddr_in sockaddr() const { GETSA; } + uint16_t local_port() const { return sockaddr().sin_port; } + uint32_t host() const { return sockaddr().sin_addr.s_addr; } + sockaddr_in sockaddr() const { sockaddr_in sa; sockaddr(sa); return sa; } void sockaddr(sockaddr_in &sa) const { socklen_t salen = sizeof sa; check0x(getpeername(st_netfd_fileno(fd_), @@ -398,7 +382,7 @@ * was chosen (sync or async). */ template<typename T> -void +inline void bcastmsg(const vector<st_netfd_t> &dsts, const T &msg /* XXX optimize this , ser_t &s */) { if (use_bcast_async) bcastmsg_async(dsts, msg); @@ -409,7 +393,7 @@ * Send a message to a single recipient. */ template<typename T> -void +inline void sendmsg(st_netfd_t dst, const T &msg) { // XXX optimize this @@ -456,15 +440,15 @@ *start_time = current_time_millis(); len = ntohl(len); - // Parse the message body. - if (len <= 4096) { - char buf[4096]; - GETMSG(buf); - } else { - //cout << "receiving large msg; heap-allocating " << len << " bytes" << endl; - scoped_array<char> buf(new char[len]); - GETMSG(buf.get()); - } + // Parse the message body. Try stack-allocation if possible. + scoped_array<char> sbuf; + char *buf; + if (len <= 4096) buf = reinterpret_cast<char*>(alloca(len)); + else sbuf.reset(buf = new char[len]); + checkeqnneg(st_read_fully(src, buf, len, timeout), int(len)); + if (stop_time != nullptr) + *stop_time = current_time_millis(); + check(msg.ParseFromArray(buf, len)); return len; } @@ -556,7 +540,7 @@ reader r(nullptr); function<void(const void*, size_t)> fn; if (use_wal) - fn = boost::bind(&wal::logbuf, g_wal, _1, _2); + fn = bind(&wal::logbuf, g_wal, _1, _2); else fn = lambda(const void *buf, size_t len) { foreach (st_netfd_t dst, __ref(fds)) @@ -610,7 +594,7 @@ // Process immediately if not bcasting. if (fds.empty()) { --seqno; - process_txn<Types>(g_map, txn, seqno, nullptr); + process_txn<Types, pb_types>(g_map, txn, seqno, nullptr); w.reset(); } @@ -684,18 +668,19 @@ * Process a transaction: update DB state (incl. seqno) and send response to * leader. */ -template<typename Types> +template<typename Types, typename RTypes> void -process_txn(mii &map, const typename Types::Txn &txn, int &seqno, Response *res) +process_txn(mii&map, const typename Types::Txn &txn, int &seqno, + typename RTypes::Response *res) { typedef typename Types::Txn Txn; typedef typename Types::Op Op; - //wal &wal = *g_wal; checkeq(txn.seqno(), seqno + 1); seqno = txn.seqno(); if (res != nullptr) { res->set_seqno(seqno); res->set_caught_up(true); + start_result(*res); } if (!fake_exec) { for (int o = 0; o < txn.op_size(); ++o) { @@ -733,6 +718,8 @@ } } } + if (res != nullptr) + fin_result(*res); //if (use_wal) wal.commit(); } @@ -773,12 +760,12 @@ } #end -template<typename Txn> shared_ptr<ydb::pb::Txn> to_pb_Txn(Txn txn); -template<> shared_ptr<ydb::pb::Txn> to_pb_Txn(ydb::pb::Txn txn) { - return shared_ptr<ydb::pb::Txn>(new ydb::pb::Txn(txn)); +template<typename Txn> shared_ptr<pb::Txn> to_pb_Txn(Txn txn); +template<> shared_ptr<pb::Txn> to_pb_Txn(pb::Txn txn) { + return shared_ptr<pb::Txn>(new pb::Txn(txn)); } -template<> shared_ptr<ydb::pb::Txn> to_pb_Txn(ydb::msg::Txn txn) { - shared_ptr<ydb::pb::Txn> ptxn(new ydb::pb::Txn()); +template<> shared_ptr<pb::Txn> to_pb_Txn(msg::Txn txn) { + shared_ptr<pb::Txn> ptxn(new pb::Txn()); ptxn->set_seqno(txn.seqno()); // XXX FIXME return ptxn; @@ -809,16 +796,18 @@ * * \param[in] wal The WAL. */ -template<typename Types> +template<typename Types, typename RTypes> void process_txns(st_netfd_t leader, mii &map, int &seqno, st_channel<shared_ptr<Recovery> > &send_states, - st_channel<shared_ptr<ydb::pb::Txn> > &backlog, int init_seqno, + st_channel<shared_ptr<pb::Txn> > &backlog, int init_seqno, int mypos, int nnodes) { typedef typename Types::TxnBatch TxnBatch; typedef typename Types::Txn Txn; typedef typename Types::Op Op; + typedef typename RTypes::Response Response; + typedef typename RTypes::ResponseBatch ResponseBatch; bool caught_up = init_seqno == 0; long long start_time = current_time_millis(), @@ -844,15 +833,17 @@ st_reader reader(leader); vector<st_netfd_t> leader_v(1, leader); - writer w(lambda(const void*, size_t) { - throw operation_not_supported("process_txns should not be writing"); - }, buf_size); + writer w(lambda(const void *buf, size_t len) { + checkeqnneg(st_write(__ref(leader), buf, len, ST_UTIME_NO_TIMEOUT), + static_cast<ssize_t>(len)); + }, buf_size); stream s(reader, w); try { scoped_ptr<TxnBatch> pbatch(new_TxnBatch<TxnBatch>(s)); - TxnBatch batch = *pbatch; - ResponseBatch resbatch; + TxnBatch &batch = *pbatch; + scoped_ptr<ResponseBatch> presbatch(new_ResponseBatch<ResponseBatch>(s)); + ResponseBatch &resbatch = *presbatch; while (true) { long long before_read = -1; if (read_thresh > 0) { @@ -871,7 +862,9 @@ } } if (batch.txn_size() > 0) { + w.mark(); resbatch.Clear(); + start_res(resbatch); for (int t = 0; t < batch.txn_size(); ++t) { const Txn &txn = batch.txn(t); // Regular transaction. @@ -888,8 +881,8 @@ caught_up = true; } Response *res = resbatch.add_res(); - process_txn<Types>(map, txn, seqno, res); - if (!Types::is_pb()) { + process_txn<Types, RTypes>(map, txn, seqno, res); + if (fake_exec && !Types::is_pb()) { reader.skip(txn.op_size() * Op_Size); } action = "processed"; @@ -912,7 +905,8 @@ st_sleep(0); } } - if (resbatch.res_size() > 0) + fin_res(resbatch); + if (resbatch.res_size() > 0 && RTypes::is_pb()) sendmsg(leader, resbatch); } else { // Empty (default) TxnBatch means "generate a snapshot." @@ -967,14 +961,27 @@ last_seqno(-1) {} + template<typename Types> void run() { - finally f(boost::bind(&response_handler::cleanup, this)); + typedef typename Types::Response Response; + typedef typename Types::ResponseBatch ResponseBatch; + finally f(bind(&response_handler::cleanup, this)); + st_reader reader(replica); - ResponseBatch batch; + writer w(lambda(const void*, size_t) { + throw operation_not_supported("response handler should not be writing"); + }, buf_size); + stream s(reader,w); + scoped_ptr<ResponseBatch> pbatch(new_ResponseBatch<ResponseBatch>(s)); + ResponseBatch &batch = *pbatch; + + function<void()> loop_cleanup = + bind(&response_handler::loop_cleanup, this); + while (true) { - finally f(boost::bind(&response_handler::loop_cleanup, this)); + finally f(loop_cleanup); // Read the message, but correctly respond to interrupts so that we can // cleanly exit (slightly tricky). @@ -982,7 +989,8 @@ // Stop-interruptible in case we're already caught up. try { st_intr intr(stop_hub); - readmsg(reader, batch); + if (Types::is_pb()) readmsg(reader, batch); + else batch.Clear(); } catch (...) { // TODO: only catch interruptions // This check on seqnos is OK for termination since the seqno will // never grow again if stop_hub is set. @@ -999,7 +1007,8 @@ // Only kill-interruptible because we want a clean termination (want // to get all the acks back). st_intr intr(kill_hub); - readmsg(reader, batch); + if (Types::is_pb()) readmsg(reader, batch); + else batch.Clear(); } for (int i = 0; i < batch.res_size(); ++i) { @@ -1007,7 +1016,9 @@ // Determine if this response handler's host (the only joiner) has finished // catching up. If it has, then broadcast a signal so that all response // handlers will know about this event. - if (!caught_up && res.caught_up()) { + int rseqno = res.seqno(); + bool rcaught_up = res.caught_up(); + if (!caught_up && rcaught_up) { long long now = current_time_millis(), timediff = now - start_time; caught_up = true; recover_signals.push(now); @@ -1022,14 +1033,14 @@ stop_hub.set(); } } - if (res.seqno() % chkpt == 0) { + if (rseqno % chkpt == 0) { if (verbose) { cout << rid << ": "; - cout << "got response " << res.seqno() << " from " << replica << endl; + cout << "got response " << rseqno << " from " << replica << endl; } st_sleep(0); } - last_seqno = res.seqno(); + last_seqno = rseqno; } } } @@ -1079,12 +1090,13 @@ /** * Swallow replica responses. */ +template<typename Types> void handle_responses(st_netfd_t replica, const int &seqno, int rid, st_multichannel<long long> &recover_signals, bool caught_up) { response_handler h(replica, seqno, rid, recover_signals, caught_up); - h.run(); + h.run<Types>(); } /** @@ -1137,6 +1149,7 @@ /** * Run the leader. */ +template<typename Types, typename RTypes> void run_leader(int minreps, uint16_t leader_port) { @@ -1184,9 +1197,8 @@ st_bool accept_joiner; int seqno = 0; st_channel<replica_info> newreps; - const function<void()> f = use_pb ? - bind(issue_txns<pb_types>, ref(newreps), ref(seqno), ref(accept_joiner)) : - bind(issue_txns<rb_types>, ref(newreps), ref(seqno), ref(accept_joiner)); + const function<void()> f = + bind(issue_txns<Types>, ref(newreps), ref(seqno), ref(accept_joiner)); st_thread_t issue_txns_thread = my_spawn(f, "issue_txns"); foreach (const replica_info &r, replicas) newreps.push(r); st_joining join_issue_txns(issue_txns_thread); @@ -1212,7 +1224,7 @@ st_thread_group handlers; int rid = 0; foreach (replica_info r, replicas) { - handlers.insert(my_spawn(bind(handle_responses, r.fd(), ref(seqno), rid++, + handlers.insert(my_spawn(bind(handle_responses<RTypes>, r.fd(), ref(seqno), rid++, ref(recover_signals), true), "handle_responses")); } @@ -1242,7 +1254,7 @@ // Start streaming txns to joiner. cout << "start streaming txns to joiner" << endl; newreps.push(replicas.back()); - handlers.insert(my_spawn(bind(handle_responses, joiner, ref(seqno), rid++, + handlers.insert(my_spawn(bind(handle_responses<RTypes>, joiner, ref(seqno), rid++, ref(recover_signals), false), "handle_responses_joiner")); } catch (break_exception &ex) { @@ -1256,6 +1268,7 @@ /** * Run a replica. */ +template<typename Types, typename RTypes> void run_replica(string leader_host, uint16_t leader_port, uint16_t listen_port) { @@ -1329,13 +1342,10 @@ } // Process txns. - st_channel<shared_ptr<ydb::pb::Txn> > backlog; - const function<void()> process_fn = use_pb ? - bind(process_txns<pb_types>, leader, ref(map), ref(seqno), + st_channel<shared_ptr<pb::Txn> > backlog; + const function<void()> process_fn = + bind(process_txns<Types, RTypes>, leader, ref(map), ref(seqno), ref(send_states), ref(backlog), init.txnseqno(), mypos, - init.node_size()) : - bind(process_txns<rb_types>, leader, ref(map), ref(seqno), - ref(send_states), ref(backlog), init.txnseqno(), mypos, init.node_size()); st_joining join_proc(my_spawn(process_fn, "process_txns")); st_joining join_rec(my_spawn(bind(recover_joiner, listener, @@ -1391,9 +1401,9 @@ int mid_seqno = seqno; while (!backlog.empty()) { - using ydb::pb::Txn; + using pb::Txn; shared_ptr<Txn> p = backlog.take(); - process_txn<pb_types>(map, *p, seqno, nullptr); + process_txn<pb_types, pb_types>(map, *p, seqno, nullptr); if (p->seqno() % chkpt == 0) { if (verbose) cout << "processed txn " << p->seqno() << " off the backlog; " @@ -1520,7 +1530,9 @@ ("general-txns,g", po::bool_switch(&general_txns), "issue read and delete transactions as well as the default of (only) insertion/update transactions (for leader only)") ("use-pb", po::bool_switch(&use_pb), - "use protocol buffers instead of raw buffers") + "use protocol buffers instead of raw buffers for txns") + ("use-pb-res", po::bool_switch(&use_pb_res), + "use protocol buffers instead of raw buffers for responses") ("wal", po::bool_switch(&use_wal), "enable ARIES write-ahead logging") ("force-ser", po::bool_switch(&force_ser), @@ -1671,9 +1683,33 @@ // Which role are we? if (is_leader) { - run_leader(minreps, leader_port); + if (use_pb) { + if (use_pb_res) { + run_leader<pb_types, pb_types>(minreps, leader_port); + } else { + run_leader<pb_types, rb_types>(minreps, leader_port); + } + } else { + if (use_pb_res) { + run_leader<rb_types, pb_types>(minreps, leader_port); + } else { + run_leader<rb_types, rb_types>(minreps, leader_port); + } + } } else { - run_replica(leader_host, leader_port, listen_port); + if (use_pb) { + if (use_pb_res) { + run_replica<pb_types, pb_types>(leader_host, leader_port, listen_port); + } else { + run_replica<pb_types, rb_types>(leader_host, leader_port, listen_port); + } + } else { + if (use_pb_res) { + run_replica<rb_types, pb_types>(leader_host, leader_port, listen_port); + } else { + run_replica<rb_types, rb_types>(leader_host, leader_port, listen_port); + } + } } return 0; Modified: ydb/trunk/src/ser.cc =================================================================== --- ydb/trunk/src/ser.cc 2009-02-23 23:55:26 UTC (rev 1228) +++ ydb/trunk/src/ser.cc 2009-02-24 08:40:14 UTC (rev 1229) @@ -29,7 +29,8 @@ str.append(sizeof len, '\0'); check(batch.AppendToString(&str)); len = uint32_t(str.size() - sizeof len); - copy((char*) &len, (char*) &len + sizeof len, str.begin()); + char *p = reinterpret_cast<char*>(&len); + copy(p, p + sizeof len, str.begin()); os(str.data(), str.size()); } Modified: ydb/trunk/src/ser.h =================================================================== --- ydb/trunk/src/ser.h 2009-02-23 23:55:26 UTC (rev 1228) +++ ydb/trunk/src/ser.h 2009-02-24 08:40:14 UTC (rev 1229) @@ -12,6 +12,22 @@ #define BEGIN_NAMESPACE(ns) namespace ns { #define END_NAMESPACE } +#define MAKE_START_FIN_HELPER(MsgType, field, action) \ + template<typename T> void action##_##field(T &msg); \ + template<> void action##_##field(ydb::pb::MsgType&) {} \ + template<> void action##_##field(ydb::msg::MsgType& msg) { msg.action##_##field(); } +#define MAKE_START_FIN(MsgType, field) \ + MAKE_START_FIN_HELPER(MsgType, field, start) \ + MAKE_START_FIN_HELPER(MsgType, field, fin) + +#define EXPAND_PB \ + bool AppendToString(string*) const { throw_operation_not_supported(); } \ + bool SerializeToString(string*) const { throw_operation_not_supported(); } \ + bool SerializeToOstream(ostream*) const { throw_operation_not_supported(); } \ + bool ParseFromArray(void*, size_t) { throw_operation_not_supported(); } \ + size_t GetCachedSize() const { throw_operation_not_supported(); } \ + size_t ByteSize() const { throw_operation_not_supported(); } \ + BEGIN_NAMESPACE(ydb) BEGIN_NAMESPACE(msg) @@ -160,38 +176,77 @@ return ntxn_; } const Txn &txn(int) const { txn_.Clear(); return txn_; } - bool AppendToString(string*) const { throw_operation_not_supported(); } - bool SerializeToString(string*) const { throw_operation_not_supported(); } - bool SerializeToOstream(ostream*) const { throw_operation_not_supported(); } - bool ParseFromArray(void*, size_t) { throw_operation_not_supported(); } - size_t GetCachedSize() const { throw_operation_not_supported(); } - size_t ByteSize() const { throw_operation_not_supported(); } + EXPAND_PB }; -template<typename T> void start_txn(T &batch); -template<> void start_txn(ydb::pb::TxnBatch &) {} -template<> void start_txn(ydb::msg::TxnBatch &batch) { batch.start_txn(); } +template<typename T> T *new_TxnBatch(stream &s); +template<> ydb::pb::TxnBatch *new_TxnBatch(stream &) { return new ydb::pb::TxnBatch(); } +template<> ydb::msg::TxnBatch *new_TxnBatch(stream &s) { return new ydb::msg::TxnBatch(s); } -template<typename T> void fin_txn(T &batch); -template<> void fin_txn(ydb::pb::TxnBatch &) {} -template<> void fin_txn(ydb::msg::TxnBatch &batch) { batch.fin_txn(); } +MAKE_START_FIN(Txn, op) +MAKE_START_FIN(TxnBatch, txn) -template<typename T> void start_op(T &txn); -template<> void start_op(ydb::pb::Txn &) {} -template<> void start_op(ydb::msg::Txn &txn) { txn.start_op(); } +class Response +{ + stream &s_; + reader &r_; + writer &w_; + size_t off_; + mutable short nres_; +public: + Response(stream &s) : s_(s), r_(s.get_reader()), w_(s.get_writer()), off_(w_.pos()), nres_(unset) {} + void Clear() { nres_ = unset; off_ = w_.pos(); } + void set_seqno(int x) { w_.write(x); } + void set_caught_up(char x) { w_.write(x); } + int seqno() const { return r_.read<int>(); } + bool caught_up() const { return r_.read<int>(); } + void start_result() { if (nres_ == unset) nres_ = 0; w_.skip<typeof(nres_)>(); } + void add_result(int x) { w_.write(x); } + void fin_result() { w_.write(nres_, off_ + sizeof(int) + sizeof(char)); } + int result_size() const { + if (nres_ == unset) + nres_ = r_.read<typeof(nres_)>(); + return nres_; + } + int result(int) const { return r_.read<int>(); } +}; -template<typename T> void fin_op(T &txn); -template<> void fin_op(ydb::pb::Txn &) {} -template<> void fin_op(ydb::msg::Txn &txn) { txn.fin_op(); } +class ResponseBatch +{ + stream &s_; + reader &r_; + writer &w_; + size_t off_; + mutable Response res_; + mutable short nres_; +public: + ResponseBatch(stream &s) : s_(s), r_(s.get_reader()), w_(s.get_writer()), off_(w_.pos()), res_(s), nres_(unset) {} + void Clear() { res_.Clear(); nres_ = unset; off_ = w_.pos(); } + void start_res() { if (nres_ == unset) nres_ = 0; w_.skip<typeof(nres_)>(); } + Response *add_res() { ++nres_; return &res_; } + void fin_res() { w_.write(nres_, off_); } + int res_size() const { + if (nres_ == unset) + nres_ = r_.read<typeof(nres_)>(); + return nres_; + } + const Response &res(int) { res_.Clear(); return res_; } + EXPAND_PB +}; -template<typename T> T *new_TxnBatch(stream &s); -template<> ydb::pb::TxnBatch *new_TxnBatch(stream &) { return new ydb::pb::TxnBatch(); } -template<> ydb::msg::TxnBatch *new_TxnBatch(stream &s) { return new ydb::msg::TxnBatch(s); } +template<typename T> T *new_ResponseBatch(stream &s); +template<> ydb::pb::ResponseBatch *new_ResponseBatch(stream &) { return new ydb::pb::ResponseBatch(); } +template<> ydb::msg::ResponseBatch *new_ResponseBatch(stream &s) { return new ydb::msg::ResponseBatch(s); } +MAKE_START_FIN(Response, result) +MAKE_START_FIN(ResponseBatch, res) + struct pb_types { typedef ydb::pb::TxnBatch TxnBatch; typedef ydb::pb::Txn Txn; typedef ydb::pb::Op Op; + typedef ydb::pb::Response Response; + typedef ydb::pb::ResponseBatch ResponseBatch; static bool is_pb() { return true; } }; @@ -200,6 +255,8 @@ typedef ydb::msg::TxnBatch TxnBatch; typedef ydb::msg::Txn Txn; typedef ydb::msg::Op Op; + typedef ydb::msg::Response Response; + typedef ydb::msg::ResponseBatch ResponseBatch; static bool is_pb() { return false; } }; Modified: ydb/trunk/tools/test.bash =================================================================== --- ydb/trunk/tools/test.bash 2009-02-23 23:55:26 UTC (rev 1228) +++ ydb/trunk/tools/test.bash 2009-02-24 08:40:14 UTC (rev 1229) @@ -188,8 +188,7 @@ refresh-local cd ~/ydb/src make clean - PPROF=1 OPT=1 make WTF= - PPROF=1 OPT=1 make WTF= p2 + PPROF=1 OPT=1 make WTF= ydb } init-setup() { This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <yan...@us...> - 2009-02-25 21:13:36
|
Revision: 1234 http://assorted.svn.sourceforge.net/assorted/?rev=1234&view=rev Author: yangzhang Date: 2009-02-25 21:13:24 +0000 (Wed, 25 Feb 2009) Log Message: ----------- - removed (commented out) async sending - sendmsg no longer uses bcastmsg - reuse serialized msg in issue_txns - reuse serialization buffer in issue_txns - cleaned up sending code in issue_txns - added more notes Modified Paths: -------------- ydb/trunk/README ydb/trunk/src/main.lzz.clamp Modified: ydb/trunk/README =================================================================== --- ydb/trunk/README 2009-02-25 19:25:53 UTC (rev 1233) +++ ydb/trunk/README 2009-02-25 21:13:24 UTC (rev 1234) @@ -437,13 +437,23 @@ - 1: 518K 467K 359K 333K - 2: 505K 470K 350K 333K - 3: 485K 465K 335K 333K -- TODO get raw-buffer working in wal, 0-node +- DONE get raw-buffer working in wal, 0-node + - 0: 520K (vs 550K using pb) + - rb is a bit slower than pb + - -1: 495K (vs 350K using pb) + +- DONE reuse serialization buffer for pb + - almost no diff + - 1: 362K + - 2: 350K +- TODO use arrays instead of strings for pb and avoid dyn alloc +- TODO fix pb recovery + - TODO refactor st_reader, etc. to be generic opportunistic buffered readers - TODO see how streambuf read/write is actually implemented (whether it's too slow) - TODO try making a streambuf for st_write, then try it in conj with struct-less pb -- TODO fix pb recovery - TODO implement new recovery (add buffer swapping, add buffers to a list) - TODO async (threaded) wal - TODO 0-node 0-copy (don't need to use threads, just process each batch immed) @@ -491,6 +501,8 @@ Longer term +- Dynamically switch between 0-node and n-node modes + - Testing - unit/regression/mock - performance tests Modified: ydb/trunk/src/main.lzz.clamp =================================================================== --- ydb/trunk/src/main.lzz.clamp 2009-02-25 19:25:53 UTC (rev 1233) +++ ydb/trunk/src/main.lzz.clamp 2009-02-25 21:13:24 UTC (rev 1234) @@ -68,7 +68,7 @@ bool verbose, yield_during_build_up, yield_during_catch_up, dump, show_updates, count_updates, stop_on_recovery, general_txns, profile_threads, debug_threads, multirecover, disk, debug_memory, use_wal, use_pb, use_pb_res, - suppress_txn_msgs, use_bcast_async, fake_bcast, force_ser, fake_exec; + suppress_txn_msgs, fake_bcast, force_ser, fake_exec; long long timelim, read_thresh, write_thresh; // Control. @@ -304,6 +304,7 @@ check(msg.SerializeToOstream(&s)); } +#if 0 /** * The worker that performs the actual broadcasting. */ @@ -329,14 +330,14 @@ /** * Asynchronous version of the broadcaster. */ -template<typename T> void -bcastmsg_async(const vector<st_netfd_t> &dsts, const T &msg) +bcastbuf_async(const vector<st_netfd_t> &dsts, const ser_t &msg) { shared_ptr<string> p(new string); ser(*p.get(), msg); foreach (st_netfd_t dst, dsts) msgs.push(make_pair(dst, p)); } +#endif /** * Perform an st_write but warn if it took over write_thresh ms. @@ -362,17 +363,14 @@ } /** - * Send a message to some destinations (sequentially). + * Send a message to some destinations. */ -template<typename T> -void -bcastmsg_sync(const vector<st_netfd_t> &dsts, const T &msg /*, ser_t &s */) +inline void +bcastbuf(const vector<st_netfd_t> &dsts, const ser_t &msg) { - ser_t s; - ser(s, msg); if (!fake_bcast) { foreach (st_netfd_t dst, dsts) { - st_timed_write(dst, s.data(), s.size()); + st_timed_write(dst, msg.data(), msg.size()); } } } @@ -383,22 +381,33 @@ */ template<typename T> inline void -bcastmsg(const vector<st_netfd_t> &dsts, const T &msg /* XXX optimize this , ser_t &s */) +bcastmsg(const vector<st_netfd_t> &dsts, const T &msg) { - if (use_bcast_async) bcastmsg_async(dsts, msg); - else bcastmsg_sync(dsts, msg); + ser_t s; + ser(s, msg); + bcastbuf(dsts, s); } /** * Send a message to a single recipient. */ +inline void +sendbuf(st_netfd_t dst, const ser_t &msg) +{ + if (!fake_bcast) + st_timed_write(dst, msg.data(), msg.size()); +} + +/** + * Send a message to a single recipient. + */ template<typename T> inline void sendmsg(st_netfd_t dst, const T &msg) { - // XXX optimize this - vector<st_netfd_t> dsts(1, dst); - bcastmsg(dsts, msg); + ser_t s; + ser(s, msg); + sendbuf(dst, s); } /** @@ -489,6 +498,7 @@ wal() : of("wal"), out(of) {} template <typename T> void log(const T &msg) { ser(of, msg); } + void logbuf(const ser_t &s) { logbuf(s.data(), s.size()); } void logbuf(const void *buf, size_t len) { of.write(reinterpret_cast<const char*>(buf), len); } @@ -542,16 +552,6 @@ function<void(const void*, size_t)> fn; if (use_wal) fn = bind(&wal::logbuf, g_wal, _1, _2); - //else if (newreps.empty()) - // fn = lambda(const void *buf, size_t len) { - // // Prepare a new buffer to swap with the writer's current working buffer. - // new buffer; - // // Copy data past the end of the current buffer into the new buffer, so - // // that it's not lost. - // copy(); - // // Swap the current buffer with the new buffer. - // swap(); - // }; else fn = lambda(const void *buf, size_t len) { foreach (st_netfd_t dst, __ref(fds)) @@ -569,6 +569,7 @@ for (int t = 0; t < batch_size; ++t) batch.add_txn(); + ser_t serbuf; while (!stop_hub) { w.mark(); batch.Clear(); @@ -649,25 +650,20 @@ } fin_txn(batch); + bool do_bcast = !fds.empty() && !suppress_txn_msgs; if (Types::is_pb()) { // Broadcast/log/serialize. - // TODO optimize: reuse serialization (have these functions take - // serialized buffers instead of message structures) - if (!fds.empty() && !suppress_txn_msgs) { - bcastmsg(fds, batch); + if (force_ser || do_bcast || use_wal) { + serbuf.clear(); + ser(serbuf, batch); + if (do_bcast) bcastbuf(fds, serbuf); + if (use_wal) g_wal->logbuf(serbuf); } - if (use_wal) { - g_wal->log(batch); - } - if (fds.empty() && suppress_txn_msgs && !use_wal && force_ser) { - string s; - ser(s, batch); - } } else { // Reset if we have nobody to send to (incl. disk) or if we actually have // no txns (possible due to loop structure; want to avoid to avoid // confusing with the 0-txn message signifying "prepare a recovery msg"). - if ((fds.empty() && !use_wal) || batch.txn_size() == 0) { + if (!do_bcast && !use_wal) { w.reset(); } } @@ -1554,8 +1550,6 @@ "when using --bcast-async, don't actually perform the socket write") ("show-updates,U", po::bool_switch(&show_updates), "log operations that touch (update/read/delete) an existing key") - ("bcast-async", po::bool_switch(&use_bcast_async), - "broadcast messages asynchronously") ("count-updates,u",po::bool_switch(&count_updates), "count operations that touch (update/read/delete) an existing key") ("general-txns,g", po::bool_switch(&general_txns), @@ -1673,20 +1667,11 @@ my_spawn(memmon, "memmon"); } - // Start the message broadcaster thread, if requested. - st_thread_t bcaster_thread = use_bcast_async ? - my_spawn(bcaster, "bcaster") : nullptr; - long long start = thread_start_time = current_time_millis(); // At the end, cleanly stop the bcaster thread and print thread profiling // information. finally f(lambda() { - if (use_bcast_async) { - msgs.push(make_pair(nullptr, shared_ptr<string>())); - st_join(__ref(bcaster_thread)); - } - if (profile_threads) { long long end = current_time_millis(); long long all = end - __ref(start); This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <yan...@us...> - 2009-02-26 03:17:23
|
Revision: 1236 http://assorted.svn.sourceforge.net/assorted/?rev=1236&view=rev Author: yangzhang Date: 2009-02-26 03:17:16 +0000 (Thu, 26 Feb 2009) Log Message: ----------- - use ser_array adapter for arrays to stand in as the serialization type (instead of strings) - fixed the types on the rb pb-expander methods - added ser for ser_arrays - removed pb_size; unreliable - added some more notes Modified Paths: -------------- ydb/trunk/README ydb/trunk/src/main.lzz.clamp ydb/trunk/src/ser.h Modified: ydb/trunk/README =================================================================== --- ydb/trunk/README 2009-02-26 03:15:58 UTC (rev 1235) +++ ydb/trunk/README 2009-02-26 03:17:16 UTC (rev 1236) @@ -446,7 +446,11 @@ - almost no diff - 1: 362K - 2: 350K -- TODO use arrays instead of strings for pb and avoid dyn alloc +- DONE use arrays instead of strings for pb and avoid dyn alloc + - almost no diff + - 1: 362K + - 2: 355K +- TODO make same changes for sending responses - TODO fix pb recovery - TODO refactor st_reader, etc. to be generic opportunistic buffered readers Modified: ydb/trunk/src/main.lzz.clamp =================================================================== --- ydb/trunk/src/main.lzz.clamp 2009-02-26 03:15:58 UTC (rev 1235) +++ ydb/trunk/src/main.lzz.clamp 2009-02-26 03:17:16 UTC (rev 1236) @@ -52,7 +52,6 @@ #define map_t dense_hash_map typedef pair<int, int> pii; typedef map_t<int, int> mii; -typedef string ser_t; template<typename T> void init_map(T &map) {} template<> void init_map(dense_hash_map<int, int> &map) { @@ -256,6 +255,28 @@ st_channel<pair<st_netfd_t, shared_ptr<string> > > msgs; /** + * Adapter for arrays to look like strings (for PB serialization). + */ +class ser_array +{ + commons::array<char> a_; + size_t size_; +public: + ser_array(size_t size = buf_size) : a_(size), size_(0) {} + char *data() const { return a_.get(); } + size_t size() const { return size_; } + void clear() { size_ = 0; } + void stretch(size_t size) { + if (size > a_.size()) + a_.reset(new char[size], size); + size_ = size; + } +}; + +//typedef string ser_t; +typedef ser_array ser_t; + +/** * Serialization. * * TODO: experiment with which method is the fastest: using a string as shown @@ -281,15 +302,18 @@ copy(plen, plen + sizeof len, s.begin()); } -/** - * Helper for getting the cached ByteSize of a message. - */ -template <typename T> -size_t -pb_size(const T &msg) { - // GetCachedSize returns 0 if no cached size. - size_t len = msg.GetCachedSize(); - return len == 0 ? msg.ByteSize() : len; +template<typename T> +void +ser(ser_array &s, const T &msg) +{ + int len = msg.ByteSize(); + + // Grow the array as needed. + s.stretch(len + sizeof(uint32_t)); + + // Serialize message to a buffer with four-byte length prefix. + check(msg.SerializeToArray(s.data() + sizeof(uint32_t), len)); + *reinterpret_cast<uint32_t*>(s.data()) = htonl(uint32_t(len)); } /** @@ -299,7 +323,7 @@ void ser(ostream &s, const T &msg) { - uint32_t len = htonl(uint32_t(pb_size(msg))); + uint32_t len = htonl(uint32_t(msg.ByteSize())); s.write(reinterpret_cast<const char*>(&len), sizeof len); check(msg.SerializeToOstream(&s)); } Modified: ydb/trunk/src/ser.h =================================================================== --- ydb/trunk/src/ser.h 2009-02-26 03:15:58 UTC (rev 1235) +++ ydb/trunk/src/ser.h 2009-02-26 03:17:16 UTC (rev 1236) @@ -34,11 +34,12 @@ #define EXPAND_PB \ bool AppendToString(string*) const { throw_operation_not_supported(); } \ + bool SerializeToArray(void*, int) const { throw_operation_not_supported(); } \ bool SerializeToString(string*) const { throw_operation_not_supported(); } \ bool SerializeToOstream(ostream*) const { throw_operation_not_supported(); } \ - bool ParseFromArray(void*, size_t) { throw_operation_not_supported(); } \ - size_t GetCachedSize() const { throw_operation_not_supported(); } \ - size_t ByteSize() const { throw_operation_not_supported(); } \ + bool ParseFromArray(void*, int) { throw_operation_not_supported(); } \ + int GetCachedSize() const { throw_operation_not_supported(); } \ + int ByteSize() const { throw_operation_not_supported(); } \ #define MAKE_TYPE_BATCH(name, ns, b) \ struct name##_types { \ This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <yan...@us...> - 2009-02-26 06:41:33
|
Revision: 1237 http://assorted.svn.sourceforge.net/assorted/?rev=1237&view=rev Author: yangzhang Date: 2009-02-26 06:41:25 +0000 (Thu, 26 Feb 2009) Log Message: ----------- - responses now reuse serialization buffer as well - added --disp-interval for response_handler - fixed int overflow in showtput - increased default -X param - added -pipe -march=native Modified Paths: -------------- ydb/trunk/README ydb/trunk/src/Makefile ydb/trunk/src/main.lzz.clamp ydb/trunk/tools/test.bash Modified: ydb/trunk/README =================================================================== --- ydb/trunk/README 2009-02-26 03:17:16 UTC (rev 1236) +++ ydb/trunk/README 2009-02-26 06:41:25 UTC (rev 1237) @@ -450,9 +450,26 @@ - almost no diff - 1: 362K - 2: 355K -- TODO make same changes for sending responses +- DONE make same changes for sending responses + - tiny improvement + - 1: 366K + - 2: 360K - TODO fix pb recovery +- DONE figure out why there's such a dramatic slowdown as the DB grows + - ydb + - 1e5: 530K + - 1e6: 428K + - 2e6: 417K + - 3e6: before bug fix: -200K! this is because 3e6*1000 > INT_MAX + - 1e7: after bug fix: 412K + - p2 + - 1e5: 700K + - 1e6: 655K + - 1e7: 501K + - 5e7: 495K + - there was an int overflow bug + - TODO refactor st_reader, etc. to be generic opportunistic buffered readers - TODO see how streambuf read/write is actually implemented (whether it's too slow) @@ -462,7 +479,7 @@ - TODO async (threaded) wal - TODO 0-node 0-copy (don't need to use threads, just process each batch immed) -- TODO reuse the serialization buffer in the pb path of ydb +- TODO batch up the responses until they make large-enough buffer in pb mode - TODO show aries-write - TODO checkpointing + replaying log from replicas (not from disk) Modified: ydb/trunk/src/Makefile =================================================================== --- ydb/trunk/src/Makefile 2009-02-26 03:17:16 UTC (rev 1236) +++ ydb/trunk/src/Makefile 2009-02-26 06:41:25 UTC (rev 1237) @@ -34,7 +34,7 @@ OPT := -g3 endif # CXX := $(WTF) ag++ -k --Xcompiler # $(CXX) -CXX := $(WTF) $(CXX) +CXX := $(WTF) $(CXX) -pipe LDFLAGS := -pthread $(GPROF) LDLIBS := -lstx -lst -lresolv -lprotobuf -lgtest \ -lboost_program_options-gcc43-mt -lboost_thread-gcc43-mt \ @@ -71,6 +71,7 @@ -Wlong-long \ -Wvolatile-register-var \ -std=gnu++0x \ + -march=native \ $(CXXFLAGS) # \ Modified: ydb/trunk/src/main.lzz.clamp =================================================================== --- ydb/trunk/src/main.lzz.clamp 2009-02-26 03:17:16 UTC (rev 1236) +++ ydb/trunk/src/main.lzz.clamp 2009-02-26 06:41:25 UTC (rev 1237) @@ -62,7 +62,7 @@ // Configuration. st_utime_t timeout; int chkpt, accept_joiner_seqno, issuing_interval, min_ops, max_ops, - stop_on_seqno, batch_size; + stop_on_seqno, batch_size, display_interval; size_t accept_joiner_size, buf_size, read_buf_size; bool verbose, yield_during_build_up, yield_during_catch_up, dump, show_updates, count_updates, stop_on_recovery, general_txns, profile_threads, @@ -778,7 +778,7 @@ { long long time_diff = stop_time - start_time; int count_diff = stop_count - start_count; - double rate = count_diff * 1000 / double(time_diff); + double rate = double(count_diff) * 1000. / double(time_diff); cout << action << " " << count_diff << " txns [" << start_count << ".." << stop_count << "] in " << time_diff << " ms [" @@ -894,6 +894,7 @@ TxnBatch &batch = *pbatch; scoped_ptr<ResponseBatch> presbatch(new_ResponseBatch<ResponseBatch>(s)); ResponseBatch &resbatch = *presbatch; + ser_t serbuf; while (true) { long long before_read = -1; if (read_thresh > 0) { @@ -956,8 +957,11 @@ } } fin_res(resbatch); - if (resbatch.res_size() > 0 && RTypes::is_pb()) - sendmsg(leader, resbatch); + if (RTypes::is_pb() && resbatch.res_size() > 0) { + serbuf.clear(); + ser(serbuf, batch); + sendbuf(leader, serbuf); + } } else { // Empty (default) TxnBatch means "generate a snapshot." // TODO make this faster @@ -1028,6 +1032,8 @@ scoped_ptr<ResponseBatch> pbatch(new_ResponseBatch<ResponseBatch>(s)); ResponseBatch &batch = *pbatch; + long long last_display_time = current_time_millis(); + function<void()> loop_cleanup = bind(&response_handler::loop_cleanup, this); @@ -1084,11 +1090,13 @@ stop_hub.set(); } } - if (rseqno % chkpt == 0) { - if (verbose) { - cout << rid << ": "; - cout << "got response " << rseqno << " from " << replica << endl; - } + if (display_interval > 0 && rseqno % display_interval == 0 && rseqno > 0) { + cout << rid << ": " << "got response " << rseqno << " from " + << replica << "; "; + long long display_time = current_time_millis(); + showtput("handling", display_time, last_display_time, rseqno, + rseqno - display_interval); + last_display_time = display_time; st_sleep(0); } last_seqno = rseqno; @@ -1598,6 +1606,8 @@ po::value<size_t>(&accept_joiner_size)->default_value(0), "accept recovering joiner (start recovery) after DB grows to this size " "(for leader only)") + ("disp-interval", po::value<int>(&display_interval)->default_value(0), + "after this many txns, print current handling rate") ("issuing-interval,i", po::value<int>(&issuing_interval)->default_value(0), "seconds to sleep between issuing txns (for leader only)") Modified: ydb/trunk/tools/test.bash =================================================================== --- ydb/trunk/tools/test.bash 2009-02-26 03:17:16 UTC (rev 1236) +++ ydb/trunk/tools/test.bash 2009-02-26 06:41:25 UTC (rev 1237) @@ -387,7 +387,7 @@ scaling-helper() { local leader=$1 shift - tagssh $leader "CPUPROFILE=ydb.prof ydb/src/ydb -q -l -n $# -X 100000 ${extraargs:-}" & + tagssh $leader "CPUPROFILE=ydb.prof ydb/src/ydb -q -l -n $# -X 10000000 ${extraargs:-}" & sleep .1 for rep in "$@" do tagssh $rep "CPUPROFILE=ydb.prof ydb/src/ydb -q -n $# -H $leader ${extraargs:-}" & This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <yan...@us...> - 2009-02-26 18:32:56
|
Revision: 1239 http://assorted.svn.sourceforge.net/assorted/?rev=1239&view=rev Author: yangzhang Date: 2009-02-26 18:32:43 +0000 (Thu, 26 Feb 2009) Log Message: ----------- - fixed (single-node) recovery - removed verbose and added separate --*-display flags - refactored interval-checking - updated notes/todos Modified Paths: -------------- ydb/trunk/README ydb/trunk/src/main.lzz.clamp Modified: ydb/trunk/README =================================================================== --- ydb/trunk/README 2009-02-26 18:28:10 UTC (rev 1238) +++ ydb/trunk/README 2009-02-26 18:32:43 UTC (rev 1239) @@ -454,7 +454,6 @@ - tiny improvement - 1: 366K - 2: 360K -- TODO fix pb recovery - DONE figure out why there's such a dramatic slowdown as the DB grows - ydb @@ -470,6 +469,16 @@ - 5e7: 495K - there was an int overflow bug +- DONE fix pb recovery + - abysmal perf; long wait at the map dump + almost never catch up, but at + least it works + +- TODO speed up backlogging; don't create pb objects, just take buffers + +- TODO fix multi-recovery if necessary + +- TODO speed up map dump; don't use range partitioning, but hash partitioning + - TODO refactor st_reader, etc. to be generic opportunistic buffered readers - TODO see how streambuf read/write is actually implemented (whether it's too slow) Modified: ydb/trunk/src/main.lzz.clamp =================================================================== --- ydb/trunk/src/main.lzz.clamp 2009-02-26 18:28:10 UTC (rev 1238) +++ ydb/trunk/src/main.lzz.clamp 2009-02-26 18:32:43 UTC (rev 1239) @@ -61,10 +61,12 @@ // Configuration. st_utime_t timeout; -int chkpt, accept_joiner_seqno, issuing_interval, min_ops, max_ops, - stop_on_seqno, batch_size, display_interval; +int yield_interval, accept_joiner_seqno, issuing_interval, min_ops, max_ops, + stop_on_seqno, batch_size, handle_responses_display, + catch_up_display, issue_display, + process_display; size_t accept_joiner_size, buf_size, read_buf_size; -bool verbose, yield_during_build_up, yield_during_catch_up, dump, show_updates, +bool yield_during_build_up, yield_during_catch_up, dump, show_updates, count_updates, stop_on_recovery, general_txns, profile_threads, debug_threads, multirecover, disk, debug_memory, use_wal, use_pb, use_pb_res, suppress_txn_msgs, fake_bcast, force_ser, fake_exec; @@ -81,7 +83,10 @@ * Convenience function for calculating percentages. */ template<typename T> -double pct(T sub, T tot) { return 100 * double(sub) / double(tot); } +inline double pct(T sub, T tot) +{ + return 100 * double(sub) / double(tot); +} /** * Convenience class for performing long-jumping break. @@ -119,7 +124,7 @@ /** * Look up thread name, or just show thread ID. */ -string +inline string threadname(st_thread_t t = st_thread_self()) { if (threadnames.find(t) != threadnames.end()) { return threadnames[t]; @@ -131,7 +136,7 @@ /** * Debug function for thread names. Remember what we're switching from. */ -void +inline void switch_out_cb() { if (debug_threads) last_thread = st_thread_self(); @@ -142,7 +147,7 @@ /** * Debug function for thread names. Show what we're switching from/to. */ -void switch_in_cb() +inline void switch_in_cb() { if (debug_threads && last_thread != st_thread_self()) { cout << "switching"; @@ -303,7 +308,7 @@ } template<typename T> -void +inline void ser(ser_array &s, const T &msg) { int len = msg.ByteSize(); @@ -320,7 +325,7 @@ * Serialization. */ template<typename T> -void +inline void ser(ostream &s, const T &msg) { uint32_t len = htonl(uint32_t(msg.ByteSize())); @@ -492,7 +497,7 @@ * for avoiding unnecessary copies. */ template <typename T> -T +inline T readmsg(st_netfd_t src, st_utime_t timeout = ST_UTIME_NO_TIMEOUT) { T msg; @@ -505,7 +510,7 @@ * st_netfd_t. */ template <typename T> -void +inline void readmsg(st_reader &src, T & msg) { managed_array<char> a = src.read(sizeof(uint32_t)); @@ -602,11 +607,14 @@ // one) to prepare to send recovery information (by sending an // empty/default Txn). if (!newreps.empty() && seqno > 0) { - if (multirecover) { - bcastmsg(fds, batch); - } else { - sendmsg(fds[0], batch); + start_txn(batch); + fin_txn(batch); + w.mark(); + if (Types::is_pb()) { + if (multirecover) bcastmsg(fds, batch); + else sendmsg(fds[0], batch); } + batch.Clear(); } // Bring in any new members. // TODO more efficient: copy/extend/append @@ -642,15 +650,14 @@ } // Checkpoint. - if (seqno % chkpt == 0) { - if (verbose) - cout << "issued txn " << seqno << endl; + if (check_interval(seqno, yield_interval)) st_sleep(0); + if (check_interval(seqno, issue_display)) { + cout << "issued txn " << seqno << endl; if (timelim > 0 && current_time_millis() - start_time > timelim) { cout << "time's up; issued " << seqno << " txns in " << timelim << " ms" << endl; stop_hub.set(); } - st_sleep(0); } // For debugging purposes. @@ -809,14 +816,20 @@ } #end -template<typename Txn> shared_ptr<pb::Txn> to_pb_Txn(Txn txn); -template<> shared_ptr<pb::Txn> to_pb_Txn(pb::Txn txn) { +template<typename Txn> inline shared_ptr<pb::Txn> to_pb_Txn(Txn txn); +template<> inline shared_ptr<pb::Txn> to_pb_Txn(pb::Txn txn) { return shared_ptr<pb::Txn>(new pb::Txn(txn)); } -template<> shared_ptr<pb::Txn> to_pb_Txn(msg::Txn txn) { +template<> inline shared_ptr<pb::Txn> to_pb_Txn(msg::Txn txn) { shared_ptr<pb::Txn> ptxn(new pb::Txn()); ptxn->set_seqno(txn.seqno()); - // XXX FIXME + for (int o = 0; o < txn.op_size(); ++o) { + pb::Op *pop = ptxn->add_op(); + const msg::Op &op = txn.op(o); + pop->set_type(static_cast<Op_OpType>(op.type())); + pop->set_key(op.key()); + pop->set_value(op.value()); + } return ptxn; } @@ -946,14 +959,12 @@ action = "backlogged"; } - if (txn.seqno() % chkpt == 0) { - if (verbose) { - cout << action << " txn " << txn.seqno() - << "; db size = " << map.size() - << "; seqno = " << seqno - << "; backlog.size = " << backlog.queue().size() << endl; - } - st_sleep(0); + if (check_interval(txn.seqno(), yield_interval)) st_sleep(0); + if (check_interval(txn.seqno(), process_display)) { + cout << action << " txn " << txn.seqno() + << "; db size = " << map.size() + << "; seqno = " << seqno + << "; backlog.size = " << backlog.queue().size() << endl; } } fin_res(resbatch); @@ -962,9 +973,10 @@ ser(serbuf, batch); sendbuf(leader, serbuf); } - } else { + } else if (multirecover || mypos == 0) { // Empty (default) TxnBatch means "generate a snapshot." // TODO make this faster + cout << "generating recovery..." << endl; shared_ptr<Recovery> recovery(new Recovery); typedef ::map<int, int> mii_; mii_ map_(map.begin(), map.end()); @@ -1090,13 +1102,15 @@ stop_hub.set(); } } - if (display_interval > 0 && rseqno % display_interval == 0 && rseqno > 0) { + if (check_interval(rseqno, handle_responses_display)) { cout << rid << ": " << "got response " << rseqno << " from " << replica << "; "; long long display_time = current_time_millis(); showtput("handling", display_time, last_display_time, rseqno, - rseqno - display_interval); + rseqno - handle_responses_display); last_display_time = display_time; + } + if (check_interval(rseqno, yield_interval)) { st_sleep(0); } last_seqno = rseqno; @@ -1439,7 +1453,7 @@ for (int i = 0; i < recovery.pair_size(); ++i) { const Recovery_Pair &p = recovery.pair(i); __ref(map)[p.key()] = p.value(); - if (i % chkpt == 0) { + if (i % yield_interval == 0) { if (yield_during_build_up) st_sleep(0); } } @@ -1462,14 +1476,17 @@ while (!backlog.empty()) { using pb::Txn; shared_ptr<Txn> p = backlog.take(); - process_txn<pb_types, pb_types>(map, *p, seqno, nullptr); - if (p->seqno() % chkpt == 0) { - if (verbose) + if (p->seqno() > seqno) { + process_txn<pb_types, pb_types>(map, *p, seqno, nullptr); + if (check_interval(p->seqno(), catch_up_display)) { cout << "processed txn " << p->seqno() << " off the backlog; " << "backlog.size = " << backlog.queue().size() << endl; - // Explicitly yield. (Note that yielding does still effectively - // happen anyway because process_txn is a yield point.) - st_sleep(0); + } + if (check_interval(p->seqno(), yield_interval)) { + // Explicitly yield. (Note that yielding does still effectively + // happen anyway because process_txn is a yield point.) + st_sleep(0); + } } } showtput("replayer caught up; from backlog replayed", @@ -1483,6 +1500,12 @@ stop_hub.insert(st_thread_self()); } +inline bool +check_interval(int seqno, int interval) +{ + return interval > 0 && seqno % interval == interval - 1; +} + int sig_pipe[2]; /** @@ -1559,16 +1582,14 @@ "enable context switch debug outputs") ("profile-threads,q",po::bool_switch(&profile_threads), "enable profiling of threads") - ("verbose,v", po::bool_switch(&verbose), - "enable periodic printing of txn processing progress") ("epoll,e", po::bool_switch(&use_epoll), "use epoll (select is used by default)") ("yield-build-up", po::bool_switch(&yield_during_build_up), - "yield periodically during build-up phase of recovery (for recoverer only)") + "yield periodically during build-up phase of recovery (for recoverer)") ("yield-catch-up", po::bool_switch(&yield_during_catch_up), - "yield periodically during catch-up phase of recovery (for recoverer only)") + "yield periodically during catch-up phase of recovery (for recoverer)") ("multirecover,m", po::bool_switch(&multirecover), - "recover from multiple hosts, instead of just one (specified via leader only)") + "recover from multiple hosts, instead of just one (specified via leader)") ("disk,k", po::bool_switch(&disk), "use disk-based recovery") ("dump,D", po::bool_switch(&dump), @@ -1585,7 +1606,7 @@ ("count-updates,u",po::bool_switch(&count_updates), "count operations that touch (update/read/delete) an existing key") ("general-txns,g", po::bool_switch(&general_txns), - "issue read and delete transactions as well as the default of (only) insertion/update transactions (for leader only)") + "issue read and delete transactions as well as the default of (only) insertion/update transactions (for leader)") ("use-pb", po::bool_switch(&use_pb), "use protocol buffers instead of raw buffers for txns") ("use-pb-res", po::bool_switch(&use_pb_res), @@ -1597,26 +1618,36 @@ ("leader,l", po::bool_switch(&is_leader), "run the leader (run replica by default)") ("exit-on-recovery,x", po::bool_switch(&stop_on_recovery), - "exit after the joiner fully recovers (for leader only)") + "exit after the joiner fully recovers (for leader)") ("batch-size,b", po::value<int>(&batch_size)->default_value(100), - "number of txns to batch up in each msg (for leader only)") + "number of txns to batch up in each msg (for leader)") ("exit-on-seqno,X",po::value<int>(&stop_on_seqno)->default_value(-1), - "exit after txn seqno is issued (for leader only)") + "exit after txn seqno is issued (for leader)") ("accept-joiner-size,s", po::value<size_t>(&accept_joiner_size)->default_value(0), "accept recovering joiner (start recovery) after DB grows to this size " - "(for leader only)") - ("disp-interval", po::value<int>(&display_interval)->default_value(0), - "after this many txns, print current handling rate") - ("issuing-interval,i", + "(for leader)") + ("handle-responses-display", + po::value<int>(&handle_responses_display)->default_value(0), + "number of responses before printing current handling rate (for leader)") + ("catch-up-display", + po::value<int>(&catch_up_display)->default_value(0), + "number of catch-up txns before printing current recovery rate and queue length (for recoverer)") + ("issue-display", + po::value<int>(&issue_display)->default_value(0), + "number of txns before showing the current issue rate (for leader)") + ("process-display", + po::value<int>(&process_display)->default_value(0), + "number of txns before showing the current issue rate (for worker)") + ("issuing-interval", po::value<int>(&issuing_interval)->default_value(0), - "seconds to sleep between issuing txns (for leader only)") + "seconds to sleep between issuing txns (for leader)") ("min-ops,o", po::value<int>(&min_ops)->default_value(5), - "lower bound on randomly generated number of operations per txn (for leader only)") + "lower bound on randomly generated number of operations per txn (for leader)") ("max-ops,O", po::value<int>(&max_ops)->default_value(5), - "upper bound on randomly generated number of operations per txn (for leader only)") + "upper bound on randomly generated number of operations per txn (for leader)") ("accept-joiner-seqno,j", po::value<int>(&accept_joiner_seqno)->default_value(0), "accept recovering joiner (start recovery) after this seqno (for leader " @@ -1631,18 +1662,18 @@ "size of the incoming (read) buffer in bytes") ("write-buf", po::value<size_t>(&buf_size)->default_value(1e5), "size of the outgoing (write) buffer in bytes") - ("chkpt,c", po::value<int>(&chkpt)->default_value(1000), - "number of txns before yielding/verbose printing") + ("yield_interval,y", po::value<int>(&yield_interval)->default_value(10000), + "number of txns before yielding") ("timelim,T", po::value<long long>(&timelim)->default_value(0), "general network IO time limit in milliseconds, or 0 for none") ("write-thresh,w", po::value<long long>(&write_thresh)->default_value(200), - "if positive and any txn write exceeds this, then print a message (for replicas only)") + "if positive and any txn write exceeds this, then print a message") ("read-thresh,r", po::value<long long>(&read_thresh)->default_value(0), - "if positive and any txn read exceeds this, then print a message (for replicas only)") + "if positive and any txn read exceeds this, then print a message") ("listen-port,p", po::value<uint16_t>(&listen_port)->default_value(7654), - "port to listen on (replicas only)") + "port to listen on (for worker)") ("timeout,t", po::value<st_utime_t>(&timeout)->default_value(200000), - "timeout for IO operations (in microseconds)") + "timeout for some IO operations that should actually time out (in microseconds)") ("test", "execute unit tests instead of running the normal system") ("minreps,n", po::value<int>(&minreps)->default_value(2), "minimum number of replicas the system is willing to process txns on"); This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <yan...@us...> - 2009-03-08 09:39:18
|
Revision: 1269 http://assorted.svn.sourceforge.net/assorted/?rev=1269&view=rev Author: yangzhang Date: 2009-03-08 09:39:11 +0000 (Sun, 08 Mar 2009) Log Message: ----------- - cleanup - using fast_map - added -DNDEBUG to optimized build - specialized recovery message generation - using unique_ptr instead of shared_ptr for Recovery channel - added some more notes/todos - added ghash to setup Modified Paths: -------------- ydb/trunk/README ydb/trunk/src/Makefile ydb/trunk/src/main.lzz.clamp ydb/trunk/src/ser.h ydb/trunk/tools/test.bash Modified: ydb/trunk/README =================================================================== --- ydb/trunk/README 2009-03-07 21:30:30 UTC (rev 1268) +++ ydb/trunk/README 2009-03-08 09:39:11 UTC (rev 1269) @@ -473,8 +473,58 @@ - abysmal perf; long wait at the map dump + almost never catch up, but at least it works -- TODO speed up backlogging; don't create pb objects, just take buffers +- report for sam + - got the speed up to as fast as it'll go before 1000 + - added disk logging for workers; still need to grab numbers for the + single-node ('no replica') case + - added physical logging: slower + - adding log transfer vs. state transfer +- DONE added byte length prefixes for faster backlogging +- DONE speed up backlogging; don't create pb objects, just take buffers + + pseudocode (out of date/buggy) + r.setanchor + first_start = r.start + while true + start = r.start + headerlen = sizeof([prefix, ntxns, seqno]) + if r.unread + r.rem < headerlen + buf = new buf + buf.write([r.start..r.end]) + swap(r.buf, buf) + backlog.push(buf, first_start, start) + r.reset + first_start = r.start + prefix = r.read + ntxns = r.read + seqno = r.read + if ...seqno... + if r.rem < prefix - headerlen + buf = new buf + buf.write([prefix, ntxns, seqno] + [r.start..r.end]) + swap(r.buf, buf) + backlog.push(buf, first_start, start) + r.reset + first_start = r.start + assert r.rem >= prefix - headerlen + check0 r.accum(prefix - headerlen) + +- DONE notify process_txns to "flush" to backlog (caught up) + +- DONE pb_types -> pb_traits, etc. + +- DONE try building and using your own map type; compare against other + containers + - built something really fast, faster than even google dense_hash_map + +- TODO experiment with large pages + +- TODO use rb instead of pb for recovery state + +- TODO test out recovery mode more thoroughly, make sure progress is being + made, see how fast it is + - TODO fix multi-recovery if necessary - TODO speed up map dump; don't use range partitioning, but hash partitioning Modified: ydb/trunk/src/Makefile =================================================================== --- ydb/trunk/src/Makefile 2009-03-07 21:30:30 UTC (rev 1268) +++ ydb/trunk/src/Makefile 2009-03-08 09:39:11 UTC (rev 1269) @@ -29,7 +29,7 @@ PPROF := -lprofiler endif ifneq ($(OPT),) - OPT := -O3 -Wdisabled-optimization + OPT := -O3 -Wdisabled-optimization -DNDEBUG else OPT := -g3 endif Modified: ydb/trunk/src/main.lzz.clamp =================================================================== --- ydb/trunk/src/main.lzz.clamp 2009-03-07 21:30:30 UTC (rev 1268) +++ ydb/trunk/src/main.lzz.clamp 2009-03-08 09:39:11 UTC (rev 1269) @@ -8,6 +8,8 @@ #include <boost/scoped_array.hpp> #include <boost/shared_ptr.hpp> #include <boost/tuple/tuple.hpp> +#include <boost/unique_ptr.hpp> +#include <commons/fast_map.h> #include <commons/nullptr.h> #include <commons/rand.h> #include <commons/st/st.h> @@ -52,9 +54,10 @@ //#define map_t unordered_map //#define map_t map -#define map_t dense_hash_map -typedef pair<int, int> pii; +//#define map_t dense_hash_map +#define map_t fast_map typedef map_t<int, int> mii; +typedef pair<int, int> pii; typedef tuple<sized_array<char>, char*, char*> chunk; @@ -63,6 +66,10 @@ map.set_empty_key(-1); map.set_deleted_key(-2); } +template<> void init_map(fast_map<int, int> &map) { + map.set_empty_key(-1); + map.set_deleted_key(-2); +} // Configuration. st_utime_t timeout; @@ -731,7 +738,7 @@ */ template<typename Types, typename RTypes> void -process_txn(mii&map, const typename Types::Txn &txn, int &seqno, +process_txn(mii &map, const typename Types::Txn &txn, int &seqno, typename RTypes::Response *res) { typedef typename Types::Txn Txn; @@ -821,23 +828,6 @@ } #end -template<typename Txn> inline shared_ptr<pb::Txn> to_pb_Txn(Txn txn); -template<> inline shared_ptr<pb::Txn> to_pb_Txn(pb::Txn txn) { - return shared_ptr<pb::Txn>(new pb::Txn(txn)); -} -template<> inline shared_ptr<pb::Txn> to_pb_Txn(msg::Txn txn) { - shared_ptr<pb::Txn> ptxn(new pb::Txn()); - ptxn->set_seqno(txn.seqno()); - for (int o = 0; o < txn.op_size(); ++o) { - pb::Op *pop = ptxn->add_op(); - const msg::Op &op = txn.op(o); - pop->set_type(static_cast<Op_OpType>(op.type())); - pop->set_key(op.key()); - pop->set_value(op.value()); - } - return ptxn; -} - /** * Actually do the work of executing a transaction and sending back the reply. * @@ -866,7 +856,7 @@ template<typename Types, typename RTypes> void process_txns(st_netfd_t leader, mii &map, int &seqno, - st_channel<shared_ptr<Recovery> > &send_states, + st_channel<unique_ptr<Recovery> > &send_states, /* XXX st_channel<shared_ptr<pb::Txn> > &backlog */ st_channel<chunk> &backlog, int init_seqno, int mypos, int nnodes) @@ -906,7 +896,7 @@ showtput("live-processed", now, __ref(time_caught_up), __ref(seqno), __ref(seqno_caught_up)); } - __ref(send_states).push(shared_ptr<Recovery>()); + __ref(send_states).push(unique_ptr<Recovery>()); __ref(w).mark_and_flush(); st_sleep(1); }); @@ -1038,30 +1028,9 @@ } } else if (multirecover || mypos == 0) { // Empty (default) TxnBatch means "generate a snapshot." - // TODO make this faster - cout << "generating recovery..." << endl; - shared_ptr<Recovery> recovery(new Recovery); - typedef ::map<int, int> mii_; - mii_ map_(map.begin(), map.end()); - mii_::const_iterator begin = - map_.lower_bound(multirecover ? interp(RAND_MAX, mypos, nnodes) : 0); - mii_::const_iterator end = multirecover && mypos < nnodes - 1 ? - map_.lower_bound(interp(RAND_MAX, mypos + 1, nnodes)) : map_.end(); - cout << "generating recovery over " << begin->first << ".." - << (end == map_.end() ? "end" : lexical_cast<string>(end->first)); - if (multirecover) - cout << " (node " << mypos << " of " << nnodes << ")"; - cout << endl; - long long start_snap = current_time_millis(); - foreach (const pii &p, make_iterator_range(begin, end)) { - Recovery_Pair *pair = recovery->add_pair(); - pair->set_key(p.first); - pair->set_value(p.second); - } - cout << "generating recovery took " - << current_time_millis() - start_snap << " ms" << endl; + unique_ptr<Recovery> recovery(make_recovery(map, mypos, nnodes)); recovery->set_seqno(seqno); - send_states.push(recovery); + send_states.push(boost::move(recovery)); } } } catch (break_exception &ex) { @@ -1069,6 +1038,33 @@ } +template<typename mii> +unique_ptr<Recovery> make_recovery(const mii &map, int mypos, int nnodes) { + // TODO make this faster + cout << "generating recovery..." << endl; + unique_ptr<Recovery> recovery(new Recovery); + typedef ::map<int, int> mii_; + mii_ map_(map.begin(), map.end()); + mii_::const_iterator begin = + map_.lower_bound(multirecover ? interp(RAND_MAX, mypos, nnodes) : 0); + mii_::const_iterator end = multirecover && mypos < nnodes - 1 ? + map_.lower_bound(interp(RAND_MAX, mypos + 1, nnodes)) : map_.end(); + cout << "generating recovery over " << begin->first << ".." + << (end == map_.end() ? "end" : lexical_cast<string>(end->first)); + if (multirecover) + cout << " (node " << mypos << " of " << nnodes << ")"; + cout << endl; + long long start_snap = current_time_millis(); + foreach (const pii &p, make_iterator_range(begin, end)) { + Recovery_Pair *pair = recovery->add_pair(); + pair->set_key(p.first); + pair->set_value(p.second); + } + cout << "generating recovery took " + << current_time_millis() - start_snap << " ms" << endl; + return boost::move(recovery); +} + class response_handler { public: @@ -1100,7 +1096,7 @@ commons::array<char> rbuf(read_buf_size), wbuf(buf_size); st_reader reader(replica, rbuf.get(), rbuf.size()); writer w(lambda(const void*, size_t) { - throw operation_not_supported("response handler should not be writing"); + throw not_supported_exception("response handler should not be writing"); }, wbuf.get(), wbuf.size()); stream s(reader,w); @@ -1258,10 +1254,10 @@ */ void recover_joiner(st_netfd_t listener, - st_channel<shared_ptr<Recovery> > &send_states) + st_channel<unique_ptr<Recovery> > &send_states) { st_netfd_t joiner; - shared_ptr<Recovery> recovery; + unique_ptr<Recovery> recovery; { st_intr intr(stop_hub); // Wait for the snapshot. @@ -1441,7 +1437,7 @@ } } }); - st_channel<shared_ptr<Recovery> > send_states; + st_channel<unique_ptr<Recovery> > send_states; cout << "starting as replica on port " << listen_port << endl; @@ -1551,7 +1547,7 @@ commons::array<char> rbuf(0), wbuf(buf_size); reader reader(nullptr, rbuf.get(), rbuf.size()); writer writer(lambda(const void*, size_t) { - throw operation_not_supported("should not be writing responses during catch-up phase"); + throw not_supported_exception("should not be writing responses during catch-up phase"); }, wbuf.get(), wbuf.size()); stream s(reader, writer); TxnBatch batch(s); Modified: ydb/trunk/src/ser.h =================================================================== --- ydb/trunk/src/ser.h 2009-03-07 21:30:30 UTC (rev 1268) +++ ydb/trunk/src/ser.h 2009-03-08 09:39:11 UTC (rev 1269) @@ -34,13 +34,13 @@ } #define EXPAND_PB \ - bool AppendToString(string*) const { throw_operation_not_supported(); } \ - bool SerializeToArray(void*, int) const { throw_operation_not_supported(); } \ - bool SerializeToString(string*) const { throw_operation_not_supported(); } \ - bool SerializeToOstream(ostream*) const { throw_operation_not_supported(); } \ - bool ParseFromArray(void*, int) { throw_operation_not_supported(); } \ - int GetCachedSize() const { throw_operation_not_supported(); } \ - int ByteSize() const { throw_operation_not_supported(); } \ + bool AppendToString(string*) const { throw_not_supported(); } \ + bool SerializeToArray(void*, int) const { throw_not_supported(); } \ + bool SerializeToString(string*) const { throw_not_supported(); } \ + bool SerializeToOstream(ostream*) const { throw_not_supported(); } \ + bool ParseFromArray(void*, int) { throw_not_supported(); } \ + int GetCachedSize() const { throw_not_supported(); } \ + int ByteSize() const { throw_not_supported(); } \ #define MAKE_TYPE_BATCH(name, ns, b) \ struct name##_traits { \ Modified: ydb/trunk/tools/test.bash =================================================================== --- ydb/trunk/tools/test.bash 2009-03-07 21:30:30 UTC (rev 1268) +++ ydb/trunk/tools/test.bash 2009-03-08 09:39:11 UTC (rev 1269) @@ -225,6 +225,7 @@ parremote node-setup-bison parremote node-setup-clamp parremote node-setup-gtest + parremote node-setup-ghash } setup-ydb() { This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |