[Assorted-commits] SF.net SVN: assorted:[1180] ydb/trunk/README
Brought to you by:
yangzhang
|
From: <yan...@us...> - 2009-02-13 20:57:52
|
Revision: 1180
http://assorted.svn.sourceforge.net/assorted/?rev=1180&view=rev
Author: yangzhang
Date: 2009-02-13 20:57:48 +0000 (Fri, 13 Feb 2009)
Log Message:
-----------
added notes
Modified Paths:
--------------
ydb/trunk/README
Modified: ydb/trunk/README
===================================================================
--- ydb/trunk/README 2009-02-13 20:57:32 UTC (rev 1179)
+++ ydb/trunk/README 2009-02-13 20:57:48 UTC (rev 1180)
@@ -282,12 +282,79 @@
Period: 2/5-
-- TODO commit!!!
+- DONE commit!!!
+- DONE google profiling
+ - doesn't work well with this app since no samples are generated while in a
+ blocking syscall
+ - top: randint, readmsg, ... lots of ties, hard to tell
+- DONE thread profiling
+ - leader:
+ - 60% in compute: 40% handle_responses, 60% issue_txns
+ - 40% unaccounted
+ - replica:
+ - 90-100% in process_txns thread
+ - perftools showed a lot of samples in map operations and in process_txn,
+ but not sure how to trim down process_txn any more
+- DONE replace the map with unordered_map, start using -O3
+ - this gave me some performance boost, from around 45 Ktps to 65 Ktps
+ - now performing better than unoptimized, simple disk
+ - issues ~72Ktps
+ - optimized, simple disk does much better as well, unfortunately (nearly
+ 200Ktps)
+ - this is close to the 70 Ktps from Abadi's H-Store paper
+- DONE asynchronously bcast txns
+ - issuing, handle_responses %, issuing net throughput, processing
+ - 0: 260 Ktps, 0%, N/A, 260 Ktps
+ - 1: <250 Ktps, 70%, 21 MB/s, 65 Ktps
+ - 2: 75 Ktps, 38% 38%, 6.4 MB/s, 65 Ktps
+ - 3: 53 Ktps, 27% 27% 27%, 4.5 MB/s, 50 Ktps
+ - from 0 to 1: it seems that Response serialization is taking up a lot of
+ time?
+ - tried removing Response construction, but that barely changed anything
+ - at under 3 nodes, we are bottlenecked by the CPU throughput of process_txn
+ (65 Ktps)
+- DONE asynchronously bcast responses
+ - async: handling
+ - 0: 260 Ktps
+ - 1: 90 Ktps
+ - 2: 66 Ktps
+ - 3: 46 Ktps
+ - sync
+ - 0: 260 Ktps
+ - 1: 63 Ktps: this shows that async does make a big diff
+ - 2: 60 Ktps: this + next are more similar to async bc of readmsg
+ bottleneck
+ - 3: 43 Ktps
+ - at 3, we are bottlenecked by the leader in handle_responses (apparently) to
+ 50 Ktps
+ - we are definitely not bottlenecked by IO throughput
+- DONE make readmsg perform fewer syscalls (buffer opportunistically)
+ - like magic: now can sustain 90 Ktps all the way up through 3 xacts!
+
+Period
+
+- DONE p2 prototype
+ - some interesting performance bugs
+ - forgot to make some sockets non-blocking, eg accepted client socket, eg
+ the client's socket to server; everything still works with select
+ - i was indeed forgetting to set this as well in ydb
+ - was always inadvertently calling read() whenever i requested some # bytes
+ - made a big diff in leveling the field between smallerish to largerish
+ msg sizes
+ - this was hurting me only slightly in ydb, it seems
+ - was not aggressively consuming as many msgs as i could, only 1 at a time
+ (per return from select)
+- DONE batch responses
+ - made a marked difference; ~100Ktps -> ~140Ktps (for 1-4 reps)
+- TODO flushing
+- TODO make the logger a "single replica"
+- TODO oprofile
+ - not giving much info either for things that are stalled on IO
- TODO serialization bench (multiple layers, control batch sizes)
- TODO network throughput bench
- TODO associative container bench
- TODO combine the analyses of the above three; integrate with actual message
- formats, etc.
+ formats, etc.; updated README
- TODO batching, serialization, disk speed
- TODO better wal
- TODO better understand multihost recovery
@@ -295,7 +362,7 @@
- TODO data structures benchmark
- TODO implement checkpointing disk-based scheme
- TODO implement log-based recovery; show that it sucks
-- TODO implement group (batch) commit for log-based recovery
+- TODO implement group (batch) commit (sync) for log-based recovery
- TODO try scaling up
- TODO serialize outputs from the various clients to a single merger to (1)
have ordering over the (timestamped) messages, and (2) avoid interleaved
@@ -359,3 +426,62 @@
- TODO differences from: harbor, harp, aries
- TODO understand 2pc, paxos, etc.
+
+Notes
+-----
+
+### IO limits
+
+Theoretically, with a GigE network connection, you can max out at roughly 800
+Mb/s or 100 MB/s. Assuming each transaction can be encoded in about 50 bytes,
+you can push 100e6 B/s / 50 B/txn = 2e6 txn/s.
+
+Imagine that this is the only network traffic we need to worry about, so we
+don't have any other replicas to dispatch transactions to or any responses to
+receive (though GigE is full-duplex).
+
+As of 2008, a typical 7200rpm desktop hard drive has a sustained
+"disk-to-buffer" data transfer rate of about 70 MB/s[^1]. This is slightly
+less than the max rate of a GigE network, so we do expect to do better with the
+network---but not substantially.
+
+[^1]: <http://en.wikipedia.org/wiki/Hard_disk_drive>
+
+### Compute limits
+
+To be able to process 2e6 txn/s on a 1GHz CPU, we must spend at most 1e9
+cycle/s / 2e6 txn/s = 500 cycle/txn. Assuming:
+
+- 100 ns for main memory
+- 10 ns for L2 cache
+- 1 ns for L1 cache
+
+At 1 ns/cycle, a single cache miss will take up to a fifth of the allotted
+processing time per txn. (TODO: what's the baseline for syscalls?)
+
+A `std::map` can in a tight loop process 1M sequential insertions in ~500 ms,
+or ~2M/s (note that these are *not* random keys, but keys incrementing from 0).
+If a txn inserts 5 records, this equates to 400Ktps. An `stx::btree_map` takes
+~250 ms (4M/s or 800Ktps), and a `tr1::unordered_map` (hash table) takes ~200
+ms (5M/s or 1Mtps). For reference, we can populate a raw array sequentially in
+5ms (200M/s or 40Mtps). (These microbenchmarks come from container-bench.)
+
+These results suggest that we can expect to be bounded by the CPU/memory, and
+not IO throughput. With hash tables, we can come close to the target of 2e6
+txn/s, but can't expect to exceed it.
+
+In the H-Store paper "The End of an Architectural Era", the prototype achieves
+70,416 TPC-C txn/s, 82 times faster than the 850 txn/s achieved on the
+commercial system.
+
+### Workloads
+
+Ebay engages in 26 Btxn/day, or 300Ktps.
+
+Discussion
+----------
+
+Disk vs. network: practical considerations
+
+- Network is simpler, can be stateless
+- Disk may quickly grow stale during downtime
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|