[Assorted-commits] SF.net SVN: assorted:[1180] ydb/trunk/README

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Revision: 1180
          http://assorted.svn.sourceforge.net/assorted/?rev=1180&view=rev
Author:   yangzhang
Date:     2009-02-13 20:57:48 +0000 (Fri, 13 Feb 2009)

Log Message:
-----------
added notes

Modified Paths:
--------------
    ydb/trunk/README

Modified: ydb/trunk/README
===================================================================

--- ydb/trunk/README	2009-02-13 20:57:32 UTC (rev 1179)
+++ ydb/trunk/README	2009-02-13 20:57:48 UTC (rev 1180)
@@ -282,12 +282,79 @@
 
 Period: 2/5-
 
-- TODO commit!!!
+- DONE commit!!!
+- DONE google profiling
+  - doesn't work well with this app since no samples are generated while in a
+    blocking syscall
+  - top: randint, readmsg, ... lots of ties, hard to tell
+- DONE thread profiling
+  - leader:
+    - 60% in compute: 40% handle_responses, 60% issue_txns
+    - 40% unaccounted
+  - replica:
+    - 90-100% in process_txns thread
+    - perftools showed a lot of samples in map operations and in process_txn,
+      but not sure how to trim down process_txn any more
+- DONE replace the map with unordered_map, start using -O3
+  - this gave me some performance boost, from around 45 Ktps to 65 Ktps
+  - now performing better than unoptimized, simple disk
+  - issues ~72Ktps
+  - optimized, simple disk does much better as well, unfortunately (nearly
+    200Ktps)
+  - this is close to the 70 Ktps from Abadi's H-Store paper
+- DONE asynchronously bcast txns
+  - issuing, handle_responses %, issuing net throughput, processing
+    - 0: 260 Ktps, 0%, N/A, 260 Ktps
+    - 1: <250 Ktps, 70%, 21 MB/s, 65 Ktps
+    - 2: 75 Ktps, 38% 38%, 6.4 MB/s, 65 Ktps
+    - 3: 53 Ktps, 27% 27% 27%, 4.5 MB/s, 50 Ktps
+  - from 0 to 1: it seems that Response serialization is taking up a lot of
+    time?
+    - tried removing Response construction, but that barely changed anything
+  - at under 3 nodes, we are bottlenecked by the CPU throughput of process_txn
+    (65 Ktps)
+- DONE asynchronously bcast responses
+  - async: handling
+    - 0: 260 Ktps
+    - 1: 90 Ktps
+    - 2: 66 Ktps
+    - 3: 46 Ktps
+  - sync
+    - 0: 260 Ktps
+    - 1: 63 Ktps: this shows that async does make a big diff
+    - 2: 60 Ktps: this + next are more similar to async bc of readmsg
+      bottleneck
+    - 3: 43 Ktps
+  - at 3, we are bottlenecked by the leader in handle_responses (apparently) to
+    50 Ktps
+  - we are definitely not bottlenecked by IO throughput
+- DONE make readmsg perform fewer syscalls (buffer opportunistically)
+  - like magic: now can sustain 90 Ktps all the way up through 3 xacts!
+
+Period
+
+- DONE p2 prototype
+  - some interesting performance bugs
+    - forgot to make some sockets non-blocking, eg accepted client socket, eg
+      the client's socket to server; everything still works with select
+      - i was indeed forgetting to set this as well in ydb
+    - was always inadvertently calling read() whenever i requested some # bytes
+      - made a big diff in leveling the field between smallerish to largerish
+        msg sizes
+      - this was hurting me only slightly in ydb, it seems
+    - was not aggressively consuming as many msgs as i could, only 1 at a time
+      (per return from select)
+- DONE batch responses
+  - made a marked difference; ~100Ktps -> ~140Ktps (for 1-4 reps)
+- TODO flushing
+- TODO make the logger a "single replica"
+- TODO oprofile
+  - not giving much info either for things that are stalled on IO
 - TODO serialization bench (multiple layers, control batch sizes)
 - TODO network throughput bench
 - TODO associative container bench
 - TODO combine the analyses of the above three; integrate with actual message
-  formats, etc. 
+  formats, etc.; updated README
 - TODO batching, serialization, disk speed
 - TODO better wal
 - TODO better understand multihost recovery
@@ -295,7 +362,7 @@
 - TODO data structures benchmark
 - TODO implement checkpointing disk-based scheme
 - TODO implement log-based recovery; show that it sucks
-- TODO implement group (batch) commit for log-based recovery
+- TODO implement group (batch) commit (sync) for log-based recovery
 - TODO try scaling up
 - TODO serialize outputs from the various clients to a single merger to (1)
   have ordering over the (timestamped) messages, and (2) avoid interleaved
@@ -359,3 +426,62 @@
 
 - TODO differences from: harbor, harp, aries
 - TODO understand 2pc, paxos, etc.
+
+Notes
+-----
+
+### IO limits
+
+Theoretically, with a GigE network connection, you can max out at roughly 800
+Mb/s or 100 MB/s.  Assuming each transaction can be encoded in about 50 bytes,
+you can push 100e6 B/s / 50 B/txn = 2e6 txn/s.
+
+Imagine that this is the only network traffic we need to worry about, so we
+don't have any other replicas to dispatch transactions to or any responses to
+receive (though GigE is full-duplex).
+
+As of 2008, a typical 7200rpm desktop hard drive has a sustained
+"disk-to-buffer" data transfer rate of about 70 MB/s[^1].  This is slightly
+less than the max rate of a GigE network, so we do expect to do better with the
+network---but not substantially.
+
+[^1]: <http://en.wikipedia.org/wiki/Hard_disk_drive>
+
+### Compute limits
+
+To be able to process 2e6 txn/s on a 1GHz CPU, we must spend at most 1e9
+cycle/s / 2e6 txn/s = 500 cycle/txn.  Assuming:
+
+- 100 ns for main memory
+- 10 ns for L2 cache
+- 1 ns for L1 cache
+
+At 1 ns/cycle, a single cache miss will take up to a fifth of the allotted
+processing time per txn.  (TODO: what's the baseline for syscalls?)
+
+A `std::map` can in a tight loop process 1M sequential insertions in ~500 ms,
+or ~2M/s (note that these are *not* random keys, but keys incrementing from 0).
+If a txn inserts 5 records, this equates to 400Ktps.  An `stx::btree_map` takes
+~250 ms (4M/s or 800Ktps), and a `tr1::unordered_map` (hash table) takes ~200
+ms (5M/s or 1Mtps).  For reference, we can populate a raw array sequentially in
+5ms (200M/s or 40Mtps).  (These microbenchmarks come from container-bench.)
+
+These results suggest that we can expect to be bounded by the CPU/memory, and
+not IO throughput.  With hash tables, we can come close to the target of 2e6
+txn/s, but can't expect to exceed it.
+
+In the H-Store paper "The End of an Architectural Era", the prototype achieves
+70,416 TPC-C txn/s, 82 times faster than the 850 txn/s achieved on the
+commercial system.
+
+### Workloads
+
+Ebay engages in 26 Btxn/day, or 300Ktps.
+
+Discussion
+----------
+
+Disk vs. network: practical considerations
+
+- Network is simpler, can be stateless
+- Disk may quickly grow stale during downtime


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.