[Assorted-commits] SF.net SVN: assorted:[1283] ydb/trunk/README
Brought to you by:
yangzhang
From: <yan...@us...> - 2009-03-12 02:13:52
|
Revision: 1283 http://assorted.svn.sourceforge.net/assorted/?rev=1283&view=rev Author: yangzhang Date: 2009-03-12 02:13:29 +0000 (Thu, 12 Mar 2009) Log Message: ----------- added a bunch of notes/todos Modified Paths: -------------- ydb/trunk/README Modified: ydb/trunk/README =================================================================== --- ydb/trunk/README 2009-03-12 02:13:21 UTC (rev 1282) +++ ydb/trunk/README 2009-03-12 02:13:29 UTC (rev 1283) @@ -545,6 +545,106 @@ - decided to leave this as-is; getting an accurate number is too much effort - relabeled the messages to be "roughly" +Wed Mar 11 00:46:35 EDT 2009 + +- DONE experiment on the new set of machines, farm1-4 + - scaling + - 3: 401Ktps + - 2: 418Ktps + - 1: 425Ktps + - 0: 401Ktps + - twal: 360Ktps + - pwal: 264Ktps + - pwal on rep: 240Ktps + - recovery + - 1 rep + - before: 413Ktps + - during: 220-250Ktps + - serializing: 203ms + - recv: 1955ms + - catchup: 5133ms 300Ktps + - 2 reps + - before: 419Ktps + - during: 236Ktps + - serializing: 106ms + - recv: 1500ms + - catchup: 5023ms 311Ktps + - problem: noticed that there's a lot of long blocking during the + - blocking to replica 0: why? serialization only takes 200ms... + - blocking to replica 1: why? we should not be blocked on anything.... + +- current state of the system: + - replicas can do pwal + - solo can do twal; may also implement for replicas + - strange slowdown + - replicas can recover over network + - cannot recover from disk + +- DONE get to the bottom of the blocking issues + - turns out this really is due to the CPU load + - when copying, the CPU is distracted from process_txns by the recover_joiner + thread, which is repeatedly making write syscalls to send a large message + - i can add a backlog, but not sure what good that will do, really + - the problem is prioritizing the live txn processing over the sending, but + then nothing will ever be sent, because even in normal (non-recovery) + operation, the issuer is faster than the processor + - need to apply *some* back-pressure, and TCP's buffer limits are a natural + and easy way to achieve that + +- DONE implement recovery (replay) from disk + - 2 + - before: 419Ktps + - during: 191Ktps + - replay: 3260ms (19MB/s) + - catchup: 4908ms (319Ktps) + +Wed Mar 11 16:38:24 EDT 2009 + +- meeting with sam +- got the numbers + - fixed up/understood various issues with earlier implementation + - pwal, twal, on replicas + - new machines + - proper recovery/backlogging + - custom map storage structure + - the blocking problem + - got new numbers + - disk is slow; still some things i can do to speed up + - try multiprocessing? +- outline + - two approaches to paper: + - "here are some recovery methods" + - "network based recovery is the new method, better than (or at least as + good as) disk" + - intro + - main-memory distributed databases for oltp [h-store] + - replication for high availability and durability [harbor] + - recovery methods for main-memory database + - disk logging vs network replication + - network bandwidth > disk bandwidth + - flexibility: bringing back orig node + - new system is cpu bound + - snapshots + - mechanisms/design + - network recovery + - pwal recovery + - no redo (no steal) + - twal recovery + - hybrid recovery + - prelim experiments show: all pages become dirty, fast + - realistic workload? + - evaluation + - each of the above mechanisms + - related work + - aries + - harbor + - harp +- new goals + - implement the chunking/snapshotting + - try to get tpcc working + +- TODO faster disk logging using separate threads + - TODO show aries-write - TODO checkpointing + replaying log from replicas (not from disk) - TODO scale-up on multicore This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |