Welcome to your wiki!
The history of this project began when I tried to run vblade from aoetools on DLink DNS-325 device together with WinAoE initiator. It worked, but was very slow, even compared to SMB and FTP access to same device. Enabling jumbo frames made things better but transfer rate was still low. I tried another virtual target named ggaoed but failed to compile for that platform, also after quick looking at AoE protocol I found its very easy but having some obvious shortcomings:
At first I worried about data integrity: current AoE specification doesn't provide clear way to allow target to detect duplicated write requests, that theoretically can damage data if some network hardware will send too 'outdated' duplicate of some random write request that was already sent and processed in the past.
Second concern was an ineffective way of how non-data containing requests are sent: for example every single small read request comes in separate small ethernet frame that makes lot traffic overhead by comparing with any other protocol used to transfer same size of data.
So I decided to make some changes in vblade project and since too much code was affected, vblade authors reasonable recommended to fork it as separate project, so here it is.
Here are changes by comparing with 'normal' vblade project:
- Added support to network receive with using mem-mapped ring buffer (PACKET_RX_RING option) instead of plain read-from-socket
- Used Linux AIO to read/write files. Compiling with AIO enabled (default for linux) also implies O_DIRECT and O_DSYNC option, thus running AoEde has minimal affect on system-wide file cache.
- Added write-behind/read-ahead buffering, so O_DIRECT doesn't defeat IO performance (in my case it even improved it)
- Added command 'Extensions' that allows initiator to negotiate protocol extensions that supported by it and target. Its request packet contains plain table of NULL-terminated strings. Each string means requested feature or argument passed to previously specified feature if it has parameters. Table is terminated by additional final NULL character, so valid Aoeextensions request must have two NULL chars at the end. Target must reply with same string table, but strings that corresponds to supported and activated features are converted to upper case, that allows initiator to check what exactly features are really in use. Pseudofeature 'reset' clear all extension features to 'normal' state. Also same happens by processing of config command.
- Added incrementing tags tracking functionality: if client guarantees that every new subsequent unique request will have its tag field initialized as incrementing 32 bit integer - it can specify 'tag_inc_le' or 'tag_inc_be' extension - depending if tag must be considered as incrementing little endian or big endian value. This enables 32K tags sliding window to avoid processing of duplicated write requests. Also this optimizes duplicated packets processing in bounds of current ring buffer. Initiator can also specify 'tag_random' extension, that allows only optimization, but not duplicated write commands skipping.
- Added read requests coalescing. To use it initiator first should negotiate about it by specifying 'coalesced_read' extension and check that target supports it by verifying upper-case in reply. If it supported (and activated) by target - initiator then should use slightly modified Read commands: after normal read command represented with struct Ata initiator can append multiple addition read requests, so final coalesced-read request should look like:
struct Ata normal_read_request;
unsigned char coalesced_count;
struct AtaCoalescedRead coalesced_read_requests[coalesced_count];
that allows aggregating multiple read requests in single ethernet frame that noticeable decreased network and target'c CPU usage during bulk read operations.
I'm thinking also how to do same trick for write replies, but did not implemented this yet. Most probably this will be possible with TX ring buffer, but since Linux kernel on DLink-DNS325 doesn't support TX ring - I didnt implement this yet.
- Added congestion threshold detection: initiator sends multiple 'congestion' extension request each marked by common id number as argument that caused target to delay packets processing for 0.5 second on receiving first 'congestion' request with new id and then replying on all other congestion requests with same id without delay. This causes target to process so many congestion requests how many can fit into network buffers, all others are discarded. Initiator subsequently should count received replies and use resulted number to guess best outstanding requests threshold to minimize drops/resends while keeping best throughput.
Also see blog for a project news: https://sourceforge.net/p/aoede/blog/