Thread: [Aoetools-discuss] [PATCH] Experimental async disk IO version of vblade
Brought to you by:
ecashin,
elcapitansam
From: Chris W. <ch...@ar...> - 2008-04-16 23:17:29
|
I've been experimenting with an AIO-enabled version of vblade this evening. A proof-of-concept patch against vblade-15 is below; this includes the changes from o_direct.diff bundled in vblade-15/contrib/. My changes are basic and rather rough at the moment and need quite a bit of tidying up. Also, I've changed linux.c without making the corresponding changes to freebsd.c. (However, FreeBSD also has AIO so changing freebsd.c to match should be straightforward.) Any comments, feedback or test results would be very interesting. I've tested briefly that data doesn't get corrupted and observed performance vaguely similar to standard vblade with O_DIRECT on my laptop, but haven't had a chance to test it with anything faster than 100Mbps ethernet or a 2.5" drive. (For what it's worth, the reason I'm looking at AIO in vblade is that one of my projects needs an efficient 'vshelf', exporting multiple block devices from a single host over AoE. Running many copies of vblade doesn't work well: every process wakes and separately examines every incoming network packet. If I export thirty small block devices with thirty vblade processes on a low-powered machine, it can't even manage a data rate of 5MB/s from any one of the vblades with the others idle, whereas a single vblade on its own easily manages 10MB/s without heavily loading the box. A single-threaded vshelf with blocking IO would solve this problem, but at the expense of one block device blocking another. The only viable solutions appear to be a multithreaded vshelf, or an AIO-driven vshelf, and the latter seemed like a more elegant solution. However, I don't know whether it will work out remotely reasonable in terms of performance.) Best wishes, Chris. diff -uprN vblade-15/aoe.c vblade-15-aio/aoe.c --- vblade-15/aoe.c 2008-03-07 20:22:16.000000000 +0000 +++ vblade-15-aio/aoe.c 2008-04-17 00:11:07.000000000 +0100 @@ -8,6 +8,8 @@ #include <sys/stat.h> #include <fcntl.h> #include <netinet/in.h> +#include <errno.h> +#include <aio.h> #include "dat.h" #include "fns.h" @@ -78,32 +80,50 @@ getlba(uchar *p) } int -aoeata(Ata *p, int pktlen) // do ATA reqeust +aoeata(struct Place *place) // do ATA reqeust { - Ataregs r; - int len = 60; int n; + int len = 60; - r.lba = getlba(p->lba); - r.sectors = p->sectors; - r.feature = p->err; - r.cmd = p->cmd; - if (atacmd(&r, (uchar *)(p+1), maxscnt*512, pktlen - sizeof(*p)) < 0) { - p->h.flags |= Error; - p->h.error = BadArg; + place->r.lba = getlba(place->p->lba); + place->r.sectors = place->p->sectors; + place->r.feature = place->p->err; + place->r.cmd = place->p->cmd; + if (atacmd(&place->r, (uchar *)(place->p + 1), maxscnt*512, + place->pktlen - sizeof(Ata), &place->aiocb) < 0) { + place->p->h.flags |= Error; + place->p->h.error = BadArg; return len; } - if (!(p->aflag & Write)) - if ((n = p->sectors)) { - n -= r.sectors; + if (place->aiocb.aio_buf) + return 0; + if (!(place->p->aflag & Write) && (n = place->p->sectors)) { + n -= place->r.sectors; + len = sizeof (Ata) + (n*512); + } + place->p->sectors = place->r.sectors; + place->p->err = place->r.err; + place->p->cmd = place->r.status; + return len; +} + +int aoeatacomplete(struct Place *place, int pktlen) +{ + int n; + int len = 60; + atacmdcomplete(&place->r, &place->aiocb); + if (!(place->p->aflag & Write) && (n = place->p->sectors)) { + n -= place->r.sectors; len = sizeof (Ata) + (n*512); } - p->sectors = r.sectors; - p->err = r.err; - p->cmd = r.status; + place->p->sectors = place->r.sectors; + place->p->err = place->r.err; + place->p->cmd = place->r.status; + bzero((char *) &place->aiocb, sizeof(struct aiocb)); return len; } + #define QCMD(x) ((x)->vercmd & 0xf) // yes, this makes unnecessary copies. @@ -156,8 +176,9 @@ confcmd(Conf *p, int payload) // process } void -doaoe(Aoehdr *p, int n) +doaoe(struct Place *place) { + Aoehdr *p = (Aoehdr *) place->p; int len; enum { // config query header size CHDR_SIZ = sizeof(Conf) - sizeof(((Conf *)0)->data), @@ -165,14 +186,16 @@ doaoe(Aoehdr *p, int n) switch (p->cmd) { case ATAcmd: - if (n < sizeof(Ata)) + if (place->pktlen < sizeof(Ata)) + return; + len = aoeata(place); + if (len == 0) return; - len = aoeata((Ata*)p, n); break; case Config: - if (n < CHDR_SIZ) + if (place->pktlen < CHDR_SIZ) return; - len = confcmd((Conf *)p, n - CHDR_SIZ); + len = confcmd((Conf *)p, place->pktlen - CHDR_SIZ); if (len == 0) return; break; @@ -193,25 +216,104 @@ doaoe(Aoehdr *p, int n) } void +doaoecomplete(struct Place *place) +{ + Aoehdr *p = (Aoehdr *) place->p; + int len = aoeatacomplete(place, place->pktlen); + + memmove(p->dst, p->src, 6); + memmove(p->src, mac, 6); + p->maj = htons(shelf); + p->min = slot; + p->flags |= Resp; + if (putpkt(sfd, (uchar *) p, len) == -1) { + perror("write to network"); + exit(1); + } + +} + +// allocate the buffer so that the ata data area +// is page aligned for o_direct on linux + +void * +bufalloc(uchar **buf, long len) +{ + long psize; + unsigned long n; + + psize = sysconf(_SC_PAGESIZE); + if (psize == -1) { + perror("sysconf"); + exit(EXIT_FAILURE); + } + n = len/psize + 3; + *buf = malloc(psize * n); + if (!*buf) { + perror("malloc"); + exit(EXIT_FAILURE); + } + n = (unsigned long) *buf; + n += psize * 2; + n &= ~(psize - 1); + return (void *) (n - sizeof (Ata)); +} + +void +sigio(int sig) +{ + int n; + for (n = 0; n < Nplaces; n++) { + if (places[n].aiocb.aio_buf == NULL) + continue; + if (aio_error(&places[n].aiocb) == EINPROGRESS) + continue; + doaoecomplete(places + n); + } +} + +void aoe(void) { Aoehdr *p; - uchar *buf; int n, sh; enum { bufsz = 1<<16, }; + sigset_t mask, oldmask; + struct sigaction sigact; - buf = malloc(bufsz); + for (n = 0; n < Nplaces; n++) { + places[n].p = bufalloc(&places[n].freeme, bufsz); + bzero((char *) &places[n].aiocb, sizeof(struct aiocb)); + } aoead(sfd); + sigemptyset(&sigact.sa_mask); + sigact.sa_flags = 0; + sigact.sa_sigaction = sigio; + sigaction(SIGIO, &sigact, NULL); + + sigemptyset(&mask); + sigaddset(&mask, SIGIO); + sigprocmask(SIG_BLOCK, &mask, &oldmask); + for (;;) { - n = getpkt(sfd, buf, bufsz); - if (n < 0) { + for (n = 0; places[n].aiocb.aio_buf && n < Nplaces; n++); + if (n >= Nplaces) { + sigsuspend(&oldmask); + continue; + } + sigprocmask(SIG_SETMASK, &oldmask, NULL); + places[n].pktlen = getpkt(sfd, (uchar *) places[n].p, bufsz); + sigprocmask(SIG_BLOCK, &mask, NULL); + if (places[n].pktlen < 0) { + if (errno == EINTR) + continue; perror("read network"); exit(1); } - if (n < sizeof(Aoehdr)) + if (places[n].pktlen < sizeof(Aoehdr)) continue; - p = (Aoehdr *) buf; + p = (Aoehdr *) places[n].p; if (ntohs(p->type) != 0x88a2) continue; if (p->flags & Resp) @@ -223,9 +325,10 @@ aoe(void) continue; if (nmasks && !maskok(p->src)) continue; - doaoe(p, n); + doaoe(&places[n]); } - free(buf); + for (n = 0; n < Nplaces; n++) + free(places[n].freeme); } void @@ -317,7 +420,7 @@ main(int argc, char **argv) } if (s.st_mode & (S_IWUSR|S_IWGRP|S_IWOTH)) omode = O_RDWR; - bfd = open(argv[3], omode); + bfd = opendisk(argv[3], omode); if (bfd == -1) { perror("open"); exit(1); diff -uprN vblade-15/ata.c vblade-15-aio/ata.c --- vblade-15/ata.c 2008-03-07 20:22:16.000000000 +0000 +++ vblade-15-aio/ata.c 2008-04-17 00:11:16.000000000 +0100 @@ -3,6 +3,8 @@ #include <string.h> #include <stdio.h> #include <sys/types.h> +#include <errno.h> +#include <aio.h> #include "dat.h" #include "fns.h" @@ -98,7 +100,7 @@ atainit(void) * check for that. */ int -atacmd(Ataregs *p, uchar *dp, int ndp, int payload) // do the ata cmd +atacmd(Ataregs *p, uchar *dp, int ndp, int payload, struct aiocb *aiocb) // do the ata cmd { vlong lba; ushort *ip; @@ -155,14 +157,31 @@ atacmd(Ataregs *p, uchar *dp, int ndp, i return 0; } if (p->cmd == 0x20 || p->cmd == 0x24) - n = getsec(bfd, dp, lba, p->sectors); + n = getsec(bfd, dp, lba, p->sectors, aiocb); else { // packet should be big enough to contain the data if (payload < 512 * p->sectors) return -1; - n = putsec(bfd, dp, lba, p->sectors); + n = putsec(bfd, dp, lba, p->sectors, aiocb); } - n /= 512; + if (n < 0) { + p->err = ABRT; + p->status = ERR|DRDY; + p->lba += n; + p->sectors -= n; + // no callback, so ensure aiocb->aio_buf is NULL + bzero((char *) aiocb, sizeof(struct aiocb)); + return 0; + } + return 0; +} + + +int +atacmdcomplete(Ataregs *p, struct aiocb *aiocb) // complete the ata cmd +{ + int n; + n = aio_return(aiocb) / 512; if (n != p->sectors) { p->err = ABRT; p->status = ERR; @@ -171,6 +190,7 @@ atacmd(Ataregs *p, uchar *dp, int ndp, i p->status |= DRDY; p->lba += n; p->sectors -= n; + // don't get called back again: ensure aiocb->aio_buf is NULL + bzero((char *) aiocb, sizeof(struct aiocb)); return 0; } - diff -uprN vblade-15/dat.h vblade-15-aio/dat.h --- vblade-15/dat.h 2008-03-07 20:22:16.000000000 +0000 +++ vblade-15-aio/dat.h 2008-04-17 00:11:28.000000000 +0100 @@ -111,6 +111,8 @@ enum { Nconfig = 1024, Bufcount = 16, + + Nplaces = 32, }; int shelf, slot; @@ -120,3 +122,13 @@ int bfd; // block file descriptor int sfd; // socket file descriptor vlong size; // size of vblade char *progname; + +struct Place +{ + int pktlen; + Ata *p; + Ataregs r; + struct aiocb aiocb; + uchar *freeme; +}; +struct Place places[Nplaces]; diff -uprN vblade-15/fns.h vblade-15-aio/fns.h --- vblade-15/fns.h 2008-03-07 20:22:16.000000000 +0000 +++ vblade-15-aio/fns.h 2008-04-17 00:11:36.000000000 +0100 @@ -15,14 +15,16 @@ int maskok(uchar *); // ata.c void atainit(void); -int atacmd(Ataregs *, uchar *, int, int); +int atacmd(Ataregs *, uchar *, int, int, struct aiocb *); +int atacmdcomplete(Ataregs *, struct aiocb *); // os specific int dial(char *); int getea(int, char *, uchar *); -int putsec(int, uchar *, vlong, int); -int getsec(int, uchar *, vlong, int); +int opendisk(const char *, int); +int putsec(int, uchar *, vlong, int, struct aiocb *); +int getsec(int, uchar *, vlong, int, struct aiocb *); int putpkt(int, uchar *, int); int getpkt(int, uchar *, int); vlong getsize(int); diff -uprN vblade-15/freebsd.c vblade-15-aio/freebsd.c --- vblade-15/freebsd.c 2008-03-07 20:22:16.000000000 +0000 +++ vblade-15-aio/freebsd.c 2008-04-16 18:43:05.000000000 +0100 @@ -241,6 +241,11 @@ getea(int s, char *eth, uchar *ea) return(0); } +int +opendisk(const char *disk, int omode) +{ + return open(disk, omode); +} int getsec(int fd, uchar *place, vlong lba, int nsec) diff -uprN vblade-15/linux.c vblade-15-aio/linux.c --- vblade-15/linux.c 2008-03-07 20:22:16.000000000 +0000 +++ vblade-15-aio/linux.c 2008-04-17 00:10:54.000000000 +0100 @@ -1,5 +1,6 @@ // linux.c: low level access routines for Linux #include "config.h" +#define _GNU_SOURCE #include <sys/socket.h> #include <stdio.h> #include <string.h> @@ -22,6 +23,9 @@ #include <netinet/in.h> #include <linux/fs.h> #include <sys/stat.h> +#include <fcntl.h> +#include <errno.h> +#include <aio.h> #include "dat.h" #include "fns.h" @@ -29,8 +33,6 @@ int getindx(int, char *); int getea(int, char *, uchar *); - - int dial(char *eth) // get us a raw connection to an interface { @@ -76,7 +78,7 @@ getea(int s, char *name, uchar *ea) struct ifreq xx; int n; - strcpy(xx.ifr_name, name); + strcpy(xx.ifr_name, name); n = ioctl(s, SIOCGIFHWADDR, &xx); if (n == -1) { perror("Can't get hw addr"); @@ -102,17 +104,33 @@ getmtu(int s, char *name) } int -getsec(int fd, uchar *place, vlong lba, int nsec) +opendisk(const char *disk, int omode) +{ + return open(disk, omode|O_DIRECT); +} + +int +getsec(int fd, uchar *place, vlong lba, int nsec, struct aiocb *aiocb) { - lseek(fd, lba * 512, 0); - return read(fd, place, nsec * 512); + aiocb->aio_fildes = fd; + aiocb->aio_buf = place; + aiocb->aio_nbytes = nsec * 512; + aiocb->aio_offset = lba * 512; + aiocb->aio_sigevent.sigev_notify = SIGEV_SIGNAL; + aiocb->aio_sigevent.sigev_signo = SIGIO; + return aio_read(aiocb); } int -putsec(int fd, uchar *place, vlong lba, int nsec) +putsec(int fd, uchar *place, vlong lba, int nsec, struct aiocb *aiocb) { - lseek(fd, lba * 512, 0); - return write(fd, place, nsec * 512); + aiocb->aio_fildes = fd; + aiocb->aio_buf = place; + aiocb->aio_nbytes = nsec * 512; + aiocb->aio_offset = lba * 512; + aiocb->aio_sigevent.sigev_notify = SIGEV_SIGNAL; + aiocb->aio_sigevent.sigev_signo = SIGIO; + return aio_write(aiocb); } int diff -uprN vblade-15/linux.h vblade-15-aio/linux.h --- vblade-15/linux.h 2008-03-07 20:22:16.000000000 +0000 +++ vblade-15-aio/linux.h 2008-04-16 23:03:07.000000000 +0100 @@ -6,6 +6,6 @@ typedef long long vlong; int dial(char *); int getindx(int, char *); int getea(int, char *, uchar *); -int getsec(int, uchar *, vlong, int); -int putsec(int, uchar *, vlong, int); +int getsec(int, uchar *, vlong, int, struct aiocb *); +int putsec(int, uchar *, vlong, int, struct aiocb *); vlong getsize(int); diff -uprN vblade-15/makefile vblade-15-aio/makefile --- vblade-15/makefile 2008-03-07 20:22:16.000000000 +0000 +++ vblade-15-aio/makefile 2008-04-16 19:09:46.000000000 +0100 @@ -13,7 +13,7 @@ CFLAGS += -Wall -g -O2 CC = gcc vblade: $O - ${CC} -o vblade $O + ${CC} -lrt -o vblade $O aoe.o : aoe.c config.h dat.h fns.h makefile ${CC} ${CFLAGS} -c $< |
From: Tracy R R. <tr...@ul...> - 2008-04-17 03:49:20
|
Chris Webb wrote: > (For what it's worth, the reason I'm looking at AIO in vblade is that one of > my projects needs an efficient 'vshelf', exporting multiple block devices > from a single host over AoE. Running many copies of vblade doesn't work > well: every process wakes and separately examines every incoming network Have you looked at qaoed? I use it and get FAR better performance than I did with vblade. -- Tracy R Reed Read my blog at http://ultraviolet.org Key fingerprint = D4A8 4860 535C ABF8 BA97 25A6 F4F2 1829 9615 02AD Non-GPG signed mail gets read only if I can find it among the spam. |
From: Chris W. <ch...@ar...> - 2008-04-17 08:52:00
|
Tracy R Reed <tr...@ul...> writes: > Have you looked at qaoed? I use it and get FAR better performance than I > did with vblade. Hi. Thanks for responding. I did take a look at qaoed, yes. I think it takes the same approach as vblade but with multiple children (one per block device) and a front-end parent listening on the socket and queueing the requests for the children. I think the requests still have to be satisfied by a particular block device in the order they came in, with each device blocking on each IO operation until it's complete, which make it a solution to the problem of multiple vblades waking up on each network packet, but not to that of multiple IO operations being queued for a single block device. I presume you're only seeing greater performance when exporting multiple nodes? I can't see it doing anything different (e.g. multithreaded access) to speed up access when just a single node is exported? If it is faster there too, then I'm missing something in my skim-reading of the qaoed code. Other than the queueing, do they differ in any substantial way? Because the single threaded, asynchronous IO approach I was trying was so different to qaoed's multi-threaded queue approach, I decided it would be easier to modify vblade (which is small, simple and clear) than the larger qaoed. (If it turns out that a pool of threads is a better model for high performance, qaoed would be the natural base to change that pool so that multiple threads can work on the same block device.) Best wishes, Chris. |
From: Ed L. C. <ec...@co...> - 2008-04-18 13:32:47
|
On Thu, Apr 17, 2008 at 12:17:25AM +0100, Chris Webb wrote: ... > block device blocking another. The only viable solutions appear to be a > multithreaded vshelf, or an AIO-driven vshelf, and the latter seemed like a > more elegant solution. I haven't thought about the applicability of the idea to the vblade much, but it sounds like you might be skipping over the strategy of using non-blocking I/O and select(2), like thttpd does. -- Ed L Cashin <ec...@co...> |
From: Chris W. <ch...@ar...> - 2008-04-18 14:19:19
|
"Ed L. Cashin" <ec...@co...> writes: > I haven't thought about the applicability of the idea to the vblade > much, but it sounds like you might be skipping over the strategy of > using non-blocking I/O and select(2), like thttpd does. I wondered about that when I started. On a pipe or socket, O_NONBLOCK and poll()/select() is clearly the canonical way of doing this sort of thing, and that sort of non-blocking IO is something I'm much more comfortable with than posix AIO. However, are you sure O_NONBLOCK behaves in the required way with regular files (and block devices) where the only thing blocking the read is disk IO? I was under the impression that a read() from a file with O_NONBLOCK set will only block if there's a resource issue, mandatory lock or other strange situation, not just because the data hasn't arrived from disk yet. A few brief experiments suggest that this is the case. Best wishes, Chris. |
From: Ed L. C. <ec...@co...> - 2008-04-18 15:32:42
|
On Fri, Apr 18, 2008 at 03:17:59PM +0100, Chris Webb wrote: ... > However, are you sure O_NONBLOCK behaves in the required way with regular > files (and block devices) where the only thing blocking the read is disk IO? > I was under the impression that a read() from a file with O_NONBLOCK set > will only block if there's a resource issue, mandatory lock or other strange > situation, not just because the data hasn't arrived from disk yet. A few > brief experiments suggest that this is the case. No, I'm not sure. I would expect it to work, but I'd like to hear about it if it doesn't. -- Ed L Cashin <ec...@co...> |
From: Chris W. <ch...@ar...> - 2008-04-19 09:32:12
|
"Ed L. Cashin" <ec...@co...> writes: > No, I'm not sure. I would expect it to work, but I'd like to hear > about it if it doesn't. I've checked up on this. POSIX specifies that regular files always poll() TRUE: http://www.opengroup.org/onlinepubs/000095399/functions/poll.html and select() as ready to read and write: http://www.opengroup.org/onlinepubs/000095399/functions/select.html I've also tested out the behaviour of reading from a plain file opened as O_NONBLOCK on Linux. It doesn't return EAGAIN but just fetches the block and returns, regardless of how long this process takes. (However, if the same test program O_NONBLOCK opens a named pipe or socket instead of the file, read() returns EAGAIN immediately unless there's at least one byte available.) Best wishes, Chris. |
From: Chris W. <ch...@ar...> - 2008-04-20 13:46:47
Attachments:
vblade-15-aio.2.patch
|
I've turned my initial experimental patch to vblade into something a bit more presentable. It now fits in better with the existing code style of vblade, doesn't do network IO or serious work in a signal handler, and includes a stab at correct FreeBSD support. I haven't been able to test the latter, but it looks like it supports the same POSIX AIO as Linux and allows the same signal notification mechanism. Any test results would be gratefully received. Running oprofile on a box with a heavily loaded loopback vblade-aio suggests that it spends an inordinate amount of time in the signal handler---despite the fact that (now) this is just used to notify the main event loop that an AIO event has completed. Some method of poll()ing directly on the AIO events at the same time as the socket fd would cut this overhead out completely. (I need to investigate the options here further.) More generally, experimenting with O_DIRECT vblade and O_DIRECT vblade-aio on a loopback interface with mtu 9000 suggests that performance differences on a single RAID1-backed block device are fairly small and swamped by measurement error and the performance of the network device and underlying block device. As a result, I think that the additional complexity introduced by this patch make it a poor fit for vblade itself; I am sending it to the list for interest value only. However, a multiplexing 'vshelf' exporting a number of block devices through a single socket would need either a pool of worker threads or some mechanism along these lines to multiplex the IO requests to different devices rather than blocking one device while another is working. I hope to post such a multiplexing aoe server later this week. It will be interesting to compare it with the thread-per-block-device solution implemented by qaoed. Best wishes, Chris. |