From: Mike K. <mkr...@se...> - 2001-01-18 23:53:19
|
I just posted an updated version of the multi-queue scheduler for the 2.4.0 kernel. This version also contains support for realtime tasks. The patch can be found at: http://lse.sourceforge.net/scheduling/ Here are some very preliminary numbers from sched_test_yield (which was previously posted to this (lse-tech) list by Bill Hartner). Tests were run on a system with 8 700 MHz Pentium III processors. microseconds/yield # threads 2.2.16-22 2.4 2.4-multi-queue ------------ --------- -------- --------------- 16 18.740 4.603 1.455 32 17.702 5.134 1.456 64 23.300 5.586 1.466 128 47.273 18.812 1.480 256 105.701 71.147 1.517 512 FRC 143.500 1.661 1024 FRC 196.425 6.166 2048 FRC FRC 23.291 4096 FRC FRC 47.117 *FRC = failed to reach confidence level -- Mike Kravetz mkr...@se... IBM Linux Technology Center 15450 SW Koll Parkway Beaverton, OR 97006-6063 (503)578-3494 |
From: Andrea A. <an...@su...> - 2001-01-19 00:25:41
|
On Thu, Jan 18, 2001 at 03:53:11PM -0800, Mike Kravetz wrote: > Here are some very preliminary numbers from sched_test_yield > (which was previously posted to this (lse-tech) list by Bill > Hartner). Tests were run on a system with 8 700 MHz Pentium > III processors. > > microseconds/yield > # threads 2.2.16-22 2.4 2.4-multi-queue > ------------ --------- -------- --------------- > 16 18.740 4.603 1.455 I remeber the O(1) scheduler from Davide Libenzi was beating the mainline O(N) scheduler with over 7 tasks in the runqueue (actually I'm not sure if the number was 7 but certainly it was under 10). So if you also use a O(1) scheduler too as I guess (since you have a chance to run fast on the lots of tasks running case) the most interesting thing is how you score with 2/4/8 tasks in the runqueue (I think the tests on the O(1) scheduler patch was done at max on a 2-way SMP btw). (the argument for which Davide's patch wasn't included is that most machines have less than 4/5 tasks in the runqueue at the same time) Andrea |
From: Mike K. <mkr...@se...> - 2001-01-19 00:52:36
|
On Fri, Jan 19, 2001 at 01:26:16AM +0100, Andrea Arcangeli wrote: > On Thu, Jan 18, 2001 at 03:53:11PM -0800, Mike Kravetz wrote: > > Here are some very preliminary numbers from sched_test_yield > > (which was previously posted to this (lse-tech) list by Bill > > Hartner). Tests were run on a system with 8 700 MHz Pentium > > III processors. > > > > microseconds/yield > > # threads 2.2.16-22 2.4 2.4-multi-queue > > ------------ --------- -------- --------------- > > 16 18.740 4.603 1.455 > > I remeber the O(1) scheduler from Davide Libenzi was beating the mainline O(N) > scheduler with over 7 tasks in the runqueue (actually I'm not sure if the > number was 7 but certainly it was under 10). So if you also use a O(1) > scheduler too as I guess (since you have a chance to run fast on the lots of > tasks running case) the most interesting thing is how you score with 2/4/8 > tasks in the runqueue (I think the tests on the O(1) scheduler patch was done > at max on a 2-way SMP btw). (the argument for which Davide's patch wasn't > included is that most machines have less than 4/5 tasks in the runqueue at the > same time) > > Andrea Thanks for the suggestion. The only reason I hesitated to test with a small number of threads is because I was under the assumption that this particular benchmark may have problems if the number of threads was less than the number of processors. I'll give the tests a try with a smaller number of threads. I'm also open to suggestions for what benchmarks/test methods I could use for scheduler testing. If you remember what people have used in the past, please let me know. -- Mike Kravetz mkr...@se... IBM Linux Technology Center |
From: Andrea A. <an...@su...> - 2001-01-19 01:30:04
|
On Thu, Jan 18, 2001 at 04:52:25PM -0800, Mike Kravetz wrote: > was less than the number of processors. I'll give the tests a try > with a smaller number of threads. I'm also open to suggestions for OK! > what benchmarks/test methods I could use for scheduler testing. If > you remember what people have used in the past, please let me know. It was this one IIRC (it spawns threads calling sched_yield() in loop). /* Tester for the kernel's speed in scheduling. (C) 1999 / Willy Tarreau <wi...@me...> Modified by Davide Libenzi <da...@ma...> You can do whatever you want with this program, but I'm not responsible for any misuse. Be aware that it can heavily load a host. As it is multithreaded, it might take advantages of SMP. It basically creates a growing amount of threads and measures their cumulative work (i.e. loop iterations/second). The output is easily useable by gnuplot. To compile, you need libpthread : gcc -O2 -fomit-frame-pointer -o threads threads.c -lpthread Output on stdout is : <nb_threads> <average_work> <zero_work_threads> <std_deviation> */ #include <stdio.h> #include <pthread.h> #include <signal.h> #include <unistd.h> #include <time.h> #define MAXTHREADS 450 #define MEASURE_TIME 60 pthread_t thr[MAXTHREADS]; int nbthreads = MAXTHREADS; int measure_time = MEASURE_TIME; volatile actthreads = 0; long long int totalwork[MAXTHREADS]; volatile int stop = 0, start = 0, count = 0; void oneatwork(int thr) { int i; while (!start) /* don't disturb pthread_create() */ usleep(10000); actthreads++; while (!stop) { if (count) totalwork[thr]++; syscall(158); /* sys_sched_yield() */ } actthreads--; pthread_exit(0); } main(int argc, char **argv) { int i, err, avgwork, thrzero; long long int value, avgvalue; double sqrdev; time_t ts, te; if (argc < 3) { printf("usage: %s threads time\n", argv[0]); exit(1); } nbthreads = atoi(argv[1]); measure_time = atoi(argv[2]); start = 0; count = 0; stop = 0; actthreads = 0; thrzero = 0; value = 0; sqrdev = 0.0; fprintf(stderr, "\nCreating %d threads ...", nbthreads); for (i = 0; i < nbthreads; i++) { if ((err = pthread_create(&thr[i], NULL, (void *) &oneatwork, (void *) i)) != 0) { fprintf(stderr, "thread %d pthread_create=%d -> ", i, err); perror(""); exit(1); } pthread_detach(thr[i]); } for (i = 0; i < nbthreads; i++) totalwork[i] = 0; fprintf(stderr, " OK !\nWaiting for all threads to start ..."); start = 1; while (actthreads != nbthreads) usleep(10000); /* waiting for a bit of stability */ fprintf(stderr, "Go !\n"); count = 1; time(&ts); sleep(measure_time); count = 0; stop = 1; time(&te); for (i = 0; i < nbthreads; i++) { value += totalwork[i]; if (totalwork[i] == 0) ++thrzero; } avgvalue = value / nbthreads; value /= (int) difftime(te, ts); avgwork = (int) (value / nbthreads); for (i = 0; i < nbthreads; i++) { double difvv = (double) (totalwork[i] - avgvalue); sqrdev += difvv * difvv; } while (actthreads > 0) usleep(10000); printf("%d\t\t%lld\t\t%d\t\t%d\t\t%f\n", nbthreads, value, avgwork, thrzero, sqrdev / ((double) nbthreads * avgvalue * avgvalue)); exit(0); } Andrea |
From: Mike K. <mkr...@se...> - 2001-01-19 01:34:49
|
On Fri, Jan 19, 2001 at 02:30:41AM +0100, Andrea Arcangeli wrote: > On Thu, Jan 18, 2001 at 04:52:25PM -0800, Mike Kravetz wrote: > > was less than the number of processors. I'll give the tests a try > > with a smaller number of threads. I'm also open to suggestions for > > OK! > > > what benchmarks/test methods I could use for scheduler testing. If > > you remember what people have used in the past, please let me know. > > It was this one IIRC (it spawns threads calling sched_yield() in loop). Thanks! At first glance this looks to be the same type of test/benchmark I have been using. - Mike |
From: Davide L. <da...@xm...> - 2001-01-19 01:39:42
|
On Thursday 18 January 2001 17:39, Andrea Arcangeli wrote: > On Thu, Jan 18, 2001 at 04:52:25PM -0800, Mike Kravetz wrote: > > was less than the number of processors. I'll give the tests a try > > with a smaller number of threads. I'm also open to suggestions for > > OK! > > > what benchmarks/test methods I could use for scheduler testing. If > > you remember what people have used in the past, please let me know. > > It was this one IIRC (it spawns threads calling sched_yield() in loop). > > /* > Tester for the kernel's speed in scheduling. > (C) 1999 / Willy Tarreau <wi...@me...> > > Modified by Davide Libenzi <da...@ma...> > > > You can do whatever you want with this program, but I'm not > responsible for any misuse. Be aware that it can heavily load > a host. As it is multithreaded, it might take advantages of SMP. > > It basically creates a growing amount of threads and measures > their cumulative work (i.e. loop iterations/second). The output > is easily useable by gnuplot. > > To compile, you need libpthread : > > gcc -O2 -fomit-frame-pointer -o threads threads.c -lpthread > > Output on stdout is : > <nb_threads> <average_work> <zero_work_threads> <std_deviation> > > */ > > #include <stdio.h> > #include <pthread.h> > #include <signal.h> > #include <unistd.h> > #include <time.h> > > > > #define MAXTHREADS 450 > #define MEASURE_TIME 60 > > > > pthread_t thr[MAXTHREADS]; > int nbthreads = MAXTHREADS; > int measure_time = MEASURE_TIME; > volatile actthreads = 0; > > long long int totalwork[MAXTHREADS]; > volatile int stop = 0, > start = 0, > count = 0; > > void oneatwork(int thr) > { > int i; > while (!start) /* don't disturb pthread_create() */ > usleep(10000); > > actthreads++; > while (!stop) > { > if (count) > totalwork[thr]++; > > syscall(158); /* sys_sched_yield() */ > } > actthreads--; > pthread_exit(0); > } > > main(int argc, char **argv) > { > > int i, > err, > avgwork, > thrzero; > long long int value, > avgvalue; > double sqrdev; > time_t ts, > te; > > if (argc < 3) > { > printf("usage: %s threads time\n", argv[0]); > exit(1); > } > > nbthreads = atoi(argv[1]); > measure_time = atoi(argv[2]); > > > start = 0; > count = 0; > stop = 0; > actthreads = 0; > thrzero = 0; > value = 0; > sqrdev = 0.0; > > fprintf(stderr, "\nCreating %d threads ...", nbthreads); > for (i = 0; i < nbthreads; i++) > { > if ((err = pthread_create(&thr[i], NULL, (void *) &oneatwork, (void > *) i)) != 0) { > fprintf(stderr, "thread %d pthread_create=%d -> ", i, err); > perror(""); > exit(1); > } > pthread_detach(thr[i]); > } > > for (i = 0; i < nbthreads; i++) > totalwork[i] = 0; > > fprintf(stderr, " OK !\nWaiting for all threads to start ..."); > > start = 1; > while (actthreads != nbthreads) > usleep(10000); /* waiting for a bit of stability */ > > fprintf(stderr, "Go !\n"); > > count = 1; > time(&ts); > > sleep(measure_time); > > count = 0; > stop = 1; > time(&te); > > > for (i = 0; i < nbthreads; i++) > { > value += totalwork[i]; > if (totalwork[i] == 0) > ++thrzero; > } > avgvalue = value / nbthreads; > value /= (int) difftime(te, ts); > avgwork = (int) (value / nbthreads); > > for (i = 0; i < nbthreads; i++) > { > double difvv = (double) (totalwork[i] - avgvalue); > > sqrdev += difvv * difvv; > } > > while (actthreads > 0) > usleep(10000); > > printf("%d\t\t%lld\t\t%d\t\t%d\t\t%f\n", nbthreads, value, avgwork, > thrzero, sqrdev / ((double) nbthreads * avgvalue * avgvalue)); > > exit(0); > > } > Andrea found it before me :) - Davide |
From: David L. <dl...@di...> - 2001-01-19 16:07:33
|
another thing that would be interesting is what is the overhead on UP or small (2-4 way) SMP machines David Lang On Thu, 18 Jan 2001, Mike Kravetz wrote: > Date: Thu, 18 Jan 2001 16:52:25 -0800 > From: Mike Kravetz <mkr...@se...> > To: Andrea Arcangeli <an...@su...> > Cc: lse...@li..., lin...@vg... > Subject: Re: [Lse-tech] Re: multi-queue scheduler update > > On Fri, Jan 19, 2001 at 01:26:16AM +0100, Andrea Arcangeli wrote: > > On Thu, Jan 18, 2001 at 03:53:11PM -0800, Mike Kravetz wrote: > > > Here are some very preliminary numbers from sched_test_yield > > > (which was previously posted to this (lse-tech) list by Bill > > > Hartner). Tests were run on a system with 8 700 MHz Pentium > > > III processors. > > > > > > microseconds/yield > > > # threads 2.2.16-22 2.4 2.4-multi-queue > > > ------------ --------- -------- --------------- > > > 16 18.740 4.603 1.455 > > > > I remeber the O(1) scheduler from Davide Libenzi was beating the mainline O(N) > > scheduler with over 7 tasks in the runqueue (actually I'm not sure if the > > number was 7 but certainly it was under 10). So if you also use a O(1) > > scheduler too as I guess (since you have a chance to run fast on the lots of > > tasks running case) the most interesting thing is how you score with 2/4/8 > > tasks in the runqueue (I think the tests on the O(1) scheduler patch was done > > at max on a 2-way SMP btw). (the argument for which Davide's patch wasn't > > included is that most machines have less than 4/5 tasks in the runqueue at the > > same time) > > > > Andrea > > Thanks for the suggestion. The only reason I hesitated to test with > a small number of threads is because I was under the assumption that > this particular benchmark may have problems if the number of threads > was less than the number of processors. I'll give the tests a try > with a smaller number of threads. I'm also open to suggestions for > what benchmarks/test methods I could use for scheduler testing. If > you remember what people have used in the past, please let me know. > > -- > Mike Kravetz mkr...@se... > IBM Linux Technology Center > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to maj...@vg... > Please read the FAQ at http://www.tux.org/lkml/ > |
From: Andi K. <ak...@su...> - 2001-01-19 01:03:27
|
On Fri, Jan 19, 2001 at 01:26:16AM +0100, Andrea Arcangeli wrote: > I remeber the O(1) scheduler from Davide Libenzi was beating the mainline O(N) > scheduler with over 7 tasks in the runqueue (actually I'm not sure if the > number was 7 but certainly it was under 10). So if you also use a O(1) > scheduler too as I guess (since you have a chance to run fast on the lots of > tasks running case) the most interesting thing is how you score with 2/4/8 > tasks in the runqueue (I think the tests on the O(1) scheduler patch was done > at max on a 2-way SMP btw). (the argument for which Davide's patch wasn't > included is that most machines have less than 4/5 tasks in the runqueue at the > same time) They seem to have tried that in a separate patch: http://lse.sourceforge.net/scheduling/PrioScheduler.html Very nice literate programming style btw @-) -Andi |
From: Mike K. <mkr...@se...> - 2001-01-19 23:35:32
|
As promised, here are some numbers for low thread counts from the benchmark Andrew and Davide provided. I ran the benchmark for 1,2,4 and 8 threads. I ran the test 5 times for each thread count and used 60 seconds as the measure time in each case. 2.4.0 ----- 1 1785408 1785408 0 0.000000 1 1786130 1786130 0 0.000000 1 1786156 1786156 0 0.000000 1 1781575 1781575 0 0.000000 1 1780079 1780079 0 0.000000 2 1873405 936702 0 0.000000 2 2006473 1003236 0 0.000001 2 1953842 976921 0 0.000004 2 1951338 975669 0 0.000000 2 1887887 943943 0 0.000004 4 1936350 484087 0 0.000055 4 1814430 453607 0 0.000087 4 1972681 493170 0 0.000055 4 1951748 487937 0 0.000206 4 1862182 465545 0 0.000283 8 2917216 364652 0 0.000008 8 2655834 331979 0 0.000018 8 3026734 378341 0 0.000005 8 3010204 376275 0 0.000004 8 2569647 321205 0 0.000014 2.4.0-multi-queue ----------------- 1 1295498 1295498 0 0.000000 1 1295011 1295011 0 0.000000 1 1296768 1296768 0 0.000000 1 1296053 1296053 0 0.000000 1 1296472 1296472 0 0.000000 2 1999043 999521 0 0.000000 2 1410636 705318 0 0.000000 2 1414476 707238 0 0.000000 2 2014664 1007332 0 0.000001 2 1414509 707254 0 0.000000 4 2046182 511545 0 0.000232 4 2101535 525383 0 0.000115 4 2094828 523707 0 0.000155 4 2097406 524351 0 0.000144 4 2057331 514332 0 0.000132 8 3795829 474478 0 0.000185 8 4058329 507291 0 0.001871 8 3845934 480741 0 0.000248 8 3715243 464405 0 0.000084 8 3777303 472162 0 0.000194 As expected the single thread numbers for the multi-queue scheduler are not as good as those of the existing scheduler. However, at 2 threads it is getting pretty close and from 4 threads up, the multi-queue scheduler does better. In this multi-queue implementation, the amount of overhead is related to the number of processors in the system. Therefore, I would expect the numbers to 'be better' for low thread counts on systems with lower (<8) processor counts. It would be interesting to see if the point at which the multi-queue does better stays at aprox CPUs/2 as we change system configurations. Hopefully we will have some more extensive benchmark results in the not too distant future. Until then, we'll be looking into optimizations to help out the multi-queue scheduler at low thread counts. -- Mike Kravetz mkr...@se... IBM Linux Technology Center |
From: Gerhard M. <gm...@in...> - 2001-01-19 00:43:04
|
What affect does this scheduler have on 1 - 5 tasks?? Gerhard On Thu, 18 Jan 2001, Mike Kravetz wrote: > I just posted an updated version of the multi-queue scheduler > for the 2.4.0 kernel. This version also contains support for > realtime tasks. The patch can be found at: > > http://lse.sourceforge.net/scheduling/ > > Here are some very preliminary numbers from sched_test_yield > (which was previously posted to this (lse-tech) list by Bill > Hartner). Tests were run on a system with 8 700 MHz Pentium > III processors. > > microseconds/yield > # threads 2.2.16-22 2.4 2.4-multi-queue > ------------ --------- -------- --------------- > 16 18.740 4.603 1.455 > 32 17.702 5.134 1.456 > 64 23.300 5.586 1.466 > 128 47.273 18.812 1.480 > 256 105.701 71.147 1.517 > 512 FRC 143.500 1.661 > 1024 FRC 196.425 6.166 > 2048 FRC FRC 23.291 > 4096 FRC FRC 47.117 > > *FRC = failed to reach confidence level > > -- > Mike Kravetz mkr...@se... > IBM Linux Technology Center > 15450 SW Koll Parkway > Beaverton, OR 97006-6063 (503)578-3494 > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to maj...@vg... > Please read the FAQ at http://www.tux.org/lkml/ > -- Gerhard Mack gm...@in... <>< As a computer I find your faith in technology amusing. |
From: Mike K. <mkr...@se...> - 2001-01-19 20:49:31
|
On Thu, Jan 18, 2001 at 05:34:35PM -0800, Mike Kravetz wrote: > On Fri, Jan 19, 2001 at 02:30:41AM +0100, Andrea Arcangeli wrote: > > On Thu, Jan 18, 2001 at 04:52:25PM -0800, Mike Kravetz wrote: > > > was less than the number of processors. I'll give the tests a try > > > with a smaller number of threads. I'm also open to suggestions for > > > > OK! > > > > > what benchmarks/test methods I could use for scheduler testing. If > > > you remember what people have used in the past, please let me know. > > > > It was this one IIRC (it spawns threads calling sched_yield() in loop). > > Thanks! It was my intention to post IIRC numbers for small thread counts today. However, the benchmark (not the system) seems to hang on occasion. This occurs on both the unmodified 2.4.0 kernel and the one which contains my multi-queue patch. Therefore, I'm pretty sure it is not something I did. :) Anyone else see anything like this before? I'll look into the reason for the hang, but it will delay my posting of these numbers. -- Mike Kravetz mkr...@se... IBM Linux Technology Center |
From: Mike K. <mkr...@se...> - 2001-01-19 21:51:44
|
On Fri, Jan 19, 2001 at 12:49:21PM -0800, Mike Kravetz showed his lack of internet slang understanding and wrote: > > It was my intention to post IIRC numbers for small thread counts today. > However, the benchmark (not the system) seems to hang on occasion. This > occurs on both the unmodified 2.4.0 kernel and the one which contains > my multi-queue patch. Therefore, I'm pretty sure it is not something > I did. :) > > Anyone else see anything like this before? I'll look into the reason > for the hang, but it will delay my posting of these numbers. I think I have found the problem. Here is a code snippet from the benchmark Andrea posted. void oneatwork(int thr) { int i; while (!start) /* don't disturb pthread_create() */ usleep(10000); actthreads++; while (!stop) { if (count) totalwork[thr]++; syscall(158); /* sys_sched_yield() */ } actthreads--; pthread_exit(0); } Note that actthreads is a global variable which is being updated by multiple threads without any form of synchronization. Because of this actthreads sometimes never goes to zero after all worker threads have finished. I changed actthreads to be an atomic and used atomic operations to manipulate it. With this change, I was able to complete one round of testing which I had not been able to do in the past. Does anyone maintain this benchmark code? The changes I indicate above should be made. If you need more specifics I can provide them. -- Mike Kravetz mkr...@se... IBM Linux Technology Center |
From: Davide L. <da...@xm...> - 2001-01-19 22:03:08
|
On Friday 19 January 2001 13:59, Mike Kravetz wrote: > On Fri, Jan 19, 2001 at 12:49:21PM -0800, Mike Kravetz showed his lack > > of internet slang understanding and wrote: > > It was my intention to post IIRC numbers for small thread counts today. > > However, the benchmark (not the system) seems to hang on occasion. This > > occurs on both the unmodified 2.4.0 kernel and the one which contains > > my multi-queue patch. Therefore, I'm pretty sure it is not something > > I did. :) > > > > Anyone else see anything like this before? I'll look into the reason > > for the hang, but it will delay my posting of these numbers. > > I think I have found the problem. Here is a code snippet from the > benchmark Andrea posted. > > void oneatwork(int thr) > { > int i; > while (!start) /* don't disturb pthread_create() */ > usleep(10000); > > actthreads++; > while (!stop) > { > if (count) > totalwork[thr]++; > > syscall(158); /* sys_sched_yield() */ > } > actthreads--; > pthread_exit(0); > } > > Note that actthreads is a global variable which is being updated > by multiple threads without any form of synchronization. Because > of this actthreads sometimes never goes to zero after all worker > threads have finished. If all threads complete successfully actthreads has to be zero. If some thread dies, this won't be true. - Davide |
From: Mike K. <mkr...@se...> - 2001-01-19 22:18:49
|
On Fri, Jan 19, 2001 at 02:03:06PM -0800, Davide Libenzi wrote: <stuff deleted> > > void oneatwork(int thr) > > { > > int i; > > while (!start) /* don't disturb pthread_create() */ > > usleep(10000); > > > > actthreads++; > > while (!stop) > > { > > if (count) > > totalwork[thr]++; > > > > syscall(158); /* sys_sched_yield() */ > > } > > actthreads--; > > pthread_exit(0); > > } > > > > Note that actthreads is a global variable which is being updated > > by multiple threads without any form of synchronization. Because > > of this actthreads sometimes never goes to zero after all worker > > threads have finished. > > If all threads complete successfully actthreads has to be zero. Not as currently coded. If two threads try to decrement actthreads at the same time, there is no guarantee that it will be decremented twice. That is why you need to put some type of synchronization in place. -- Mike Kravetz mkr...@se... IBM Linux Technology Center |
From: Davide L. <da...@xm...> - 2001-01-19 23:24:33
|
On Friday 19 January 2001 15:23, Mike Kravetz wrote: > On Fri, Jan 19, 2001 at 02:03:06PM -0800, Davide Libenzi wrote: > <stuff deleted> > > > > void oneatwork(int thr) > > > { > > > int i; > > > while (!start) /* don't disturb pthread_create() */ > > > usleep(10000); > > > > > > actthreads++; > > > while (!stop) > > > { > > > if (count) > > > totalwork[thr]++; > > > > > > syscall(158); /* sys_sched_yield() */ > > > } > > > actthreads--; > > > pthread_exit(0); > > > } > > > > > > Note that actthreads is a global variable which is being updated > > > by multiple threads without any form of synchronization. Because > > > of this actthreads sometimes never goes to zero after all worker > > > threads have finished. > > > > If all threads complete successfully actthreads has to be zero. > > Not as currently coded. If two threads try to decrement actthreads > at the same time, there is no guarantee that it will be decremented > twice. That is why you need to put some type of synchronization in > place. Right, inc & dec are not atomic w/o #LOCK. - Davide |
From: Jun N. <ju...@sc...> - 2001-01-23 16:54:35
|
I tried to run SDET (Software Development Environment Throughput), which basically is a system level, throughput oriented benchmark, on the 2.4.0 kernel and 2.4.0 kernel with this patch. I guess many (old?) Unix guys are familiar with it, and it is (was?) sometimes used to check some aspects of scalability of the system. The details of this bechmark are not so important in this mail (available upon request). The following are very preliminary numbers from the benchmark. Tests were run on a system with 8 550 MHz Pentium III processors. I think those results are encouraging. # of Scripts Throughput Throughput 2.4 2.4-multi-queue --------- ---------- -------- 1 2057.1 1978.0 2 4114.3 4067.8 4 7700.5 7700.5 6 10746.3 10746.3 8 12973.0 12576.4 10 13186.8 13235.3 15 13138.7 13235.3 20 12996.4 13043.5 25 13005.8 13005.8 30 12811.4 13059.3 40 12676.1 12732.1 50 12121.2 12676.1 60 12314.7 12442.4 70 12051.6 11954.5 80 11871.4 11985.0 90 11608.7 11777.5 100 10849.9 11523.7 125 10678.7 10940.9 150 10416.7 10503.8 175 10187.6 10314.3 200 9749.5 10106.7 250 8343.4 8787.3 I also checked hot-spots with the 2.4.0 kernel (not with multi-queue) with lockmeter (http://oss.sgi.com/projects/lockmeter/). The data were sampled when the number of scripte is 175. SPINLOCKS HOLD WAIT UTIL CON MEAN (MAX) MEAN (MAX) TOTAL NAME ... 10.56% 26.89% 7.4us(175us) 3.4us(692us) 1569304 runqueue_lock 2.23% 29.75% 4.5us(20us) 4.4us(646us) 550505 __wake_up+0x7c 0.01% 11.62% 6.6us(15us) 1.0us(65us) 2056 __wake_up+0x128 0.00% 14.29% 0.4us(2.6us) 3.0us(332us) 1393 deliver_signal+0x58 0.00% 9.94% 7.2us(16us) 1.2us(56us) 332 process_timeout+0x14 0.01% 26.70% 4.7us(16us) 5.0us(296us) 1457 schedule_tail+0x58 7.53% 23.28% 11us(175us) 3.0us(692us) 781676 schedule+0xd0 0.66% 35.42% 3.5us(23us) 2.8us(486us) 206008 schedule+0x458 0.00% 11.79% 4.2us(78us) 1.1us(56us) 560 schedule+0x504 0.11% 9.42% 5.0us(21us) 2.3us(420us) 25317 wake_up_process+0x14 The above result basically tells utilization of runqueue_lock is about 10% of all spinlocks held during the benchmarck and nealy 27% of the requests for this lock need to spin and wait for the lock (The NAMEs below the lock are the locations where that lock is used). This might explain the throughput improvements gained by the multi-queue scheduler. Now who has the largest utilization? Of course it's kernel_flag. SPINLOCKS HOLD WAIT UTIL CON MEAN (MAX) MEAN (MAX) TOTAL NAME ... 43.15% 33.08% 13us(95971us) 12us(95997us) 3558789 kernel_flag 0.02% 38.26% 0.7us(29us) 34us(94975us) 23788 acct_process+0x1c 0.02% 44.63% 8.3us(43us) 23us(675us) 2012 chrdev_open+0x4c 0.00% 22.26% 0.9us(2.5us) 16us(525us) 283 de_put+0x28 5.26% 38.34% 244us(1184us) 21us(53127us) 23788 do_exit+0xf8 0.99% 36.22% 11us(840us) 12us(53195us) 96205 ext2_delete_inode+0x20 0.46% 29.64% 1.2us(159us) 9.1us(53249us) 430421 ext2_discard_prealloc+0x20 1.28% 40.60% 9.7us(152us) 22us(43404us) 146014 ext2_get_block+0x54 0.00% 40.00% 0.4us(0.7us) 8.6us(34us) 5 locks_remove_flock+0x34 0.00% 40.00% 0.6us(1.2us) 4.5us(14us) 5 locks_remove_posix+0x38 0.92% 40.80% 12us(572us) 16us(47804us) 84618 lookup_hash+0x84 0.16% 37.35% 1.0us(178us) 13us(53173us) 175002 notify_change+0x68 7.78% 15.00% 46us(2523us) 3.1us(27213us)188485 permission+0x38 20.34% 32.99% 12us(1981us) 12us(95997us)1927065 real_lookup+0x64 0.05% 47.31% 595us(51910us) 22us(270us) 93 schedule+0x490 0.56% 42.11% 32861us(95971us)41us(405us) 19 sync_old_buffers+0x20 0.83% 40.22% 19us(1473us) 19us(41614us) 48081 sys_fcntl64+0x44 0.01% 38.05% 1.3us(37us) 22us(49506us) 12422 sys_ioctl+0x4c 0.06% 33.12% 0.5us(62us) 15us(49778us) 132230 sys_llseek+0x88 0.00% 39.64% 0.9us(4.9us) 19us(849us) 5401 sys_lseek+0x6c 0.00% 37.50% 28us(48us) 12us(222us) 200 sys_rename+0x1a0 0.02% 42.29% 6.2us(22us) 81us(93181us) 3802 sys_sysctl+0x4c 0.00% 52.27% 6.4us(29us) 13us(156us) 132 tty_read+0xbc 0.01% 41.36% 13us(37us) 16us(434us) 810 tty_release+0x1c 0.00% 48.12% 17us(143us) 22us(497us) 133 tty_write+0x1bc 2.08% 41.32% 25us(309us) 18us(29470us) 92009 vfs_create+0x98 0.52% 38.57% 85us(227us) 12us(698us) 6800 vfs_mkdir+0x90 1.10% 38.40% 20us(317us) 14us(1100us) 60359 vfs_readdir+0x68 0.07% 41.66% 12us(78us) 18us(1120us) 6800 vfs_rmdir+0x188 0.00% 100.00% 24us(24us) 21us(27us) 2 vfs_statfs+0x4c 0.60% 36.52% 7.2us(104us) 9.4us(904us) 91805 vfs_unlink+0x110 This tells many things, but - utilization of kernel_flag is about 43% and more than half of that utilization is done by real_lookup. - its average hold-time is not relatively significant, but max wait-time is. - The location sync_old_buffers+0x20 looks responsible for the longest wait-time (95997us). - sync_old_buffers is responsible only for 0.83% of lock utilization, but it has the largest average (32861us) and max (95971us) hold-time. So if we replace the big kernel lock with a fine-grained lock in the real_lookup function, we would see more throughput improvements at leaset for this benchmarck. But I guess the reason for holding the big kernel in real_lookup() is that not all filesystems don't implement an MP-safe lookup routine. Is that correct assumption? For sync_old_buffers, we could hold the big kernel lock per filesystem, for example. static struct dentry * real_lookup(struct dentry * parent, struct qstr * name, int flags) { ... result = d_lookup(parent, name); if (!result) { struct dentry * dentry = d_alloc(parent, name); result = ERR_PTR(-ENOMEM); if (dentry) { lock_kernel(); result = dir->i_op->lookup(dir, dentry); unlock_kernel(); if (result) dput(dentry); else result = dentry; } up(&dir->i_sem); return result; } ... } static int sync_old_buffers(void) { lock_kernel(); sync_supers(0); sync_inodes(0); unlock_kernel(); flush_dirty_buffers(1); /* must really sync all the active I/O request to disk here */ run_task_queue(&tq_disk); return 0; } Mike Kravetz wrote: > > I just posted an updated version of the multi-queue scheduler > for the 2.4.0 kernel. This version also contains support for > realtime tasks. The patch can be found at: > > http://lse.sourceforge.net/scheduling/ > > Here are some very preliminary numbers from sched_test_yield > (which was previously posted to this (lse-tech) list by Bill > Hartner). Tests were run on a system with 8 700 MHz Pentium > III processors. > > microseconds/yield > # threads 2.2.16-22 2.4 2.4-multi-queue > ------------ --------- -------- --------------- > 16 18.740 4.603 1.455 > 32 17.702 5.134 1.456 > 64 23.300 5.586 1.466 > 128 47.273 18.812 1.480 > 256 105.701 71.147 1.517 > 512 FRC 143.500 1.661 > 1024 FRC 196.425 6.166 > 2048 FRC FRC 23.291 > 4096 FRC FRC 47.117 > > *FRC = failed to reach confidence level > > -- > Mike Kravetz mkr...@se... > IBM Linux Technology Center > 15450 SW Koll Parkway > Beaverton, OR 97006-6063 (503)578-3494 > > _______________________________________________ > Lse-tech mailing list > Lse...@li... > http://lists.sourceforge.net/lists/listinfo/lse-tech -- Jun U Nakajima Core OS Development SCO/Murray Hill, NJ Email: ju...@sc..., Phone: 908-790-2352 Fax: 908-790-2426 |
From: Marcelo T. <ma...@co...> - 2001-01-23 21:44:00
|
On Tue, 23 Jan 2001, Jun Nakajima wrote: <snip> > Now who has the largest utilization? Of course it's kernel_flag. > SPINLOCKS HOLD WAIT > UTIL CON MEAN (MAX) MEAN (MAX) TOTAL NAME > ... > 43.15% 33.08% 13us(95971us) 12us(95997us) 3558789 kernel_flag > 0.02% 38.26% 0.7us(29us) 34us(94975us) 23788 acct_process+0x1c > 0.02% 44.63% 8.3us(43us) 23us(675us) 2012 chrdev_open+0x4c > 0.00% 22.26% 0.9us(2.5us) 16us(525us) 283 de_put+0x28 > 5.26% 38.34% 244us(1184us) 21us(53127us) 23788 do_exit+0xf8 > 0.99% 36.22% 11us(840us) 12us(53195us) 96205 > ext2_delete_inode+0x20 > 0.46% 29.64% 1.2us(159us) 9.1us(53249us) 430421 > ext2_discard_prealloc+0x20 > 1.28% 40.60% 9.7us(152us) 22us(43404us) 146014 > ext2_get_block+0x54 > 0.00% 40.00% 0.4us(0.7us) 8.6us(34us) 5 > locks_remove_flock+0x34 > 0.00% 40.00% 0.6us(1.2us) 4.5us(14us) 5 > locks_remove_posix+0x38 > 0.92% 40.80% 12us(572us) 16us(47804us) 84618 lookup_hash+0x84 > 0.16% 37.35% 1.0us(178us) 13us(53173us) 175002 notify_change+0x68 > 7.78% 15.00% 46us(2523us) 3.1us(27213us)188485 permission+0x38 > 20.34% 32.99% 12us(1981us) 12us(95997us)1927065 real_lookup+0x64 > 0.05% 47.31% 595us(51910us) 22us(270us) 93 schedule+0x490 > 0.56% 42.11% 32861us(95971us)41us(405us) 19 > sync_old_buffers+0x20 > 0.83% 40.22% 19us(1473us) 19us(41614us) 48081 sys_fcntl64+0x44 > 0.01% 38.05% 1.3us(37us) 22us(49506us) 12422 sys_ioctl+0x4c > 0.06% 33.12% 0.5us(62us) 15us(49778us) 132230 sys_llseek+0x88 > 0.00% 39.64% 0.9us(4.9us) 19us(849us) 5401 sys_lseek+0x6c > 0.00% 37.50% 28us(48us) 12us(222us) 200 sys_rename+0x1a0 > 0.02% 42.29% 6.2us(22us) 81us(93181us) 3802 sys_sysctl+0x4c > 0.00% 52.27% 6.4us(29us) 13us(156us) 132 tty_read+0xbc > 0.01% 41.36% 13us(37us) 16us(434us) 810 tty_release+0x1c > 0.00% 48.12% 17us(143us) 22us(497us) 133 tty_write+0x1bc > 2.08% 41.32% 25us(309us) 18us(29470us) 92009 vfs_create+0x98 > 0.52% 38.57% 85us(227us) 12us(698us) 6800 vfs_mkdir+0x90 > 1.10% 38.40% 20us(317us) 14us(1100us) 60359 vfs_readdir+0x68 > 0.07% 41.66% 12us(78us) 18us(1120us) 6800 vfs_rmdir+0x188 > 0.00% 100.00% 24us(24us) 21us(27us) 2 vfs_statfs+0x4c > 0.60% 36.52% 7.2us(104us) 9.4us(904us) 91805 vfs_unlink+0x110 > > This tells many things, but > - utilization of kernel_flag is about 43% and more than half of that > utilization is done by real_lookup. > - its average hold-time is not relatively significant, but max wait-time > is. > - The location sync_old_buffers+0x20 looks responsible for the longest > wait-time (95997us). > - sync_old_buffers is responsible only for 0.83% of lock utilization, > but it has the largest average (32861us) and max (95971us) hold-time. Hi Jun, I suppose one of the problems related to the kernel lock with this benchmark (and probably a lot of other stuff) is superblock lock contention. If the amount of dirty buffers on the system is higher than a certain waterkmark, processes which are creating dirty buffers have to block and write older dirty buffers to disk before continuing. The problem is that ext2 is doing this in a lot of places with the superblock lock held. And most of them (if not all) are holding the kernel lock at the sametime. sync_old_buffers() walks the list of superblocks and each dirty sb found is locked and its buffer is marked dirty, potentially blocking again. The same is done for each page on each dirty inode. The right thing is to make the ext2 functions flush dirty buffers, if needed, after unlocking the sb lock. Are you interested in testing these modifications ? |
From: Jun N. <ju...@sc...> - 2001-01-24 19:09:24
|
Hi Marcelo, Yes, I'm interested in testing those modifications. Let me explore the code a little bit. If some patch is available, that would be appreciated. In the meantime, if somebody else is interested in it and has cycles, please let me know. I have other stuff to do right now. Thanks, Marcelo Tosatti wrote: > > On Tue, 23 Jan 2001, Jun Nakajima wrote: > > <snip> > > > Now who has the largest utilization? Of course it's kernel_flag. > > SPINLOCKS HOLD WAIT > > UTIL CON MEAN (MAX) MEAN (MAX) TOTAL NAME > > ... > > 43.15% 33.08% 13us(95971us) 12us(95997us) 3558789 kernel_flag > > 0.02% 38.26% 0.7us(29us) 34us(94975us) 23788 acct_process+0x1c > > 0.02% 44.63% 8.3us(43us) 23us(675us) 2012 chrdev_open+0x4c > > 0.00% 22.26% 0.9us(2.5us) 16us(525us) 283 de_put+0x28 > > 5.26% 38.34% 244us(1184us) 21us(53127us) 23788 do_exit+0xf8 > > 0.99% 36.22% 11us(840us) 12us(53195us) 96205 > > ext2_delete_inode+0x20 > > 0.46% 29.64% 1.2us(159us) 9.1us(53249us) 430421 > > ext2_discard_prealloc+0x20 > > 1.28% 40.60% 9.7us(152us) 22us(43404us) 146014 > > ext2_get_block+0x54 > > 0.00% 40.00% 0.4us(0.7us) 8.6us(34us) 5 > > locks_remove_flock+0x34 > > 0.00% 40.00% 0.6us(1.2us) 4.5us(14us) 5 > > locks_remove_posix+0x38 > > 0.92% 40.80% 12us(572us) 16us(47804us) 84618 lookup_hash+0x84 > > 0.16% 37.35% 1.0us(178us) 13us(53173us) 175002 notify_change+0x68 > > 7.78% 15.00% 46us(2523us) 3.1us(27213us)188485 permission+0x38 > > 20.34% 32.99% 12us(1981us) 12us(95997us)1927065 real_lookup+0x64 > > 0.05% 47.31% 595us(51910us) 22us(270us) 93 schedule+0x490 > > 0.56% 42.11% 32861us(95971us)41us(405us) 19 > > sync_old_buffers+0x20 > > 0.83% 40.22% 19us(1473us) 19us(41614us) 48081 sys_fcntl64+0x44 > > 0.01% 38.05% 1.3us(37us) 22us(49506us) 12422 sys_ioctl+0x4c > > 0.06% 33.12% 0.5us(62us) 15us(49778us) 132230 sys_llseek+0x88 > > 0.00% 39.64% 0.9us(4.9us) 19us(849us) 5401 sys_lseek+0x6c > > 0.00% 37.50% 28us(48us) 12us(222us) 200 sys_rename+0x1a0 > > 0.02% 42.29% 6.2us(22us) 81us(93181us) 3802 sys_sysctl+0x4c > > 0.00% 52.27% 6.4us(29us) 13us(156us) 132 tty_read+0xbc > > 0.01% 41.36% 13us(37us) 16us(434us) 810 tty_release+0x1c > > 0.00% 48.12% 17us(143us) 22us(497us) 133 tty_write+0x1bc > > 2.08% 41.32% 25us(309us) 18us(29470us) 92009 vfs_create+0x98 > > 0.52% 38.57% 85us(227us) 12us(698us) 6800 vfs_mkdir+0x90 > > 1.10% 38.40% 20us(317us) 14us(1100us) 60359 vfs_readdir+0x68 > > 0.07% 41.66% 12us(78us) 18us(1120us) 6800 vfs_rmdir+0x188 > > 0.00% 100.00% 24us(24us) 21us(27us) 2 vfs_statfs+0x4c > > 0.60% 36.52% 7.2us(104us) 9.4us(904us) 91805 vfs_unlink+0x110 > > > > This tells many things, but > > - utilization of kernel_flag is about 43% and more than half of that > > utilization is done by real_lookup. > > - its average hold-time is not relatively significant, but max wait-time > > is. > > - The location sync_old_buffers+0x20 looks responsible for the longest > > wait-time (95997us). > > - sync_old_buffers is responsible only for 0.83% of lock utilization, > > but it has the largest average (32861us) and max (95971us) hold-time. > > Hi Jun, > > I suppose one of the problems related to the kernel lock with this > benchmark (and probably a lot of other stuff) is superblock lock > contention. > > If the amount of dirty buffers on the system is higher than a certain > waterkmark, processes which are creating dirty buffers have to block and > write older dirty buffers to disk before continuing. > > The problem is that ext2 is doing this in a lot of places with the > superblock lock held. And most of them (if not all) are holding the kernel > lock at the sametime. > > sync_old_buffers() walks the list of superblocks and each dirty sb found > is locked and its buffer is marked dirty, potentially blocking again. The > same is done for each page on each dirty inode. > > The right thing is to make the ext2 functions flush dirty buffers, if > needed, after unlocking the sb lock. > > Are you interested in testing these modifications ? > > _______________________________________________ > Lse-tech mailing list > Lse...@li... > http://lists.sourceforge.net/lists/listinfo/lse-tech -- Jun U Nakajima Core OS Development SCO/Murray Hill, NJ Email: ju...@sc..., Phone: 908-790-2352 Fax: 908-790-2426 |
From: Marcelo T. <ma...@co...> - 2001-01-26 02:18:27
|
On Wed, 24 Jan 2001, Jun Nakajima wrote: > Hi Marcelo, > > Yes, I'm interested in testing those modifications. Let me explore the > code a little bit. If some patch is available, that would be > appreciated. > > In the meantime, if somebody else is interested in it and has cycles, > please let me know. I have other stuff to do right now. > > Thanks, I have a part of the code done. I'll finish it and send you. |
From: Andi K. <ak...@su...> - 2001-01-26 14:46:01
|
On Tue, Jan 23, 2001 at 05:53:55PM -0200, Marcelo Tosatti wrote: > The right thing is to make the ext2 functions flush dirty buffers, if > needed, after unlocking the sb lock. Hopefully you can do that without additional memory allocation. This means that some random process can get blocked when you run out of elevator entries, even when it doesn't do any IO anymore. A lot of this clever stuff IMHO needs a truly asynchronous replacement for ll_rw_block() -Andi |
From: Marcelo T. <ma...@co...> - 2001-01-26 19:20:20
|
On Fri, 26 Jan 2001, Andi Kleen wrote: > On Tue, Jan 23, 2001 at 05:53:55PM -0200, Marcelo Tosatti wrote: > > The right thing is to make the ext2 functions flush dirty buffers, if > > needed, after unlocking the sb lock. > > Hopefully you can do that without additional memory allocation. > > This means that some random process can get blocked when you run out of > elevator entries, even when it doesn't do any IO anymore. You mean some other process will have to block because I'm "bypassing" the buffer dirty watermark? |
From: Andi K. <ak...@su...> - 2001-01-26 19:54:01
|
On Fri, Jan 26, 2001 at 03:29:43PM -0200, Marcelo Tosatti wrote: > > On Fri, 26 Jan 2001, Andi Kleen wrote: > > > On Tue, Jan 23, 2001 at 05:53:55PM -0200, Marcelo Tosatti wrote: > > > The right thing is to make the ext2 functions flush dirty buffers, if > > > needed, after unlocking the sb lock. > > > > Hopefully you can do that without additional memory allocation. > > > > This means that some random process can get blocked when you run out of > > elevator entries, even when it doesn't do any IO anymore. > > You mean some other process will have to block because I'm "bypassing" the > buffer dirty watermark? Maybe I misunderstood your suggestion. As I understand you basically want to create a construct like the linux socket lock -- when a try_lock() fails queue the work item somehow and the original lock holder processes it in its context after it freed the lock. flushing would probably end up doing a ll_rw_block(), right? The problem is that ll_rw_block is not 100% asynchronous, when you run out of elevator entries it'll block. This means that the lock freer could block on some foreign IO. Not nice. -Andi |
From: Marcelo T. <ma...@co...> - 2001-01-26 20:06:17
|
On Fri, 26 Jan 2001, Andi Kleen wrote: > On Fri, Jan 26, 2001 at 03:29:43PM -0200, Marcelo Tosatti wrote: > > > > On Fri, 26 Jan 2001, Andi Kleen wrote: > > > > > On Tue, Jan 23, 2001 at 05:53:55PM -0200, Marcelo Tosatti wrote: > > > > The right thing is to make the ext2 functions flush dirty buffers, if > > > > needed, after unlocking the sb lock. > > > > > > Hopefully you can do that without additional memory allocation. > > > > > > This means that some random process can get blocked when you run out of > > > elevator entries, even when it doesn't do any IO anymore. > > > > You mean some other process will have to block because I'm "bypassing" the > > buffer dirty watermark? > > Maybe I misunderstood your suggestion. As I understand you basically want > to create a construct like the linux socket lock -- when a try_lock() fails > queue the work item somehow and the original lock holder processes it > in its context after it freed the lock. Not exactly. > flushing would probably end up doing a ll_rw_block(), right? > The problem is that ll_rw_block is not 100% asynchronous, when you run out > of elevator entries it'll block. This means that the lock freer could > block on some foreign IO. Not nice. Right now we potentially call ll_rw_block with the superblock lock held. Example (I've removed a lot of code): void ext2_free_inode (struct inode * inode) { struct super_block * sb = inode->i_sb; int is_directory; lock_super (sb); mark_buffer_dirty(bh2); mark_buffer_dirty(sb->u.ext2_sb.s_sbh); mark_buffer_dirty(bh); error_return: unlock_super (sb); } Each "mark_buffer_dirty" will call balance_dirty() if we're really dirtying a buffer (if it was previously clean), and from there we potentially reach ll_rw_block(). Instead sleeping with the superblock lock held we can call balance_dirty() after "unlock_super()" if we actually freed any buffer, so we avoid sleeping with the superblock lock held. |
From: Mike K. <mkr...@se...> - 2001-01-23 17:08:48
|
On Tue, Jan 23, 2001 at 11:49:27AM -0500, Jun Nakajima wrote: > I tried to run SDET (Software Development Environment Throughput), which > basically is a system level, throughput oriented benchmark, on the 2.4.0 > kernel and 2.4.0 kernel with this patch. Thanks for running this. I too remember SDET, but I won't claim to be old. :) We were doing some more analysis on the multi-queue scheduler and noticed that performance has regressed since posting preliminary numbers with the 2.4.0-test10 kernel. After comparing the code, it looks like I have over-engineered for the worst case of lock contention. This was done at the expense of the normal case. I'm currently working on this situation and expect to have a new patch out in the not too distant future. I expect the numbers will get better. -- Mike Kravetz mkr...@se... IBM Linux Technology Center |