From: James M. L. <llm...@gm...> - 2012-11-14 12:59:02
|
The attached patch replaces the doubly-linked list implementation of sb-concurrency:queue with a simpler singly-linked one. The new queue conses 1/3 as much and is 2X-5X faster in my benchmarks. It passes the sb-concurrency tests, as well as other tests for libraries that use sb-concurrency:queue. The paper mentioned in sb-concurrency/queue.lisp appears to assume, along with one of its citations, that a dequeue operation for a singly-linked list implementation requires two CAS operations. Garbage-collected languages can get by with one CAS, however. In addition, the head and tail references stored need not be precise. They move in one direction only. A thread is able to discover the actual head and tail nodes on its own, without having to read a stored reference twice. Barriers are not required -- if a thread gets an old head or tail reference, it merely has a slightly longer journey to the actual head or tail node. A possible objection to this implementation is that it does not clear the CDRs of old nodes (though the CARs are cleared). This could possibly be done, but I wonder if it's necessary considering that CL:POP is commonly used and has the same behavior. ------- old queue: CL-USER> (bench-queue:run 8 5000000) === SERIAL Evaluation took: 0.794 seconds of real time 0.792049 seconds of total run time (0.668041 user, 0.124008 system) [ Run times consist of 0.548 seconds GC time, and 0.245 seconds non-GC time. ] 99.75% CPU 2,695,490,254 processor cycles 120,008,240 bytes consed === SINGLE-PRODUCER-MULTIPLE-CONSUMER Evaluation took: 1.012 seconds of real time 5.372336 seconds of total run time (5.216326 user, 0.156010 system) [ Run times consist of 0.348 seconds GC time, and 5.025 seconds non-GC time. ] 530.83% CPU 3,435,016,666 processor cycles 120,013,328 bytes consed === MULTIPLE-PRODUCER-SINGLE-CONSUMER Evaluation took: 1.657 seconds of real time 5.296331 seconds of total run time (5.080318 user, 0.216013 system) [ Run times consist of 1.008 seconds GC time, and 4.289 seconds non-GC time. ] 319.61% CPU 5,619,372,273 processor cycles 119,967,432 bytes consed ------- new queue: CL-USER> (bench-queue:run 8 5000000) === SERIAL Evaluation took: 0.170 seconds of real time 0.168011 seconds of total run time (0.140009 user, 0.028002 system) [ Run times consist of 0.044 seconds GC time, and 0.125 seconds non-GC time. ] 98.82% CPU 577,716,716 processor cycles 40,001,536 bytes consed === SINGLE-PRODUCER-MULTIPLE-CONSUMER Evaluation took: 0.563 seconds of real time 3.300205 seconds of total run time (3.132195 user, 0.168010 system) [ Run times consist of 0.088 seconds GC time, and 3.213 seconds non-GC time. ] 586.15% CPU 1,908,838,348 processor cycles 40,014,064 bytes consed === MULTIPLE-PRODUCER-SINGLE-CONSUMER Evaluation took: 0.516 seconds of real time 2.812176 seconds of total run time (2.808176 user, 0.004000 system) [ Run times consist of 0.144 seconds GC time, and 2.669 seconds non-GC time. ] 544.96% CPU 1,751,178,119 processor cycles 40,013,808 bytes consed |
From: Paul K. <pv...@pv...> - 2012-11-14 15:27:14
|
On 2012-11-14, at 7:58 AM, "James M. Lawrence" <llm...@gm...> wrote: > The attached patch replaces the doubly-linked list implementation of > sb-concurrency:queue with a simpler singly-linked one. The new queue > conses 1/3 as much and is 2X-5X faster in my benchmarks. It passes the > sb-concurrency tests, as well as other tests for libraries that use > sb-concurrency:queue. [...] > In addition, the head and tail references stored need not be precise. > They move in one direction only. A thread is able to discover the > actual head and tail nodes on its own, without having to read a stored > reference twice. Barriers are not required -- if a thread gets an old > head or tail reference, it merely has a slightly longer journey to the > actual head or tail node. I don't see why it's necessary to acquire conses by CASing +dummy+ in, rather than popping the head with a CAS? > A possible objection to this implementation is that it does not clear > the CDRs of old nodes (though the CARs are cleared). This could > possibly be done, but I wonder if it's necessary considering that > CL:POP is commonly used and has the same behavior. The difference is that leaving the CDR as is on CL:POP doesn't leave a trail of reference to *all* the subsequent-enqueued conses, but only to the ones in the stack at that time. As the tail-hunting loop shows, it's possible to find the current tail from any cons that's once been enqueued. Thus, any stray reference to a cons from the queue (e.g. due to conservative stack scanning) will prevent GCing not only the conses in the queue at the time the cons was dequeued, but also everything that was enqueued later. This isn't a theoretical issue, and has severely affected users in the past. It's easy to clear cdrs (though not with NIL to avoid ABA issues) on dequeue, but the tail-hunting loop isn't as nicely adjusted. Let's see if there's something clever. > [SPSC/MPSC/SPMC micro-benchmark results] Actually, I believe the most important thing to benchmark is MPMC performance. We obviously don't want disastrous performance in general, but sb-queue attempts to cover a lot of use cases; if performance is an issue, people should probably use a specialised implementation when they can. 8 cores sounds like a recent single-socket machine, is that right? I currently don't have a lot of free time, but I'll run tests and benchmarks on 2- and 4-socket machines as soon as I can. Reports from a (multiple-socket, ideally) PPC might also be useful. Paul Khuong |
From: James M. L. <llm...@gm...> - 2012-11-14 23:09:55
Attachments:
bench-queue.lisp
|
On Wed, Nov 14, 2012 at 10:19 AM, Paul Khuong <pv...@pv...> wrote: > I don't see why it's necessary to acquire conses by CASing +dummy+ > in, rather than popping the head with a CAS? I don't understand the question -- I am popping the head with CAS. +dummy+ is symbol, not a node. DEQUEUE CASes +dummy+ into the head node. There may exist more than one node with +dummy+ because the stored reference to head is not immediately updated. > The difference is that leaving the CDR as is on CL:POP doesn't leave > a trail of reference to *all* the subsequent-enqueued conses, but > only to the ones in the stack at that time. Yes, that was a bad analogy. POP and DEQUEUE are similar, but PUSH conses to head while ENQUEUE conses to tail. I should make some adjustments. > Actually, I believe the most important thing to benchmark is MPMC > performance. Right, I was being lazy. I translated an existing function into SBCL-specific code and stopped there. Below is a general MPMC. Arguments are: message-count producer-count consumer-count. > 8 cores sounds like a recent single-socket machine, is that right? This is a 4-core i7 -- I had shown 8 threads because I was about to show a range of threads, but picked a single medium size instead. ------- old queue: === (BENCH-QUEUE:SERIAL 10000000) Evaluation took: 1.192 seconds of real time 1.188075 seconds of total run time (1.032065 user, 0.156010 system) [ Run times consist of 0.772 seconds GC time, and 0.417 seconds non-GC time. ] 99.66% CPU 4,044,003,488 processor cycles 240,002,144 bytes consed === (BENCH-QUEUE:CONCURRENT 10000000 1 3) Evaluation took: 1.342 seconds of real time 4.988312 seconds of total run time (4.940309 user, 0.048003 system) [ Run times consist of 0.128 seconds GC time, and 4.861 seconds non-GC time. ] 371.68% CPU 4,553,442,536 processor cycles 315,922,304 bytes consed === (BENCH-QUEUE:CONCURRENT 10000000 3 1) Evaluation took: 2.449 seconds of real time 4.444277 seconds of total run time (4.088255 user, 0.356022 system) [ Run times consist of 1.564 seconds GC time, and 2.881 seconds non-GC time. ] 181.46% CPU 8,306,514,625 processor cycles 239,979,280 bytes consed === (BENCH-QUEUE:CONCURRENT 10000000 2 2) Evaluation took: 2.337 seconds of real time 4.512282 seconds of total run time (4.184262 user, 0.328020 system) [ Run times consist of 1.356 seconds GC time, and 3.157 seconds non-GC time. ] 193.07% CPU 7,925,570,060 processor cycles 240,075,008 bytes consed === (BENCH-QUEUE:CONCURRENT 10000000 1 15) Evaluation took: 2.267 seconds of real time 17.141072 seconds of total run time (17.117070 user, 0.024002 system) [ Run times consist of 0.120 seconds GC time, and 17.022 seconds non-GC time. ] 756.11% CPU 7,691,566,363 processor cycles 241,826,120 bytes consed === (BENCH-QUEUE:CONCURRENT 10000000 15 1) Evaluation took: 4.405 seconds of real time 11.784736 seconds of total run time (11.288705 user, 0.496031 system) [ Run times consist of 3.064 seconds GC time, and 8.721 seconds non-GC time. ] 267.54% CPU 14,945,190,670 processor cycles 239,931,448 bytes consed === (BENCH-QUEUE:CONCURRENT 10000000 8 8) Evaluation took: 5.169 seconds of real time 16.485030 seconds of total run time (14.656916 user, 1.828114 system) [ Run times consist of 3.225 seconds GC time, and 13.261 seconds non-GC time. ] 318.92% CPU 17,534,303,287 processor cycles 239,954,376 bytes consed ------- new queue: === (BENCH-QUEUE:SERIAL 10000000) Evaluation took: 0.342 seconds of real time 0.340021 seconds of total run time (0.312019 user, 0.028002 system) [ Run times consist of 0.088 seconds GC time, and 0.253 seconds non-GC time. ] 99.42% CPU 1,160,856,090 processor cycles 80,001,192 bytes consed === (BENCH-QUEUE:CONCURRENT 10000000 1 3) Evaluation took: 1.026 seconds of real time 2.796175 seconds of total run time (2.684168 user, 0.112007 system) [ Run times consist of 0.320 seconds GC time, and 2.477 seconds non-GC time. ] 272.51% CPU 3,479,107,673 processor cycles 80,001,320 bytes consed === (BENCH-QUEUE:CONCURRENT 10000000 3 1) Evaluation took: 0.682 seconds of real time 2.224139 seconds of total run time (2.224139 user, 0.000000 system) [ Run times consist of 0.124 seconds GC time, and 2.101 seconds non-GC time. ] 326.10% CPU 2,313,474,512 processor cycles 80,001,416 bytes consed === (BENCH-QUEUE:CONCURRENT 10000000 2 2) Evaluation took: 1.836 seconds of real time 3.572224 seconds of total run time (3.352210 user, 0.220014 system) [ Run times consist of 1.201 seconds GC time, and 2.372 seconds non-GC time. ] 194.55% CPU 6,228,006,490 processor cycles 80,001,216 bytes consed === (BENCH-QUEUE:CONCURRENT 10000000 1 15) Evaluation took: 1.208 seconds of real time 8.840552 seconds of total run time (8.776548 user, 0.064004 system) [ Run times consist of 0.100 seconds GC time, and 8.741 seconds non-GC time. ] 731.87% CPU 4,097,924,610 processor cycles 80,010,248 bytes consed === (BENCH-QUEUE:CONCURRENT 10000000 15 1) Evaluation took: 1.158 seconds of real time 5.940371 seconds of total run time (5.828364 user, 0.112007 system) [ Run times consist of 0.444 seconds GC time, and 5.497 seconds non-GC time. ] 512.95% CPU 3,927,897,633 processor cycles 79,974,432 bytes consed === (BENCH-QUEUE:CONCURRENT 10000000 8 8) Evaluation took: 0.866 seconds of real time 6.056379 seconds of total run time (6.036378 user, 0.020001 system) [ Run times consist of 0.092 seconds GC time, and 5.965 seconds non-GC time. ] 699.31% CPU 2,935,998,863 processor cycles 80,007,568 bytes consed |
From: James M. L. <llm...@gm...> - 2012-11-16 00:04:38
|
Attached is a version which clears CDRs. I was attracted to the idea of threads reading a head or tail slot once then seeking independently, but that entails the bad of idea of leaving CDRs intact. In any case CDR-clearing and the added checks for the clear flag do not appear to adversely affect the benchmarks numbers; indeed they look a little better now. I don't know what to make of the Ladan-Mozes/Shavit paper cited in queue.lisp. I guess they didn't realize that a singly-linked queue implementation does not need two CAS operations for enqueue, as the purpose of their doubly-linked implementation is to avoid a second CAS. It's possible that I'm missing something. |
From: Nikodemus S. <nik...@ra...> - 2012-12-16 18:49:50
|
On 16 November 2012 02:04, James M. Lawrence <llm...@gm...> wrote: > I don't know what to make of the Ladan-Mozes/Shavit paper cited in > queue.lisp. I guess they didn't realize that a singly-linked queue > implementation does not need two CAS operations for enqueue, as the > purpose of their doubly-linked implementation is to avoid a second > CAS. It's possible that I'm missing something. (Catching up on email.) I certainly didn't when I wrote the current version. :) Embarrassingly even though it was obvious to me that their tagged pointers were not need with GC, I didn't follow that train of thought to it's proper conclusion. At a quick read your patch looks good to me. I could niggle about style in a few places, but nothing major. The only real complaint I have it that it would be nice to have comment discursive comment explaining the structure of the updates. Cheers, -- Nikodemus |
From: James M. L. <llm...@gm...> - 2012-12-17 03:17:08
|
On Sun, Dec 16, 2012 at 1:49 PM, Nikodemus Siivola <nik...@ra...> wrote: > On 16 November 2012 02:04, James M. Lawrence <llm...@gm...> wrote: > >> I don't know what to make of the Ladan-Mozes/Shavit paper cited in >> queue.lisp. I guess they didn't realize that a singly-linked queue >> implementation does not need two CAS operations for enqueue, as the >> purpose of their doubly-linked implementation is to avoid a second >> CAS. It's possible that I'm missing something. > > (Catching up on email.) > > I certainly didn't when I wrote the current version. :) Embarrassingly > even though it was obvious to me that their tagged pointers were not > need with GC, I didn't follow that train of thought to it's proper > conclusion. > > At a quick read your patch looks good to me. I could niggle about > style in a few places, but nothing major. The only real complaint I > have it that it would be nice to have comment discursive comment > explaining the structure of the updates. Here's a new patch which might address some niggles. - added overview - slots are typed - ENQUEUE and DEQUEUE make use of inferencing - visiting all nodes is less stringent -- quit if +DEAD-END+ is encountered, don't bother restarting The read barrier in the old DEQUEUE is a curiosity to me. Does it reduce spinning? I didn't see much difference when I added a similar barrier to the new DEQUEUE -- it was maybe a little slower. |
From: Nikodemus S. <nik...@ra...> - 2013-01-20 14:37:42
|
On 17 December 2012 05:16, James M. Lawrence <llm...@gm...> wrote: > Here's a new patch which might address some niggles. > > - added overview > - slots are typed > - ENQUEUE and DEQUEUE make use of inferencing Very nice! > - visiting all nodes is less stringent -- quit if +DEAD-END+ is > encountered, don't bother restarting If I read correctly this means that even if there are N live nodes in the queue, we might visit zero nodes because the first one was dequeued at an inopportune moment? That doesn't seem right. Suggested changes attached. > The read barrier in the old DEQUEUE is a curiosity to me. Does it > reduce spinning? I didn't see much difference when I added a similar > barrier to the new DEQUEUE -- it was maybe a little slower. Read barriers are nops on x86oid arches; the guarantee they provide holds always on x86oids. (FWIW, http://www.kernel.org/doc/Documentation/memory-barriers.txt is a good summary of the different barriers.) That particular barrier was introduced by Alastair at 713bb89f472457ec6654732b6b248b17b971f0ff for PPC threading -- which needs barriers. I'm not sure if that barrier was ever strictly speaking necessary, and I don't see anything that would require a barrier in your implementation. If the patch attached looks good to you, I'll commit it in conjunction with your. Cheers, -- Nikodemus |
From: James M. L. <llm...@gm...> - 2013-01-22 00:21:22
|
Hi, LIST-QUEUE-CONTENTS doesn't have a clear meaning, of course. It's akin to sending a runner down a magical track in which the finish line is receding away from him while the starting line is chasing him. He can't tell us how long the track was at any given time, he can only tell us how many steps he took and what he saw. My first implementation of LIST-QUEUE-CONTENTS (basically the same as yours) was to make a dogged runner that doesn't give up when the starting line catches up to him. But there is a bothersome point here: LIST-QUEUE-CONTENTS is not guaranteed to terminate! The runner may be perpetually shoved along by the starting line (corresponding to LIST-QUEUE-CONTENTS spinning), or may remain tantalizingly close to the finish line as Achilles and the Tortoise, or anywhere in between. In practice we know the runner is faster, and perhaps we really can assume that for all present and future hardware, though it still seems bothersome. The way we both implemented LIST-QUEUE-CONTENTS is to afflict the runner with amnesia when the starting line catches him. But whether or not it catches him is arbitrary -- so on the roll of a die, the runner reports seventeen steps instead of a thousand. When I changed LIST-QUEUE-CONTENTS, my thought at the time was that LIST-QUEUE-CONTENTS is already a lie: the list it returns is a montage rather than a snapshot. We can't assume that it corresponds to a state that exited in the past or a state that presently exists. What is the point of spinning -- effectively contributing to contention -- for a lie? However the proper way to address excessive spinning (instead of how I did it) would be to offer a TRY-LIST-QUEUE-CONTENTS function which is like LIST-QUEUE-CONTENTS except that it may return a failure flag along with the partial contents accumulated. I'm not actually proposing this because the issue is only theoretical. In practice I don't think the spinning matters, as you indicate. The runner might trip a few times at the outset, but once he gets into stride then ZOOM he's outta there. I have a couple additional changes. ENQUEUE shouldn't be pessimistic; removing the (cdr tail) check yields a small improvement. And stylistically I should use CONDs, as in the original queue.lisp. Since your LIST-QUEUE-CONTENTS isn't much different than my previous one, do you mind if I re-diff your change against that? For good measure I should mention that DEQUEUE may be implemented in two ways: one CASes with CAR and the other CASes with QUEUE-HEAD. I found the former to be slightly faster on my machine, and that's the only reason I chose it. On Sun, Jan 20, 2013 at 9:11 AM, Nikodemus Siivola <nik...@ra...> wrote: > On 17 December 2012 05:16, James M. Lawrence <llm...@gm...> wrote: > >> Here's a new patch which might address some niggles. >> >> - added overview >> - slots are typed >> - ENQUEUE and DEQUEUE make use of inferencing > > Very nice! > >> - visiting all nodes is less stringent -- quit if +DEAD-END+ is >> encountered, don't bother restarting > > If I read correctly this means that even if there are N live nodes in > the queue, we might visit zero nodes because the first one was > dequeued at an inopportune moment? > > That doesn't seem right. Suggested changes attached. > >> The read barrier in the old DEQUEUE is a curiosity to me. Does it >> reduce spinning? I didn't see much difference when I added a similar >> barrier to the new DEQUEUE -- it was maybe a little slower. > > Read barriers are nops on x86oid arches; the guarantee they provide > holds always on x86oids. (FWIW, > http://www.kernel.org/doc/Documentation/memory-barriers.txt is a good > summary of the different barriers.) > > That particular barrier was introduced by Alastair at > 713bb89f472457ec6654732b6b248b17b971f0ff for PPC threading -- which > needs barriers. > > I'm not sure if that barrier was ever strictly speaking necessary, and > I don't see anything that would require a barrier in your > implementation. > > If the patch attached looks good to you, I'll commit it in conjunction > with your. > > Cheers, > > -- Nikodemus |
From: Nikodemus S. <nik...@ra...> - 2013-01-22 05:38:06
|
On 22 January 2013 02:21, James M. Lawrence <llm...@gm...> wrote: > LIST-QUEUE-CONTENTS doesn't have a clear meaning, of course. It's akin > to sending a runner down a magical track in which the finish line is > receding away from him while the starting line is chasing him. He > can't tell us how long the track was at any given time, he can only > tell us how many steps he took and what he saw. Given that all operations that need to traverse the queue are there primarily for debugging, I would /really/ hate to try to see what was in queue and be told it was empty when it wasn't. If these operations don't retry till they succeed, they must fail in a way that allows the user to know what happened. (An error with restarts to retry once or "as long as necessary" would work for me.) -- Cheers, -- Nikodemus |
From: James M. L. <llm...@gm...> - 2013-01-22 12:36:32
|
On Tue, Jan 22, 2013 at 12:38 AM, Nikodemus Siivola <nik...@ra...> wrote: > On 22 January 2013 02:21, James M. Lawrence <llm...@gm...> wrote: > >> LIST-QUEUE-CONTENTS doesn't have a clear meaning, of course. It's akin >> to sending a runner down a magical track in which the finish line is >> receding away from him while the starting line is chasing him. He >> can't tell us how long the track was at any given time, he can only >> tell us how many steps he took and what he saw. > > Given that all operations that need to traverse the queue are there > primarily for debugging, I would /really/ hate to try to see what was > in queue and be told it was empty when it wasn't. > > If these operations don't retry till they succeed, they must fail in a > way that allows the user to know what happened. (An error with > restarts to retry once or "as long as necessary" would work for me.) We are already on the same page -- I said the way to prevent unwanted spinning is to return fail flag along with the partial contents (otherwise we throw away information). But I said I wasn't actually proposing that since spinning isn't a concern in practice. I should have written a shorter email. |
From: Nikodemus S. <nik...@ra...> - 2013-01-22 13:29:06
|
Or I shoul have read it completely before replying. :) Sent from my phone - apologies for substandard spelling and formatting. On Jan 22, 2013 2:36 PM, "James M. Lawrence" <llm...@gm...> wrote: > On Tue, Jan 22, 2013 at 12:38 AM, Nikodemus Siivola > <nik...@ra...> wrote: > > On 22 January 2013 02:21, James M. Lawrence <llm...@gm...> wrote: > > > >> LIST-QUEUE-CONTENTS doesn't have a clear meaning, of course. It's akin > >> to sending a runner down a magical track in which the finish line is > >> receding away from him while the starting line is chasing him. He > >> can't tell us how long the track was at any given time, he can only > >> tell us how many steps he took and what he saw. > > > > Given that all operations that need to traverse the queue are there > > primarily for debugging, I would /really/ hate to try to see what was > > in queue and be told it was empty when it wasn't. > > > > If these operations don't retry till they succeed, they must fail in a > > way that allows the user to know what happened. (An error with > > restarts to retry once or "as long as necessary" would work for me.) > > We are already on the same page -- I said the way to prevent unwanted > spinning is to return fail flag along with the partial contents > (otherwise we throw away information). But I said I wasn't actually > proposing that since spinning isn't a concern in practice. I should > have written a shorter email. > |
From: James M. L. <llm...@gm...> - 2013-01-25 01:34:00
|
Here's a final tweak -- a slightly faster ENQUEUE along with some comment/style cleanup, and Nikodemus' rebased change on top. |
From: Nikodemus S. <nik...@ra...> - 2013-01-25 14:42:50
|
Thank you! I'll merge this after the freeze. Sent from my phone - apologies for substandard spelling and formatting. On Jan 25, 2013 3:36 AM, "James M. Lawrence" <llm...@gm...> wrote: > Here's a final tweak -- a slightly faster ENQUEUE along with some > comment/style cleanup, and Nikodemus' rebased change on top. > > > ------------------------------------------------------------------------------ > Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, > MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current > with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft > MVPs and experts. ON SALE this month only -- learn more at: > http://p.sf.net/sfu/learnnow-d2d > _______________________________________________ > Sbcl-devel mailing list > Sbc...@li... > https://lists.sourceforge.net/lists/listinfo/sbcl-devel > > |
From: James M. L. <llm...@gm...> - 2013-02-08 20:04:18
|
At the risk of making this even more complicated than it should have been... On Mon, Jan 21, 2013 at 7:21 PM, James M. Lawrence <llm...@gm...> wrote: > For good measure I should mention that DEQUEUE may be implemented in > two ways: one CASes with CAR and the other CASes with QUEUE-HEAD. I > found the former to be slightly faster on my machine, and that's the > only reason I chose it. This result is a little unexpected, so I ran the benchmarks again with (optimize speed) added. This caused CASing with QUEUE-HEAD to slightly win when producers = consumers, while otherwise it's tradeoff with CAS CAR winning by 12% when consumers > producers and CAS QUEUE-HEAD winning by 12% when producers > consumers. (Thanks to Martin Simmons for causing me to re-evaluate this.) The general rule seems to be that "pessimistic" checking -- testing if an operation will fail before performing it -- is worse except perhaps during high contention. For example ENQUEUE became faster when I removed the (null (cdr tail)) check. Considering that CAS QUEUE-HEAD now wins when producers = consumers, and that it makes DEQUEUE simpler, I think it should be used instead. I also noticed the new try-walk-queue adds a pessimistic check. The simpler version (sans the funny GO in a lambda) is faster, which appears to be a better defense against restarting. I've taken the liberty of restoring it and rebased Nikodemus' comments on top. # 50000 messages ## CAS QUEUE-HEAD ### 2p2c 3028.752 3022.7446 3025.45 ### 3p1c 3130.399 3134.422 3149.8293 ### 1p3c 3698.7356 3724.058 3668.6968 ## CAS CAR ### 2p2c 3295.852 3269.6978 3292.238 ### 3p1c 3535.445 3552.5916 3537.678 ### 1p3c 3308.7092 3295.0532 3289.5344 ## old queue ### 2p2c 4463.223 4440.3125 4424.1255 ### 3p1c 4301.402 4279.555 4280.9595 ### 1p3c 5060.3525 5047.053 4975.4663 |
From: Nikodemus S. <nik...@ra...> - 2013-02-11 19:33:04
|
On 8 February 2013 22:04, James M. Lawrence <llm...@gm...> wrote: > At the risk of making this even more complicated than it should have > been... Pushed, thank you! -- nikodemus |
From: James M. L. <llm...@gm...> - 2013-02-22 05:24:50
|
(defun enqueue (value queue) (let ((new (cons value nil))) (loop (when (eq nil (sb-ext:compare-and-swap (cdr (queue-tail queue)) nil new)) (setf (queue-tail queue) new) (return value))))) The CCL version of ENQUEUE fails intermittently. After filing a bug report, I was told that a memory barrier is required. That seems unlikely, but if true then it might apply to SBCL since both implementations use LOCK CMPXCHG. The rebuttal I gave on the CCL bugtracker follows -- is it correct? === It seems to me that the single CAS in ENQUEUE provides necessary and sufficient ordering. After a thread writes to the tail slot, another thread may read the old tail for some length of time. If so then that's fine -- CAS will fail for some number of iterations, and no harm is done. Therefore a write to the tail slot is always preceded by a valid read. And a valid read must have been preceded by a completed write. Therefore writes to the tail slot are ordered as a consequence of CAS writes being ordered. === |
From: Nikodemus S. <nik...@ra...> - 2013-02-22 13:47:53
|
On 22 February 2013 07:24, James M. Lawrence <llm...@gm...> wrote: > It seems to me that the single CAS in ENQUEUE provides necessary and > sufficient ordering. > > After a thread writes to the tail slot, another thread may read the > old tail for some length of time. If so then that's fine -- CAS will > fail for some number of iterations, and no harm is done. > > Therefore a write to the tail slot is always preceded by a valid read. > And a valid read must have been preceded by a completed write. > Therefore writes to the tail slot are ordered as a consequence of CAS > writes being ordered. Seems correct to me. Comparing disassembly could be instructive. (If CPXCHG is used without LOCK prefix, that would do the trick, for example.) Cheers, -- Nikodemus |
From: Paul K. <pv...@pv...> - 2013-02-25 05:11:56
|
Nikodemus Siivola wrote: > On 22 February 2013 07:24, James M. Lawrence<llm...@gm...> wrote: > >> It seems to me that the single CAS in ENQUEUE provides necessary and >> sufficient ordering. [...] > Seems correct to me. Comparing disassembly could be instructive. (If > CPXCHG is used without LOCK prefix, that would do the trick, for > example.) Also SBCL guarantees that even on non-TSO platforms, CAS is a full barrier, right? I don't know if one can expect every other implementation to be this nice. For anyone else following, the issue was a tiny bug caused in CCL's PC Lusering logic. It was fixed yesterday: http://trac.clozure.com/ccl/ticket/1058 . Paul Khuong |