Thread: [GD-General] cross platform 64bit lock free FIFO?
Brought to you by:
vexxed72
From: Andras B. <and...@gm...> - 2006-08-05 16:02:47
|
I've found info on know how to implement a lock free FIFO on 32bit systems, but I've also read that making it work for 64bit systems with the current instruction sets is very difficult and extremely error prone. So I would prefer to use a proven solution, instead of trying to roll my own. On Windows, it's easy, because the Win32 API provides an interlocked SList, that is supposed to work on 64 bit platforms too. But what about other operating systems? Also, AFAIK these new 64bit systems don't really have a 64bit address space, but 40-44? Does this mean, that all these 64bit implementations are doing is only using the top 20 bits for ABA resolution?? I'm confused :) Andras |
From: Jon W. <hp...@mi...> - 2006-08-07 18:29:20
|
More than one question in the same message. Typically, a lock-free FIFO supports only a single writer and a single reader. If that's all you need, then you don't even need the special instructions, because the writer will always write the same variable, and the reader will always read the other same variable. If your cache is not 100% read-after-write coherent, then you may have a temporary situation where there's something written that the reader doesn't yet see (or vice versa), but on Intel, that can never happen. If you need multiple readers, or multiple writers, then correct lock-free implementation suddenly becomes a *lot* harder. Cheers, / h+ Andras Balogh wrote: > I've found info on know how to implement a lock free FIFO on 32bit > systems, but I've also read that making it work for 64bit systems with the > current instruction sets is very difficult and extremely error prone. So I > would prefer to use a proven solution, instead of trying to roll my own. > On Windows, it's easy, because the Win32 API provides an interlocked > SList, that is supposed to work on 64 bit platforms too. But what about > other operating systems? Also, AFAIK these new 64bit systems don't really > have a 64bit address space, but 40-44? Does this mean, that all these > 64bit implementations are doing is only using the top 20 bits for ABA > resolution?? I'm confused :) > |
From: Daniel V. <Dan...@ep...> - 2006-08-07 18:35:21
|
> see (or vice versa), but on Intel, that can never happen. FWIW, it can on next gen consoles though at least there are intrinsics for inserting light weight memory barriers. -- Daniel, Epic Games Inc. =20 > -----Original Message----- > From: gam...@li...=20 > [mailto:gam...@li...]=20 > On Behalf Of Jon Watte > Sent: Monday, August 07, 2006 2:29 PM > To: gam...@li... > Subject: Re: [GD-General] cross platform 64bit lock free FIFO? >=20 > More than one question in the same message. >=20 > Typically, a lock-free FIFO supports only a single writer and=20 > a single=20 > reader. If that's all you need, then you don't even need the special=20 > instructions, because the writer will always write the same variable,=20 > and the reader will always read the other same variable. If=20 > your cache=20 > is not 100% read-after-write coherent, then you may have a temporary=20 > situation where there's something written that the reader doesn't yet=20 > see (or vice versa), but on Intel, that can never happen. >=20 > If you need multiple readers, or multiple writers, then correct=20 > lock-free implementation suddenly becomes a *lot* harder. >=20 > Cheers, >=20 > / h+ >=20 >=20 > Andras Balogh wrote: > > I've found info on know how to implement a lock free FIFO on 32bit =20 > > systems, but I've also read that making it work for 64bit=20 > systems with the =20 > > current instruction sets is very difficult and extremely=20 > error prone. So I =20 > > would prefer to use a proven solution, instead of trying to=20 > roll my own. =20 > > On Windows, it's easy, because the Win32 API provides an=20 > interlocked =20 > > SList, that is supposed to work on 64 bit platforms too.=20 > But what about =20 > > other operating systems? Also, AFAIK these new 64bit=20 > systems don't really =20 > > have a 64bit address space, but 40-44? Does this mean, that=20 > all these =20 > > 64bit implementations are doing is only using the top 20=20 > bits for ABA =20 > > resolution?? I'm confused :) > > =20 >=20 >=20 > -------------------------------------------------------------- > ----------- > Using Tomcat but need to do more? Need to support web=20 > services, security? > Get stuff done quickly with pre-integrated technology to make=20 > your job easier > Download IBM WebSphere Application Server v.1.0.1 based on=20 > Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=3D120709&bid=3D263057& dat=3D121642 > _______________________________________________ > Gamedevlists-general mailing list > Gam...@li... > https://lists.sourceforge.net/lists/listinfo/gamedevlists-general > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_id=3D557 >=20 |
From: Andras B. <and...@gm...> - 2006-08-09 05:01:21
|
> Typically, a lock-free FIFO supports only a single writer and a single > reader. If that's all you need, then you don't even need the special > instructions, because the writer will always write the same variable, > and the reader will always read the other same variable. If your cache > is not 100% read-after-write coherent, then you may have a temporary > situation where there's something written that the reader doesn't yet > see (or vice versa), but on Intel, that can never happen. Yes, I only need one writer and one reader thread, but I'm not sure that I understand the solution you propose. Could you elaborate a bit? Thanks, Andras |
From: Jon W. <hp...@mi...> - 2006-08-09 16:26:31
|
Andras Balogh wrote: > Yes, I only need one writer and one reader thread, but I'm not sure that I > understand the solution you propose. Could you elaborate a bit? Thanks, > Here is the implementation, an all its simplicity: // Lock-free FIFO by Jon Watte. // Size must be power of 2. // This FIFO is correct on systems with strongly ordered caches (such as Intel // and desktop PPC). It may require write barriers on systems with weakly ordered // caches, at the points marked /* 1 */ and /* 2 */. template<typename Stored, size_t Size> class LockFreeFifo { public: LockFreeFifo() { head = 0; tail = 0; } /* get an item from the FIFO, return true if an item was gotten, else return false */ bool get(Stored & out) { if (head - tail > 0) { out = fifo_[tail & (Size-1)]; ++tail; /* 1 */ return true; } return false; } /* put an item into the FIFO, return true if it fit, false if it's full */ bool put(Stored & in) { if (head - tail < Size) { fifo_[head & (Size-1)] = in; ++head; /* 2 */ return true; } return false; } private: unsigned int head; unsigned int tail; Stored fifo_[Size]; }; Because it's non-blocking, it returns FALSE when there's nothing to do (FIFO full or empty). Because there's a single reader and a single writer, there is no thread contention on "head" vs "tail". Because it's using unsigned math, and the Size is a power of two (important!), the two's complement wrapping arithmetic makes for no special cases. (Note the comparison of head-tail > 0 !) On Intel, and other machines with strongly ordered caches, this doesn't need any kind of memory barrier or interlocked operation at all. On some more exotic systems with weakly ordered caches (where you may get "old" data when reading), there is a chance of get() returning false when there is actually one item in the cache; similarly, there's a chance of put() returning false when there is actually space. When the "head" and "tail" share a cache line, the maximum difference can be one item. However, for absolute correctness in these cases, you want to insert a cheap memory barrier at the points marked /* 1 */ and /* 2 */ -- or, alternately, use the platform's atomic intrinsics (atomic_add(), InterlockedIncrement(), etc). On PPC, the memory barrier is traditionally the EIEIO instruction, on Intel (where it's not needed, unless you're writing to un-cached memory) it's the SFENCE instruction. I don't know if the new console PPC cores have a cheaper instruction than EIEIO. Cheers, / h+ |
From: Ola O. <ola...@gm...> - 2006-08-09 16:52:18
|
Hello there. Just a question here, in the weakly ordered cache scenario (guss the platform :), isnt the real danger that you may end up with the ++head arriving at the cache before the (or part of) fifo_[head & (Size-1)] = in Leading to an opportunity for the reader to get totally bogus data? Thus the barrier is needed to ensure the the data gets to cache before the increment, i.e. it needs to go before the increment on head (your comment seem to indicate they should go after)? Having just started looking into this myself, this kind of discussion is great. Another thing that was bothering me is whether its "common knowledge" or universally guaranteed that the whole of any given 32bit write end up arriving at the cache together, I guess more generally if any store operation is kept together, although 32bits is plenty for a fifo length I imagine. cheers .ola ----- Original Message ----- From: "Jon Watte" <hp...@mi...> To: <gam...@li...> Sent: Wednesday, August 09, 2006 6:26 PM Subject: Re: [GD-General] cross platform 64bit lock free FIFO? > > > Andras Balogh wrote: >> Yes, I only need one writer and one reader thread, but I'm not sure that >> I >> understand the solution you propose. Could you elaborate a bit? Thanks, >> > > Here is the implementation, an all its simplicity: > > // Lock-free FIFO by Jon Watte. > // Size must be power of 2. > // This FIFO is correct on systems with strongly ordered caches (such as > Intel > // and desktop PPC). It may require write barriers on systems with > weakly ordered > // caches, at the points marked /* 1 */ and /* 2 */. > template<typename Stored, size_t Size> > class LockFreeFifo { > public: > LockFreeFifo() { head = 0; tail = 0; } > > /* get an item from the FIFO, return true if an item was gotten, > else return false */ > bool get(Stored & out) { > if (head - tail > 0) { > out = fifo_[tail & (Size-1)]; > ++tail; /* 1 */ > return true; > } > return false; > } > > /* put an item into the FIFO, return true if it fit, false if it's > full */ > bool put(Stored & in) { > if (head - tail < Size) { > fifo_[head & (Size-1)] = in; > ++head; /* 2 */ > return true; > } > return false; > } > private: > unsigned int head; > unsigned int tail; > Stored fifo_[Size]; > }; > > > Because it's non-blocking, it returns FALSE when there's nothing to do > (FIFO full or empty). Because there's a single reader and a single > writer, there is no thread contention on "head" vs "tail". Because it's > using unsigned math, and the Size is a power of two (important!), the > two's complement wrapping arithmetic makes for no special cases. (Note > the comparison of head-tail > 0 !) > > On Intel, and other machines with strongly ordered caches, this doesn't > need any kind of memory barrier or interlocked operation at all. > > On some more exotic systems with weakly ordered caches (where you may > get "old" data when reading), there is a chance of get() returning false > when there is actually one item in the cache; similarly, there's a > chance of put() returning false when there is actually space. When the > "head" and "tail" share a cache line, the maximum difference can be one > item. However, for absolute correctness in these cases, you want to > insert a cheap memory barrier at the points marked /* 1 */ and /* 2 */ > -- or, alternately, use the platform's atomic intrinsics (atomic_add(), > InterlockedIncrement(), etc). On PPC, the memory barrier is > traditionally the EIEIO instruction, on Intel (where it's not needed, > unless you're writing to un-cached memory) it's the SFENCE instruction. > I don't know if the new console PPC cores have a cheaper instruction > than EIEIO. > > Cheers, > > / h+ > > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job > easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Gamedevlists-general mailing list > Gam...@li... > https://lists.sourceforge.net/lists/listinfo/gamedevlists-general > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_id=557 > |
From: Jon W. <hp...@mi...> - 2006-08-09 19:53:22
|
Yes, you would need that as well -- I forgot that (which makes the class dangerous on weakly consistent caches). You should put the barrier before the ++head, then. Note that the problem is with weakly _consistent_ cahces, not just weakly ordered -- i e, if cache line A can flush before cache line B, it doesn't matter, as long as the mutliple CPUs implement cache snooping to stay consistent (which is the case on Intel). On PPC, you may end up wanting to use the cache line reservations with a spin loop. Any naturally aligned write up to the CPU word size will end up in the cache together on pretty much any CPU. Note the alignment restriction, and the fact that this does not necessarily apply to 64-bit or 128-bit quantities on 32-bit CPUs! Also, the original question said "64-bit FIFO" -- I failed to ask the question "what does 64-bit really mean?" -- I thought it meant communicating 64-bit quantities, but it can mean many other things. Cheers, / h+ Ola Olsson wrote: > Hello there. > > Just a question here, in the weakly ordered cache scenario (guss the > platform :), isnt the real danger that you may end up with the > ++head > arriving at the cache before the (or part of) > fifo_[head & (Size-1)] = in > Leading to an opportunity for the reader to get totally bogus data? Thus the > barrier is needed to ensure the the data gets to cache before the increment, > i.e. it needs to go before the increment on head (your comment seem to > indicate they should go after)? > Having just started looking into this myself, this kind of discussion is > great. > Another thing that was bothering me is whether its "common knowledge" or > universally guaranteed that the whole of any given 32bit write end up > arriving at the cache together, I guess more generally if any store > operation is kept together, although 32bits is plenty for a fifo length I > imagine. > > cheers > .ola > > ----- Original Message ----- > From: "Jon Watte" <hp...@mi...> > To: <gam...@li...> > Sent: Wednesday, August 09, 2006 6:26 PM > Subject: Re: [GD-General] cross platform 64bit lock free FIFO? > > > >> Andras Balogh wrote: >> >>> Yes, I only need one writer and one reader thread, but I'm not sure that >>> I >>> understand the solution you propose. Could you elaborate a bit? Thanks, >>> >>> >> Here is the implementation, an all its simplicity: >> >> // Lock-free FIFO by Jon Watte. >> // Size must be power of 2. >> // This FIFO is correct on systems with strongly ordered caches (such as >> Intel >> // and desktop PPC). It may require write barriers on systems with >> weakly ordered >> // caches, at the points marked /* 1 */ and /* 2 */. >> template<typename Stored, size_t Size> >> class LockFreeFifo { >> public: >> LockFreeFifo() { head = 0; tail = 0; } >> >> /* get an item from the FIFO, return true if an item was gotten, >> else return false */ >> bool get(Stored & out) { >> if (head - tail > 0) { >> out = fifo_[tail & (Size-1)]; >> ++tail; /* 1 */ >> return true; >> } >> return false; >> } >> >> /* put an item into the FIFO, return true if it fit, false if it's >> full */ >> bool put(Stored & in) { >> if (head - tail < Size) { >> fifo_[head & (Size-1)] = in; >> ++head; /* 2 */ >> return true; >> } >> return false; >> } >> private: >> unsigned int head; >> unsigned int tail; >> Stored fifo_[Size]; >> }; >> >> >> Because it's non-blocking, it returns FALSE when there's nothing to do >> (FIFO full or empty). Because there's a single reader and a single >> writer, there is no thread contention on "head" vs "tail". Because it's >> using unsigned math, and the Size is a power of two (important!), the >> two's complement wrapping arithmetic makes for no special cases. (Note >> the comparison of head-tail > 0 !) >> >> On Intel, and other machines with strongly ordered caches, this doesn't >> need any kind of memory barrier or interlocked operation at all. >> >> On some more exotic systems with weakly ordered caches (where you may >> get "old" data when reading), there is a chance of get() returning false >> when there is actually one item in the cache; similarly, there's a >> chance of put() returning false when there is actually space. When the >> "head" and "tail" share a cache line, the maximum difference can be one >> item. However, for absolute correctness in these cases, you want to >> insert a cheap memory barrier at the points marked /* 1 */ and /* 2 */ >> -- or, alternately, use the platform's atomic intrinsics (atomic_add(), >> InterlockedIncrement(), etc). On PPC, the memory barrier is >> traditionally the EIEIO instruction, on Intel (where it's not needed, >> unless you're writing to un-cached memory) it's the SFENCE instruction. >> I don't know if the new console PPC cores have a cheaper instruction >> than EIEIO. >> >> Cheers, >> >> / h+ >> >> >> ------------------------------------------------------------------------- >> Using Tomcat but need to do more? Need to support web services, security? >> Get stuff done quickly with pre-integrated technology to make your job >> easier >> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo >> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 >> _______________________________________________ >> Gamedevlists-general mailing list >> Gam...@li... >> https://lists.sourceforge.net/lists/listinfo/gamedevlists-general >> Archives: >> http://sourceforge.net/mailarchive/forum.php?forum_id=557 >> >> > > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Gamedevlists-general mailing list > Gam...@li... > https://lists.sourceforge.net/lists/listinfo/gamedevlists-general > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_id=557 > > > |
From: Jon W. <hp...@mi...> - 2006-08-09 20:10:56
|
And, to confuse things even more, if you have strict aliasing options turned on, the compiler may even decide to re-order the stores. Thus, the Totally Safe lock-free FIFO for arbitrary-sized items looks like this: /* Lock-free FIFO by Jon Watte. * Stored can be a type of any size. * Size must be power of 2. * This FIFO is correct on systems with strongly consistent * caches (such as Intel). It may require write barriers on * systems with weakly consistent caches. Insert such a * barrier in the inline function barrier() if that's the case. */ template<typename Stored, size_t Size> class LockFreeFifo { public: LockFreeFifo() { head = 0; tail = 0; } /* Get an item from the FIFO. * Return true if an item was there and was copied, * false if the FIFO is empty. */ bool get(Stored & out) { if (head - tail > 0) { out = fifo_[tail & (Size-1)]; ++tail; barrier(); /* make sure writer knows FIFO isn't full */ return true; } return false; } /* Put an item into the FIFO. * Return true if it fit and was copied, * false if the FIFO is full. */ bool put(Stored & in) { if (head - tail < Size) { fifo_[head & (Size-1)] = in; barrier(); /* make sure FIFO is updated before head */ ++head; barrier(); /* make sure reader can find the new item */ return true; } return false; } private: inline void barrier() { /* If your platform has a weakly consistent cache (i e, * not Intel) then you need to insert a barrier instruction * (such as EIEIO on PPC) here. */ #warning "Use the intrinsic specific to your platform, if needed." } /* Make these volatile, to avoid strict type aliasing analysis to allow * the compiler to re-order stores. */ volatile unsigned int head; volatile unsigned int tail; volatile Stored fifo_[Size]; }; Jon Watte wrote: > Note that the problem is with weakly _consistent_ cahces, not just > weakly ordered -- i e, if cache line A can flush before cache line B, it > doesn't matter, as long as the mutliple CPUs implement cache snooping to > stay consistent (which is the case on Intel). On PPC, you may end up > wanting to use the cache line reservations with a spin loop. > |
From: Jon W. <hp...@mi...> - 2006-08-21 02:44:51
|
Andras Balogh wrote: > Hmm, I have one more question regarding your implementation: You put > barrier() after the increments, because otherwise get() might think that > the FIFO is empty, when it's not, and put() might think that the FIFO is > full, when it's not. I believe that putting in these memory barriers won't > solve the problem, because what if the threads get switched after the > increment, but before executing the barrier()?? Even using > interlocked_increment won't help, because only one thread is changing the > value, and again, what if the thread switch occurs after the store and > before the increment? > > Of course, none of this really matters, because it should not result in > corrupt data in any way, I just think that those barriers are unnecessary. > If you get switched out before the barrier, that's no different (really) from being switched out before the put operation -- you'll be switched back in and perform the barrier in the next time-slice. The barriers are not necessary on Intel machines, because the data caches are guaranteed to be consistent. They use snooping to make read-after-write always return the consistent value. Some more esoteric hardware may have caches that don't have this nice property, and a read from a word that's already in cache may return the old cache value, rather than the updated value living in some other cache. The barriers may also be needed on machines with out-of-order speculative memory, where the write to update the FIFO contents may be re-ordered to happen after the write to update the queue size. There may be some chance of this happening under certain circumstances on certain Intel chips, AFAICT, so an SFENCE wouldn't hurt (although I believe the "common, plain" cases don't actually have this problem). Cheers, / h+ |
From: Martin D. <ak...@sk...> - 2006-08-08 16:44:00
|
LL/SC primitives don't suffer from the ABA problem, however they don't seem to be widely supported. They can be emulated using CAS or RSC/RLL though, google gave me these: http://www.cs.dartmouth.edu/~spetrovic/papers/icdcs_multiword.pdf http://portal.acm.org/citation.cfm?id=872078&dl=GUIDE&coll=&CFID=15151515&CFTOKEN=6184618 You are right though, most 64-bit platforms right now aren't using the full address space so you should be able to find enough unused bits in a 64-bit pointer for CAS ABA resolution. Andras Balogh wrote: > I've found info on know how to implement a lock free FIFO on 32bit > systems, but I've also read that making it work for 64bit systems with the > current instruction sets is very difficult and extremely error prone. So I > would prefer to use a proven solution, instead of trying to roll my own. > On Windows, it's easy, because the Win32 API provides an interlocked > SList, that is supposed to work on 64 bit platforms too. But what about > other operating systems? Also, AFAIK these new 64bit systems don't really > have a 64bit address space, but 40-44? Does this mean, that all these > 64bit implementations are doing is only using the top 20 bits for ABA > resolution?? I'm confused :) > > > > Andras > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys -- and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Gamedevlists-general mailing list > Gam...@li... > https://lists.sourceforge.net/lists/listinfo/gamedevlists-general > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_id=557 > |
From: Andras B. <and...@gm...> - 2006-08-20 05:34:34
|
Jon, I very much appreciate all the explanation and sharing of the = implementation details! It is extremely helpful!! To answer your question, when I asked for a 64 bit FIFO, I was thinking = of = the fully generic FIFO's on 64bit systems, with unlimited queue length, = = and with multiple producer and consumer threads. In this case, all the = solutions that I've found used linked lists with pointers, and required = = using an atomic compare-and-swap primitive that works on double the = machine word size (so that you can use another machine word for ABA = resolution), which is non existent for 64 bit systems (there's no = cmpxchg16b AFAIK), hence my question. I admit that I didn't even think about having only one producer and one = = consumer thread and a fixed maximum queue length can simplify things thi= s = much! Turns out that this is all I need. Thanks again, Andras On Wed, 09 Aug 2006 14:10:50 -0600, Jon Watte <hp...@mi...> = wrote: > > And, to confuse things even more, if you have strict aliasing options > turned on, the compiler may even decide to re-order the stores. Thus, > the Totally Safe lock-free FIFO for arbitrary-sized items looks like = > this: > > /* Lock-free FIFO by Jon Watte. > * Stored can be a type of any size. > * Size must be power of 2. > * This FIFO is correct on systems with strongly consistent > * caches (such as Intel). It may require write barriers on > * systems with weakly consistent caches. Insert such a > * barrier in the inline function barrier() if that's the case. > */ > template<typename Stored, size_t Size> > class LockFreeFifo { > public: > LockFreeFifo() { head =3D 0; tail =3D 0; } > > /* Get an item from the FIFO. > * Return true if an item was there and was copied, > * false if the FIFO is empty. > */ > bool get(Stored & out) { > if (head - tail > 0) { > out =3D fifo_[tail & (Size-1)]; > ++tail; > barrier(); /* make sure writer knows FIFO isn't full */ > return true; > } > return false; > } > > /* Put an item into the FIFO. > * Return true if it fit and was copied, > * false if the FIFO is full. > */ > bool put(Stored & in) { > if (head - tail < Size) { > fifo_[head & (Size-1)] =3D in; > barrier(); /* make sure FIFO is updated before head */ > ++head; > barrier(); /* make sure reader can find the new item */ > return true; > } > return false; > } > > private: > > inline void barrier() { > /* If your platform has a weakly consistent cache (i e, > * not Intel) then you need to insert a barrier instruction > * (such as EIEIO on PPC) here. > */ > #warning "Use the intrinsic specific to your platform, if needed." > } > > /* Make these volatile, to avoid strict type aliasing analysis to = = > allow > * the compiler to re-order stores. > */ > volatile unsigned int head; > volatile unsigned int tail; > volatile Stored fifo_[Size]; > }; > > > > Jon Watte wrote: >> Note that the problem is with weakly _consistent_ cahces, not just >> weakly ordered -- i e, if cache line A can flush before cache line B,= it >> doesn't matter, as long as the mutliple CPUs implement cache snooping= to >> stay consistent (which is the case on Intel). On PPC, you may end up >> wanting to use the cache line reservations with a spin loop. |
From: Jon W. <hp...@mi...> - 2006-08-21 02:40:26
|
Andras Balogh wrote: > To answer your question, when I asked for a 64 bit FIFO, I was thinking of > the fully generic FIFO's on 64bit systems, with unlimited queue length, > and with multiple producer and consumer threads. In this case, all the > solutions that I've found used linked lists with pointers, and required > using an atomic compare-and-swap primitive that works on double the > machine word size (so that you can use another machine word for ABA > resolution), which is non existent for 64 bit systems (there's no > cmpxchg16b AFAIK), hence my question. > Yes, if you want that, you'll need locking. Sucks but true. Which is why I much prefer the single-writer, single-reader FIFO, which solves 95% of the actual cases :-) Cheers, / h+ |
From: Andras B. <and...@gm...> - 2006-08-20 17:21:03
|
Hmm, I have one more question regarding your implementation: You put = barrier() after the increments, because otherwise get() might think that= = the FIFO is empty, when it's not, and put() might think that the FIFO is= = full, when it's not. I believe that putting in these memory barriers won= 't = solve the problem, because what if the threads get switched after the = increment, but before executing the barrier()?? Even using = interlocked_increment won't help, because only one thread is changing th= e = value, and again, what if the thread switch occurs after the store and = before the increment? Of course, none of this really matters, because it should not result in = = corrupt data in any way, I just think that those barriers are unnecessar= y. Andras On Wed, 09 Aug 2006 14:10:50 -0600, Jon Watte <hp...@mi...> = wrote: > > And, to confuse things even more, if you have strict aliasing options > turned on, the compiler may even decide to re-order the stores. Thus, > the Totally Safe lock-free FIFO for arbitrary-sized items looks like = > this: > > /* Lock-free FIFO by Jon Watte. > * Stored can be a type of any size. > * Size must be power of 2. > * This FIFO is correct on systems with strongly consistent > * caches (such as Intel). It may require write barriers on > * systems with weakly consistent caches. Insert such a > * barrier in the inline function barrier() if that's the case. > */ > template<typename Stored, size_t Size> > class LockFreeFifo { > public: > LockFreeFifo() { head =3D 0; tail =3D 0; } > > /* Get an item from the FIFO. > * Return true if an item was there and was copied, > * false if the FIFO is empty. > */ > bool get(Stored & out) { > if (head - tail > 0) { > out =3D fifo_[tail & (Size-1)]; > ++tail; > barrier(); /* make sure writer knows FIFO isn't full */ > return true; > } > return false; > } > > /* Put an item into the FIFO. > * Return true if it fit and was copied, > * false if the FIFO is full. > */ > bool put(Stored & in) { > if (head - tail < Size) { > fifo_[head & (Size-1)] =3D in; > barrier(); /* make sure FIFO is updated before head */ > ++head; > barrier(); /* make sure reader can find the new item */ > return true; > } > return false; > } > > private: > > inline void barrier() { > /* If your platform has a weakly consistent cache (i e, > * not Intel) then you need to insert a barrier instruction > * (such as EIEIO on PPC) here. > */ > #warning "Use the intrinsic specific to your platform, if needed." > } > > /* Make these volatile, to avoid strict type aliasing analysis to = = > allow > * the compiler to re-order stores. > */ > volatile unsigned int head; > volatile unsigned int tail; > volatile Stored fifo_[Size]; > }; > > > > Jon Watte wrote: >> Note that the problem is with weakly _consistent_ cahces, not just >> weakly ordered -- i e, if cache line A can flush before cache line B,= it >> doesn't matter, as long as the mutliple CPUs implement cache snooping= to >> stay consistent (which is the case on Intel). On PPC, you may end up >> wanting to use the cache line reservations with a spin loop. |
From: Andras B. <and...@gm...> - 2006-08-21 05:35:37
|
I think that we are talking about different barriers, here's a snippet = from your code: if (head - tail < Size) { fifo_[head & (Size-1)] =3D in; barrier(); /* make sure FIFO is updated before head */ ++head; barrier(); /* make sure reader can find the new item */ return true; I agree that the first barrier is absolutely necessary on those esoteric= = systems, because it forces the data to be actually written out (and I me= an = written in a sense that anyone else reading it will get the intended = value) before the head is incremented. Now, the other barrier after the increment is the one that is not needed= , = because it doesn't really matter if the head has been incremented or not= . = In the worst case, the receiver won't know that there's a new element in= = the queue until later, but it will by no means result in data loss or = corrupton, no matter what architecture you are running on. Andras On Sun, 20 Aug 2006 20:44:37 -0600, Jon Watte <hp...@mi...> = wrote: > If you get switched out before the barrier, that's no different (reall= y) > from being switched out before the put operation -- you'll be switched= > back in and perform the barrier in the next time-slice. > > The barriers are not necessary on Intel machines, because the data > caches are guaranteed to be consistent. They use snooping to make > read-after-write always return the consistent value. > > Some more esoteric hardware may have caches that don't have this nice > property, and a read from a word that's already in cache may return th= e > old cache value, rather than the updated value living in some other = > cache. > > The barriers may also be needed on machines with out-of-order > speculative memory, where the write to update the FIFO contents may b= e > re-ordered to happen after the write to update the queue size. There m= ay > be some chance of this happening under certain circumstances on certai= n > Intel chips, AFAICT, so an SFENCE wouldn't hurt (although I believe th= e > "common, plain" cases don't actually have this problem). > > Cheers, > > / h+ |
From: Jon W. <hp...@mi...> - 2006-08-21 17:25:58
|
Andras Balogh wrote: > Now, the other barrier after the increment is the one that is not needed, > because it doesn't really matter if the head has been incremented or not. > In the worst case, the receiver won't know that there's a new element in > the queue until later, but it will by no means result in data loss or > corrupton, no matter what architecture you are running on. > There may be architectures where the other side wouldn't learn about the change until the following happens: 1) the cache line with the update gets evicted 2) the other side attempts to modify the cache line that was evicted (so it's re-loaded) This is more likely in "NUMA" or "multiple bus master shared memory" architectures, than traditional SMP. Call it AMP for asymmetric multi-processing :-) Without the second barrier, such an architecture could get into live-lock. The barrier in that case serves as an explicit notification. Cheers, / h+ |