Thread: [Algorithms] lock-free stack with hazard list performance

Brought to you by: vexxed72

gdalgorithms-list

[Algorithms] lock-free stack with hazard list performance

From: Stephan R. <ke...@so...> - 2009-04-21 21:05:01

Currently working on lock-free stack and queue implementations with hazard lists.

>From a technical standpoint, they are working nicely, currently doing my testing with the stack as it's just the simpler one implementation-wise so less room for bugs to sneak in.

However, performance wise, things aren't so happy.

Ran a few tests:

Common to all tests:
Values pushed/popped: 2^24 per push thread
Thread count, 2 pushing 2 popping
Processor: Quad core so each thread is running on its own core

Stack with spinlock: ~33 seconds

Stack with hazard-pointer, properly deleting memory: ~45 seconds

Stack with hazard-pointer, leaking memory by just simply marking the pointer as no longer referenced: ~17 seconds

The 3rd test does indicate a significant gain over the spinlock, nearly cutting the time in half. 

However, something about scanning the other threads for pointers that are being referenced is just taking performance and is completely dragging it down. In pseudo count this looks roughly something like this:

In pseudocode below:

if deleteCount < threshold return; // Tried various values here, values up to around 128 seem to slightly improve performance, while larger values like 1024 totally annihilate performance.

foreach node
  if node->status = Delete
    if !scan(node->pointer) // scan returns false if no other threads hold a ref
      delete the node and mark node as free
    endif
  endif
endforeach

then the scan function:

foreach thread
  if thread != thisthread and thread->isrunning
    if thread->hasreference(pointer)
      return true;
    endif
  endif
endforeach

The hasrefeference function then has to cycle through every single entry of it's threads hazard list and see if it has an active reference to that pointer.

It works. Slowly. But 1 loop that for each iteration calls another loop which for each iteration calls another loop (try to say that 10 times really fast...) is bound to be like trying to tow a garbage truck with a honda civic, especially as the number of items in the lists goes up.

Currently this these lists are all singly linked lists, I could try using something like a hashtable and seeing if I do better than that. But all the information I've found online about hazard pointers I think also use just singly linked lists so I'm not quite sure where my issue is.

Any ideas?

Thanks,

Stephan

Re: [Algorithms] lock-free stack with hazard list performance

From: Nicholas \Indy\ R. <ar...@gm...> - 2009-04-21 22:33:51

I'm not familiar with Hazard lists, but from reading the pseudo code,
it seems that you get additional functionality at the cost or
performance, Unless I am missing something (which is likely, googling
"hazard lists" doesn't seem to be very helpful.) It seems that it'd
would be slower by principal. and only losing about 25% performance
over a spin lock doesn't seem to bad.

Nicholas "Indy" Ray

On Tue, Apr 21, 2009 at 1:52 PM, Stephan Rose <ke...@so...> wrote:
> Currently working on lock-free stack and queue implementations with hazard lists.
>
> >From a technical standpoint, they are working nicely, currently doing my testing with the stack as it's just the simpler one implementation-wise so less room for bugs to sneak in.
>
> However, performance wise, things aren't so happy.
>
> Ran a few tests:
>
> Common to all tests:
> Values pushed/popped: 2^24 per push thread
> Thread count, 2 pushing 2 popping
> Processor: Quad core so each thread is running on its own core
>
> Stack with spinlock: ~33 seconds
>
> Stack with hazard-pointer, properly deleting memory: ~45 seconds
>
> Stack with hazard-pointer, leaking memory by just simply marking the pointer as no longer referenced: ~17 seconds
>
> The 3rd test does indicate a significant gain over the spinlock, nearly cutting the time in half.
>
> However, something about scanning the other threads for pointers that are being referenced is just taking performance and is completely dragging it down. In pseudo count this looks roughly something like this:
>
> In pseudocode below:
>
> if deleteCount < threshold return; // Tried various values here, values up to around 128 seem to slightly improve performance, while larger values like 1024 totally annihilate performance.
>
> foreach node
>  if node->status = Delete
>    if !scan(node->pointer) // scan returns false if no other threads hold a ref
>      delete the node and mark node as free
>    endif
>  endif
> endforeach
>
> then the scan function:
>
> foreach thread
>  if thread != thisthread and thread->isrunning
>    if thread->hasreference(pointer)
>      return true;
>    endif
>  endif
> endforeach
>
> The hasrefeference function then has to cycle through every single entry of it's threads hazard list and see if it has an active reference to that pointer.
>
> It works. Slowly. But 1 loop that for each iteration calls another loop which for each iteration calls another loop (try to say that 10 times really fast...) is bound to be like trying to tow a garbage truck with a honda civic, especially as the number of items in the lists goes up.
>
> Currently this these lists are all singly linked lists, I could try using something like a hashtable and seeing if I do better than that. But all the information I've found online about hazard pointers I think also use just singly linked lists so I'm not quite sure where my issue is.
>
> Any ideas?
>
> Thanks,
>
> Stephan
>
> ------------------------------------------------------------------------------
> Stay on top of everything new and different, both inside and
> around Java (TM) technology - register by April 22, and save
> $200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
> 300 plus technical and hands-on sessions. Register today.
> Use priority code J9JMT32. http://p.sf.net/sfu/p
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>

Re: [Algorithms] lock-free stack with hazard list performance

From: Stephan R. <ke...@so...> - 2009-04-21 23:28:52

On Tue, 2009-04-21 at 15:33 -0700, Nicholas "Indy" Ray wrote:
> I'm not familiar with Hazard lists, but from reading the pseudo code,
> it seems that you get additional functionality at the cost or
> performance, Unless I am missing something (which is likely, googling
> "hazard lists" doesn't seem to be very helpful.) It seems that it'd
> would be slower by principal. and only losing about 25% performance
> over a spin lock doesn't seem to bad.

Not really gaining any additional functionality, the whole point of
going lock-free is performance gain over spinlocks.

However, there are 2 key problems with lock-free that really have their
root in the same place (apologies if I'm simply repeating stuff you
already know):

In between thread 1 obtaining the pointer to what I'll call Node A and
then performing a CAS to change it, other threads could do the
following:

1. Remove Node A and put it right back by recycling it. ABA problem.
This will cause the first thread's CAS to succeed but with the wrong
data causing data corruption.

2. Remove Node A and delete it. Major crash right there when thread 1
now tries do it's CAS.

Both problems can be solved by ensuring that Node A is not recycled nor
deleted until all threads release their reference to it. All lock-free
implementations so far I've found have been in managed languages such as
C# or Java for that reason so far as the GC ensures exactly that.

In C++ however without a GC, this becomes a bit more problematic. Using
a free-list to recycle nodes will solve problem 2, but it does not solve
ABA, it actually begs for it and rolls out the red carpet and even pays
for the limo.

That can be countered a bit by using tag bits in the pointer, a common
solution to the issue. When working with 32-bit this is easy, 32-bit
pointer, 32-bit tag field, 64-bit CAS. Everything is happy. Each time
the node is recycled, the tag field is increased ensuring that any still
referencing threads will fail on their CAS.

This however becomes problematic when working with 64-bit code and
64-bit pointers! No more spare 32-bits to use for tags. Can use the
lower bits, but that really I'm not comfortable with as without enough
bits, the tag will roll over around and cause ABA anyway.

Which last but not least, brings me to what I'm using right now, Hazard
Pointers. 

Basically, the principle is this:

Each thread contains its own list of 'hazards'. A hazard basically is a
pointer that one or multiple threads are currently referencing. So in my
stack code for instance, when I access to top node, here's what happens:

1. Obtain pointer to node
2. Put the pointer into my hazard list
3. Do whatever I want with the node
4. When done, remove the pointer from my hazard list

In principle, this works identically to a spinlock or mutex when you
look at the code, except that every single thread has it's privately
owned list. So each thread has its own private 'lock' so to speak,
allowing multiple threads access to the same resource without blocking
each other.

However, no thread is allowed to delete or recycle a node as long as any
other threads still have this pointer in their hazard list. That's what
the pseudo code in the previous e-mail does. Solves ABA and the deletion
problem.

However, all that work is relatively pointless if it doesn't perform any
quicker than a simple spinlock.

I was just thinking, there may be one thing I can try to speed things up
a bit. Currently, the hazard list is implemented as follows:

One singly linked list that contains both pointers that are actively
locked and to be deleted. Each record in the list has a status field
that is either 'Active' (currently referenced), 'Delete' (to be deleted
whenever possible) or 'Dead' (either reference has been released or it
has been deleted). Dead records can then be recycled to avoid needing to
grow the list which works quite well. ~15-20 hazard records in varying
states per thread pumping 2^24 values on/off a stack.

This also simplifies the Acquire / Release / Free code as I don't have
to move data from one list to another. I simply just need to find my
record, change the Status and be done.

However, it may end up paying off to maintain two linked lists and
dedicate one just to objects to be deleted. That way when I do my scans
across other threads' lists, I avoid scanning records that the other
thread wants to have deleted.

Come to think of it, maybe even three doubly linked lists that allow me
to move records around between lists and instead of having a status
field in the record, the list that the record is in determines its
status. With a doubly linked list I could do that fairly easily. 

I'll have to play with that and see what happens...

Stephan

> 
> On Tue, Apr 21, 2009 at 1:52 PM, Stephan Rose <ke...@so...> wrote:
> > Currently working on lock-free stack and queue implementations with hazard lists.
> >
> > >From a technical standpoint, they are working nicely, currently doing my testing with the stack as it's just the simpler one implementation-wise so less room for bugs to sneak in.
> >
> > However, performance wise, things aren't so happy.
> >
> > Ran a few tests:
> >
> > Common to all tests:
> > Values pushed/popped: 2^24 per push thread
> > Thread count, 2 pushing 2 popping
> > Processor: Quad core so each thread is running on its own core
> >
> > Stack with spinlock: ~33 seconds
> >
> > Stack with hazard-pointer, properly deleting memory: ~45 seconds
> >
> > Stack with hazard-pointer, leaking memory by just simply marking the pointer as no longer referenced: ~17 seconds
> >
> > The 3rd test does indicate a significant gain over the spinlock, nearly cutting the time in half.
> >
> > However, something about scanning the other threads for pointers that are being referenced is just taking performance and is completely dragging it down. In pseudo count this looks roughly something like this:
> >
> > In pseudocode below:
> >
> > if deleteCount < threshold return; // Tried various values here, values up to around 128 seem to slightly improve performance, while larger values like 1024 totally annihilate performance.
> >
> > foreach node
> >  if node->status = Delete
> >    if !scan(node->pointer) // scan returns false if no other threads hold a ref
> >      delete the node and mark node as free
> >    endif
> >  endif
> > endforeach
> >
> > then the scan function:
> >
> > foreach thread
> >  if thread != thisthread and thread->isrunning
> >    if thread->hasreference(pointer)
> >      return true;
> >    endif
> >  endif
> > endforeach
> >
> > The hasrefeference function then has to cycle through every single entry of it's threads hazard list and see if it has an active reference to that pointer.
> >
> > It works. Slowly. But 1 loop that for each iteration calls another loop which for each iteration calls another loop (try to say that 10 times really fast...) is bound to be like trying to tow a garbage truck with a honda civic, especially as the number of items in the lists goes up.
> >
> > Currently this these lists are all singly linked lists, I could try using something like a hashtable and seeing if I do better than that. But all the information I've found online about hazard pointers I think also use just singly linked lists so I'm not quite sure where my issue is.
> >
> > Any ideas?
> >
> > Thanks,
> >
> > Stephan
> >
> > ------------------------------------------------------------------------------
> > Stay on top of everything new and different, both inside and
> > around Java (TM) technology - register by April 22, and save
> > $200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
> > 300 plus technical and hands-on sessions. Register today.
> > Use priority code J9JMT32. http://p.sf.net/sfu/p
> > _______________________________________________
> > GDAlgorithms-list mailing list
> > GDA...@li...
> > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> > Archives:
> > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
> >
> 
> ------------------------------------------------------------------------------
> Stay on top of everything new and different, both inside and 
> around Java (TM) technology - register by April 22, and save
> $200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
> 300 plus technical and hands-on sessions. Register today. 
> Use priority code J9JMT32. http://p.sf.net/sfu/p
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list

Re: [Algorithms] lock-free stack with hazard list performance

From: Nicholas \Indy\ R. <ar...@gm...> - 2009-04-22 00:23:21

On Tue, Apr 21, 2009 at 4:28 PM, Stephan Rose <ke...@so...> wrote:
> 2. Remove Node A and delete it. Major crash right there when thread 1
> now tries do it's CAS.

Here is where I am confused, shouldn't the CAS be on the head pointer
during a dequeue so, that if the CAS is successful (assuming that
dequeue is the only way to get a reference to an item on the list,
which is how lock-free queues should be anyways) you know that you are
the only thread that has access to the node?

> Both problems can be solved by ensuring that Node A is not recycled nor
> deleted until all threads release their reference to it. All lock-free
> implementations so far I've found have been in managed languages such as
> C# or Java for that reason so far as the GC ensures exactly that.

Again, not sure why multiple threads have reference to nodes still on the queue.

Nicholas "Indy" Ray

Re: [Algorithms] lock-free stack with hazard list performance

From: Stephan R. <ke...@so...> - 2009-04-22 01:15:40

On Tue, 2009-04-21 at 17:23 -0700, Nicholas "Indy" Ray wrote:
> On Tue, Apr 21, 2009 at 4:28 PM, Stephan Rose <ke...@so...> wrote:
> > 2. Remove Node A and delete it. Major crash right there when thread 1
> > now tries do it's CAS.
> 
> Here is where I am confused, shouldn't the CAS be on the head pointer
> during a dequeue so, that if the CAS is successful (assuming that
> dequeue is the only way to get a reference to an item on the list,
> which is how lock-free queues should be anyways) you know that you are
> the only thread that has access to the node?

To get a reference to an item on the list you're correct, you have to
dequeue it and then it solely belongs to the thread that dequeued it and
there are no issues. That isn't the problem.

The problem are the nodes that make up the list and the references to
them. See below for an example.

> 
> > Both problems can be solved by ensuring that Node A is not recycled nor
> > deleted until all threads release their reference to it. All lock-free
> > implementations so far I've found have been in managed languages such as
> > C# or Java for that reason so far as the GC ensures exactly that.
> 
> Again, not sure why multiple threads have reference to nodes still on the queue.

Well first off, the node declaration itself is simple:

template <typename T>
struct Node
{
  T data;
  Node* next;
}

Simple enough, nothing fancy.

Now the stack is initialized with a dummy head node that contains no
data and initially next = 0.

Now here is what happens, I'll prefix each line with the thread that's
executing it such as T1, T2, etc. Code is simplified for clarity.

T1 Wants to pop a node off the stack.

T1: Node* node;
T1: do {
T1:   node = head->next; 

Next, T1 gets preempted and it's T2's turn, it also wants to pop a node
off the stack.

T2: Node* node; 
T2: do {
T2:   node = head->next; 

Both T1 and T2 are now referencing the same node and are both trying to
pop the same node off the stack.

T2: } while(!cas(head->next, node, node->next); 

T2 is not preempted and succeeds with its CAS and now proceeds to delete
the node.

T2: delete node; 

T2 is now done, control returns to T1 which is still pointing at the
very node that was just deleted.

T1: } while(!cas(head->next, node, node->next);
T1: CRASH!

And it doesn't matter if this is running on 1 core, or multiple cores,
the above will happen when multiple threads try to pop a node from the
same stack if access to the current node is not guarded to prevent too
early deletion.

Stephan

Re: [Algorithms] lock-free stack with hazard list performance

From: Nicholas \Indy\ R. <ar...@gm...> - 2009-04-22 01:53:26

So, excuse if I'm missing a lot, I haven't created any lock-free data
structures in about 2 years, But I suspect something like the
following might work, It has a bit longer contention, and will have to
spin more often, but I suspect there are also better ways to achieve
the same thing.

do
{
  node = head;
  if(!node) continue;
}
while(cas(head, node, 0));

head = node->next;

retData = node->Data;

delete node;

This also will also require at least one dummy node, but I don't
suppose that would be a huge problem.

Anyways, when I get access to my main computer tonight or tomorrow, I
will check to see how my current queue manages memory, which I think
is more optimal.

Nicholas "Indy" Ray

On Tue, Apr 21, 2009 at 6:15 PM, Stephan Rose <ke...@so...> wrote:
> On Tue, 2009-04-21 at 17:23 -0700, Nicholas "Indy" Ray wrote:
>> On Tue, Apr 21, 2009 at 4:28 PM, Stephan Rose <ke...@so...> wrote:
>> > 2. Remove Node A and delete it. Major crash right there when thread 1
>> > now tries do it's CAS.
>>
>> Here is where I am confused, shouldn't the CAS be on the head pointer
>> during a dequeue so, that if the CAS is successful (assuming that
>> dequeue is the only way to get a reference to an item on the list,
>> which is how lock-free queues should be anyways) you know that you are
>> the only thread that has access to the node?
>
> To get a reference to an item on the list you're correct, you have to
> dequeue it and then it solely belongs to the thread that dequeued it and
> there are no issues. That isn't the problem.
>
> The problem are the nodes that make up the list and the references to
> them. See below for an example.
>
>>
>> > Both problems can be solved by ensuring that Node A is not recycled nor
>> > deleted until all threads release their reference to it. All lock-free
>> > implementations so far I've found have been in managed languages such as
>> > C# or Java for that reason so far as the GC ensures exactly that.
>>
>> Again, not sure why multiple threads have reference to nodes still on the queue.
>
> Well first off, the node declaration itself is simple:
>
> template <typename T>
> struct Node
> {
>  T data;
>  Node* next;
> }
>
> Simple enough, nothing fancy.
>
> Now the stack is initialized with a dummy head node that contains no
> data and initially next = 0.
>
> Now here is what happens, I'll prefix each line with the thread that's
> executing it such as T1, T2, etc. Code is simplified for clarity.
>
> T1 Wants to pop a node off the stack.
>
> T1: Node* node;
> T1: do {
> T1:   node = head->next;
>
> Next, T1 gets preempted and it's T2's turn, it also wants to pop a node
> off the stack.
>
> T2: Node* node;
> T2: do {
> T2:   node = head->next;
>
> Both T1 and T2 are now referencing the same node and are both trying to
> pop the same node off the stack.
>
> T2: } while(!cas(head->next, node, node->next);
>
> T2 is not preempted and succeeds with its CAS and now proceeds to delete
> the node.
>
> T2: delete node;
>
> T2 is now done, control returns to T1 which is still pointing at the
> very node that was just deleted.
>
> T1: } while(!cas(head->next, node, node->next);
> T1: CRASH!
>
> And it doesn't matter if this is running on 1 core, or multiple cores,
> the above will happen when multiple threads try to pop a node from the
> same stack if access to the current node is not guarded to prevent too
> early deletion.
>
> Stephan
>
>
>
> ------------------------------------------------------------------------------
> Stay on top of everything new and different, both inside and
> around Java (TM) technology - register by April 22, and save
> $200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
> 300 plus technical and hands-on sessions. Register today.
> Use priority code J9JMT32. http://p.sf.net/sfu/p
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>

Re: [Algorithms] lock-free stack with hazard list performance

From: Stephan R. <ke...@so...> - 2009-04-22 03:12:16

On Tue, 2009-04-21 at 18:53 -0700, Nicholas "Indy" Ray wrote:
> So, excuse if I'm missing a lot, I haven't created any lock-free data
> structures in about 2 years, But I suspect something like the
> following might work, It has a bit longer contention, and will have to
> spin more often, but I suspect there are also better ways to achieve
> the same thing.
> 
> do
> {
>   node = head;
>   if(!node) continue;
> }
> while(cas(head, node, 0));
> 
> head = node->next;
> 
> retData = node->Data;
> 
> delete node;

Nope, it definitely works, however...

1. It's not lock-free, your cas on the head to set it to 0 forms a
lock. :)

2. It prevents other threads from peeking at the top of the stack
without potentially locking which in my case is not only lock-free but
also wait-free. Though honestly not sure how much of an issue that is as
I'm not really sure how useful it really is to peek at a stack/queue
being accessed by multiple threads. No guarantee that a following
pop/dequeue will return the peeked item.

3. One advantage that locking algorithms do have over lock-free is that
with the right kind of spinlock, thread starvation can be avoided by
guaranteeing that threads will obtain their lock in the order they
requested it. The above lock makes no such guarantees however.

> 
> This also will also require at least one dummy node, but I don't
> suppose that would be a huge problem.
> 
> Anyways, when I get access to my main computer tonight or tomorrow, I
> will check to see how my current queue manages memory, which I think
> is more optimal.

That would be awesome, I'd be most interested in seeing that.

Stephan

Re: [Algorithms] lock-free stack with hazard list performance

From: <asy...@gm...> - 2009-04-22 05:42:48

You can look into win32api InterlockedPopEntrySList and
InterlockedPushEntrySList.
They solve aba problem and deletion problem is partially fixed (they still
can do a read access to a node which is popped from a list, so in case you
delete that page with the node you have a problem then).

x86 code is pretty simple. I didn't look into x64 code yet.

Alexander.

2009/4/22 Stephan Rose <ke...@so...>

> On Tue, 2009-04-21 at 18:53 -0700, Nicholas "Indy" Ray wrote:
> > So, excuse if I'm missing a lot, I haven't created any lock-free data
> > structures in about 2 years, But I suspect something like the
> > following might work, It has a bit longer contention, and will have to
> > spin more often, but I suspect there are also better ways to achieve
> > the same thing.
> >
> > do
> > {
> >   node = head;
> >   if(!node) continue;
> > }
> > while(cas(head, node, 0));
> >
> > head = node->next;
> >
> > retData = node->Data;
> >
> > delete node;
>
> Nope, it definitely works, however...
>
> 1. It's not lock-free, your cas on the head to set it to 0 forms a
> lock. :)
>
> 2. It prevents other threads from peeking at the top of the stack
> without potentially locking which in my case is not only lock-free but
> also wait-free. Though honestly not sure how much of an issue that is as
> I'm not really sure how useful it really is to peek at a stack/queue
> being accessed by multiple threads. No guarantee that a following
> pop/dequeue will return the peeked item.
>
> 3. One advantage that locking algorithms do have over lock-free is that
> with the right kind of spinlock, thread starvation can be avoided by
> guaranteeing that threads will obtain their lock in the order they
> requested it. The above lock makes no such guarantees however.
>
>
> >
> > This also will also require at least one dummy node, but I don't
> > suppose that would be a huge problem.
> >
> > Anyways, when I get access to my main computer tonight or tomorrow, I
> > will check to see how my current queue manages memory, which I think
> > is more optimal.
>
> That would be awesome, I'd be most interested in seeing that.
>
> Stephan
>
>
>
>
> ------------------------------------------------------------------------------
> Stay on top of everything new and different, both inside and
> around Java (TM) technology - register by April 22, and save
> $200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
> 300 plus technical and hands-on sessions. Register today.
> Use priority code J9JMT32. http://p.sf.net/sfu/p
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>



-- 
Regards,
Alexander Karnakov

Re: [Algorithms] lock-free stack with hazard list performance

From: Marc B. R. <mar...@or...> - 2009-04-22 06:21:28

On 64-bit another option would be to do pointer-compresssion then do 64-bit
CAS as per 32-bit.