Thread: Re: [Algorithms] General purpose task parallel threading approach (Page 12)

Brought to you by: vexxed72

gdalgorithms-list

Re: [Algorithms] General purpose task parallel threading approach

From: Tony C. <to...@mi...> - 2009-04-13 18:37:25

" that any parallel loop with a static call graph could be turned into a 16 wide vector version mechanically."

"Vector complete" is also a more or less irrelevant property for real world problems, since the "static call graph" constraint is typically only satisfied by trivial code. Simple shader type things meet the requirement, complex ones with conditionals don't, and good luck with anything resembling game logic.

Don't get me wrong, I think LRB is doing some really interesting things, but it's by no means anything like a silver bullet.

- Tony


From: Sebastian Sylvan [mailto:seb...@gm...]
Sent: Monday, April 13, 2009 10:38 AM
To: Game Development Algorithms
Subject: Re: [Algorithms] General purpose task parallel threading approach

The cool thing about LRB is that it's what they call "vector complete", meaning it has all of the machinery (scatter/gather etc.) that any parallel loop with a static call graph could be turned into a 16 wide vector version mechanically. This should hopefully allow the compiler to do most of this work for us if we just mark up the loops we think ought to be data parallel using some pragma.
On Mon, Apr 13, 2009 at 6:28 PM, Adrian Bentley <ad...@gm...<mailto:ad...@gm...>> wrote:
Daunting to be sure.  You know... this would be a great place for a
good macro system.  Commence language war!

Cheers,
Adrian

Re: [Algorithms] General purpose task parallel threading approach

From: Sebastian S. <seb...@gm...> - 2009-04-13 19:00:36

On Mon, Apr 13, 2009 at 7:37 PM, Tony Cox <to...@mi...> wrote:

>  " that any parallel loop with a static call graph could be turned into a
> 16 wide vector version mechanically."
>
>
>
> "Vector complete" is also a more or less irrelevant property for real world
> problems, since the "static call graph" constraint is typically only
> satisfied by trivial code. Simple shader type things meet the requirement,
> complex ones with conditionals don't, and good luck with anything resembling
> game logic.
>
>
>
> Don't get me wrong, I think LRB is doing some really interesting things,
> but it's by no means anything like a silver bullet.
>

Yeah definitely, you'll probably only find suitable data parallelism very
close to the leaves of the call graph, not in the big bulky middle, so you
clearly need an efficient general purpose CPU to run most of the code.
It is a lot more flexible then we're used to at dealing with those rare data
parallel bits though. Of course, it remains unknown how it will stack up
against whatever Nvidia/AMD will have out at the time of its release, as
they're both getting more and more general purpose too. Whatever happens I'm
sure it'll be an interesting couple of years ahead!

-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862

Re: [Algorithms] General purpose task parallel threading approach

From: Jarkko L. <al...@gm...> - 2009-04-14 10:25:32

Personally, I try to do allocations from the stack when ever possible to
avoid memory fragmentation issues, so 2-4kb stack sounds quite small for my
purposes at least.

Cheers, Jarkko

  _____  

From: asy...@gm... [mailto:asy...@gm...] 
Sent: Tuesday, April 14, 2009 1:41 AM
To: Game Development Algorithms
Subject: Re: [Algorithms] General purpose task parallel threading approach

So far the experience I (and few others) have says that 2 to 4 kilobytes is
the size which is enough to handle 95% of all cases. You could use same
memory pool for tasks of different sizes (bigger size will result in less
tasks you have in flight), for example - 2 continuous pieces of 4k memory
could be used to create an 8k task.

Alexander

2009/4/14 Nicholas "Indy" Ray <ar...@gm...>

On Mon, Apr 13, 2009 at 2:49 PM, Adrian Bentley <ad...@gm...> wrote:
> 4 MB feels like a lot given console memory limits, but allowing 2000
> tasks in flight sounds pedantic.  This mechanism can't support SPUs,
> right?

Also keep in mind that a 2 KB stack is likely to get overflowed REALLY
quickly with any real operations going on.

Nicholas "Indy" Ray

----------------------------------------------------------------------------
--
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
_______________________________________________
GDAlgorithms-list mailing list
GDA...@li...
https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
Archives:
http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list

-- 
Regards,
Alexander Karnakov

Re: [Algorithms] General purpose task parallel threading approach

From: <asy...@gm...> - 2009-04-14 12:51:52

>I try to do allocations from the stack when ever possible to avoid memory
fragmentation issues
You can temporary convert a task to be executed in thread's context. I could
also provide a function to access worker thread's stack for temporal
computations.

Alexander.

2009/4/14 Jarkko Lempiainen <al...@gm...>

>  Personally, I try to do allocations from the stack when ever possible to
> avoid memory fragmentation issues, so 2-4kb stack sounds quite small for my
> purposes at least.
>
>
>
>
>
> Cheers, Jarkko
>
>
>  ------------------------------
>
> *From:* asy...@gm... [mailto:asy...@gm...]
> *Sent:* Tuesday, April 14, 2009 1:41 AM
> *To:* Game Development Algorithms
> *Subject:* Re: [Algorithms] General purpose task parallel threading
> approach
>
>
>
> So far the experience I (and few others) have says that 2 to 4 kilobytes is
> the size which is enough to handle 95% of all cases. You could use same
> memory pool for tasks of different sizes (bigger size will result in less
> tasks you have in flight), for example - 2 continuous pieces of 4k memory
> could be used to create an 8k task.
>
> Alexander
>
> 2009/4/14 Nicholas "Indy" Ray <ar...@gm...>
>
> On Mon, Apr 13, 2009 at 2:49 PM, Adrian Bentley <ad...@gm...> wrote:
> > 4 MB feels like a lot given console memory limits, but allowing 2000
> > tasks in flight sounds pedantic.  This mechanism can't support SPUs,
> > right?
>
> Also keep in mind that a 2 KB stack is likely to get overflowed REALLY
> quickly with any real operations going on.
>
> Nicholas "Indy" Ray
>
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by:
> High Quality Requirements in a Collaborative Environment.
> Download a free trial of Rational Requirements Composer Now!
> http://p.sf.net/sfu/www-ibm-com
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>
>
>
>
> --
> Regards,
> Alexander Karnakov
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by:
> High Quality Requirements in a Collaborative Environment.
> Download a free trial of Rational Requirements Composer Now!
> http://p.sf.net/sfu/www-ibm-com
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>



-- 
Regards,
Alexander Karnakov

Re: [Algorithms] General purpose task parallel threading approach

From: Jon W. <jw...@gm...> - 2009-04-14 19:41:09

Sebastian Sylvan wrote:
> That's not the same as saying that flexibility isn't desirable, 
> though. What I said was that if you evaluate a certain system you 
> would put "flexible" and "general" in the "pros" column, and 
> conversely you would put "rigid" and "specialized" in the "cons" 
> column. There may be a whole host of other things going into the two 
> columns which guide you to your final conclusion, but that doesn't 
> mean that flexibility in itself isn't a desirable trait, nor that 
> restrictions/rigidity is.

That's the "guns don't kill people" argument. It turns out that, in real 
life, you never get "generality" or "flexibility" without a cost, and 
often, those costs and benefits are evaluated by someone who has a very 
different set of goals than you do.

For example, the C++ STL is a very flexible, very general set of 
containers and algorithms. However, a number of assumptions made in the 
construction of that library makes it, or its implementations, very poor 
in many circumstances. For example, various STL implementations use 
static buffers for intermediary storage, so you can't even use the 
containers in threaded code! As another example, the allocators involved 
often make different assumptions about memory than what will be true for 
applications that manage their own memory, such as games, embedded 
systems, kernels and database servers (among some examples).

It turns out that the standard C library, with malloc(), fread() and 
memcpy(), is actually a lot easier to re-use across a wider range of 
applications, even though it is, by pretty much all measures of the 
words, less flexible, and less general, than the STL.

To tie it back to asynclib: Asynclib is written by a transaction server 
programmer, who solved a transaction server problem, and is now trying 
to expand it to solve more kinds of problems. However, there are some 
classes of problems (such as, in my argument, micro-tasks for games) 
where the fundamental assumptions are so different that it doesn't make 
sense to apply a "one size fits all" component to solve that problem. 
Smaller, more well defined components are almost always a better match, 
as long as you can find those smaller, more well defined components that 
match your needs.

Sincerely,

jw

Re: [Algorithms] General purpose task parallel threading approach

From: Sebastian S. <seb...@gm...> - 2009-04-14 20:04:30

On Tue, Apr 14, 2009 at 8:40 PM, Jon Watte <jw...@gm...> wrote:

> Sebastian Sylvan wrote:
> > That's not the same as saying that flexibility isn't desirable,
> > though. What I said was that if you evaluate a certain system you
> > would put "flexible" and "general" in the "pros" column, and
> > conversely you would put "rigid" and "specialized" in the "cons"
> > column. There may be a whole host of other things going into the two
> > columns which guide you to your final conclusion, but that doesn't
> > mean that flexibility in itself isn't a desirable trait, nor that
> > restrictions/rigidity is.
>
> That's the "guns don't kill people" argument. It turns out that, in real
> life, you never get "generality" or "flexibility" without a cost, and
> often, those costs and benefits are evaluated by someone who has a very
> different set of goals than you do.

I didn't say that, I said they were data points worth considering.
Also "never" is a very strong statement word to use, there are plenty of
cases when you can arbitrarily restrict an API for no real benefit because
you just didn't think of a way of doing it better when you first wrote it.
There are other cases where you can restrict an API because you think it
will buy you something, but then it turns out that the rigidity actually
hurts you because people have to work around the restrictions in ways which
end up costing more than the imagined benefits..

>
> To tie it back to asynclib: Asynclib is written by a transaction server
> programmer, who solved a transaction server problem, and is now trying
> to expand it to solve more kinds of problems. However, there are some
> classes of problems (such as, in my argument, micro-tasks for games)
> where the fundamental assumptions are so different that it doesn't make
> sense to apply a "one size fits all" component to solve that problem.
> Smaller, more well defined components are almost always a better match,
> as long as you can find those smaller, more well defined components that
> match your needs.

Well I certainly disagree with that, and I don't think you've demonstrated
the win, but I think we've had that argument to death already so no point
going over it again.

-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862

Re: [Algorithms] General purpose task parallel threading approach

From: <asy...@gm...> - 2009-04-14 20:20:19

>and is now trying
>to expand it to solve more kinds of problems. However, there are some
>classes of problems (such as, in my argument, micro-tasks for games)
>where the fundamental assumptions are so different that it doesn't make
If I understood you correctly, you say that it is not really going to work
for games (because it works for client-server apps ?), but I don't really
see why I can't succeed ?
I think I've explained how to achieve some specific details people were
wondering about pretty clearly, but at this point, after 170 messages posted
under this thread, I still didn't get a clear
picture of what I need to change to make asynclib an attractive threading
library for games. Hopefully I didn't miss anything, but if I did, please
point me at it again.

Alexander.

2009/4/14 Jon Watte <jw...@gm...>

> Sebastian Sylvan wrote:
> > That's not the same as saying that flexibility isn't desirable,
> > though. What I said was that if you evaluate a certain system you
> > would put "flexible" and "general" in the "pros" column, and
> > conversely you would put "rigid" and "specialized" in the "cons"
> > column. There may be a whole host of other things going into the two
> > columns which guide you to your final conclusion, but that doesn't
> > mean that flexibility in itself isn't a desirable trait, nor that
> > restrictions/rigidity is.
>
> That's the "guns don't kill people" argument. It turns out that, in real
> life, you never get "generality" or "flexibility" without a cost, and
> often, those costs and benefits are evaluated by someone who has a very
> different set of goals than you do.
>
> For example, the C++ STL is a very flexible, very general set of
> containers and algorithms. However, a number of assumptions made in the
> construction of that library makes it, or its implementations, very poor
> in many circumstances. For example, various STL implementations use
> static buffers for intermediary storage, so you can't even use the
> containers in threaded code! As another example, the allocators involved
> often make different assumptions about memory than what will be true for
> applications that manage their own memory, such as games, embedded
> systems, kernels and database servers (among some examples).
>
> It turns out that the standard C library, with malloc(), fread() and
> memcpy(), is actually a lot easier to re-use across a wider range of
> applications, even though it is, by pretty much all measures of the
> words, less flexible, and less general, than the STL.
>
> To tie it back to asynclib: Asynclib is written by a transaction server
> programmer, who solved a transaction server problem, and is now trying
> to expand it to solve more kinds of problems. However, there are some
> classes of problems (such as, in my argument, micro-tasks for games)
> where the fundamental assumptions are so different that it doesn't make
> sense to apply a "one size fits all" component to solve that problem.
> Smaller, more well defined components are almost always a better match,
> as long as you can find those smaller, more well defined components that
> match your needs.
>
> Sincerely,
>
> jw
>
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by:
> High Quality Requirements in a Collaborative Environment.
> Download a free trial of Rational Requirements Composer Now!
> http://p.sf.net/sfu/www-ibm-com
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>



-- 
Regards,
Alexander Karnakov

Re: [Algorithms] General purpose task parallel threading approach

From: Jon W. <jw...@gm...> - 2009-04-14 20:37:20

asy...@gm... wrote:
> I still didn't get a clear picture of what I need to change to make 
> asynclib an attractive threading library for games. 

As seen from my point of view, the two top-level changes would be:

1) Don't allocate a stack per task. Instead, allocate a stack per worker 
thread. (The bonus is that these stacks can be large enough to be 
useful, without too much overhead)

2) Let each task have a dependency tree, where a task will automatically 
become runnable when all its dependencies are complete (and a typical 
dependency will be the completion of some other task).

Now, if you really want to support arbitrary blocking within the 
execution of a micro-task (which I feel is a mistake), then you can make 
it so that you allocate a new stack for the next task only when some 
stack is "used" because there is a blocking task.

Sincerely,

jw

Re: [Algorithms] General purpose task parallel threading approach

From: <asy...@gm...> - 2009-04-14 21:42:45

Ok, regarding point 2, isn't the following pseudo-code does what you want  ?

void  TaskWhichWaitsForDepsToComplete()
{
    async::Task *TaskA, *TaskB, *TaskC;

    ... do some work here

    async::Execute(TaskAProc, &TaskA, ...);

    ... do some more work here

    async::Execute(TaskBProc, &TaskB, ...);

    ... do some more work here

    async::Execute(TaskCProc, &TaskC, ...);

    ... do even more work here

    async::WaitAll(*TaskA, *TaskB, *TaskC); // this call makes a task wait
for tasks A,B,C. Optionally it can either help out worker threads while
waiting, or it can simply block

    TaskA->Release(); TaskB->Release(); TaskC->Release();


    ... finally do some work at the end of task

   return;  // done!
};

void main()
{
  async::Task* TheTask;

  async::Execute(&TaskWhichWaitsForDepsToComplete, &TheTask, ...);

  async::Wait(*TheTask);

  TheTask->Release();
}

You can additionally put any other primitive into async::Wait, and you can
put as many of those as you want.

As an example, you can even achieve point 2 in a different way:

void DependeeTask(void* pSemaphore)
{
   ... do work
  ((async::Semaphore*) pSemaphore)->Release();
}

void MainTask()
{
   int nDependiesToSpawn = xxx;

   async::Semaphose DependendsSemaphore;  // create a semaphore on the
stack, which is actually an 8 bytes zero initialized

   for (int i = 0; i < nDependiesToSpawn; ++i)
     asyn::Execute(&DependeeTask, &DependendsSemaphore, ...);

   DependendsSemaphore.Achquire(nDependiesToSpawn);  //waits until all task
release the semaphore, which happens at the end of task
}




Now regarding point 1:

void ThisTaskWillBeExecutedInsideWorkerThread()
{
  printf("hello");
}

void TaskWhichExecutesCodeInsideTheWorkerThreadStack(void* pUserData)
{
    // assuming that pData holds a pointer to what actually needs to be
executed

    async::ExecuteAsThread((void* (*)(void*)) pData); // executes a function
in the stack of the worker thread - this is a veeery lightweight function
(check with the debugger if you don't beleive me)
}


void main()
{
   async::Task* TheTask;
   async::Execute(&TaskWhichExecutesCodeInsideTheWorkerThreadStack,
                          &ThisTaskWillBeExecutedInsideWorkerThread, // user
data pointer
                          &TheTask
                          ...,
                          async::MINIMAL_TASK_SIZE); // 32 bytes for task
structure

   async::Wait(*TheTask);

   TheTask->Release();
}


This code does exactly what you require under point1. I can create an option
which will not require you to create a wrapper function (which is
TaskWhichExecutesCodeInsideTheWorkerThreadStack in this example).


Would that solution suit you for both points 1 and 2 ?


Alexander.




2009/4/14 Jon Watte <jw...@gm...>

> asy...@gm... wrote:
> > I still didn't get a clear picture of what I need to change to make
> > asynclib an attractive threading library for games.
>
> As seen from my point of view, the two top-level changes would be:
>
> 1) Don't allocate a stack per task. Instead, allocate a stack per worker
> thread. (The bonus is that these stacks can be large enough to be
> useful, without too much overhead)
>
> 2) Let each task have a dependency tree, where a task will automatically
> become runnable when all its dependencies are complete (and a typical
> dependency will be the completion of some other task).
>
> Now, if you really want to support arbitrary blocking within the
> execution of a micro-task (which I feel is a mistake), then you can make
> it so that you allocate a new stack for the next task only when some
> stack is "used" because there is a blocking task.
>
> Sincerely,
>
> jw
>
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by:
> High Quality Requirements in a Collaborative Environment.
> Download a free trial of Rational Requirements Composer Now!
> http://p.sf.net/sfu/www-ibm-com
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>



-- 
Regards,
Alexander Karnakov

Re: [Algorithms] General purpose task parallel threading approach

From: Nicholas \Indy\ R. <ar...@gm...> - 2009-04-14 22:08:52

On Tue, Apr 14, 2009 at 2:42 PM,  <asy...@gm...> wrote:
> Now regarding point 1:

The problem, Alexander, is that once you address point one, you are
disposing of the vast majority of the functionality that you're
library provides, and at that point, we have a lot of extra runtime
sitting there being useless. I think this falls in a case of a
component being too general for the job.

Indy

Re: [Algorithms] General purpose task parallel threading approach

From: <asy...@gm...> - 2009-04-14 22:26:45

>The problem, Alexander, is that once you address point one, you are
>disposing of the vast majority of the functionality that you're
>library provides
That's not really true, you can still execute other tasks the same way in a
recursive manner (but keep away of dependencies on something up the worker
thread's stack).
Besides, even if it would be fully true, then "the vast majority of the
functionality" is actually a ~2k of code which you never execute, is this a
"generalization" overhead you can't afford?

Alexander.

2009/4/15 Nicholas "Indy" Ray <ar...@gm...>

> On Tue, Apr 14, 2009 at 2:42 PM,  <asy...@gm...> wrote:
> > Now regarding point 1:
>
> The problem, Alexander, is that once you address point one, you are
> disposing of the vast majority of the functionality that you're
> library provides, and at that point, we have a lot of extra runtime
> sitting there being useless. I think this falls in a case of a
> component being too general for the job.
>
> Indy
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by:
> High Quality Requirements in a Collaborative Environment.
> Download a free trial of Rational Requirements Composer Now!
> http://p.sf.net/sfu/www-ibm-com
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>



-- 
Regards,
Alexander Karnakov

Re: [Algorithms] General purpose task parallel threading approach

From: Nicholas \Indy\ R. <ar...@gm...> - 2009-04-15 00:37:06

On Tue, Apr 14, 2009 at 3:26 PM,  <asy...@gm...> wrote:
> That's not really true, you can still execute other tasks the same way in a
> recursive manner (but keep away of dependencies on something up the worker
> thread's stack).

I'm sorry, that statement makes little sense to me.

> Besides, even if it would be fully true, then "the vast majority of the
> functionality" is actually a ~2k of code which you never execute, is this a
> "generalization" overhead you can't afford?

~2k loc have a tendency to muck up performance, and other such things.
But alas, when I have to include 2k useless loc for functionality
better provided by a specific library that takes only a few hours to
write/debug/test I'd always go for the later, so yes, that is
*certainly* a generalization overhead I cannot afford.

Indy

Re: [Algorithms] General purpose task parallel threading approach

From: <asy...@gm...> - 2009-04-15 05:50:20

 >> That's not really true, you can still execute other tasks the same way
in a
>> recursive manner (but keep away of dependencies on something up the
worker
>> thread's stack).
>I'm sorry, that statement makes little sense to me.
When a task gets blocked by a dependency, worker thread can start executing
other tasks recursively.


>that takes only a few hours to
>write/debug/test I'd always go for the later, so yes, that is
>*certainly* a generalization overhead I cannot afford.
I could always provide a special build without the stuff you are not going
to use. But if you think that the rest of functionality would really take
you a couple of hours to write/debug/test then of course you'd better go
with your own solution.


Alexander.



>~2k loc have a tendency to muck up performance, and other such things.
2009/4/15 Nicholas "Indy" Ray <ar...@gm...>

> On Tue, Apr 14, 2009 at 3:26 PM,  <asy...@gm...> wrote:
> > That's not really true, you can still execute other tasks the same way in
> a
> > recursive manner (but keep away of dependencies on something up the
> worker
> > thread's stack).
>
> I'm sorry, that statement makes little sense to me.
>
> > Besides, even if it would be fully true, then "the vast majority of the
> > functionality" is actually a ~2k of code which you never execute, is this
> a
> > "generalization" overhead you can't afford?
>
> ~2k loc have a tendency to muck up performance, and other such things.
> But alas, when I have to include 2k useless loc for functionality
> better provided by a specific library that takes only a few hours to
> write/debug/test I'd always go for the later, so yes, that is
> *certainly* a generalization overhead I cannot afford.
>
> Indy
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by:
> High Quality Requirements in a Collaborative Environment.
> Download a free trial of Rational Requirements Composer Now!
> http://p.sf.net/sfu/www-ibm-com
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>



-- 
Regards,
Alexander Karnakov

Re: [Algorithms] General purpose task parallel threading approach

From: Jon W. <jw...@gm...> - 2009-04-15 15:17:22

asy...@gm... wrote:
> When a task gets blocked by a dependency, worker thread can start 
> executing other tasks recursively.
>

So, in my world, I know what dependencies a task has before that task is 
started. Thus, I can prove that the task will not block, and thus no 
recursion (or additional stacks) necessary. The trade-off is that new 
dependencies can be introduced, but not synchronized, until after the 
current task invocation ends. The benefit is that scheduling can be a 
lot smarter, and the requirements for running a task are known and can 
be bounded (very important in embedded and otherwise limited environments).

And, because tasks do not block while executing, this also means that 
there are no extra context switches -- each task gets switched in; 
that's the extent of switching. Thus, slightly less pressure on the 
context switcher (but optimizing switching is still worthwhile, up to a 
point).

Sincerely,

jw

Re: [Algorithms] General purpose task parallel threading approach

From: <asy...@gm...> - 2009-04-15 18:08:47

>So, in my world, I know what dependencies a task has before that task is
>The trade-off is that new
>dependencies can be introduced, but not synchronized
That's an ok tradeoff for your project, but for someone else these
restrictions may seem too exotic because a problem might be more dynamic in
nature.
What I try to do is to provide enough possibilities for you guys to achieve
what you want with less  amount of movements. What you say, Jon, could be
achieved if you'd simply call async::Wait not inside the task but outside
it. You could wait not directly on a task, but on a proxy synchronization
object.  You could make a lot fewer synchronization points (relative to a
task) very far outside the tasks you work with, and resolve the dependency
for a bigger number of tasks, etc, etc. As I've with point 2 you can do the
same thing in many ways, where each way gives you slightly different
possibilites.  Of course you'll probably want to wrap some repetive actions
you do into a set of another primitives or functions to ease programming for
your version of the multithreading model which you view as a best approach
to solve your specific problem.

I have actually a question regarding point 1 (where you say a task needs to
be executed inside thread's stack and nowhere else). I think I know what you
want to say here:
a. - you need to have a growable stack
b. - you need cpu to access the same piece of memory from task to task.

As I can see now the only thing I can't provide is 'a.'.   I'll work on
'b.'  together with new optimized scheduler  which will reuse the same
memory for tasks as much as possible to prevent cache thrashing on a
particular cpu. Point a. is an OS feature which I can't provide in usermode,
though I do provide a way to handle 3rd party library calls for which you
don't know stack requirements, and in the future I'll try provide some
automation for the code running in small stack environment.

The whole purpose of asynclib idea was to simplify programming. Anything
asynclib does you can do it manually, for example, a sync point in the
middle of a task can be achieved without asynclib by splitting the task into
2 functions - one does work before sync point, another does it after, and
register 2nd part in some interlocked queue which will be traversed by some
other code. Task suspension can be achieved by saving data you need to
restore in some structure and restoring it later and so on. Can you squeze
out a bit more performance ? Of course you can, since you have all required
information you could an instruction here and there.
You can do all that, but for that you need to write code and spend
programmer's time on that, horrify the code by introducing a quadrillion of
callbacks, completion functions, etc. etc, where as I try to provide a good
enough solution which handles all that automatically, and seems like that
the only thing I can't handle is stack growing, any other aspect of
behaviour is achievable one way or another with a very acceptable
performance imho (yet I've noted some performance advices you mentioned and
will work on them).

Alexander.

2009/4/15 Jon Watte <jw...@gm...>

> asy...@gm... wrote:
> > When a task gets blocked by a dependency, worker thread can start
> > executing other tasks recursively.
> >
>
> So, in my world, I know what dependencies a task has before that task is
> started. Thus, I can prove that the task will not block, and thus no
> recursion (or additional stacks) necessary. The trade-off is that new
> dependencies can be introduced, but not synchronized, until after the
> current task invocation ends. The benefit is that scheduling can be a
> lot smarter, and the requirements for running a task are known and can
> be bounded (very important in embedded and otherwise limited environments).
>
> And, because tasks do not block while executing, this also means that
> there are no extra context switches -- each task gets switched in;
> that's the extent of switching. Thus, slightly less pressure on the
> context switcher (but optimizing switching is still worthwhile, up to a
> point).
>
> Sincerely,
>
> jw
>
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by:
> High Quality Requirements in a Collaborative Environment.
> Download a free trial of Rational Requirements Composer Now!
> http://p.sf.net/sfu/www-ibm-com
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>

-- 
Regards,
Alexander Karnakov

Re: [Algorithms] General purpose task parallel threading approach

From: Nicholas \Indy\ R. <ar...@gm...> - 2009-04-15 06:24:23

On 4/14/09, asy...@gm... <asy...@gm...> wrote:
>  When a task gets blocked by a dependency, worker thread can start executing
> other tasks recursively.

But if you are are doing a single stack per worker thread, you can't
actually swap tasks as the stack is already being used.

Indy

Re: [Algorithms] General purpose task parallel threading approach

From: <asy...@gm...> - 2009-04-15 06:46:11

A task is not swapped in this case, it just stays on the worker thread's
stack, and worker thread recursively starts doing next task. When dependee
(is this the right word ?) task is done, dependent task will get resumed
when worker thread finishes tasks it has started working on during the wait
period. This fits fine for dependencies which are structured as a tree.

Alexander.

2009/4/15 Nicholas "Indy" Ray <ar...@gm...>

> On 4/14/09, asy...@gm... <asy...@gm...> wrote:
> >  When a task gets blocked by a dependency, worker thread can start
> executing
> > other tasks recursively.
>
> But if you are are doing a single stack per worker thread, you can't
> actually swap tasks as the stack is already being used.
>
> Indy
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by:
> High Quality Requirements in a Collaborative Environment.
> Download a free trial of Rational Requirements Composer Now!
> http://p.sf.net/sfu/www-ibm-com
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>



-- 
Regards,
Alexander Karnakov

Re: [Algorithms] General purpose task parallel threading approach

From: Gregory J. <gj...@da...> - 2009-04-15 18:46:47

Honestly, Alexander, at this point you largely are re-implementing TBB. Not
that this by itself should give you pause, but (and maybe this is lost on me
over the past 180 or so posts), what are you offering that TBB doesn't?
Given that a library such as asynclib will be considered by those starting a
new project, and not likely by those already using a particular solution
(whether commercial or in-house), what is(are) asynclib's differentiating
factor(s)?

On a side note, if you want to know what would make asynclib more appealing
in general, I'd recommend spending more time in the TBB forums, where plenty
of feedback on this topic is available.

Regards

Greg

From: asy...@gm... [mailto:asy...@gm...] 
Sent: Wednesday, April 15, 2009 11:09 AM
To: Game Development Algorithms
Subject: Re: [Algorithms] General purpose task parallel threading approach

>So, in my world, I know what dependencies a task has before that task is
>The trade-off is that new
>dependencies can be introduced, but not synchronized
That's an ok tradeoff for your project, but for someone else these
restrictions may seem too exotic because a problem might be more dynamic in
nature.
What I try to do is to provide enough possibilities for you guys to achieve
what you want with less  amount of movements. What you say, Jon, could be
achieved if you'd simply call async::Wait not inside the task but outside
it. You could wait not directly on a task, but on a proxy synchronization
object.  You could make a lot fewer synchronization points (relative to a
task) very far outside the tasks you work with, and resolve the dependency
for a bigger number of tasks, etc, etc. As I've with point 2 you can do the
same thing in many ways, where each way gives you slightly different
possibilites.  Of course you'll probably want to wrap some repetive actions
you do into a set of another primitives or functions to ease programming for
your version of the multithreading model which you view as a best approach
to solve your specific problem.

I have actually a question regarding point 1 (where you say a task needs to
be executed inside thread's stack and nowhere else). I think I know what you
want to say here:
a. - you need to have a growable stack
b. - you need cpu to access the same piece of memory from task to task.

As I can see now the only thing I can't provide is 'a.'.   I'll work on 'b.'
together with new optimized scheduler  which will reuse the same memory for
tasks as much as possible to prevent cache thrashing on a particular cpu.
Point a. is an OS feature which I can't provide in usermode, though I do
provide a way to handle 3rd party library calls for which you don't know
stack requirements, and in the future I'll try provide some automation for
the code running in small stack environment.

The whole purpose of asynclib idea was to simplify programming. Anything
asynclib does you can do it manually, for example, a sync point in the
middle of a task can be achieved without asynclib by splitting the task into
2 functions - one does work before sync point, another does it after, and
register 2nd part in some interlocked queue which will be traversed by some
other code. Task suspension can be achieved by saving data you need to
restore in some structure and restoring it later and so on. Can you squeze
out a bit more performance ? Of course you can, since you have all required
information you could an instruction here and there.
You can do all that, but for that you need to write code and spend
programmer's time on that, horrify the code by introducing a quadrillion of
callbacks, completion functions, etc. etc, where as I try to provide a good
enough solution which handles all that automatically, and seems like that
the only thing I can't handle is stack growing, any other aspect of
behaviour is achievable one way or another with a very acceptable
performance imho (yet I've noted some performance advices you mentioned and
will work on them).

Alexander.

2009/4/15 Jon Watte <jw...@gm...>

asy...@gm... wrote:
> When a task gets blocked by a dependency, worker thread can start
> executing other tasks recursively.
>

So, in my world, I know what dependencies a task has before that task is
started. Thus, I can prove that the task will not block, and thus no
recursion (or additional stacks) necessary. The trade-off is that new
dependencies can be introduced, but not synchronized, until after the
current task invocation ends. The benefit is that scheduling can be a
lot smarter, and the requirements for running a task are known and can
be bounded (very important in embedded and otherwise limited environments).

And, because tasks do not block while executing, this also means that
there are no extra context switches -- each task gets switched in;
that's the extent of switching. Thus, slightly less pressure on the
context switcher (but optimizing switching is still worthwhile, up to a
point).

Sincerely,

jw

----------------------------------------------------------------------------
--
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
_______________________________________________
GDAlgorithms-list mailing list
GDA...@li...
https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
Archives:
http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list

-- 
Regards,
Alexander Karnakov

Re: [Algorithms] General purpose task parallel threading approach

From: <asy...@gm...> - 2009-04-15 19:37:51

>what are you offering that TBB doesn’t?
Task suspension mechanism. Afaik tbb doesn't provide that.

Alexander.

2009/4/15 Gregory Junker <gj...@da...>

>  Honestly, Alexander, at this point you largely are re-implementing TBB.
> Not that this by itself should give you pause, but (and maybe this is lost
> on me over the past 180 or so posts), what are you offering that TBB
> doesn’t? Given that a library such as asynclib will be considered by those
> starting a new project, and not likely by those already using a particular
> solution (whether commercial or in-house), what is(are) asynclib’s
> differentiating factor(s)?
>
>
>
> On a side note, if you want to know what would make asynclib more appealing
> in general, I’d recommend spending more time in the TBB forums, where plenty
> of feedback on this topic is available.
>
>
>
> Regards
>
> Greg
>
>
>
> *From:* asy...@gm... [mailto:asy...@gm...]
> *Sent:* Wednesday, April 15, 2009 11:09 AM
> *To:* Game Development Algorithms
> *Subject:* Re: [Algorithms] General purpose task parallel threading
> approach
>
>
>
> >So, in my world, I know what dependencies a task has before that task is
> >The trade-off is that new
> >dependencies can be introduced, but not synchronized
> That's an ok tradeoff for your project, but for someone else these
> restrictions may seem too exotic because a problem might be more dynamic in
> nature.
> What I try to do is to provide enough possibilities for you guys to achieve
> what you want with less  amount of movements. What you say, Jon, could be
> achieved if you'd simply call async::Wait not inside the task but outside
> it. You could wait not directly on a task, but on a proxy synchronization
> object.  You could make a lot fewer synchronization points (relative to a
> task) very far outside the tasks you work with, and resolve the dependency
> for a bigger number of tasks, etc, etc. As I've with point 2 you can do the
> same thing in many ways, where each way gives you slightly different
> possibilites.  Of course you'll probably want to wrap some repetive actions
> you do into a set of another primitives or functions to ease programming for
> your version of the multithreading model which you view as a best approach
> to solve your specific problem.
>
> I have actually a question regarding point 1 (where you say a task needs to
> be executed inside thread's stack and nowhere else). I think I know what you
> want to say here:
> a. - you need to have a growable stack
> b. - you need cpu to access the same piece of memory from task to task.
>
> As I can see now the only thing I can't provide is 'a.'.   I'll work on
> 'b.'  together with new optimized scheduler  which will reuse the same
> memory for tasks as much as possible to prevent cache thrashing on a
> particular cpu. Point a. is an OS feature which I can't provide in usermode,
> though I do provide a way to handle 3rd party library calls for which you
> don't know stack requirements, and in the future I'll try provide some
> automation for the code running in small stack environment.
>
>
> The whole purpose of asynclib idea was to simplify programming. Anything
> asynclib does you can do it manually, for example, a sync point in the
> middle of a task can be achieved without asynclib by splitting the task into
> 2 functions - one does work before sync point, another does it after, and
> register 2nd part in some interlocked queue which will be traversed by some
> other code. Task suspension can be achieved by saving data you need to
> restore in some structure and restoring it later and so on. Can you squeze
> out a bit more performance ? Of course you can, since you have all required
> information you could an instruction here and there.
> You can do all that, but for that you need to write code and spend
> programmer's time on that, horrify the code by introducing a quadrillion of
> callbacks, completion functions, etc. etc, where as I try to provide a good
> enough solution which handles all that automatically, and seems like that
> the only thing I can't handle is stack growing, any other aspect of
> behaviour is achievable one way or another with a very acceptable
> performance imho (yet I've noted some performance advices you mentioned and
> will work on them).
>
> Alexander.
>
>
> 2009/4/15 Jon Watte <jw...@gm...>
>
> asy...@gm... wrote:
> > When a task gets blocked by a dependency, worker thread can start
> > executing other tasks recursively.
> >
>
> So, in my world, I know what dependencies a task has before that task is
> started. Thus, I can prove that the task will not block, and thus no
> recursion (or additional stacks) necessary. The trade-off is that new
> dependencies can be introduced, but not synchronized, until after the
> current task invocation ends. The benefit is that scheduling can be a
> lot smarter, and the requirements for running a task are known and can
> be bounded (very important in embedded and otherwise limited environments).
>
> And, because tasks do not block while executing, this also means that
> there are no extra context switches -- each task gets switched in;
> that's the extent of switching. Thus, slightly less pressure on the
> context switcher (but optimizing switching is still worthwhile, up to a
> point).
>
> Sincerely,
>
> jw
>
>
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by:
> High Quality Requirements in a Collaborative Environment.
> Download a free trial of Rational Requirements Composer Now!
> http://p.sf.net/sfu/www-ibm-com
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>
>
>
>
> --
> Regards,
> Alexander Karnakov
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by:
> High Quality Requirements in a Collaborative Environment.
> Download a free trial of Rational Requirements Composer Now!
> http://p.sf.net/sfu/www-ibm-com
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>



-- 
Regards,
Alexander Karnakov

<< < 1 .. 10 11 12 (Page 12 of 12)