Re: [Algorithms] General purpose task parallel threading approach

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Nicholas "Indy" Ray wrote:
> I can't speak on ultraspec, but neither larabee or GPUs use
> hyperthreading as the main form of threading (or afaik at all) They
> are both data parallel architectures primarily. Additionally,
>   

The latest UltraSparc has, I think, eight cores, each with eight 
hyperthreads. When one core blocks on a cache miss, it will 
automatically schedule to the next hyperthread.
Larrabee has almost the exact same set-up; many hyper-threads that it 
switches between to hide data access latencies.
NVIDIA calls the same thing warps, I think (one warp is 8 threads -- get 
it? :-)

However, in software, you don't get the same fine-grained benefit; not 
by a long shot. When you wait for something, you either wait on a 
different task using a user-mode sychronization primitive, or you wait 
on something that has to come from the kernel (I/O, interrupt, etc). 
Those are the only two options. Neither of them allow you to switch 
tasks quickly enough or with little enough overhead to be compared to 
hyper-threading IMO.

> Additionally hyperthreading still requires threads to be designed in a
> way that is low on data contention. And if you already have worker
> tasks that can do such, It's generally best to create OS threads and
> just let them run on separate threads or hyperthreaded. But still
> ensure that the main thread does no waiting. In a well designed
> system, thread switches are a non-issue as they just shouldn't happen
> very often (or at all in the case of consoles) So I don't think it's
> worthwhile worrying too much about the amount of time it takes to do a
> thread switch.
>   

I think the idea of a thread per subsystem (particles, skinning, 
collision, simulation, audio, scene graph issue, etc) will only scale so 
far. Once you have 64-way CPUs (like the latest Sparcs) and even more 
with Larrabee (if you include the hyper-threads), making your game (or 
any software) go fast will mean a task-oriented workload, rather than a 
subsystem-oriented workload. As long as your tasks are significantly 
heavier than context switching, you're doing fine. That's what the 
fibers-within-threads tries to optimize. There probably exists a 
real-world workload where the efficiency of fibers leads to measurable 
throughput, even though the overhead of fibers or threading is not 
dramatically large compared to the workload, but I think that slice is 
pretty thin. It all comes down to Amdahl's Law in the end.

Sincerely,

jw