Well, the shader cores are not vectors but scalar units, so calling it a scalar architecture makes perfect sense. Although VLIW would also be a valid description of it, if you prefer that. Personally I’m no big fan of the “VLIW” nomenclature since I suppose it’s rather subjective at which point an instruction word becomes “very long” and it says little about the flexibility of the underlying hardware. I wouldn’t be surprised if the G80’s instruction words are “very long” too.




From: gdalgorithms-list-bounces@lists.sourceforge.net [mailto:gdalgorithms-list-bounces@lists.sourceforge.net] On Behalf Of Peter-Pike Sloan
Sent: Wednesday, February 13, 2008 12:22 AM
To: Game Development Algorithms
Subject: Re: [Algorithms] Dummie Matrix math questions


I wouldn’t call it (recent ATI) a scalar architecture, it’s more like a VLIW architecture, where you explicitly schedule the various slots.




From: gdalgorithms-list-bounces@lists.sourceforge.net [mailto:gdalgorithms-list-bounces@lists.sourceforge.net] On Behalf Of Emil Persson
Sent: Tuesday, February 12, 2008 1:37 PM
To: 'Game Development Algorithms'
Subject: Re: [Algorithms] Dummie Matrix math questions


I wrote that document, and either you’re reading it wrong or you misunderstood what Jon and Marco said.

The HD 2000 series is a scalar architecture, hence it’s not necessary to vectorize code like you would on R520 or G70, however, it’s parallel rather than serial. The important thing is to parallelize the code so that each scalar instruction can be issued into a separate slot. Since vectorized code is by nature also parallel it will run fast, so vectorized code is not a bad thing. But unlike earlier generations parallel code that’s not vectorized also runs fast since all parallel scalars can be computed in parallel. What doesn’t run fast is serial scalar dependencies since these lead to poor utilization of the shader cores. In average shaders this is not so much of a problem, but you could construct pathological cases where utilization would be 1/5 of the maximum throughput. To parallelize your code it’s recommended that you use parentheses to break up long lines since HLSL evaluates expressions from left to right, just like C/C++. So A+B+C+D is computed as ((A+B)+C)+D, whereas (A+B)+(C+D) could run in one instruction less (assuming all are scalars).





From: gdalgorithms-list-bounces@lists.sourceforge.net [mailto:gdalgorithms-list-bounces@lists.sourceforge.net] On Behalf Of Jesús de Santos García
Sent: Tuesday, February 12, 2008 2:47 PM
To: Game Development Algorithms
Subject: Re: [Algorithms] Dummie Matrix math questions


The same is true for ATI too as described in the ATI HD2000 Programming Guide. Non-vectorized code is optimized and it is recommended for cases where vectorial instructions are not needed.


Good to know that nvidia and intel are using the same architecture.

On Feb 8, 2008 7:34 PM, Marco Salvi <marcotti@gmail.com> wrote:


On Feb 8, 2008 9:42 AM, Jon Watte <hplus@mindcontrol.org> wrote:


I'm told that the implementation of modern 4-way SIMD on graphics cards
is now done as serial instructions, so there's little benefit to be had
compared to just coding it out.

This is true for NVIDIA (G8x architecture) and latest Intel GPUs, not for AMD/ATI GPUs.
But yes..I guess everyone will sooner or later move to the same model.


This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
GDAlgorithms-list mailing list