I wouldn’t call it (recent ATI) a scalar architecture, it’s more like a VLIW architecture, where you explicitly schedule the various slots.
I wrote that document, and either you’re reading it wrong or you misunderstood what Jon and Marco said.
The HD 2000 series is a scalar architecture, hence it’s not necessary to vectorize code like you would on R520 or G70, however, it’s parallel rather than serial. The important thing is to parallelize the code so that each scalar instruction can be issued into a separate slot. Since vectorized code is by nature also parallel it will run fast, so vectorized code is not a bad thing. But unlike earlier generations parallel code that’s not vectorized also runs fast since all parallel scalars can be computed in parallel. What doesn’t run fast is serial scalar dependencies since these lead to poor utilization of the shader cores. In average shaders this is not so much of a problem, but you could construct pathological cases where utilization would be 1/5 of the maximum throughput. To parallelize your code it’s recommended that you use parentheses to break up long lines since HLSL evaluates expressions from left to right, just like C/C++. So A+B+C+D is computed as ((A+B)+C)+D, whereas (A+B)+(C+D) could run in one instruction less (assuming all are scalars).