From: Charles Bloom <cbloom@cb...>  20001119 21:57:36

I've been doing some experiments on triangle rate; I thought I'd report them. These numbers are all taken on a GF2 with Dx8. For calibration, NVidia's BenMark5 reports 20 MThz on my machine (more later on this). I made a rectangle of 64x64 tiles. That's 64*64*2 = 8k tris, and 65*65 = 4225 verts. I rendered it 64 times in a frame, for 512k tris per frame. Little details like screen resolution, z buffer, whether a texture was used, whether a directional light was used, etc. didn't make any difference, because I wasn't pushing the rasterizer heavily at all. I organized the triangles of the rectangle in six ways. I made three versions with different triangle orders : A) random triangle order B) rowstripped triangle order C) "optimal" stripped order For each of those I made a version using triangle strips, and one with triangle lists. Now for an important point : I'm counting *actual* triangle rate; that is, not counting degenerate triangles in the strip which are needed to join up runs. Counting actual triangles, I saw *ZERO* difference between strips and lists (within the error of the sample, that is about 0.1 MThz). The speeds I saw for actual triangles was : A) 4.5 MThz B) 8.7 MThz C) 17.4 MThz Now, this top speed seems a bit low compared to the BenMark5 result, but actually it's not. Why is that? Because NVidia's BenMark5 *counts degenerate triangles*. If I count the degenerate triangles (that is, by couting num triangles = num strip indices  2) then my triangle rate for C was 23 MThz (with strips, of course), with a claimed 686k tris per frame (actually 512k). I believe that this is why strips are (falsely) believed to be superior for GeForce2. In fact, the difference appears only to be in the possibility of reducing the memory use and bus transfer of indices, but of course that's miniscule compared to the vertices (though still worthwhile if there are no other factors to influence your lists vs. strips decision). I believe that 17 MThz is the best "nondegenerate triangle" rate of GF2. Note that if I run BenMark5 with triangle lists, I get 17 MThz. We can see how the triangle rates are determined just by vertex cache behaviour. Case "B" is simple striporder caching. That is, you must send one vert per triangle. This gives us our vertexrate baseline, which is 8.5 MVhz. Case "C" is optimal caching. In optimal caching, you send 0.5 verts per triangle (0.5 is the theoretical minimum  that is, you're only sending each vert once). Thus C should have double the triangle rate of B, and indeed it does. Finally A is just nasty triangle order. I was somewhat surprised that the triangle rate wasn't even lower, but apparently about half the time the triangles are in "strip quality" caching (just send one vert) and the other half of the time we send all three verts, so we average about two verts per triangle transmitted, hence about half the triangle rate of case B. Now a brief note on how the "optimal" version was made. Case B was just listing the triangles in row order, that is, row 1, row 2, etc. Case C was made by listing rows of columnular chunks. That is, I took a column 6 tiles wide and made a rowrowrow strip of it, then moved to the next column (the last column was then just 4 tiles wide). I chose 6 because it's the widest column you can use without overflowing the cache on GF2. When you lay down each row of a column, you add 7 verts and 12 tris, and use 16 indices, counting the two needed to connect to the next little row. This means you've got an indeces to triangle ratio of 16/12, which is 1.3333, which is roughly equal to 686/512 (1.339), our triangle overcount. This is very similar to how BenMark5 makes its ribbons. It's amusing to me that this cacheoptimized stripping is the same old trick that Dave Stafford and Terje Mathiesen (sp?) came up with for software rasterizers around the era of Doom, to improve CPU cache performance fetching when texels from rotated textures.  Charles Bloom cb@... http://www.cbloom.com 