what does PipeLen output mean? document.SUBSCRIPTION_OPTIONS = { "thing": "thread", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

2011-12-12
2013-04-24

as title, can anyone tell me?

• Igor Pavlov - 2011-12-12

It makes different tests (with branch missprediction) and tries to calculate the length of CPU pipeline.

thanks so muh, but I have no idea about  the output

Test # 2 ( 3): if (c & mask) { REP-N(c^=v) } REP-2(c^=v)   <--------what does that mean?
Timer frequency  =    1000000 Hz
CPU frequency    =    2772.47 MHz
Pipeline length v.1 =  38.01 stages - TEST   <-------- why sometimes stages become ns
Pipeline length v.2 =  11.56 stages - TEST
Pipeline length     =  24.78 stages - TEST

#Branch   #B/P       0       1     0-1  Random    Len1    Len2   <-----what does that mean?

32      4   28.30    8.18   64.30   63.24   90.01   -2.10
64      8    2.21  100.28   67.28   69.41   36.33    4.26
128     16    2.29   28.12   67.31    3.47  -23.46 -127.67

• Igor Pavlov - 2011-12-12

To get more stable results you must:
Close all programs.
And then run pipelen.

If Pentium 4, probably pipelen is not too good to measure pipeline length.
Replay logic of Pentium 4 creates problems for good measurement.

In fact, I'm porting the PipeLen and MemLen to my embedded system,
so I want to know what does the output of PipeLen mean, such as #Branch,
#B/P, 0, 1, 0-1, Random, Len1, Len2, "if (c & mask) { REP-N(c^=v) } REP-2(c^=v)"
and when the nsmode value become true or false.

• Igor Pavlov - 2011-12-13

pipelen makes some tests:
1) tests with predictable branches
1) tests with unpredictable branches (branches that depends from random data)
the difference in time is misprediction penalty.
pipeline_length = misprediction_penalty * 2 (since the probability of incorrect prediction is 0.5 for unpredictable branches).
There are two patterns for predictable branches:
0-0-0-0-0-…
0-1-0-1-0-…
So there are two results: Len1 and  Len2.

If your embedded CPU has no branch prediction engine, that test can't calculate pipeline length.

This is part of the PipeLen output of my embedded system.

#Branch   #B/P       0       1     0-1  Random    Len1    Len2

32      4   56.03  158.55  107.29  103.71   -7.15   -7.15
64      8   54.84  157.36  106.10  110.86    9.54    9.54
128     16   54.84  157.35  106.09   98.94  -14.30  -14.30
256     32   54.83  156.16  104.90  101.32   -8.34   -7.15
512     64   53.64  156.15  104.90  109.67    9.54    9.54
1-K    128   53.64  156.15  104.89  109.66    9.54    9.54
2-K    256   53.63  156.13  104.88  109.65    9.53    9.53
4-K    512   53.62  156.09  104.85  109.62    9.53    9.53
8-K    1-K   53.59  156.01  104.80  108.37    7.15    7.15
16-K    2-K   54.73  155.86  104.70  109.46    8.33    9.52
32-K    4-K   53.44  155.56  104.50  108.06    7.12    7.12
64-K    8-K   54.41  156.14  105.27  108.82    7.10    7.10
128-K   16-K   53.99  156.11  105.64  109.16    8.22    7.04
256-K   32-K   54.33  156.06  105.19  108.66    6.94    6.94
512-K   64-K   53.85  155.95  105.47  108.83    7.85    6.73

Test # 2 ( 4): if (c & mask) { REP-N(c^=v) } REP-2(c^=v)
Timer frequency  =    1000000 Hz
CPU frequency    =       1.00 MHz
Pipeline length v.1 =   7.85 ns - TEST
Pipeline length v.2 =   6.73 ns - TEST
Pipeline length     =   7.29 ns - TEST

I am still have some questions.

1. Would you tell me what do #Branch and #B/P mean? Does #Branch mean the times
of branch?
2. In column 3 (4) "0" ("1"), does it mean that the latency time when we predict branch
doesn't happen (happen)?
3. In column 5 "0-1", does it mean that the latency time we predict alternatively?
4. In column 6 "Random", does it mean that the latency time we predict randomly?

• Igor Pavlov - 2011-12-14

Your CPU probably have no branch prediction engine.
Call pipelen for some Intel or AMD cpu.
Then you will see some real numbers.
#Branch - number of branches.
#B/P - number of bytes with random data (8 random bits per byte / so 8 branches per byte).
0 - if (false) - branch always
1 - if (true) - branch never
0-1 if (false) {…} … if (true) {…}… if (false) {…} … if (true) {…}
random - if (random) {…}

I'm still have some questions.
1. How do Len1 and Len2 be computed?
2. How do Pipeline length v.1 and Pipeline length v.2 be computed?
3. Why the result of Pipeline length is in ns?

• Igor Pavlov - 2011-12-14
1. Len 1 = (Random_time - (0_time + 1_time) / 2) * 2
Len 2 = (Random_time - 0-1_time) * 2

2. Pipeline length v.*  = results from last line

3. That program doesn't know how to measure CPU ticks for your embedded CPU.
So it shows misprediction penalty (length of pipeline) in ns.
The x86/Windows version of pipelen shows results in CPU cycles.

For each result of each column such as 0, 1 and 0-1,
does it mean the average penalty time of a command?

• Anonymous - 2011-12-21

Dear sir

I am confused about the result of MemLat, here is the result of my system.
the L1 cache in my system is 16KB and the L2 cache is 256KB.
My question is why the latency time become smaller when the line size
become bigger even when the data size is bigger than the cache size.

/media/sdb1 # ./Mem_MIPS 50 p
MemLat 11.00 : Igor Pavlov : Public domain : 2011-05-12
Size     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16    17    18    19    20    21    22

4-K  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00
5-K  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00
6-K  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00
7-K  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00
8-K  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00
10-K  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00
12-K  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00
14-K  2.59  2.58  2.59  2.61  2.54  2.53  2.53  2.53  2.53  2.53  2.53  2.53  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00
16-K  2.66  2.68  2.71  2.77  2.56  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00
20-K  9.04 10.06 13.66 30.98  3.40  2.54  2.53  2.53  2.53  2.53  2.53  2.53  2.53  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00
24-K 13.29 14.61 18.26 32.21 10.19  2.55  2.53  2.53  2.53  2.53  2.53  2.53  2.53  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00
28-K 15.53 16.56 20.31 32.07 11.64  3.78  2.53  2.53  2.53  2.53  2.53  2.53  2.53  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00
32-K 17.55 19.18 22.86 32.00 15.96  3.74  2.53  2.53  2.53  3.00  2.53  2.53  2.53  2.53  0.00  0.00  0.00  0.00  0.00  0.00  0.00
40-K 21.74 22.94 25.67 32.36 25.61  6.94  2.89  3.15  2.97  3.35  4.50  2.53  2.53  2.53  0.00  0.00  0.00  0.00  0.00  0.00  0.00
48-K 23.07 24.33 27.27 32.33 29.09 10.08  4.05  3.28  3.31  3.17  4.50  2.53  2.53  2.53  0.00  0.00  0.00  0.00  0.00  0.00  0.00
56-K 25.46 26.16 28.04 32.50 31.33 15.95  4.02  3.77  3.49  3.94  4.50  2.53  2.53  2.53  0.00  0.00  0.00  0.00  0.00  0.00  0.00
64-K 26.13 26.95 28.83 32.66 31.89 20.62  6.91  3.52  3.76  3.94  4.50  2.53  2.53  2.53  2.53  0.00  0.00  0.00  0.00  0.00  0.00
80-K 27.53 28.28 29.78 32.88 32.69 25.68  9.39  3.79  5.59  4.16  4.50  4.50  2.53  2.53  2.53  0.00  0.00  0.00  0.00  0.00  0.00
96-K 29.41 29.81 30.38 33.18 33.23 30.15 12.09  4.00  4.11  4.09  4.50  4.50  2.53  2.53  2.53  0.00  0.00  0.00  0.00  0.00  0.00
112-K 29.76 30.47 31.77 33.57 33.51 31.66 13.92  6.47  4.11  4.18  4.50  4.50  2.53  2.53  2.53  0.00  0.00  0.00  0.00  0.00  0.00
128-K 31.37 31.03 33.67 32.68 31.79 31.07 18.69  7.19  4.15  4.07  4.51  4.51  2.54  2.54  2.54  2.54  0.00  0.00  0.00  0.00  0.00
160-K 36.21 37.34 38.12 37.49 38.89 42.35 26.61 10.52  5.99  4.41  4.50  4.50  4.50  2.54  2.54  2.54  0.00  0.00  0.00  0.00  0.00
192-K 39.99 40.29 41.45 41.71 43.34 46.95 33.19 14.73  5.81  4.34  4.57  4.51  4.50  2.54  2.54  2.54  0.00  0.00  0.00  0.00  0.00
224-K 43.53 43.70 46.01 47.58 51.36 58.53 37.75 16.12  7.87  4.68  4.70  4.55  4.52  2.54  2.54  2.54  0.00  0.00  0.00  0.00  0.00
256-K 49.24 51.45 51.26 53.15 56.20 63.48 45.55 22.01 10.05  6.40  6.96  4.63  4.51  2.54  2.54  2.54  2.54  0.00  0.00  0.00  0.00
320-K 58.63 59.15 60.72 62.41 68.16 75.67 49.39 32.27 17.86 11.34 11.61  4.75  4.52  4.50  2.54  2.54  2.54  0.00  0.00  0.00  0.00
384-K 68.95 69.55 70.89 72.82 77.09 88.84 66.16 41.50 24.19 18.99 15.18  8.83  4.52  4.50  2.54  2.54  2.54  0.00  0.00  0.00  0.00
448-K 74.79 75.03 76.60 79.50 83.57 93.17 75.99 58.63 29.59 18.68 17.21 13.57  4.63  4.51  2.54  2.54  2.54  0.00  0.00  0.00  0.00
512-K 79.21 79.90 81.39 82.84 86.77 97.78 80.16 62.25 32.87 21.51 20.79 15.92  4.91  4.51  2.54  2.54  2.54  2.54  0.00  0.00  0.00
640-K 86.64 87.49 88.13 89.94 93.00 98.77 91.74 68.80 44.54 23.96 22.66 18.08 12.41  4.54  4.50  2.54  2.54  2.54  0.00  0.00  0.00
768-K 91.87 92.58 93.55 94.60 98.24 104.96 96.87 78.20 61.20 33.94 22.51 21.99 16.26  4.56  4.50  2.54  2.54  2.54  0.00  0.00  0.00
896-K 95.83 96.13 96.59 98.27 101.14 106.21 102.18 84.27 63.95 37.93 25.72 21.32 19.06  4.66  4.50  2.54  2.54  2.54  0.00  0.00  0.00
1024-K 99.09 99.62 100.43 101.85 105.35 112.29 104.76 87.94 67.52 40.09 28.68 22.73 20.83  6.14  4.51  2.54  2.54  2.54  2.54  0.00  0.00
1280-K 103.70 104.03 104.95 105.98 108.72 113.03 108.16 97.17 76.06 54.64 31.75 23.21 22.48 13.21  4.50  4.50  2.54  2.54  2.54  0.00  0.00
1536-K 106.30 106.52 107.14 108.26 110.79 115.14 111.71 104.62 82.52 63.59 37.91 26.16 24.01 17.38  4.55  4.50  2.54  2.54  2.54  0.00  0.00
1792-K 108.33 108.47 108.80 110.02 111.96 115.84 111.83 106.42 90.91 71.06 39.42 28.20 24.95 19.90  4.59  4.50  2.54  2.54  2.54  0.00  0.00
2048-K 110.57 110.90 111.17 112.35 114.36 117.86 114.39 107.83 93.92 75.66 45.83 29.96 25.72 21.22  6.60  4.51  2.54  2.54  2.54  2.54  0.00
2560-K 113.07 113.30 113.71 114.30 115.78 118.67 117.69 111.40 100.69 78.19 54.34 35.78 26.91 25.19 13.33  4.55  4.51  2.55  2.54  2.54  0.00
3072-K 114.61 114.76 115.22 115.66 116.39 119.02 118.22 114.34 104.36 86.29 65.61 40.57 29.29 24.73 17.55  4.58  4.50  2.54  2.54  2.54  0.00
3584-K 116.56 116.74 116.81 117.35 118.35 120.83 120.95 116.29 108.54 90.64 74.98 45.59 33.35 28.51 19.97  4.65  4.50  2.54  2.54  2.54  0.00
4-M 117.42 117.53 117.75 118.35 119.06 121.03 120.09 119.31 110.63 95.83 78.71 49.53 34.37 27.71 21.56  6.27  4.52  2.54  2.54  2.55  2.54
5-M 120.39 120.61 120.66 121.26 121.88 123.57 122.87 120.41 115.16 105.96 85.32 66.70 45.98 27.85 23.05 13.28  4.55  4.50  2.54  2.54  2.54
6-M 122.71 122.86 122.87 123.10 123.90 125.18 124.09 123.52 121.54 112.43 94.56 78.03 51.01 33.75 24.87 16.85  4.59  4.50  2.54  2.54  2.54
7-M 125.19 125.24 125.32 125.71 126.52 127.50 126.52 126.85 124.28 115.84 99.98 81.05 62.03 34.40 27.85 19.82  4.64  4.51  2.54  2.54  2.54
8-M 128.14 128.05 128.03 128.67 128.84 129.53 129.37 129.25 127.24 118.78 100.20 82.62 73.17 46.10 28.60 22.94  6.47  4.54  2.54  2.54  2.54
10-M 131.95 132.10 132.35 132.42 132.35 133.06 132.98 132.79 132.91 130.09 112.36 88.95 77.89 61.34 39.20 24.70 13.32  4.51  4.50  2.54  2.54
12-M 134.57 134.59 134.63 134.56 135.14 135.52 135.30 135.68 134.31 133.14 122.18 96.34 85.34 67.19 39.93 26.09 17.62  4.58  4.50  2.54  2.54
14-M 136.12 136.22 136.40 136.49 136.35 137.13 137.11 136.70 136.72 133.37 126.96 107.19 93.42 80.06 46.10 25.96 20.27  4.68  4.50  2.54  2.54
16-M 137.38 137.37 137.39 137.38 137.76 138.19 138.03 137.81 138.18 137.91 130.55 115.47 100.50 84.11 58.28 31.14 27.16  6.38  4.51  2.54  2.54
20-M 139.56 139.55 139.58 139.74 139.83 140.24 139.93 139.88 140.07 139.53 135.67 126.83 107.15 89.91 74.12 38.07 26.35 13.58  4.53  4.51  2.54
24-M 141.01 141.03 141.09 141.11 141.26 141.46 141.48 141.40 141.19 140.87 138.43 131.48 116.60 100.55 83.21 44.62 27.87 18.59  4.60  4.50  2.54
28-M 142.94 143.00 143.09 143.09 143.15 143.32 143.53 143.45 143.47 142.73 141.15 137.50 125.28 106.91 87.73 54.32 31.33 20.22  4.71  4.50  2.54
32-M 144.51 144.51 144.49 144.61 144.71 144.88 144.71 144.57 145.23 143.98 142.26 139.91 130.36 113.19 90.04 58.11 28.54 21.77  6.28  4.51  2.54
40-M 146.45 146.47 146.48 146.52 146.54 146.82 146.51 147.03 146.34 146.61 144.46 143.82 139.43 121.03 107.00 73.68 38.84 24.94 13.44  4.52  4.50
48-M 147.90 147.89 147.88 147.95 147.88 148.09 147.98 148.13 147.94 148.20 146.47 144.98 143.96 131.39 110.47 83.53 42.52 26.49 17.49  4.57  4.50

BW- 32 B     65     65     65     65     65     65     65     65     65     65     66     67     67     73     87    116    228    366    555   2125   2161
BW- 64 B    131    131    131    131    131    131    131    131    131    131    132    134    134    147    175    232    457    733   1110   4250   4322
BW-128 B    262    262    262    262    262    262    262    262    262    262    265    268    269    295    351    465    914   1466   2221   8501   8645

Cache latency    =      8.343 ns     =       2.53 cycles
Memory latency   =    487.107 ns     =     147.90 cycles

Timer frequency  =    1000000 Hz
CPU frequency    =     303.63 MHz

• Igor Pavlov - 2011-12-21

column 5 - one access to each 32 bytes block. So it shows real size of L1 cache, if cache line size is 32 bytes.
column 6 - one access to each 64 bytes. So the cache will use only one line (32-bytes) from each 64-bytes. Then it will show good latency even after 16-KB.
column 12 - one access per 4 KB. It shows size of TLB and penalty for TLB miss.

Why L1 latency is 2.53 cycles?
It must be integer.
Is it really 300 MHz?

try
memlat 50 p d

• Anonymous - 2011-12-22

Yes, it is really 301.5 MHz and I have no idea about why L1 latency is 2.53 cycles.

1. In column x, should it refer to test with page size 2^x bytes instead the cache line size ? Because I print out the cache line size of each test, they are all 4096.
2. In  result of BW- 32 B , it is in Mega bytes or Mega bits?
3. Why column 12 shows the size of TLB and penalty of TLB miss?

• Anonymous - 2011-12-22

• Igor Pavlov - 2011-12-22

What is your CPU (exact name and specification)?
You can try tests:
memlat 64 p d
memlat 64 p5 d
memlat 64 p6 d
memlat 64 p7 d
memlat 64 c1020h
memlat 2 b
memlat 2 b w
2)  MegaBytes
3) TLB contains information about page translation (page size = 4 KB = 2^12).
So if you read with step 4 KB, you have one access per one TLB entry.
Note also that MIPS CPUs probaby use 2 entries (for 2 pages) per one TLB entry.

• Anonymous - 2011-12-22

My CPU is a MIPS CPU.

1)  How does it works when the cache line size is bigger than the cache size?
2)  When block size is more than CPU cache size, it needs reads from memory
or external cache such as result in column 22 with block size 48-M, but why the latency
time of it is smaller than the result  of other columns with block size 48-M?

• Igor Pavlov - 2011-12-22

corrected message:

2) 2^22 = 4 MB.
it splits 48 MB to 4 MB chunks.
48 MB / 4 MB = 12 chunks.
Then it selects RANDOM offset in each chunk (so each offset is lower that 4
MB).
So it has 12 random addresses (one random address per each 4 MB chunk).
Then it read from these addresses (12 accesses in loop, then again 12 accesses
to same addresses, then again, …).
Since your L1 data cache supports 512 32-bytes lines, all these 12 accesses
will be in L1 DATA cache).

To get L1 latency you must look to column 5: 2^5 = 32 bytes (L1 cache line
size).
To get L2 latency probably you must look to column 7: 2^7 = 128 bytes. Your
CPU  probably loads 128 bytes to L2 cache from RAM for each RAM access.

• Anonymous - 2011-12-23

1) For example, in column 2 with block size 24-kB, it spilts 24 KB to 2 B chunks.
24 KB / 2 B = 12K
Then it selects RANDOM offset in each chunk (what is the size of each offset?).

• Anonymous - 2011-12-23

Sorry, the question should be:
1) For example, in column 2 with block size 24-kB, it spilts 24 KB to 4 B chunks.
24 KB / 4 B = 12K
Then it selects RANDOM offset in each chunk (what is the size of each offset?).

• Igor Pavlov - 2011-12-23

Random offset is always aligned for 4-byte range.
So for 4 B chunks, offset is 0 always.
for 8 B chunks, offset can be 0 or 4.
for 16 B chunks, offset can be 0, 4, 8, 12.

• Anonymous - 2011-12-23

1) column 2 - one access to each 4 bytes block. So the cache will use 4 bytes from each 4 bytes?
2) column 3 - one access to each 8 bytes block. So the cache will use 4 bytes from each 8 bytes?
3) column 4 - one access to each 16 bytes block. So the cache will use 4 bytes from each 16 bytes?

column 5 - one access to each 32 bytes block. So it shows real size of L1 cache, if cache line size is 32 bytes.
column 6 - one access to each 64 bytes. So the cache will use only one line (32-bytes) from each 64-bytes.

4) column 7 - one access to each 128 bytes block. So the cache will use 128 bytes from each 128 bytes?
So it shows real size of L2 cache, if cache line size is 128 bytes?

• Igor Pavlov - 2011-12-23

columns 2,3,4,5 - L1 cache always loads full line (32 - bytes) from L2 or from RAM, even if you read only 4 bytes from line.

I don't know the size of L2 cache line of your CPU.
From test results it looks like 128 bytes cache line. But it also can be 64-bytes lines, if CPU always loads two 64-bytes lines from RAM (or four 32-bytes lines).

Since columns 2,3,4,5 - L1 cache always loads full line (32 - bytes) from L2 or from RAM,
for columns 2 with block size 16-K, it spilts 16 KB to 2 B trunks,
16 KB / 2 B = 8 K trunks
8 K * 32 bytes = 256 KB
then 256 KB > 16 KB which means the latency time should be bigger.

• Igor Pavlov - 2011-12-26

columns 2,3,4,5 - L1 cache loads full line (32 - bytes) from L2 or RAM, when you have big block (> 16 KB).
column 2: it splits to 4 B chunks. So there are 4 K random adresses inside 16 KB block. So only L1 cache works in that case.