This is the jacobi benchmark. A TCU executes a loop. At each iteration, 4 load requests are sent. The values are added then one store is sent.
In this particular version, loop prefetching is enabled, and thus 4 prefetch instructions are executed at each iteration as well.
I am looking in particular at the execution of TCU 50. If you look at the generated tracedump (tcu50_memoperations.tracedump.txt), at clock cycle 7180, a packet for instr 111:pref is put in CLSTR_2_LS_FRONT_END_FAN_IN. After that, the packet stays there until cycle 11878 (when it reaches INW_SEND_INPUT_PORT_2) -- stuck there for 4698 cycles!!
However, in the meantime, other instructions from the same TCU are sent through the LS unit without any wait: 113:pref(packet created @7182), 114:pref(packet created @7183, 111:pref(packet created @7217).
How to reproduce:
$ xmtsim -v
Simulator Version: 0.81.98.r6216
Java Version: 1.6.0_10
(i used the devel simulator)
$ $ xmtsim -cycle -count -timer 30 -conf ./fpga8 -trace=directives -traceout jacobi_int_byrow.drampref.tracedump jacobi_int_byrow.drampref.sim -binload jacobi_int_byrow.drampref.b