Implement GPU loop scheduling inside a team of threads. For instance:
nuteam {
// ordinary loop, executed by all threads in a team
nfor(k in n) {
// team loop, with iterations scheduled between threads inside a team; has an implicit global barrier after it
nusched nfor((i, j) in (n, n)) {
p[i, j] = min(p[i, j], p[i, k] + p[k, j]);
}
}
}
nusched() macro is somewhat similar to OpenMP "parallel for" construction. It will most likely have similar or even some of the same options:
1) whether there is a barrier at the end of the loop
2) scheduling chunk size (most likely, a multiple of warp size)
3) scheduling type (block/cyclic, static/dynamic/implementation-defined)
4) whether the loop is a reduction (with reduction variable and operation specified)
...