[Ray-devel] RFC: RT core event model design

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hello everybody!

Recently, I've been thinking about the RT core model, i.e. how the 
actual calculation should be split for different boxes. 

The idea I would like to present below is based on discussion with 
Andrew about stream processing and the "Kilauea paper" and follows my 
proposal from some time back about using as many worker threads as 
CPUs in the system and additionally doing userspace (coroutine) switching. 

Overview
========

The core consists of the following main parts: 
- a request PQ (priority queue)
- processing kernels; one for each request type
- a list of saved states (coroutines)
- a response queue
- an (optional) network interface

Requests all have a priority, a return-path (i.e. which computer and 
couroutine generated the request and needs to receive the corresponding 
response) and a type. 

The type can be something like "calculate ray shape intersection", 
"shade this ray", "completely trace this ray",...

The system below is independent of these types; there just needs to be 
a kernel which can process a request of each type. 

Operation
=========

1. Requests are queued in the request PQ according to their priority 
(higher prio first). 
Inputs for the queue are network (received request) or internal sources. 

2. There are several worker threads (as many as there are CPUs). As soon as 
a request is available (and there are no responses to be handeled locally) 
and a thread is "free" (i.e. needs more work to do), it [creates a new 
coroutine context and] dequeues the request and then processes it using 
the apropriate kernel routines. 

3. The kernel then computes a response ("result") which is fed into the 
response queue to be "sent" back according to the request's return path 
(either via network or internally, see below). 

4. The kernel itself may need to make further requests in order to process 
the current request. E.g. for shading a surface, tracing several rays (e.g. 
shadow rays) may be needed. The kernel will then enqueue these requests in 
the request PQ with a priority which is larger than the prio of the 
current request. The request PQ will decide whether to keep them in the 
local queue or send the requests over the network (thereby removing them 
from the local queue). 
When all the sub-requests are queued, the thread must "wait" for the 
responses. Instead of doing nothing, the current coroutine state is saved 
and the thread continues as "free" (see 2. and 5.). 

5. If a worker thread is "free" and there is a response (in the response Q) 
which needs to be processed locally, the thread takes this response and 
continues with the saved coroutine state which corresponds to the response. 
It is important that 4. takes precedence over 2. so that responses are 
handeled before new requests. 

Here is a figure showing the flow of execution for the threads: 
http://www.cip.physik.uni-muenchen.de/~wwieser/tmp/files/ray-struct.png

Portability
===========

In order for this concept to be portable, at least 3 threads are required: 

1. There is one network receive thread [could be more as well] which simply 
"poll(2)"s/listens for incoming requests/responses over the network and 
queues them in the request PQ. 

2. There is a network send thread which is either blocked by the request/ 
response queues waiting for something to send or blocked in poll(2) [or 
analogous windows function] trying to send something. 
[This cannot be done by the thread from 1 because therefore the thread 
would have to be able to simultaniously to a poll() and monitor the two 
queues.]

3. At least one worker thread with several coroutine states. 
The design above demands for coroutine switching between threads, i.e. that 
one thread can execute a coroutine created by another thread. According 
to my tests, this makes no trouble under Linux and FreeBSD but is currently
not possible under Windows (src/lib/threads/test-switch.cc). 
This means that [as long as we find no solution] the Windows version will 
- use only one worker thread (i.e. suitable for uniprocessor boxes) or
- use one thread per coroutine context or
- use several threads but keep coroutine contexts tied to their threads 
  [which means that instead of processing responses in the order they arrive 
  the next "free" thread processes the first response with a compatible 
  coroutine context]

Performance
===========

The most critical part is performance and how much overhead is introduced 
by the above approach. 

I made some tests under Linux (ix86), FreeBSD (AMD64) and Windows (ix86). 
For a more detailed view of timings, see my next email. 

As for now, a the basic work cycle is: 
- Take a request from a queue. 
- Switch to the apropriate coroutine. 
- [Process the request. This is a no-op here.]
- Put a response into a queue. 
- Switch the coroutine back. 
[see src/lib/threads/test-switch.cc]

One such work cycle is 370 nsec (0.37 microseconds) on an AhlonXP with 
1.47 GHz running Linux-2.6.11 using gcc-4.1 and a glibc with NPTL support. 
This is about 15 vector transformations (matrix * vector) or 3 matrix 
inversions. 

OTOH, on an AMD64 (Athlon64) with 1.8 GHz running FreeBSD-5.4 and gcc-3.4, 
one such work cycle is 5100 nsec. (Yes, 5.1 usec!) 
This sucks because it is something like 200 vector transformations on 
that box. 

There are no good timings for Windows because the code fails to run 
properly when optimization is enabled during compiling (use -O0). 
Without optimization, I get 4.5 usec on an P4-M, 1.18 GHz. 
(Note that optimization (-O2) improves runtime by a factor of 4.5 on the 
Linux box.)

Conclusion
==========

Threads & coroutines can be used to implement a request-response 
stream-like system. 

However, the requests should be chosen in a way that the pure kernel 
calculation time of each such request consumes an order of magnitude 
more CPU time than the overhead introduced by the queues and switching. 

If anyone (Andrew?) has estimates for the ray-surface intersection times 
on current hardware for "typical" scenes (e.g. a large triangle mesh), 
please post them for comparison. 

Wolfgang