From: Wolfgang W. <ww...@gm...> - 2005-05-22 14:41:38
|
Hello everybody! Recently, I've been thinking about the RT core model, i.e. how the actual calculation should be split for different boxes. The idea I would like to present below is based on discussion with Andrew about stream processing and the "Kilauea paper" and follows my proposal from some time back about using as many worker threads as CPUs in the system and additionally doing userspace (coroutine) switching. Overview ======== The core consists of the following main parts: - a request PQ (priority queue) - processing kernels; one for each request type - a list of saved states (coroutines) - a response queue - an (optional) network interface Requests all have a priority, a return-path (i.e. which computer and couroutine generated the request and needs to receive the corresponding response) and a type. The type can be something like "calculate ray shape intersection", "shade this ray", "completely trace this ray",... The system below is independent of these types; there just needs to be a kernel which can process a request of each type. Operation ========= 1. Requests are queued in the request PQ according to their priority (higher prio first). Inputs for the queue are network (received request) or internal sources. 2. There are several worker threads (as many as there are CPUs). As soon as a request is available (and there are no responses to be handeled locally) and a thread is "free" (i.e. needs more work to do), it [creates a new coroutine context and] dequeues the request and then processes it using the apropriate kernel routines. 3. The kernel then computes a response ("result") which is fed into the response queue to be "sent" back according to the request's return path (either via network or internally, see below). 4. The kernel itself may need to make further requests in order to process the current request. E.g. for shading a surface, tracing several rays (e.g. shadow rays) may be needed. The kernel will then enqueue these requests in the request PQ with a priority which is larger than the prio of the current request. The request PQ will decide whether to keep them in the local queue or send the requests over the network (thereby removing them from the local queue). When all the sub-requests are queued, the thread must "wait" for the responses. Instead of doing nothing, the current coroutine state is saved and the thread continues as "free" (see 2. and 5.). 5. If a worker thread is "free" and there is a response (in the response Q) which needs to be processed locally, the thread takes this response and continues with the saved coroutine state which corresponds to the response. It is important that 4. takes precedence over 2. so that responses are handeled before new requests. Here is a figure showing the flow of execution for the threads: http://www.cip.physik.uni-muenchen.de/~wwieser/tmp/files/ray-struct.png Portability =========== In order for this concept to be portable, at least 3 threads are required: 1. There is one network receive thread [could be more as well] which simply "poll(2)"s/listens for incoming requests/responses over the network and queues them in the request PQ. 2. There is a network send thread which is either blocked by the request/ response queues waiting for something to send or blocked in poll(2) [or analogous windows function] trying to send something. [This cannot be done by the thread from 1 because therefore the thread would have to be able to simultaniously to a poll() and monitor the two queues.] 3. At least one worker thread with several coroutine states. The design above demands for coroutine switching between threads, i.e. that one thread can execute a coroutine created by another thread. According to my tests, this makes no trouble under Linux and FreeBSD but is currently not possible under Windows (src/lib/threads/test-switch.cc). This means that [as long as we find no solution] the Windows version will - use only one worker thread (i.e. suitable for uniprocessor boxes) or - use one thread per coroutine context or - use several threads but keep coroutine contexts tied to their threads [which means that instead of processing responses in the order they arrive the next "free" thread processes the first response with a compatible coroutine context] Performance =========== The most critical part is performance and how much overhead is introduced by the above approach. I made some tests under Linux (ix86), FreeBSD (AMD64) and Windows (ix86). For a more detailed view of timings, see my next email. As for now, a the basic work cycle is: - Take a request from a queue. - Switch to the apropriate coroutine. - [Process the request. This is a no-op here.] - Put a response into a queue. - Switch the coroutine back. [see src/lib/threads/test-switch.cc] One such work cycle is 370 nsec (0.37 microseconds) on an AhlonXP with 1.47 GHz running Linux-2.6.11 using gcc-4.1 and a glibc with NPTL support. This is about 15 vector transformations (matrix * vector) or 3 matrix inversions. OTOH, on an AMD64 (Athlon64) with 1.8 GHz running FreeBSD-5.4 and gcc-3.4, one such work cycle is 5100 nsec. (Yes, 5.1 usec!) This sucks because it is something like 200 vector transformations on that box. There are no good timings for Windows because the code fails to run properly when optimization is enabled during compiling (use -O0). Without optimization, I get 4.5 usec on an P4-M, 1.18 GHz. (Note that optimization (-O2) improves runtime by a factor of 4.5 on the Linux box.) Conclusion ========== Threads & coroutines can be used to implement a request-response stream-like system. However, the requests should be chosen in a way that the pure kernel calculation time of each such request consumes an order of magnitude more CPU time than the overhead introduced by the queues and switching. If anyone (Andrew?) has estimates for the ray-surface intersection times on current hardware for "typical" scenes (e.g. a large triangle mesh), please post them for comparison. Wolfgang |