Name | Modified | Size | Downloads / Week |
---|---|---|---|
Parent folder | |||
README.md | 2025-07-08 | 3.8 kB | |
Version 2.0 source code.tar.gz | 2025-07-08 | 14.3 MB | |
Version 2.0 source code.zip | 2025-07-08 | 14.5 MB | |
Totals: 3 Items | 28.8 MB | 4 |
tiny-cuda-nn now comes with a just-in-time (JIT) compilation mode that fuses encodings, neural networks, loss functions, and even backpropagation into single CUDA kernels. This leads to 1.5x-2.5x faster inference and training out of the box and can be enabled with a single line of code, see the "Automatic JIT" section below.
Even larger speed-ups are possible when applications integrate tightly with tiny-cuda-nn's new JIT compiler. For example, Instant NGP achieves a 5x speedup by fusing the entire NeRF ray marcher into a single kernel. See the "Direct JIT integration" section for details on how to accomplish this.
Automatic JIT
To enable JIT compilation mode, set the jit_fusion
property of your model to true
. All future uses of the model, whether inference or training, will then use JIT mode. Note that if there is an error during JIT compilation, a warning will be emitted and JIT compilation mode automatically turned off. Your code will still run using the tiny-cuda-nn 1.X code path.
:::cpp
auto model = tcnn::create_from_config(...);
model->set_jit_fusion(tcnn::supports_jit_fusion()); // Enable JIT if the system supports it
:::python
import tinycudann as tcnn
model = tcnn.NetworkWithInputEncoding(...) # Or any other tcnn model
model.jit_fusion = tcnn.supports_jit_fusion() # Enable JIT if the system supports it
Direct JIT integration
tiny-cuda-nn 2.0's JIT compiler works by converting a given tiny-cuda-nn model to a CUDA device function and then compiling it into a kernel using CUDA's runtime compilation (RTC) feature.
To integrate a tiny-cuda-nn model with a larger kernel in your app, you need to 1. turn your kernel into a string, 2. prepend the tiny-cuda-nn model's device function, 3. pass the result to tiny-cuda-nn's runtime compilation API.
Here is an example that implements a minimal kernel using a tiny-cuda-nn model with 32 input dimensions and 16 output dimensions:
:::cpp
#include <tiny-cuda-nn/rtc_kernel.h>
auto model = tcnn::create_from_config(32 /* input dims */, 16 /* output dims */, ...);
auto fused_kernel = tcnn::CudaRtcKernel(
"your_kernel",
fmt::format(R"
{MODEL_DEVICE_FUNCTION}
__global__ void your_kernel(...) {
// Get input to model from either registers or memory.
tcnn::hvec<32> input = ...;
// Call tiny-cuda-nn model. All 32 threads of the warp must be active here.
tcnn::hvec<16> output = model_fun(nerf_in, params);
// Do something with the model output.
}",
fmt::arg("MODEL_DEVICE_FUNCTION", model->generate_device_function("model_fun")),
)
);
uint32_t blocks = 1;
uint32_t threads = 128; // Must be multiple of 32 for neural networks to work.
uint32_t shmem_size = 0; // Can be any size that your_kernel needs.
cudaStream_t stream = nullptr; // Can be any stream.
fused_kernel.launch(blocks, threads, shmem_size, stream, ... /* params of your_kernel */);
And here is Instant NGP's NeRF integration with the JIT compiler for reference: - src/testbed_nerf.cu - include/neural-graphics-primitives/fused_kernels/render_nerf.cuh
Other additions and changes since last release
- Added unit tests to ensure the new JIT's output matches tiny-cuda-nn with JIT disabled
- Fixed various miscellaneous bugs in the build system and tiny-cuda-nn itself