Download Latest Version Version 2.0 source code.tar.gz (14.3 MB)
Email in envelope

Get an email when there's a new version of Tiny CUDA Neural Networks

Home / v2.0
Name Modified Size InfoDownloads / Week
Parent folder
README.md 2025-07-08 3.8 kB
Version 2.0 source code.tar.gz 2025-07-08 14.3 MB
Version 2.0 source code.zip 2025-07-08 14.5 MB
Totals: 3 Items   28.8 MB 4

tiny-cuda-nn now comes with a just-in-time (JIT) compilation mode that fuses encodings, neural networks, loss functions, and even backpropagation into single CUDA kernels. This leads to 1.5x-2.5x faster inference and training out of the box and can be enabled with a single line of code, see the "Automatic JIT" section below.

Even larger speed-ups are possible when applications integrate tightly with tiny-cuda-nn's new JIT compiler. For example, Instant NGP achieves a 5x speedup by fusing the entire NeRF ray marcher into a single kernel. See the "Direct JIT integration" section for details on how to accomplish this.

Automatic JIT

To enable JIT compilation mode, set the jit_fusion property of your model to true. All future uses of the model, whether inference or training, will then use JIT mode. Note that if there is an error during JIT compilation, a warning will be emitted and JIT compilation mode automatically turned off. Your code will still run using the tiny-cuda-nn 1.X code path.

:::cpp
auto model = tcnn::create_from_config(...);
model->set_jit_fusion(tcnn::supports_jit_fusion()); // Enable JIT if the system supports it

:::python
import tinycudann as tcnn

model = tcnn.NetworkWithInputEncoding(...) # Or any other tcnn model
model.jit_fusion = tcnn.supports_jit_fusion() # Enable JIT if the system supports it

Direct JIT integration

tiny-cuda-nn 2.0's JIT compiler works by converting a given tiny-cuda-nn model to a CUDA device function and then compiling it into a kernel using CUDA's runtime compilation (RTC) feature.

To integrate a tiny-cuda-nn model with a larger kernel in your app, you need to 1. turn your kernel into a string, 2. prepend the tiny-cuda-nn model's device function, 3. pass the result to tiny-cuda-nn's runtime compilation API.

Here is an example that implements a minimal kernel using a tiny-cuda-nn model with 32 input dimensions and 16 output dimensions:

:::cpp
#include <tiny-cuda-nn/rtc_kernel.h>

auto model = tcnn::create_from_config(32 /* input dims */, 16 /* output dims */, ...);
auto fused_kernel = tcnn::CudaRtcKernel(
    "your_kernel",
    fmt::format(R"
        {MODEL_DEVICE_FUNCTION}
        __global__ void your_kernel(...) {
            // Get input to model from either registers or memory.
            tcnn::hvec<32> input = ...;
            // Call tiny-cuda-nn model. All 32 threads of the warp must be active here.
            tcnn::hvec<16> output = model_fun(nerf_in, params); 
            // Do something with the model output.
        }",
        fmt::arg("MODEL_DEVICE_FUNCTION", model->generate_device_function("model_fun")),
    )
);

uint32_t blocks = 1;
uint32_t threads = 128; // Must be multiple of 32 for neural networks to work.
uint32_t shmem_size = 0; // Can be any size that your_kernel needs.
cudaStream_t stream = nullptr; // Can be any stream.
fused_kernel.launch(blocks, threads, shmem_size, stream, ... /* params of your_kernel */);

And here is Instant NGP's NeRF integration with the JIT compiler for reference: - src/testbed_nerf.cu - include/neural-graphics-primitives/fused_kernels/render_nerf.cuh

Other additions and changes since last release

  • Added unit tests to ensure the new JIT's output matches tiny-cuda-nn with JIT disabled
  • Fixed various miscellaneous bugs in the build system and tiny-cuda-nn itself
Source: README.md, updated 2025-07-08