Download Latest Version cccl-src-v3.3.0.zip (16.8 MB)
Email in envelope

Get an email when there's a new version of CUDA Core Compute Libraries (CCCL)

Home / v3.2.0
Name Modified Size InfoDownloads / Week
Parent folder
cccl-src-v3.2.0.tar.gz 2026-01-23 10.0 MB
cccl-src-v3.2.0.zip 2026-01-23 16.3 MB
cccl-v3.2.0.tar.gz 2026-01-23 1.8 MB
cccl-v3.2.0.zip 2026-01-23 3.4 MB
README.md 2026-01-23 103.9 kB
v3.2.0 source code.tar.gz 2026-01-23 9.8 MB
v3.2.0 source code.zip 2026-01-23 16.1 MB
Totals: 7 Items   57.4 MB 0

The CCCL team is excited to announce the 3.2 release of the CUDA Core Compute Library (CCCL) whose highlights include new modern CUDA C++ runtime APIs and new speed-of-light algorithms including Top-K.

Modern CUDA C++ Runtime

CCCL 3.2 broadly introduces new, idiomatic C++ interfaces for core CUDA runtime and driver functionality.

If you’ve written CUDA C++ for a while, you’ve likely built (or adopted) some form of convenience wrappers around today’s C-like APIs like cudaMalloc or cudaStreamCreate.

The new APIs added in CCCL 3.2 are meant to provide the productivity and safety benefits of C++ for core CUDA constructs so you can spend less time reinventing wrappers and more time writing kernels and algorithms.

Highlights:

  • New convenient vocabulary types for core CUDA concepts (cuda::stream, cuda::event, cuda::arch_traits)
  • Easier memory management with Memory Resources and cuda::buffer
  • More powerful and convenient kernel launch with cuda::launch

Example (vector add, revisited):

:::cpp
cuda::device_ref device = cuda::devices[0];
cuda::stream stream{device};
auto pool = cuda::device_default_memory_pool(device);

int num_elements = 1000;
auto A = cuda::make_buffer<float>(stream, pool, num_elements, 1.0);
auto B = cuda::make_buffer<float>(stream, pool, num_elements, 2.0);
auto C = cuda::make_buffer<float>(stream, pool, num_elements, cuda::no_init);

constexpr int threads_per_block = 256;
auto config = cuda::distribute<threads_per_block>(num_elements);
auto kernel = [] __device__ (auto config, cuda::std::span<const float> A, 
                                            cuda::std::span<const float> B, 
                                            cuda::std::span<float> C){
    auto tid = cuda::gpu_thread.rank(cuda::grid, config);
    if (tid < A.size())
        C[tid] = A[tid] + B[tid];
};
cuda::launch(stream, config, kernel, config, A, B, C);

(Try this example live on Compiler Explorer!)

A forthcoming blog post will go deeper into the details, the design goals, intended usage patterns, and how these new APIs fit alongside existing CUDA APIs.

New Algorithms

Top-K Selection

CCCL 3.2 introduces cub::DeviceTopK (for example, cub::DeviceTopK::MaxKeys) to select the K largest (or smallest) elements without sorting the entire input. For workloads where K is small, this can deliver up to 5X speedups over a full radix sort, and can reduce memory consumption when you don’t need sorted results.

Top‑K is an active area of ongoing work for CCCL: our roadmap includes planned segmented Top‑K as well as block‑scope and warp‑scope Top‑K variants. See what’s planned and tell us what Top‑K use cases matter most in CCCL GitHub issue [#5673].

image

Fixed-size Segmented Reduction

CCCL 3.2 now provides a new cub::DeviceSegmentedReduce variant that accepts a uniform segment_size, eliminating offset iterator overhead in the common case when segments are fixed-size. This enables optimizations for both small segment sizes (up to 66x) and large segment sizes (up to 14x).

:::cpp
// New API accepts fixed segment_size instead of per-segment begin/end offsets
cub::DeviceSegmentedReduce::Sum(d_temp, temp_bytes, input, output,  
                                num_segments, segment_size);

image

Additional New Algorithms in CCCL 3.2

Segmented Scan - cub::DeviceSegmentedScan provides a segmented version of a parallel scan that efficiently computes a scan operation over multiple independent segments.

Binary Search - cub::DeviceFind::[Upper/LowerBound] performs a parallel search for multiple values in an ordered sequence.

Search - cub::DeviceFind::FindIf searches the unordered input for the first element that satisfies a given condition. Thanks to its early-exit logic, it can be up to 7x faster than searching the entire sequence.

Full Changelog: https://github.com/NVIDIA/cccl/compare/v3.1.4...v3.2.0

What's Changed

🚀 Thrust / CUB

libcu++

  • Added cuda::barrier and cuda::memcpy_async_tx examples using TMA @bernhardmgruber in [#6231]
  • Waiting on a cuda::barrier on SM90+ is now faster and produces less code @bernhardmgruber in [#6007]
  • Improve cuda::memcpy_async codegen @bernhardmgruber in [#5996]
  • Improve TMA codegen on sm120 in cuda::memcpy_async, cuda::device::memcpy_async_tx, cub::DeviceTransform @bernhardmgruber in [#6362]

🤝 cuda.coop

🔄 Other Changes

New Contributors

Full Changelog: https://github.com/NVIDIA/cccl/compare/v3.1.4...v3.2.0

Source: README.md, updated 2026-01-23