CUDA Core Compute Libraries (CCCL) - Browse /v3.2.0 at SourceForge.net

The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
cccl-src-v3.2.0.tar.gz	2026-01-23	10.0 MB	0
cccl-src-v3.2.0.zip	2026-01-23	16.3 MB	0
cccl-v3.2.0.tar.gz	2026-01-23	1.8 MB	0
cccl-v3.2.0.zip	2026-01-23	3.4 MB	0
README.md	2026-01-23	103.9 kB	0
v3.2.0 source code.tar.gz	2026-01-23	9.8 MB	0
v3.2.0 source code.zip	2026-01-23	16.1 MB	0
Totals: 7 Items		57.4 MB	0

The CCCL team is excited to announce the 3.2 release of the CUDA Core Compute Library (CCCL) whose highlights include new modern CUDA C++ runtime APIs and new speed-of-light algorithms including Top-K.

Modern CUDA C++ Runtime

CCCL 3.2 broadly introduces new, idiomatic C++ interfaces for core CUDA runtime and driver functionality.

If you’ve written CUDA C++ for a while, you’ve likely built (or adopted) some form of convenience wrappers around today’s C-like APIs like cudaMalloc or cudaStreamCreate.

The new APIs added in CCCL 3.2 are meant to provide the productivity and safety benefits of C++ for core CUDA constructs so you can spend less time reinventing wrappers and more time writing kernels and algorithms.

Highlights:

New convenient vocabulary types for core CUDA concepts (cuda::stream, cuda::event, cuda::arch_traits)
Easier memory management with Memory Resources and cuda::buffer
More powerful and convenient kernel launch with cuda::launch

Example (vector add, revisited):

:::cpp
cuda::device_ref device = cuda::devices[0];
cuda::stream stream{device};
auto pool = cuda::device_default_memory_pool(device);

int num_elements = 1000;
auto A = cuda::make_buffer<float>(stream, pool, num_elements, 1.0);
auto B = cuda::make_buffer<float>(stream, pool, num_elements, 2.0);
auto C = cuda::make_buffer<float>(stream, pool, num_elements, cuda::no_init);

constexpr int threads_per_block = 256;
auto config = cuda::distribute<threads_per_block>(num_elements);
auto kernel = [] __device__ (auto config, cuda::std::span<const float> A, 
                                            cuda::std::span<const float> B, 
                                            cuda::std::span<float> C){
    auto tid = cuda::gpu_thread.rank(cuda::grid, config);
    if (tid < A.size())
        C[tid] = A[tid] + B[tid];
};
cuda::launch(stream, config, kernel, config, A, B, C);

(Try this example live on Compiler Explorer!)

A forthcoming blog post will go deeper into the details, the design goals, intended usage patterns, and how these new APIs fit alongside existing CUDA APIs.

New Algorithms

Top-K Selection

CCCL 3.2 introduces cub::DeviceTopK (for example, cub::DeviceTopK::MaxKeys) to select the K largest (or smallest) elements without sorting the entire input. For workloads where K is small, this can deliver up to 5X speedups over a full radix sort, and can reduce memory consumption when you don’t need sorted results.

Top‑K is an active area of ongoing work for CCCL: our roadmap includes planned segmented Top‑K as well as block‑scope and warp‑scope Top‑K variants. See what’s planned and tell us what Top‑K use cases matter most in CCCL GitHub issue [#5673].

Fixed-size Segmented Reduction

CCCL 3.2 now provides a new cub::DeviceSegmentedReduce variant that accepts a uniform segment_size, eliminating offset iterator overhead in the common case when segments are fixed-size. This enables optimizations for both small segment sizes (up to 66x) and large segment sizes (up to 14x).

:::cpp
// New API accepts fixed segment_size instead of per-segment begin/end offsets
cub::DeviceSegmentedReduce::Sum(d_temp, temp_bytes, input, output,  
                                num_segments, segment_size);

Additional New Algorithms in CCCL 3.2

Segmented Scan - cub::DeviceSegmentedScan provides a segmented version of a parallel scan that efficiently computes a scan operation over multiple independent segments.

Binary Search - cub::DeviceFind::[Upper/LowerBound] performs a parallel search for multiple values in an ordered sequence.

Search - cub::DeviceFind::FindIf searches the unordered input for the first element that satisfies a given condition. Thanks to its early-exit logic, it can be up to 7x faster than searching the entire sequence.

Full Changelog: https://github.com/NVIDIA/cccl/compare/v3.1.4...v3.2.0

What's Changed

🚀 Thrust / CUB

Modified test [reduce][nondeterministic] per gh-5443 by @oleksandr-pavlyk in https://github.com/NVIDIA/cccl/pull/5451
Remove unused include of grid/grid_queue from CUB agent/dispatch headers by @oleksandr-pavlyk in https://github.com/NVIDIA/cccl/pull/5887
[CUB] Implement BlockLoadToShared by @pauleonix in https://github.com/NVIDIA/cccl/pull/5780
Fix debug section around line 390 of dispatch_topk by @oleksandr-pavlyk in https://github.com/NVIDIA/cccl/pull/6152
Fix typos in segmented reduce by @oleksandr-pavlyk in https://github.com/NVIDIA/cccl/pull/6153
Device scan doc fixes by @oleksandr-pavlyk in https://github.com/NVIDIA/cccl/pull/6294
Scan tests and benchmarks by @oleksandr-pavlyk in https://github.com/NVIDIA/cccl/pull/6355
[Thrust]: New "sum rows" and "sum columns" examples by @brycelelbach in https://github.com/NVIDIA/cccl/pull/4462
Added new CUB APIs: DeviceTransform::Fill [#5526], DeviceTransform::Generate [#5890], DeviceTransform::TransformIf [#5198], which are used by thrust::fill[_n] [#5805], thrust::uninitialized_fill [#5813], thrust::generate[_n] [#5807], and thrust::transform_if, thrust::scatter_if [#5952], and non-trivial thrust::copy [#5954]. By @bernhardmgruber.
Made thrust::tabulate [#6012] use cub::DeviceTransform as well by @bernhardmgruber in [#5198]

libcu++

Added cuda::barrier and cuda::memcpy_async_tx examples using TMA @bernhardmgruber in [#6231]
Waiting on a cuda::barrier on SM90+ is now faster and produces less code @bernhardmgruber in [#6007]
Improve cuda::memcpy_async codegen @bernhardmgruber in [#5996]
Improve TMA codegen on sm120 in cuda::memcpy_async, cuda::device::memcpy_async_tx, cub::DeviceTransform @bernhardmgruber in [#6362]

🤝 cuda.coop

Implement cuda.coop striped_to_blocked. by @tpn in https://github.com/NVIDIA/cccl/pull/4662

🔄 Other Changes

Rework our signbit implementation to be potentially constexpr by @miscco in https://github.com/NVIDIA/cccl/pull/5259
[CUDAX->libcu++] Move ensure_current_device to libcu++ and change the name to ensure_current_context by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5285
[Version] Update main to v3.2.0 by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5286
Rework our copysign implementation to be potentially constexpr by @miscco in https://github.com/NVIDIA/cccl/pull/5287
Update NVBench by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5288
[CUDAX] Rename async_buffer::change_stream to set_stream and add a test by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5273
Extend and refactor transform overloads in CUDA system by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5238
Refactor c2h by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5205
Fix inplace_vector out of bounds access for at() by @Jacobfaib in https://github.com/NVIDIA/cccl/pull/5295
Fix cudax test breaking main by @davebayer in https://github.com/NVIDIA/cccl/pull/5301
[STF] Move occupancy calculation utility and support CUfunction by @caugonnet in https://github.com/NVIDIA/cccl/pull/5236
[CUDAX->libcu++] Move stream and event from cudax to libcu++ by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5293
Port thrust::transform_input_output_iterator to cuda by @miscco in https://github.com/NVIDIA/cccl/pull/5113
Implement format.arguments and format.context from standard formatting library by @davebayer in https://github.com/NVIDIA/cccl/pull/5217
Initial migration of cuco hasher to cudax by @srinivasyadav18 in https://github.com/NVIDIA/cccl/pull/4898
CUB - Add internal integer utils and tests (Split WarpReduce PR) by @fbusato in https://github.com/NVIDIA/cccl/pull/5314
Skip zero values in fast_mod_div unit test by @fbusato in https://github.com/NVIDIA/cccl/pull/5307
Fix cuda::static_for noexcept definition by @davebayer in https://github.com/NVIDIA/cccl/pull/5303
Add sm90 tunings for RFA F32 by @srinivasyadav18 in https://github.com/NVIDIA/cccl/pull/5269
Add and use new artifact/workflow functionality for CI scripts. by @alliepiper in https://github.com/NVIDIA/cccl/pull/4861
Add gitlab devcontainers by @wmaxey in https://github.com/NVIDIA/cccl/pull/5325
Remove mentions of CUDA experimental that sneaked into libcu++ by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5306
Add a macro to disable PDL by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5316
Move aligned_size_t, get_device_address and discard_memory to cuda/__memory/ by @davebayer in https://github.com/NVIDIA/cccl/pull/5239
Adds tests for large number of items to DeviceRunLengthEncode::NonTrivialRuns by @elstehle in https://github.com/NVIDIA/cccl/pull/5251
[libcu++] Deprecate default stream_ref constructor and fix some few last usages by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5310
Extends benchmarks for DeviceRunLengthEncode::NonTrivialRuns to differentiate between offset and run-length type by @elstehle in https://github.com/NVIDIA/cccl/pull/5248
Complex log accuracy refinement by @s-oboyle in https://github.com/NVIDIA/cccl/pull/5185
Replace use of cupy with cuda-core in cuda.cccl.parallel by @shwina in https://github.com/NVIDIA/cccl/pull/5323
Better motivates cuda::device::is_address_from by @fbusato in https://github.com/NVIDIA/cccl/pull/5341
Fix CUB 'limited' job in nightly CI by @alliepiper in https://github.com/NVIDIA/cccl/pull/5347
Fix nvrtc when there are more than one CTK include directories available by @wmaxey in https://github.com/NVIDIA/cccl/pull/5318
c.parallel: enable UBLKCP in transform by @griwes in https://github.com/NVIDIA/cccl/pull/4847
Merge sort benchmark requires no sync by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5350
Forgot to add inline in is_address_from by @fbusato in https://github.com/NVIDIA/cccl/pull/5349
Add sm86 tunings for deterministic DeviceReduce (RFA) by @srinivasyadav18 in https://github.com/NVIDIA/cccl/pull/5354
Adds support for large number of items to DeviceRunLengthEncode::NonTrivialRuns by @elstehle in https://github.com/NVIDIA/cccl/pull/5252
Document that scan_op must be associative by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5358
Fix cuco hasher test by @srinivasyadav18 in https://github.com/NVIDIA/cccl/pull/5353
Super tiny tweak for analysis script to work after introducing postgreSQL by @gonidelis in https://github.com/NVIDIA/cccl/pull/5331
c.parallel: support providing well-known operations by @griwes in https://github.com/NVIDIA/cccl/pull/4562
Add simpler, single-phase APIs for all parallel algorithms by @shwina in https://github.com/NVIDIA/cccl/pull/5207
[STF] [EASY] Fix exception guard usage in traits.cuh by @GPMueller in https://github.com/NVIDIA/cccl/pull/5369
[CUB] Add cub::detail::ThreadScan*Partial by @pauleonix in https://github.com/NVIDIA/cccl/pull/5300
Diagnose missing numeric_limits specialization in DeviceReduce Min/Max by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5359
Suppress clang warnings on vector types in upcoming CTK by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5362
Add is_object_from by @fbusato in https://github.com/NVIDIA/cccl/pull/5364
[CUB] Add cub::detail::ThreadReducePartial by @pauleonix in https://github.com/NVIDIA/cccl/pull/5324
fix noexcept clause on ctor of let_value's opstate by @ericniebler in https://github.com/NVIDIA/cccl/pull/5387
Add some notes about performance of 1 and 2 byte atomic_ref. by @wmaxey in https://github.com/NVIDIA/cccl/pull/5390
Add a section covering include changes in the migration docs by @wmaxey in https://github.com/NVIDIA/cccl/pull/5391
Add missing NV_TARGET macro by @fbusato in https://github.com/NVIDIA/cccl/pull/5388
[libcu++] Add missing pop of deprecation warning suppression by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5395
Makes thrust::unique use cub::DeviceSelect::Unique by @elstehle in https://github.com/NVIDIA/cccl/pull/5396
fix race condition in starts_on execution test by @ericniebler in https://github.com/NVIDIA/cccl/pull/5393
Split c2h sources into more files by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5384
Remove CTK <12 version check for PDL by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5343
Add nondeterministic reduce that uses atomics by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/4961
Add scan tunings from leaderboard by @gonidelis in https://github.com/NVIDIA/cccl/pull/5283
[CUDAX->libcu++] Expose fill_bytes and copy_bytes in libcudacxx by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5304
Move ownership of cudax test cmake to cudax owners by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5406
move basic_any from cudax to libcudacxx by @ericniebler in https://github.com/NVIDIA/cccl/pull/5298
fix a data race and use-after-free in execution::run_loop by @ericniebler in https://github.com/NVIDIA/cccl/pull/5402
fix the _CCCL_PP_COMMA_IFF macro by @ericniebler in https://github.com/NVIDIA/cccl/pull/5407
Replaces internal macros with __host__ and __device__ attributes by @elstehle in https://github.com/NVIDIA/cccl/pull/5412
[STF] Allow CUfunction/CUkernel (driver API) in the cuda_kernel(_chain) API by @caugonnet in https://github.com/NVIDIA/cccl/pull/5215
Improve forward declarations. We often need only a forward declaration of vocabulary types and also want to know whether something is an instance of said type. by @miscco in https://github.com/NVIDIA/cccl/pull/5305
Add NVTX ranges to C2H tests by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5332
[STF] Low level interface for the cuda_kernel(_chain) construct by @caugonnet in https://github.com/NVIDIA/cccl/pull/5319
Drops global namespace qualification from cuda namespace usage in our tests by @elstehle in https://github.com/NVIDIA/cccl/pull/5415
Add Histogram implementation for c.parallel by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/4689
Disable NVHPC optimization that leads to error by @gonidelis in https://github.com/NVIDIA/cccl/pull/5416
Combine block_reduce_warp_reduction_nondeterministic.cuh specialization with original deterministic one by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/5408
Use C2H in radix_sort c.parallel tests by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/5426
Add common constants for floating point types by @miscco in https://github.com/NVIDIA/cccl/pull/5413
[libcu++] Rename memory resource concepts to indicate asynchronous allocations are the default ones by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5313
Fix gpu_to_gpu determinism fallback conditions to run_to_run determinism by @srinivasyadav18 in https://github.com/NVIDIA/cccl/pull/5382
remove the dependence from sync_wait's receiver on the sender's type by @ericniebler in https://github.com/NVIDIA/cccl/pull/5446
Print character vectors as numbers in tests by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5154
Generate negative numbers in Thrust unit tests by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/4923
cuda.cccl: Update dependencies to enable running on CUDA 13 driver by @shwina in https://github.com/NVIDIA/cccl/pull/5442
Move TMA barrier in DeviceTransform into dynamic SMEM by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5414
Fix grid dependency sync in cub::DeviceMergeSort by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5456
Add python wrappers for c.parallel histogram API by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/4709
Integer Add with overflow checking by @fbusato in https://github.com/NVIDIA/cccl/pull/5267
fix NV_TARGET typos by @fbusato in https://github.com/NVIDIA/cccl/pull/5418
CUB - Add internal thread and warp utils (Split WarpReduce PR) by @fbusato in https://github.com/NVIDIA/cccl/pull/5317
Introduce i128 and u128 literals to libcu++ testing by @davebayer in https://github.com/NVIDIA/cccl/pull/5372
Replace address space intrinsics with cuda::device::is_address_from by @davebayer in https://github.com/NVIDIA/cccl/pull/4866
Update cuda::ptx to CTK 13 by @fbusato in https://github.com/NVIDIA/cccl/pull/5447
Implement cuda::std::from_chars for integers by @davebayer in https://github.com/NVIDIA/cccl/pull/4938
Port thrust::zip_iterator to namespace cuda by @miscco in https://github.com/NVIDIA/cccl/pull/5429
Drop all usages of _CCCL_TRAIT by @miscco in https://github.com/NVIDIA/cccl/pull/5466
[STF] Misc. STF doxygen documentation by @caugonnet in https://github.com/NVIDIA/cccl/pull/5470
[STF] Cleanup for_each_batched.cuh by @caugonnet in https://github.com/NVIDIA/cccl/pull/5473
[STF] Move only_convertible_or to reserved namespace by @caugonnet in https://github.com/NVIDIA/cccl/pull/5472
extend execution environments to support queries that take extra arguments by @ericniebler in https://github.com/NVIDIA/cccl/pull/5464
Fix atomic reduce for arches < 600 with dtype double by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/5428
Rework our fabs implementation to be potentially constexpr by @miscco in https://github.com/NVIDIA/cccl/pull/5302
Fix handling of invalid inputs (<= 0) to GridEvenShare and adjust handling of num_items == 0 on the caller side by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/5452
Simplify thrust::device_malloc by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5477
[STF] Accept shapes which are just integral values in parallel_for by @caugonnet in https://github.com/NVIDIA/cccl/pull/5485
[libcu++] Remove experimental memory resource define check from around the concept, properties and the query. by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5437
Drop unused iterator bases and update standard iterators by @miscco in https://github.com/NVIDIA/cccl/pull/5454
Fix mismatched internal dispatch in cub::ScatterToStripedFlagged by @MengAiDev in https://github.com/NVIDIA/cccl/pull/5483
[STF] Improve how we ignore void interface (tokens) arguments in prototypes by @caugonnet in https://github.com/NVIDIA/cccl/pull/5475
Refactor thrust::pointer by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5478
Add test to ensure we can use cuda::std::reverse_iterator with thrust APIs by @miscco in https://github.com/NVIDIA/cccl/pull/5486
Drop thrust::LoadIterator/make_load_iterator by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5480
Fix __float128 detection and require compiler support for literals by @davebayer in https://github.com/NVIDIA/cccl/pull/4591
[CUB] Implement *Partial member functions for WarpScan by @pauleonix in https://github.com/NVIDIA/cccl/pull/5379
Add SM_110a for non-supporting compilers by @fbusato in https://github.com/NVIDIA/cccl/pull/5489
[Thrust] Make documentation behind #if 0 visible by @pauleonix in https://github.com/NVIDIA/cccl/pull/5455
Add support for virtual shared memory to DispatchReduceByKey by @elstehle in https://github.com/NVIDIA/cccl/pull/5440
Use thrust::copy in thrust::uninitialized_copy[_n] in CUDA system when possible by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5181
Move segmented sort kernels to separate header by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/5499
Refactor agent_reduce by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5507
Enable mdspan public headers test for msvc in C++17 by @davebayer in https://github.com/NVIDIA/cccl/pull/5510
Make NVTX headers declare themselves as system headers by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5508
add _CCCL_TYPE_VISIBILITY_HIDDEN config macro by @ericniebler in https://github.com/NVIDIA/cccl/pull/5514
[STF] Avoid warning about unsed variable by @miscco in https://github.com/NVIDIA/cccl/pull/5518
Handle NVTX3 being disabled in C2H by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5511
Also test DeviceTransform with unaligned destination by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5509
Add nondeterministic reduce sum benchmark by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/5520
Add grayscale transform benchmark by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/5522
Document why workstealing is not implemented in DeviceTransform by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5525
fix device definition of cudax execution's __nothrow_fooable traits by @ericniebler in https://github.com/NVIDIA/cccl/pull/5533
use auto(expr) for _LIBCUDACXX_AUTO_CAST when it is available by @ericniebler in https://github.com/NVIDIA/cccl/pull/5537
implement a variant of P3206 for getting a sender's completion behavior by @ericniebler in https://github.com/NVIDIA/cccl/pull/5517
Fix naming of our namespace macros and friends by @miscco in https://github.com/NVIDIA/cccl/pull/5538
Fix regression introduced with agent_reduce refactoring by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5542
[libcu++] Rename resource_ref to match the new async by default naming by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5534
Add missing full qualification for ::cuda::std in libcu++ by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5544
Implement ranges::for_each{_n} by @miscco in https://github.com/NVIDIA/cccl/pull/5540
[CUDAX] Rename type-erased memory resource wrappers by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5536
Fix merge conflict :see_no_evil: by @miscco in https://github.com/NVIDIA/cccl/pull/5546
make clangd use libc++ instead of libstdc++ by @ericniebler in https://github.com/NVIDIA/cccl/pull/5548
permit __query_result_or_t to take extra arguments by @ericniebler in https://github.com/NVIDIA/cccl/pull/5551
Only download wheels artifacts for release by @cryos in https://github.com/NVIDIA/cccl/pull/5543
give pod_tuple.h the _CCCL_EXEC_CHECK_DISABLE treatment by @ericniebler in https://github.com/NVIDIA/cccl/pull/5553
Fix fp constants by @davebayer in https://github.com/NVIDIA/cccl/pull/5467
Replace _CCCL_ASSUME with _CCCL_BUILTIN_ASSUME by @fbusato in https://github.com/NVIDIA/cccl/pull/5554
suppress bogus msvc warning about unreachable code in cuda::std::optional by @ericniebler in https://github.com/NVIDIA/cccl/pull/5563
port then() tests from stdexec and fix bugs in schedule_from and sync_wait by @ericniebler in https://github.com/NVIDIA/cccl/pull/5561
extend the get_completion_scheduler to accept the receiver's env by @ericniebler in https://github.com/NVIDIA/cccl/pull/5565
Add a benchmark for transform_if with stencil by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5571
Complex sqrt accuracy/speed improvements by @s-oboyle in https://github.com/NVIDIA/cccl/pull/5371
Remove repo-docs dependency by @gevtushenko in https://github.com/NVIDIA/cccl/pull/5568
Regenerate PTX docs by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5574
Replace our qualification macros with plain cuda::std:: by @miscco in https://github.com/NVIDIA/cccl/pull/5573
Reorganize docs pages a bit by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5584
Documentation fixes by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5468
Use a custom git describe command for setuptools-scm by @shwina in https://github.com/NVIDIA/cccl/pull/5586
Update CCCL to CTK mapping table by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5587
Port thrust::shuffle_iterator to cuda by @miscco in https://github.com/NVIDIA/cccl/pull/5530
Add docstrings for all single-phase APIs in CUDA CCCL parallel algorithms by @Copilot in https://github.com/NVIDIA/cccl/pull/5582
Remove remainig namespace macros by @miscco in https://github.com/NVIDIA/cccl/pull/5608
Avoids invoking custom equality operator for out-of-bounds items by @elstehle in https://github.com/NVIDIA/cccl/pull/5566
Rework our fmax and fmin implementation to be potentially constexpr by @miscco in https://github.com/NVIDIA/cccl/pull/5539
Update cuda/ptx instructions to support all new SM architectures in CTK 13 by @fbusato in https://github.com/NVIDIA/cccl/pull/5600
[libcu++] Disable arch traits testing kernel for old arches for which we don't provide traits by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5602
Enable PDL in DeviceTransform by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5249
re-express execution::starts_on in terms of execution::continues_on by @ericniebler in https://github.com/NVIDIA/cccl/pull/5576
Refactor cuda.cccl.parallel benchmarks to reduce repetition using pytest parametrization by @Copilot in https://github.com/NVIDIA/cccl/pull/5589
Add ZipIterator to cuda.cccl.parallel by @shwina in https://github.com/NVIDIA/cccl/pull/5389
[skip-ci] Clarify GPU architecture support in README. by @jrhemstad in https://github.com/NVIDIA/cccl/pull/5618
rename _CCCL_TRIVIAL_API to _CCCL_NODEBUG_API by @ericniebler in https://github.com/NVIDIA/cccl/pull/5617
Fix includes table in migration guide by @wmaxey in https://github.com/NVIDIA/cccl/pull/5624
Implement format.formatter.spec by @davebayer in https://github.com/NVIDIA/cccl/pull/5368
Implement execution policies by @miscco in https://github.com/NVIDIA/cccl/pull/5577
Move partition kernels to separate header by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/5630
Drop internal uses of thrust::reverse_iterator by @miscco in https://github.com/NVIDIA/cccl/pull/5616
[libcu++] Add SM_110 arch traits by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5631
Add device fp128 funcitons include by @davebayer in https://github.com/NVIDIA/cccl/pull/5585
Allow C++ code for operators in c.parallel by @gevtushenko in https://github.com/NVIDIA/cccl/pull/5633
[STF] Fix CUDA graph API calls for CUDA 13 by @caugonnet in https://github.com/NVIDIA/cccl/pull/5636
[STF] Implement token elision in cuda_kernel constructs by @caugonnet in https://github.com/NVIDIA/cccl/pull/5640
[STF] make get_owning_container_of local to a class by @caugonnet in https://github.com/NVIDIA/cccl/pull/5643
Avoid issue with MinimalElementType and MSVC by @miscco in https://github.com/NVIDIA/cccl/pull/5641
[STF] Replace task dep's as_read_mode by a more general as_mode by @caugonnet in https://github.com/NVIDIA/cccl/pull/5645
Drop constraints from fp conversion rank order traits by @davebayer in https://github.com/NVIDIA/cccl/pull/5644
Implement __fp_is_explicit_conversion_v by @davebayer in https://github.com/NVIDIA/cccl/pull/5648
Rename header guards to drop the _LIBCUDACXX prefix by @miscco in https://github.com/NVIDIA/cccl/pull/5632
Minor path_finder → pathfinder fixes by @rwgk in https://github.com/NVIDIA/cccl/pull/5637
[CUDAX] Add legacy prefix to managed_memory_resource and remove async members by @pciolkosz in https://github.com/NVIDIA/cccl/pull/4983
Update docs build to deploy from gh-pages branch to docs/ directory with preserved branch history by @Copilot in https://github.com/NVIDIA/cccl/pull/5605
Fixes thrust::unique for non-const equality_op by @elstehle in https://github.com/NVIDIA/cccl/pull/5652
Fix bug in reduce tuning by @gonidelis in https://github.com/NVIDIA/cccl/pull/5654
Enable parallel Sphinx builds by @jrhemstad in https://github.com/NVIDIA/cccl/pull/5655
[STF] Remove the hook mechanism by @caugonnet in https://github.com/NVIDIA/cccl/pull/5660
cuda.cccl: Build combined CUDA 12+13 wheel by @shwina in https://github.com/NVIDIA/cccl/pull/5613
Add tests/parallel/examples/scan/scan_applications.py by @oleksandr-pavlyk in https://github.com/NVIDIA/cccl/pull/5634
cuda.cccl.parallel: Expose "well-known" operations to Python by @shwina in https://github.com/NVIDIA/cccl/pull/5578
Fix issues with compiling on 12.0 for memcpy_async on Ampere+ by @wmaxey in https://github.com/NVIDIA/cccl/pull/5665
[STF] Add examples which add tasks to user-provided CUDA graphs by @caugonnet in https://github.com/NVIDIA/cccl/pull/5410
NVHPC 25.7 by @alliepiper in https://github.com/NVIDIA/cccl/pull/5360
Add a missing variant header in c/parallel by @caugonnet in https://github.com/NVIDIA/cccl/pull/5680
Simplify enum bindings by @shwina in https://github.com/NVIDIA/cccl/pull/5666
Fix issue revealed by gcc14 stringent checking by @andralex in https://github.com/NVIDIA/cccl/pull/5671
Fix cuda::shuffle_iterator not properly working with thrust algorithms by @miscco in https://github.com/NVIDIA/cccl/pull/5686
Update cudaGraphAddDependencies for 13.0 by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5691
add a query to get a sender's completion domain for each completion disposition by @ericniebler in https://github.com/NVIDIA/cccl/pull/5599
Update PTX ISA version for CUDA 13 by @davebayer in https://github.com/NVIDIA/cccl/pull/5676
Move nvbench_helper out of CUB for easier reuse. by @alliepiper in https://github.com/NVIDIA/cccl/pull/5692
[STF] Add missing low-level API in the unified context and introduce a method to enable graph capture in the low level API by @caugonnet in https://github.com/NVIDIA/cccl/pull/5701
Add cuco hasher's benchmark in cudax by @srinivasyadav18 in https://github.com/NVIDIA/cccl/pull/5558
Fix backslashes in blocked doxygen alias in CUB docs by @oleksandr-pavlyk in https://github.com/NVIDIA/cccl/pull/5695
Fix thrust::malloc for void by @miscco in https://github.com/NVIDIA/cccl/pull/5698
Ensure that we are building with the /Zc:preprocessor flag on windows by @miscco in https://github.com/NVIDIA/cccl/pull/5687
Add support for float16 (__half) in cuda.cccl.parallel by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/5696
Work around NVRTC bug with virtual default ctors/dtors by @wmaxey in https://github.com/NVIDIA/cccl/pull/5704
Parse and merge devcontainer feature metadata in launch.sh by @trxcllnt in https://github.com/NVIDIA/cccl/pull/5074
[CUDAX] Remove synchronization from set_stream and add a stream argument to destroy in async_buffer by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5697
fix completion signature computation of starts_on, work around gcc9 ICE by @ericniebler in https://github.com/NVIDIA/cccl/pull/5724
[STF] Example to freeze logical data in a graph to use in a child graph by @caugonnet in https://github.com/NVIDIA/cccl/pull/5731
[STF] Remove the for_each_batched experiment entirely by @caugonnet in https://github.com/NVIDIA/cccl/pull/5726
Also use the new preprocessor in the libcu++ header tests by @miscco in https://github.com/NVIDIA/cccl/pull/5732
Disable test for all MSVC and NVCC 12.0 by @miscco in https://github.com/NVIDIA/cccl/pull/5734
Unify the libcudacxx header test infrastructure with the other projects by @miscco in https://github.com/NVIDIA/cccl/pull/5735
Improvements to CI PR comments. by @alliepiper in https://github.com/NVIDIA/cccl/pull/5705
Fix build scripts when sccache is not available. by @alliepiper in https://github.com/NVIDIA/cccl/pull/5727
Make default CMake options configure a minimal installation. by @alliepiper in https://github.com/NVIDIA/cccl/pull/5737
Add git-bisect script/workflow and generic single-target build/test script by @alliepiper in https://github.com/NVIDIA/cccl/pull/5728
Deprecate <cuda/discard_memory> by @davebayer in https://github.com/NVIDIA/cccl/pull/5672
Add CTK 13.0, gcc14 devcontainers and CI by @alliepiper in https://github.com/NVIDIA/cccl/pull/5431
Add workflow to build and cleanup per PR docs previews by @jrhemstad in https://github.com/NVIDIA/cccl/pull/5559
fix: Add missing pages:write permission to PR cleanup workflow by @jrhemstad in https://github.com/NVIDIA/cccl/pull/5744
Document UB in warp_match_all by @gonzalobg in https://github.com/NVIDIA/cccl/pull/5658
Guard against some optional files not being present. by @alliepiper in https://github.com/NVIDIA/cccl/pull/5742
Migrate all cuco hashers by @srinivasyadav18 in https://github.com/NVIDIA/cccl/pull/5400
Add comprehensive GitHub Copilot instructions for CCCL development workflow including Python components by @Copilot in https://github.com/NVIDIA/cccl/pull/5620
Split cub developer guide into separate sections by @miscco in https://github.com/NVIDIA/cccl/pull/5739
[cudax] Add green_context::id() method by @davebayer in https://github.com/NVIDIA/cccl/pull/5471
Fix problematic clang attribute namespace by @davebayer in https://github.com/NVIDIA/cccl/pull/5748
[CUDAX] Implement cudax::kernel_ref by @davebayer in https://github.com/NVIDIA/cccl/pull/5041
Fix local builds by @miscco in https://github.com/NVIDIA/cccl/pull/5746
Obtain temp storage size and alignment directly from LTO IR via PTX conversion. by @tpn in https://github.com/NVIDIA/cccl/pull/5355
Properly guard ptx includes for when we are in cuda mode by @miscco in https://github.com/NVIDIA/cccl/pull/5749
Remove thrust from async_buffer and use cub instead by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5659
Safe cuda::std::memset/memcpy API by @fbusato in https://github.com/NVIDIA/cccl/pull/5500
Cleaned up the AGENTS instructions with GPT5. by @alliepiper in https://github.com/NVIDIA/cccl/pull/5745
Address Sphinx warnings, populate Thrust's group pages by @oleksandr-pavlyk in https://github.com/NVIDIA/cccl/pull/5759
Remove problematic new build symlink by @alliepiper in https://github.com/NVIDIA/cccl/pull/5761
[STF] Remove broken data_from_device_async test by @caugonnet in https://github.com/NVIDIA/cccl/pull/5765
[STF] Remove dead STF example 09-nbody-blocked-graph by @caugonnet in https://github.com/NVIDIA/cccl/pull/5763
[STF] Remove the stopwatch utility header by @caugonnet in https://github.com/NVIDIA/cccl/pull/5762
[STF] Example to import logical data in a sub context with a while condition by @caugonnet in https://github.com/NVIDIA/cccl/pull/5738
[STF] Test write-back on frozen logical data by @caugonnet in https://github.com/NVIDIA/cccl/pull/5733
Make sure that cuda:: iterators are random_access_iterator when possible by @miscco in https://github.com/NVIDIA/cccl/pull/5678
[STF] Rework dot tool to have really nested sections by @caugonnet in https://github.com/NVIDIA/cccl/pull/5723
Fix generate_version.sh script to only consider tags beginning with v by @shwina in https://github.com/NVIDIA/cccl/pull/5771
Add TransformOutputIterator implementation and tests by @shwina in https://github.com/NVIDIA/cccl/pull/5743
[STF] Improve how we retrieve streams from async_resources_handle objects by @caugonnet in https://github.com/NVIDIA/cccl/pull/5769
[CUDAX] Implement cudax::library and cudax::library_ref by @davebayer in https://github.com/NVIDIA/cccl/pull/5174
Improve cudax/cuco hashers by @srinivasyadav18 in https://github.com/NVIDIA/cccl/pull/5768
cuda.cccl.parallel: Reference examples in docstrings and eliminate test_*_api.py files by @Copilot in https://github.com/NVIDIA/cccl/pull/5614
Fix Thrust header tests and remove unused defines by @alliepiper in https://github.com/NVIDIA/cccl/pull/5764
[STF] Add missing type definition in task_dep by @caugonnet in https://github.com/NVIDIA/cccl/pull/5783
add a deleted query member function to std::execution::env<> by @ericniebler in https://github.com/NVIDIA/cccl/pull/5778
Ensure that we do not rely on host library functions that might not be defined by @miscco in https://github.com/NVIDIA/cccl/pull/5782
Fix cudax::launch for kernels with no parameters by @davebayer in https://github.com/NVIDIA/cccl/pull/5785
[STF] Generic per-context resource sets by @caugonnet in https://github.com/NVIDIA/cccl/pull/5777
[URGENT][TRIVIAL] Make sure cudaLibrary_t is used only in versions that define it by @andralex in https://github.com/NVIDIA/cccl/pull/5790
Show missing executables while setting up build by @andralex in https://github.com/NVIDIA/cccl/pull/5796
Sort workflow job times by duration by @alliepiper in https://github.com/NVIDIA/cccl/pull/5795
Bump cuda99 containers to gcc14 by @wmaxey in https://github.com/NVIDIA/cccl/pull/5760
Drop unused header by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5802
Fix libcu++ compilation with clang-20 by @davebayer in https://github.com/NVIDIA/cccl/pull/5799
Use nested namespace specifier in Thrust cpp system by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5801
Improve documentation of cuda iterators by @miscco in https://github.com/NVIDIA/cccl/pull/5662
Remove "Workflow Started" PR comment. by @alliepiper in https://github.com/NVIDIA/cccl/pull/5810
Add support for large OffsetT types with deterministic DeviceReduce (RFA) by @srinivasyadav18 in https://github.com/NVIDIA/cccl/pull/5434
Drop unused include of CG by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5814
Remove stray semicolon by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5815
Adds output_ordering requirement as env option by @elstehle in https://github.com/NVIDIA/cccl/pull/5781
Fix uninitialized read in uninitialized_copy_n by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5811
Fix __fp_one by @davebayer in https://github.com/NVIDIA/cccl/pull/5800
cccl.parallel: Unify input and output iterators by @shwina in https://github.com/NVIDIA/cccl/pull/5770
Fix some issues that were found by QA by @miscco in https://github.com/NVIDIA/cccl/pull/5820
Implement remaining cmath functions and drop indirection header by @miscco in https://github.com/NVIDIA/cccl/pull/5786
[STF] C bindings library by @caugonnet in https://github.com/NVIDIA/cccl/pull/5740
silence potential warning about ignored nodiscard value by @ericniebler in https://github.com/NVIDIA/cccl/pull/5794
[STF] Misc documentation fixes/clarifications by @caugonnet in https://github.com/NVIDIA/cccl/pull/5722
Modified CUB's device-wide developer guide by @oleksandr-pavlyk in https://github.com/NVIDIA/cccl/pull/5829
Fix Thrust API docs appearing twice in toctree by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5828
[cub/grid] fix documentation typo in grid_even_share.cuh by @thewilsonator in https://github.com/NVIDIA/cccl/pull/5835
Enable NVTX for NVHPC by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5836
Skip ptx-json tests for clang-cuda by @davebayer in https://github.com/NVIDIA/cccl/pull/5841
Drop unnecessary includes from libcu++ in CUB by @miscco in https://github.com/NVIDIA/cccl/pull/5830
make the concepts portability macros slightly more maintainable by @ericniebler in https://github.com/NVIDIA/cccl/pull/5817
Slim down Thrust CUDA core utils by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5845
Improve Thrust iterator documentation by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5833
Add CI information to AGENTS.md. by @alliepiper in https://github.com/NVIDIA/cccl/pull/5779
Use a custom iter_swap kernel in Thrust by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5843
Use std::atomic in host only code by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5838
Use forward declarations of extended floating point types instead of including the headers by @miscco in https://github.com/NVIDIA/cccl/pull/5846
Clang20 CI + devcontainers by @alliepiper in https://github.com/NVIDIA/cccl/pull/5797
Fix PTX ISA detection for clang-cuda by @davebayer in https://github.com/NVIDIA/cccl/pull/5869
fix issue in concepts macros where noexcept(t) became {noexcept(t)} noexcept by @ericniebler in https://github.com/NVIDIA/cccl/pull/5867
Fix grammar in doc comment for TilePrefixCallbackOp by @oleksandr-pavlyk in https://github.com/NVIDIA/cccl/pull/5866
Introduce facilities to extract the exponent of a floating point value. by @miscco in https://github.com/NVIDIA/cccl/pull/5136
[STF] test to get the stream associated to a task in the different backends by @caugonnet in https://github.com/NVIDIA/cccl/pull/5865
Redacted some comments in util_type CUDA header file for clarity by @oleksandr-pavlyk in https://github.com/NVIDIA/cccl/pull/5868
Fixes example of DeviceScan::InclusiveScanInit to use thrust vectors instead of c2h by @elstehle in https://github.com/NVIDIA/cccl/pull/5871
[STF] Factorize add_vertex calls by @caugonnet in https://github.com/NVIDIA/cccl/pull/5864
Avoid ADL issues with GCC-9 in iterator tests by @miscco in https://github.com/NVIDIA/cccl/pull/5872
Document Thrust systems, execution policies and their dispatch by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5827
[cudax] Make cudax::host_launch work with move-only types by @davebayer in https://github.com/NVIDIA/cccl/pull/5876
[cudax] Require cudax::kernel_ref argument types to be TriviallyCopyable by @davebayer in https://github.com/NVIDIA/cccl/pull/5878
Avoid symbol clash with older clang by @miscco in https://github.com/NVIDIA/cccl/pull/5874
Use __fp_get_exp to implement ilogb and logb by @miscco in https://github.com/NVIDIA/cccl/pull/5873
Test Thrust iterator system propagation by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5875
Add missing template argument in transform_reduce benchmark by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5803
Refactor thrust::iterator_facade_category by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5877
Allow single-target build/test jobs in CI override for faster turn-around times, reduced runner usage. by @alliepiper in https://github.com/NVIDIA/cccl/pull/5784
Add tests for host system propagation by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5881
Add more SMs to cuda-clang CI builds by @alliepiper in https://github.com/NVIDIA/cccl/pull/5861
Drop obsolete is_discard_iterator by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5884
Work around submdspan compiler issue on MSVC by @miscco in https://github.com/NVIDIA/cccl/pull/5885
Fix iterator_category_to_system for device iterator tags by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5880
Add missing stream synchronization in thrust::cuda_cub::generate by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5889
Clarify missing Reference and ValueType by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5888
Modernize Thrust examples by @charan-003 in https://github.com/NVIDIA/cccl/pull/5670
Inherit thrust::transform_iterator traversal from base iterator traversal by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5883
[CUDAX] Change async_buffer constructor and make_async_buffer to only optionally take an environment by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5776
Allow CI to run on forks with sccache enabled. by @alliepiper in https://github.com/NVIDIA/cccl/pull/5882
Ensure test kernels remain active during allocator testing. by @alliepiper in https://github.com/NVIDIA/cccl/pull/5899
Implement cuda::complex by @davebayer in https://github.com/NVIDIA/cccl/pull/5609
Update RAPIDS devcontainers by @bdice in https://github.com/NVIDIA/cccl/pull/5898
Small improvements to DeviceMergeSort by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5900
Drop thrust::counting_iterator in favor of cuda::counting_iterator by @miscco in https://github.com/NVIDIA/cccl/pull/5839
Ensure that logb is constexpr by @miscco in https://github.com/NVIDIA/cccl/pull/5901
Simplify cuda::std::is_trivially_copyable implementation by @davebayer in https://github.com/NVIDIA/cccl/pull/5906
[STF] Fix a typo in the documentation about logical_data::freeze by @caugonnet in https://github.com/NVIDIA/cccl/pull/5922
Revert "Drop thrust::counting_iterator in favor of cuda::counting_iterator (#5839)" by @alliepiper in https://github.com/NVIDIA/cccl/pull/5925
Revert "Simplify cuda::std::is_trivially_copyable implementation" by @davebayer in https://github.com/NVIDIA/cccl/pull/5921
Fix branch protection checks by @alliepiper in https://github.com/NVIDIA/cccl/pull/5915
Allow bisect jobs with custom args to run through matrix.yml. by @alliepiper in https://github.com/NVIDIA/cccl/pull/5894
The test has been randomly segfaulting recently so lets disable until we know whats happening by @miscco in https://github.com/NVIDIA/cccl/pull/5930
Ignore -Wmaybe-uninitialized in dispatch_reduce.cuh. by @bdice in https://github.com/NVIDIA/cccl/pull/5933
Drop Thrust mpl math by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5897
remove early customization and redesign transform_sender by @ericniebler in https://github.com/NVIDIA/cccl/pull/5793
Enable CUDA 12.0+ testing for cuda.cccl by @shwina in https://github.com/NVIDIA/cccl/pull/5682
Require type annotations for TransformOutputIterator by @shwina in https://github.com/NVIDIA/cccl/pull/5934
Modernize iterator machinery by @miscco in https://github.com/NVIDIA/cccl/pull/5928
Allow 128-bit int/float in nvrtc tests by @davebayer in https://github.com/NVIDIA/cccl/pull/5411
Fix iterator adaptor sample by @gevtushenko in https://github.com/NVIDIA/cccl/pull/5957
correct the spelling of the _LIBCPP_VERSION macro by @ericniebler in https://github.com/NVIDIA/cccl/pull/5958
Refactor cub::DeviceMerge by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5937
Drop unused LoadAlgorithm from merge policy by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5942
for better intellisense in cudax, define LIBCUDACXX_ENABLE_EXPERIMENTAL_MEMORY_RESOURCE by @ericniebler in https://github.com/NVIDIA/cccl/pull/5959
[STF] Support larger pos4 and dim4 by @caugonnet in https://github.com/NVIDIA/cccl/pull/5893
Detect QNX for atomics support by @miscco in https://github.com/NVIDIA/cccl/pull/5961
Refactor and condense thrust::copy implementation by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5491
Improve thrust::cuda_cub::replace functor handling by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5949
Simplify and deprecate cuda::std::is_pod in C++20 by @davebayer in https://github.com/NVIDIA/cccl/pull/5914
Refactor Thrust execution policies by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5821
Try to use _CCCL_API in Thrust and CUB by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5953
Simplify cuda::std::is_trivially_constructible implementation by @davebayer in https://github.com/NVIDIA/cccl/pull/5907
Improve interoperability of cuda iterators with thrust and std by @miscco in https://github.com/NVIDIA/cccl/pull/5929
Drop thrust detail seq policy global by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5964
Use CUDA 13 for RAPIDS CI builds by @vyasr in https://github.com/NVIDIA/cccl/pull/5967
Simplify cuda::std::is_trivially_copy_constructible implementation by @davebayer in https://github.com/NVIDIA/cccl/pull/5910
Deprecate and replace THUST_[HOST|DEVICE]_FUNCTION by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5972
Allow __builtin_addressof for nvrtc 12.3+ by @davebayer in https://github.com/NVIDIA/cccl/pull/5980
Fix reference for cuda::transform_iterator by @miscco in https://github.com/NVIDIA/cccl/pull/5983
Fix dereferencing nullptr in thrust::device_reference by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/4226
[STF] frozen_logical_data<T> now inherits from frozen_logical_data_untyped by @caugonnet in https://github.com/NVIDIA/cccl/pull/5986
[CUDAX] Make kernel_config parameter a __grid_constant__ in kernel launcher by @davebayer in https://github.com/NVIDIA/cccl/pull/5990
Add env-based overloads for DeviceReduce::(Arg)MinMax by @gonidelis in https://github.com/NVIDIA/cccl/pull/5143
Split up cub.test.iterator to fix nightly NVHPC OOMs, add CI memory monitoring script. by @alliepiper in https://github.com/NVIDIA/cccl/pull/5988
Improve shared memory address range check by @fbusato in https://github.com/NVIDIA/cccl/pull/5834
Bump internal containers to LLVM20. by @wmaxey in https://github.com/NVIDIA/cccl/pull/5997
Simplify cuda::std::is_trivially_move_constructible implementation by @davebayer in https://github.com/NVIDIA/cccl/pull/5913
[libcu++] Switch to use cuGetProcAddress to get driver functions by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5976
[CUDAX] Lower copy_bytes to the batched memcpy starting with CUDA 13 by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5818
Adds device-level Top-K Parallel Algorithm to CUB by @ChristinaZ in https://github.com/NVIDIA/cccl/pull/5677
Fix merge agent construction from non-ptr contiguous iterator by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5993
Add env based api for DeviceScan::ExclusiveSum/Scan by @srinivasyadav18 in https://github.com/NVIDIA/cccl/pull/5767
[CUDAX] Implement hierarchy_dimensions::static_extents() by @davebayer in https://github.com/NVIDIA/cccl/pull/6010
Enable __grid_constant__ with clang-cuda-20 and nvrtc by @davebayer in https://github.com/NVIDIA/cccl/pull/5991
Rename the trait checks to __has_meow_traversal by @miscco in https://github.com/NVIDIA/cccl/pull/5968
remove pynvjitlink references from examples by @jayavenkatesh19 in https://github.com/NVIDIA/cccl/pull/5826
Simplify selected type traits implementation by @davebayer in https://github.com/NVIDIA/cccl/pull/5979
Fix libcu++ lit config arch list by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6014
Avoid bad_alloc inside Catch2 CHECK() by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6025
Try to clean up align utilities by @fbusato in https://github.com/NVIDIA/cccl/pull/5950
Allow small abs error < 1e-10 in Deterministic Device Reduce large num_items test by @srinivasyadav18 in https://github.com/NVIDIA/cccl/pull/6027
Make assertions work on macOS by @miscco in https://github.com/NVIDIA/cccl/pull/6028
Move for_each_canceled_block to cuda::device:: by @davebayer in https://github.com/NVIDIA/cccl/pull/6037
Remove fork-ci feature. by @alliepiper in https://github.com/NVIDIA/cccl/pull/6004
Add windows versions of the CI target/bisect scripts. by @alliepiper in https://github.com/NVIDIA/cccl/pull/5931
Get Windows c.parallel build working. by @tpn in https://github.com/NVIDIA/cccl/pull/5924
Fix cuda13.0-rapids-conda devcontainer symlink by @bdice in https://github.com/NVIDIA/cccl/pull/6042
Temporarily pin CCCL version used to test RAPIDS by @vyasr in https://github.com/NVIDIA/cccl/pull/5973
Split up high-mem compilations in CUB to help out CI runners by @alliepiper in https://github.com/NVIDIA/cccl/pull/6044
c.parallel: enable dynamic policies in scan. by @griwes in https://github.com/NVIDIA/cccl/pull/5960
Change PARALLEL_LEVEL default from nproc to nproc-1 in build_common.sh by @Copilot in https://github.com/NVIDIA/cccl/pull/6046
basic_any gets better support for storing immovable types by @ericniebler in https://github.com/NVIDIA/cccl/pull/5935
Error when including cub umbrella header under NVRTC by @cnaples79 in https://github.com/NVIDIA/cccl/pull/6035
add missing InitT tparam to specialization of DispatchSegmentedReduce by @ericniebler in https://github.com/NVIDIA/cccl/pull/6048
Improve zip_iterator by @miscco in https://github.com/NVIDIA/cccl/pull/6036
Provide escape hatch for CTK compatability check by @miscco in https://github.com/NVIDIA/cccl/pull/6029
Replace _LIBCUDACXX_DEPRECATED with CCCL_DEPRECATED by @davebayer in https://github.com/NVIDIA/cccl/pull/6024
Fix throwing functions marked as noexcept by @davebayer in https://github.com/NVIDIA/cccl/pull/6021
Simplify cuda::std::is_trivially_default_constructible implementation by @davebayer in https://github.com/NVIDIA/cccl/pull/5911
Simplify cuda::std::is_trivially_destructible implementation by @davebayer in https://github.com/NVIDIA/cccl/pull/5905
Fix addressof shadowing issue with libc++ by @wmaxey in https://github.com/NVIDIA/cccl/pull/6032
Drop unused OutputIterator template parameter in reduce by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6051
Fix licenses. by @alliepiper in https://github.com/NVIDIA/cccl/pull/6047
Use __is_same_as builtin for cuda::std::is_same by @davebayer in https://github.com/NVIDIA/cccl/pull/5994
Move CDP API macros to libcu++ by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6017
Modularize <cuda/std/chrono> a bit by @miscco in https://github.com/NVIDIA/cccl/pull/5945
Unwrap cuda::zip_iterator/zip_function in thrust::transform by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6039
Simplify cuda::std::is_trivially_move_assignable implementation by @davebayer in https://github.com/NVIDIA/cccl/pull/5912
Do not require int128 in for_each_canceled by @davebayer in https://github.com/NVIDIA/cccl/pull/5822
Simplify cuda::std::is_trivially_copy_assignable implementation by @davebayer in https://github.com/NVIDIA/cccl/pull/5909
Fix memcpy ADL/ambiguity by @fbusato in https://github.com/NVIDIA/cccl/pull/5969
Simplify cuda::std::is_trivially_copyable by @davebayer in https://github.com/NVIDIA/cccl/pull/5938
Fix nvc++ 25.9 with format_parse_context tests by @davebayer in https://github.com/NVIDIA/cccl/pull/6056
Use more inline variables when possible by @miscco in https://github.com/NVIDIA/cccl/pull/6038
[DOC]: Add OpKind to parallel API docs by @shwina in https://github.com/NVIDIA/cccl/pull/6058
define cuda::std::declval in terms of new __declfn_t alias by @ericniebler in https://github.com/NVIDIA/cccl/pull/6045
Add dynamic CUB dispatch for three_way_partition by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/5965
Simplify cuda::std::is_trivially_assignable implementation by @davebayer in https://github.com/NVIDIA/cccl/pull/5908
Build and test Python wheels with arm64 in addition to x86_64 by @shwina in https://github.com/NVIDIA/cccl/pull/6060
Assert we have enough SMEM for DeviceReduce by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6062
Improve __float128 support for isnan, fmin, fmax by @fbusato in https://github.com/NVIDIA/cccl/pull/5923
Add more tests for thrust::reduce_by_key by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6063
[STF] Fix bug 5891 with large index spaces and overflows in partitionners by @caugonnet in https://github.com/NVIDIA/cccl/pull/6015
add missing visibility attributes and workaround nvcc bug by @ericniebler in https://github.com/NVIDIA/cccl/pull/6070
Add three way partition implementation for c.parallel by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/6068
Provide cub::DeviceCopy(mdspan) by @fbusato in https://github.com/NVIDIA/cccl/pull/5939
Drop CUB_STATIC_ASSERT from Doxyfiles by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6072
Refactor thrust::[try_]unwrap_contiguous_iterator[_t] by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6065
Use __has_meow_traversal for cuda iterators by @miscco in https://github.com/NVIDIA/cccl/pull/6088
Refactor ChainedPolicy by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6075
Fix MSVC error with OffsetT in c/parallel/src/three_way_partition.cu. by @tpn in https://github.com/NVIDIA/cccl/pull/6081
Update to RAPIDS 25.12 by @bdice in https://github.com/NVIDIA/cccl/pull/6082
Fix clang-cuda 21 warning of unitialized local passed as const void* by @davebayer in https://github.com/NVIDIA/cccl/pull/6091
Add python wrappers for c.parallel three_way_partition API by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/6080
Add dynamic CUB dispatch for segmented_sort by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/6069
Replace CUDA Runtime calls with Driver calls in libcu++ by @davebayer in https://github.com/NVIDIA/cccl/pull/6073
Use constexpr for some chrono traits by @miscco in https://github.com/NVIDIA/cccl/pull/6103
Bump catch2 to 3.8.1. by @alliepiper in https://github.com/NVIDIA/cccl/pull/6101
Fix imports from cudax to libcu++ by @davebayer in https://github.com/NVIDIA/cccl/pull/6105
add a specialization of __make_tuple_types for complex<T> by @ericniebler in https://github.com/NVIDIA/cccl/pull/6102
Remove iterator workarounds for lack of operator+= by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6094
[CUB] Replace several direct uses of __clz by @wmaxey in https://github.com/NVIDIA/cccl/pull/6099
Implement cuda::zip_transform_iterator by @miscco in https://github.com/NVIDIA/cccl/pull/5982
Refactor DeviceSegmentedReduce by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6061
Skip nightlies on weekends, cleanup old CTK, bump devcontainers. by @alliepiper in https://github.com/NVIDIA/cccl/pull/6111
Cache device name and peers by @davebayer in https://github.com/NVIDIA/cccl/pull/6110
Remove fully qualified ::cuda::std:: from examples by @charan-003 in https://github.com/NVIDIA/cccl/pull/6130
[STF] Fix incorrect level index in 3-depth execution policy by @19970126ljl in https://github.com/NVIDIA/cccl/pull/6089
Use cuda::narrow instead of custom version by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6133
[CUDAX] Implement managed_memory_resource and refactor the memory pool implementation by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5998
Rename cuda.cccl.{parallel,cooperative} -> cuda.{compute,coop} by @shwina in https://github.com/NVIDIA/cccl/pull/6125
Concatenate nested namespaces in CUB by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6139
Cache modify iterators locally in agent_merge.cuh by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6142
Initial batch of changes to setup GPU windows runners. by @alliepiper in https://github.com/NVIDIA/cccl/pull/6131
Remove cuda::physical_device from public API by @davebayer in https://github.com/NVIDIA/cccl/pull/6135
Implement operator<< for cuda::std::string_view by @davebayer in https://github.com/NVIDIA/cccl/pull/4736
c.parallel: enable dynamic policies in unique_by_key. by @griwes in https://github.com/NVIDIA/cccl/pull/6087
c.parallel: enable dynamic policies in merge_sort. by @griwes in https://github.com/NVIDIA/cccl/pull/6147
Use size_t for byte count in device attributes by @davebayer in https://github.com/NVIDIA/cccl/pull/6151
Fix link to examples in cuda.cccl Python documentation by @shwina in https://github.com/NVIDIA/cccl/pull/6157
Provide cuda::ptr_in_range by @fbusato in https://github.com/NVIDIA/cccl/pull/6086
Add a 'pull_request_lite' workflow for unmodified dependees. by @alliepiper in https://github.com/NVIDIA/cccl/pull/6164
Fix title in docs/python/index.rst by @shwina in https://github.com/NVIDIA/cccl/pull/6166
Add version_compare script, minor build_common updates. by @alliepiper in https://github.com/NVIDIA/cccl/pull/6168
Move nvhpc header wrappers to libcudacxx. by @alliepiper in https://github.com/NVIDIA/cccl/pull/6167
Remove NVIDIA Software License from top-level license by @jrhemstad in https://github.com/NVIDIA/cccl/pull/6176
Fix up issues with jobs in the pull_request_lite matrix. by @alliepiper in https://github.com/NVIDIA/cccl/pull/6169
Add noexcept to deallocate in type erased wrappers by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6179
cuda.compute: Add PermutationIterator by @shwina in https://github.com/NVIDIA/cccl/pull/6182
[libcu++] Fix blocks per SM in arch traits traits by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6185
cuda.compute: Use annotations when available to determine signature of user-defined transform operation by @shwina in https://github.com/NVIDIA/cccl/pull/6183
Refactor agent_histogram.cuh by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6141
Use vector width over load size in vectorized transform by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6066
We do not need to use force includes for nvrtc by @miscco in https://github.com/NVIDIA/cccl/pull/6194
Fix clang-cuda stf build by @davebayer in https://github.com/NVIDIA/cccl/pull/6199
We should guard the host library include wrappers so that we can unconditionally include the headers with NVRTC by @miscco in https://github.com/NVIDIA/cccl/pull/6195
Refactor Thrust destroy_range and device_[new|delete|free] by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6134
Improve fully cached build times. by @alliepiper in https://github.com/NVIDIA/cccl/pull/6127
Refactor Thrust allocator internals by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6136
Do not use pair for two element zip iterators by @miscco in https://github.com/NVIDIA/cccl/pull/6209
Add missing sm121 to nv/target and CUB tests by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6205
Bypass allocator in thrust::device_delete by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6198
Refactor cuda::arch_traits by @davebayer in https://github.com/NVIDIA/cccl/pull/6150
Ensure that cuda iterators support for difference by @miscco in https://github.com/NVIDIA/cccl/pull/6201
Fix arch_traits warnings without -fpermissive for older gcc by @davebayer in https://github.com/NVIDIA/cccl/pull/6217
Improve handling of empty members in cuda iterators by @miscco in https://github.com/NVIDIA/cccl/pull/6006
Improve string_view interoperability std:: counterpart and string by @davebayer in https://github.com/NVIDIA/cccl/pull/6184
Refactor / fixup libcudacxx CMake targets. by @alliepiper in https://github.com/NVIDIA/cccl/pull/6223
Fix missing attributes in cccl-rt and rename event::flags to event_flags by @davebayer in https://github.com/NVIDIA/cccl/pull/6224
merge the schedule_from and continues_on algorithms by @ericniebler in https://github.com/NVIDIA/cccl/pull/6162
Fix {host, device, managed}_mdspan by @miscco in https://github.com/NVIDIA/cccl/pull/6093
Expose cuda::mul_hi by @fbusato in https://github.com/NVIDIA/cccl/pull/6146
Assert deallocation is noexcept by @bdice in https://github.com/NVIDIA/cccl/pull/6186
Provide cub::DeviceFor::ForEachInLayout by @fbusato in https://github.com/NVIDIA/cccl/pull/5956
Replace internal multiple_higher_bits with cuda::mul_hi by @fbusato in https://github.com/NVIDIA/cccl/pull/6239
Update GPU architecture support details in README by @jrhemstad in https://github.com/NVIDIA/cccl/pull/6229
Test polymorphic types in thrust::device_delete by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6140
Modernizes top-k examples by @elstehle in https://github.com/NVIDIA/cccl/pull/6241
use cuda::mul_hi in cuda::std::calloc by @davebayer in https://github.com/NVIDIA/cccl/pull/6242
Refactor iterator usage in thrust/cuda find() by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6019
Try different formulation for thrust::ccosh by @miscco in https://github.com/NVIDIA/cccl/pull/6200
Improve invoke machinery by @miscco in https://github.com/NVIDIA/cccl/pull/6227
Drop all uses of thrust::tabulate_output_iterator in favor of cuda::tabulate_output_iterator by @miscco in https://github.com/NVIDIA/cccl/pull/6001
Fix __compressed_movable_box by @miscco in https://github.com/NVIDIA/cccl/pull/6247
Fix __is_primary_std_template for libc++ by @miscco in https://github.com/NVIDIA/cccl/pull/6243
Add environment overloads for DeviceTransform by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6204
Fix invalid refactoring of [#4377] by @miscco in https://github.com/NVIDIA/cccl/pull/6246
[libcu++] Enable complex literals by @davebayer in https://github.com/NVIDIA/cccl/pull/6252
Implement cudax::cufile_driver by @davebayer in https://github.com/NVIDIA/cccl/pull/5941
Fixing cudax::execution CUDA stream scheduler by @ericniebler in https://github.com/NVIDIA/cccl/pull/6175
[libcu++/cudax] Move all experimental additions to memory resource properties to libcu++ by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6233
Fix invalid device_accessible namespace by @davebayer in https://github.com/NVIDIA/cccl/pull/6269
Do no use bit_cast to work around initialization issues with barrier by @miscco in https://github.com/NVIDIA/cccl/pull/6263
Fix missing qualifications for __construct_at by @miscco in https://github.com/NVIDIA/cccl/pull/6270
Fix missed constructor with compressed box by @miscco in https://github.com/NVIDIA/cccl/pull/6268
Fix using char as the index type of tabulate_output_iterator by @miscco in https://github.com/NVIDIA/cccl/pull/6271
Add host standard library detection by @davebayer in https://github.com/NVIDIA/cccl/pull/6244
Adds a section on perf checks to contributing.md by @elstehle in https://github.com/NVIDIA/cccl/pull/6267
Deprecate <cuda/stream_ref> header by @davebayer in https://github.com/NVIDIA/cccl/pull/6266
Provide cuda::in_range by @fbusato in https://github.com/NVIDIA/cccl/pull/6034
[CUDAX] Add assignment operator that rebinds resource_ref by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6240
[CUDAX] Change memory pool type to also be a resource by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6180
[CUB]: Add missing closing braces to examples in Doxygen. by @brycelelbach in https://github.com/NVIDIA/cccl/pull/6278
Pass a device array or None as the initial value to cuda.compute scan by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/6262
Limit deprecation exclusions to targeted headers. by @alliepiper in https://github.com/NVIDIA/cccl/pull/6275
Disable SASS check in cuda.compute for scan no init value for sm_90 and later by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/6287
Implement initial Windows CI support for the Python cuda-cccl library. by @tpn in https://github.com/NVIDIA/cccl/pull/6160
Fix exception handling macros in exceptions.h by @ericniebler in https://github.com/NVIDIA/cccl/pull/6286
Cleanup and simplify structured bindings support by @miscco in https://github.com/NVIDIA/cccl/pull/6281
Provide cuda::ptx::enable_smem_spilling() by @davebayer in https://github.com/NVIDIA/cccl/pull/6289
Add PyTorch build to CI. by @alliepiper in https://github.com/NVIDIA/cccl/pull/6276
Dropped duplicated math function from Thrust by @viralbhadeshiya in https://github.com/NVIDIA/cccl/pull/6188
Refines the section on perf checks to contributing.md by @elstehle in https://github.com/NVIDIA/cccl/pull/6280
Drop typedef in cuda::atomic test by @viralbhadeshiya in https://github.com/NVIDIA/cccl/pull/6297
fix kernel launch failure when sender expressions can throw by @ericniebler in https://github.com/NVIDIA/cccl/pull/6277
Extract BlockScan code-block examples to literalinclude 1/3 by @gonidelis in https://github.com/NVIDIA/cccl/pull/6288
[DOC] Fix BlockRadixRank documentation by @Aminsed in https://github.com/NVIDIA/cccl/pull/6207
Fix string_view construction from std::string_view by @davebayer in https://github.com/NVIDIA/cccl/pull/6291
Clean up cuda::/std::/cuda::std:: __is_meow_v traits by @davebayer in https://github.com/NVIDIA/cccl/pull/6300
add parallel scan support for TBB and OMP by @charan-003 in https://github.com/NVIDIA/cccl/pull/6178
Use 'python3 -m pip args' instead of 'pip args' in docs/gen_docs.bash script by @oleksandr-pavlyk in https://github.com/NVIDIA/cccl/pull/6293
Ignore OOM failures for large size unique thrust test. by @alliepiper in https://github.com/NVIDIA/cccl/pull/6304
Add support for 128b atomics to atomic_ref by @wmaxey in https://github.com/NVIDIA/cccl/pull/3440
Fix is_sufficiently_aligned with const void* by @fbusato in https://github.com/NVIDIA/cccl/pull/6307
GCC only recognizes unused-local-typedefs by @alliepiper in https://github.com/NVIDIA/cccl/pull/6303
Replace __popc with cude::std::popcounter by @viralbhadeshiya in https://github.com/NVIDIA/cccl/pull/6213
Deprecate experimental TMA exposure in cuda::barrier by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6305
Provide utility to check pointer ranges overlapping by @fbusato in https://github.com/NVIDIA/cccl/pull/6100
Always include <new> when we need operator new for clang-cuda by @miscco in https://github.com/NVIDIA/cccl/pull/6310
Fix thrust system dependend includes by @miscco in https://github.com/NVIDIA/cccl/pull/6311
Optimize cuda::minimum/maximum for float, double, __half, __nv_bfloat16, __float128 by @fbusato in https://github.com/NVIDIA/cccl/pull/5034
Disable test for compressed_movable_box by @miscco in https://github.com/NVIDIA/cccl/pull/6320
c.parallel: enable dynamic policies in radix_sort. by @griwes in https://github.com/NVIDIA/cccl/pull/6264
Simplify thrust::zip_function by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6321
Simplify numeric_limits::[min|max]() implementation for integrals by @davebayer in https://github.com/NVIDIA/cccl/pull/6324
[cudax -> libcudacxx] Move type-erased resource wrappers to libcudacxx by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6299
Include <math.h> in <cuda/std/cmath> headers unconditionally by @davebayer in https://github.com/NVIDIA/cccl/pull/6333
Ensure that we can instantiate zip_function with a type that is not non-const invocable by @miscco in https://github.com/NVIDIA/cccl/pull/6323
Use RAPIDS main branch by @bdice in https://github.com/NVIDIA/cccl/pull/6318
[CUDAX] Rename memory_resource types to memory_pool_ref by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6334
Fix transform.cu size_t/int issue. by @tpn in https://github.com/NVIDIA/cccl/pull/6332
Remove unused _LIBCUDACXX_HAS_MEOW macros by @davebayer in https://github.com/NVIDIA/cccl/pull/6338
Refactor thrust::mismatch to use CUDA iterators by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6018
Use _CCCL_CTK_MEOW instead of _CCCL_CUDACC_MEOW by @davebayer in https://github.com/NVIDIA/cccl/pull/6343
Improve cuda::barrier TMA examples and elect_one in DeviceTransform by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6329
Move the throw_meow_error functions into their own header and drop the stdexcept include by @miscco in https://github.com/NVIDIA/cccl/pull/6335
Refactor histogram kernel entrypoint by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6342
Implements a more memory-efficient way to test for large k in DeviceTopK tests by @elstehle in https://github.com/NVIDIA/cccl/pull/6322
Replace inline PTX by cuda::ptx in cuda::barrier<thread_scope_block> by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6250
Add a philox PRNG engine by @RAMitchell in https://github.com/NVIDIA/cccl/pull/6109
Do not mark deduction guides as hidden by @miscco in https://github.com/NVIDIA/cccl/pull/6350
Move the implementation of tuple into its own file by @miscco in https://github.com/NVIDIA/cccl/pull/6336
[cudax] Fix managed resource test by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6354
Retry sccache startup on windows to WAR random auth issues. by @alliepiper in https://github.com/NVIDIA/cccl/pull/6347
Unpin CCCL version used for RAPIDS testing by @bdice in https://github.com/NVIDIA/cccl/pull/6349
Refactor generic thrust scan dispatch by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6360
Add missing header to random engine tests by @RAMitchell in https://github.com/NVIDIA/cccl/pull/6364
[cudax->libcu++] Move any_resource tests and remove experimental aliases by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6351
Move CPP, OMP and TBB exec policies to detail by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6361
Refactor Thrust OMP system headers by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6372
Fix reference to cuda::std::bit_floor/bit_ceil in docs by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6373
Fix tuple constraint by @miscco in https://github.com/NVIDIA/cccl/pull/6363
Improve exception macros by @davebayer in https://github.com/NVIDIA/cccl/pull/6337
Windows CI: CCCL C Parallel by @alliepiper in https://github.com/NVIDIA/cccl/pull/6254
Use non deprecated methods for stream_ref in docs by @davebayer in https://github.com/NVIDIA/cccl/pull/6376
Inline the Thrust ADL layer by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6377
Move iosfwd to its own internal file by @miscco in https://github.com/NVIDIA/cccl/pull/6390
Move is_reference_wrapper trait to __fwd/reference_wrapper.h by @davebayer in https://github.com/NVIDIA/cccl/pull/6392
Fix iter_move constraints for MSVC by @miscco in https://github.com/NVIDIA/cccl/pull/6357
c.parallel: fixes for well-known operations. by @griwes in https://github.com/NVIDIA/cccl/pull/6386
Modularize variant by @miscco in https://github.com/NVIDIA/cccl/pull/6393
Fix label in memcpy_async_tx docs by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6398
Drop unused file to detect CUDA archs by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6374
Make some if constexpr by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6382
Refactor Thrust TBB system headers by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6394
Move cuda/std/__cuda/api_wrapper.h to cuda/__runtime/api_wrapper.h by @davebayer in https://github.com/NVIDIA/cccl/pull/6379
Rewrite agent template parameters to PascalCase by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6380
Implement std::seed_seq by @RAMitchell in https://github.com/NVIDIA/cccl/pull/6358
Fix missing monostate_include by @miscco in https://github.com/NVIDIA/cccl/pull/6403
c.parallel: single-stage runtime compilation. by @griwes in https://github.com/NVIDIA/cccl/pull/6341
Apply cuda::barrier and elect_one feedback by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6344
Fix clang 21 issues by @davebayer in https://github.com/NVIDIA/cccl/pull/6404
Add a benchmark for DeviceSegmentedReduce::ArgMin by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6401
Rewrite block algorithm template parameters to PascalCase by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6381
[cudax -> libcu++] Move memory resources to libcu++ by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6384
c.parallel: cache runtime transform configs. by @griwes in https://github.com/NVIDIA/cccl/pull/6385
Fix wrong namespace in TBB Backend by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6395
Refactor agent_histogram.cuh Part 2 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6196
Fix wrongly rewritten license headers in Thrust OMP/TBB backend by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6406
bring cudax::execution closer in line with the evolving P3826 by @ericniebler in https://github.com/NVIDIA/cccl/pull/6417
Prepare cudax::host_lauch migration to libcu++ by @davebayer in https://github.com/NVIDIA/cccl/pull/6420
Drops default constructor of BlockLoadToShared by @elstehle in https://github.com/NVIDIA/cccl/pull/6427
Inline remaining *.inl files in tbb and seq backends by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6437
Nested namespace fixes in General modules & cub by @viralbhadeshiya in https://github.com/NVIDIA/cccl/pull/6425
Fix offset_iterator tests by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6436
Enhance lane mask validation in __shfl_sync by @fbusato in https://github.com/NVIDIA/cccl/pull/6429
Use SPDX license identifiers in CUB by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6441
Add _CCCL_DECLSPEC_EMPTY_BASES to mdspan features by @miscco in https://github.com/NVIDIA/cccl/pull/6444
[clang-format] WrapNamespaceBodyWithEmptyLines: Never by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6439
Ensure that detect_wrong_difference is a valid output iterator by @miscco in https://github.com/NVIDIA/cccl/pull/6450
Fix cub.bench.radix_sort.keys.base regression on H200 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6452
Fixes non-default-constructible iterators for large number of items types in DeviceRunLengthEncode::Encode by @elstehle in https://github.com/NVIDIA/cccl/pull/6451
Test mixing iterators in DeviceMerge by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6455
Use PDL in DeviceHistogram by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6367
Enabling max pool size for memory pools by @nirandaperera in https://github.com/NVIDIA/cccl/pull/6370
Add segmented sort implementation for c.parallel by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/6095
Fix Random CI failures for Deterministic Device Reduce (RFA) with different policies by @srinivasyadav18 in https://github.com/NVIDIA/cccl/pull/6464
Implement __is_fully_bounded_array trait by @davebayer in https://github.com/NVIDIA/cccl/pull/6461
Drop build.log by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6454
Prefix CUB kernel headers with kernel_ by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6383
Nested namespace resolve for thrust & libcudacxx by @viralbhadeshiya in https://github.com/NVIDIA/cccl/pull/6465
Various CMake cleanups. by @alliepiper in https://github.com/NVIDIA/cccl/pull/6346
Ignore python/cuda_cccl/build.log by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6473
Replace enum by static constexpr in some agent tunings by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6472
fall back gpu_to_gpu floating-point min/max reductions to run_to_run by @srinivasyadav18 in https://github.com/NVIDIA/cccl/pull/6462
Fix incorrect file name by @miscco in https://github.com/NVIDIA/cccl/pull/6481
Add python wrappers for c.parallel segmented_sort by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/6471
Provide cuda::sub_overflow by @fbusato in https://github.com/NVIDIA/cccl/pull/6084
Cleanup libcu++ CMake by @miscco in https://github.com/NVIDIA/cccl/pull/6478
Avoid single letter typenames by @miscco in https://github.com/NVIDIA/cccl/pull/6474
Add WarpReduce Device-Side Benchmarks by @fbusato in https://github.com/NVIDIA/cccl/pull/6431
Avoid potentially ambiguous overload in warp_excahnge_shfl by @miscco in https://github.com/NVIDIA/cccl/pull/6484
Replace uses of cub::PowerOfTwo and deprecated it by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6490
Drop shadowing redeclaration of constants by @miscco in https://github.com/NVIDIA/cccl/pull/6479
Provide cuda::div_overflow by @fbusato in https://github.com/NVIDIA/cccl/pull/6128
Enable __int128_t as difference type in counting_iterator by @miscco in https://github.com/NVIDIA/cccl/pull/6487
Make nvrtc concept macros a bit more reliable by @miscco in https://github.com/NVIDIA/cccl/pull/6397
Use __byte_perm intrinsic rather then inline asm in cuda::std::byteswap by @davebayer in https://github.com/NVIDIA/cccl/pull/6493
Updates to populate the PyPI landing page. by @shwina in https://github.com/NVIDIA/cccl/pull/6483
Remove old C++ version checks by @davebayer in https://github.com/NVIDIA/cccl/pull/6494
[clang-format] KeepEmptyLines only at EOF by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6440
Expose ptx::mbarrier_inval and use it by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6496
Move the libcu++ specific config by @miscco in https://github.com/NVIDIA/cccl/pull/6396
Implement cuda::invalid_stream by @davebayer in https://github.com/NVIDIA/cccl/pull/6488
Fix invalid reference type of cuda::strided_iterator by @miscco in https://github.com/NVIDIA/cccl/pull/6501
Allow passing in None as init value for scan when using an iterator as input by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/6499
Catchs NaN's before they make it to static_cast and creating UB by @s-oboyle in https://github.com/NVIDIA/cccl/pull/6502
Extract BlockScan code-block examples to literalinclude 2/3 by @gonidelis in https://github.com/NVIDIA/cccl/pull/6418
Fix NVTX disabling test by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6516
Disable CI workflows on forks. by @alliepiper in https://github.com/NVIDIA/cccl/pull/6514
Adds token to enforce correct call sequence in BlockLoadToShared: Commit()->Wait() by @elstehle in https://github.com/NVIDIA/cccl/pull/6510
Expose ptx::setmaxnreg by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6504
[CUDAX] Uglify the hierarchy files by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6491
[CUB] Use BlockLoadToShared in DeviceMerge by @pauleonix in https://github.com/NVIDIA/cccl/pull/6077
Replace custom equal_to functors by _1 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6515
Add cccl_add_xfail_compile_target_test CMake function by @alliepiper in https://github.com/NVIDIA/cccl/pull/6434
Add conda installation instructions for cuda.cccl Python package by @Copilot in https://github.com/NVIDIA/cccl/pull/6513
Fix missing token passing in AgentMerge by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6525
[cudax->libcu++] Move uninitialized_async_buffer and heterogeneous_iterator to libcu++ by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6489
[CUDAX] Rename async_buffer to buffer by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6520
Split DeviceSegmentedReduce in its own file by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6524
Add launch bounds to block_reduce_kernel by @Artem-B in https://github.com/NVIDIA/cccl/pull/6533
Fixes braces around scalar initializer warning in BlockLoadToShared by @elstehle in https://github.com/NVIDIA/cccl/pull/6534
[cudax->libcu++] Move host_launch to libcu++ by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6536
Improves DeviceTopK docs by @elstehle in https://github.com/NVIDIA/cccl/pull/6531
Allow __builtin_bitreverse with clang-cuda by @davebayer in https://github.com/NVIDIA/cccl/pull/6545
[cudax->libcu++] Move shared_resource to libcu++ by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6539
Fix/Improve <cuda/bit> documentation by @fbusato in https://github.com/NVIDIA/cccl/pull/6543
cuda::align_up/down workaround for memory space by @fbusato in https://github.com/NVIDIA/cccl/pull/6541
Cleanup thrust::complex math includes and functions by @miscco in https://github.com/NVIDIA/cccl/pull/6546
Fix bit_reverse documentation example by @fbusato in https://github.com/NVIDIA/cccl/pull/6551
Drop old namespace macros by @miscco in https://github.com/NVIDIA/cccl/pull/6548
Update NVBench type string declarations for FP16 and BF16 by @fbusato in https://github.com/NVIDIA/cccl/pull/6555
Make uniform_int_distribution constexpr by @RAMitchell in https://github.com/NVIDIA/cccl/pull/6523
Cleanup includes in thrust by @miscco in https://github.com/NVIDIA/cccl/pull/6547
Rename some of the namespace macros by @miscco in https://github.com/NVIDIA/cccl/pull/6549
Fix merge conflicts from dropping headers by @miscco in https://github.com/NVIDIA/cccl/pull/6563
Use cuda/iterator in cub/test by @oleksandr-pavlyk in https://github.com/NVIDIA/cccl/pull/6405
Fix compute capability -> PTX version conversion by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6567
Add gersemi CMake formatter by @alliepiper in https://github.com/NVIDIA/cccl/pull/6557
[cudax->libcudacxx] Move device_transform to libcu++ by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6469
Fix some warnings in cub headers that are picked up by the libcu++ tests by @miscco in https://github.com/NVIDIA/cccl/pull/6522
Fix cuda/cmath and cuda/memory documentation by @fbusato in https://github.com/NVIDIA/cccl/pull/6569
Implement bernoulli_distribution by @RAMitchell in https://github.com/NVIDIA/cccl/pull/6375
Fix missing include by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6578
Test passing a custom policy to DispatchReduce by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6577
Refactor DispatchMergeSort by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6580
Refactor cub::detail::for_each::dispatch_t by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6579
Replace enum by static constexpr in CUB/Thrust by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6480
[cudax] Add pointer attributes fallback to async buffer initialization by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6352
[libcu++] Add initial cccl-runtime docs for 3.1 by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6562
Split segmented radix sort into separate files by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6581
Add BlockLoadToShared improvements by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6526
Disable clang-cuda with libc++ tests for now by @miscco in https://github.com/NVIDIA/cccl/pull/6586
Try and improve our is_nothrow_constructible fallback by @miscco in https://github.com/NVIDIA/cccl/pull/6583
Refactor DispatchSegmentedSort by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6599
Refactor DispatchReduce by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6590
Fix misspelling of contiguous range in documentation by @brycelelbach in https://github.com/NVIDIA/cccl/pull/6603
Refactor DispatchScan by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6594
Refactor DispatchScanByKey by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6596
Refactor rfa dispatcher by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6591
Fix typo in mbarrier.inval by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6615
Test and refactor [Mem|Reg]BoundScaling by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6575
Try to fix windos python runner by @miscco in https://github.com/NVIDIA/cccl/pull/6602
Allow using ZipIterator as an output in cuda.compute by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/6518
Fix issue with old GCC by @miscco in https://github.com/NVIDIA/cccl/pull/6614
Fix some minor issues in the extents implementation by @miscco in https://github.com/NVIDIA/cccl/pull/6604
Make Thrust/CUB ABI namespace resilient against user-defined macros by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6564
Improve cudax::dynamic_shared_memory implementation by @davebayer in https://github.com/NVIDIA/cccl/pull/6495
Refactor DispatchUniqueByKey by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6600
Replace uses of thrust::pair with cuda::std::pair by @miscco in https://github.com/NVIDIA/cccl/pull/6616
Refactor deterministic reduce dispatcher by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6593
[cudax] Add synchronous_resource_adapter and use it in async_buffer by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6432
Split fixed-size segmented reduce dispatch header by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6597
Publicly expose <cuda/std/algorithm> by @miscco in https://github.com/NVIDIA/cccl/pull/3741
Refactor cub::detail::AliasTemporaries by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6617
Add __version_(at_least|below) utilities to CUDA Driver wrappers by @davebayer in https://github.com/NVIDIA/cccl/pull/6626
Fix includes in work stealing example by @miscco in https://github.com/NVIDIA/cccl/pull/6631
streamline the implementation of cuda::std::__tuple by @ericniebler in https://github.com/NVIDIA/cccl/pull/6623
Optimize cuda::is_address_space by forcing the memory space by @fbusato in https://github.com/NVIDIA/cccl/pull/6553
Add DiscardIterator to cuda.compute to enable unique_by_key keys only by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/6618
Provide utilities to check pointer memory space (host/device/managed) by @fbusato in https://github.com/NVIDIA/cccl/pull/6325
MVP for disabling nvtx ranges for thrust::seq by @gonidelis in https://github.com/NVIDIA/cccl/pull/6415
Provide make_tma_descriptor, DLPack -> CUtensorMap by @fbusato in https://github.com/NVIDIA/cccl/pull/6237
Cleanup dependencies between internal targets. by @alliepiper in https://github.com/NVIDIA/cccl/pull/6571
Port thrust complex.cu tests to catch2_test_complex.cu by @dunga1k58bh in https://github.com/NVIDIA/cccl/pull/6625
Support nested structs in cuda.compute by @shwina in https://github.com/NVIDIA/cccl/pull/6353
[cudax->libcu++] Move the hierarchy type from cudax to libcu++ by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6611
[Backport branch/3.2.x] Address pending comments for make_tma_descriptor by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6683
[Backport branch/3.2.x] Fixes issue with select close to int_max by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6701
[Backport branch/3.2.x] fix omp scan bug by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6704
[Backport branch/3.2.x] Fix electing leader from any group in cuda::memcpy_async by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6716
[Backport branch/3.2.x] Avoid scaling twice in ReduceNondeterministicPolicy by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6719
[Backport branch/3.2.x] [libcu++] Automatically bump up the release threshold of default mempools by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6735
[Backport branch/3.2.x] Fix __throw_cuda_error availability with nvrtc by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6769
[Backport 3.2] Add sm_62 arch traits (#6772) by @davebayer in https://github.com/NVIDIA/cccl/pull/6778
[Backport branch/3.2.x] Ensure that we properly warn about device lambdas that need to query the return type by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6782
[Backport branch/3.2.x] Use conventional order of _CCCL_API friend consistently by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6794
[Backport branch/3.2.x] Temporarily add upper bound to numba-cuda dependency by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6830
Msvc-error-backport by @alliepiper in https://github.com/NVIDIA/cccl/pull/6827
[Backport branch/3.2.x] Fix arch related cuda::device:: APIs for nvhpc in CUDA mode by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6832
[Backport 3.2.x] Test building for all arches. (#6113) by @davebayer in https://github.com/NVIDIA/cccl/pull/6842
[Backport branch/3.2.x] Remove upper bound on numba-cuda by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6853
[Backport branch/3.2.x] Use lit for cuda::arch_id and cuda::compute_capability tests by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6840
[Backport branch/3.2.x] [PTX] Add cp.async.bulk.dst.src.mbarrier::complete_tx::bytes.ignore_oob by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6860
CMake backports for 3.2 by @alliepiper in https://github.com/NVIDIA/cccl/pull/6850
[Backport branch/3.2.x] Add missing doc strings to support old CMake. by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6877
[backport 3.2.x][cudax->libcu++] Move buffer type from cudax to libcu++ (#6627) by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6833
[Backport branch/3.2.x] Move launch API from cudax to libcu++ by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6891
[backport 3.2.x][libcu++] Add memory_pool header and correct legacy resources namespace by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6893
[Backport 3.2.x] [cuda.compute] Add dependency on nvidia-nvvm [#6909] by @shwina in https://github.com/NVIDIA/cccl/pull/6949
[Backport branch/3.2.x] Remove all usage of old experimental MR macro by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6965
[Backport branch/3.2.x] [libcu++] Leak static CUDA resources and add missing release on memory pool by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6960
[Backport branch/3.2.x] [libcu++] Add as_ref() to memory pool types by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6959
[Backport branch/3.2.x] Remove [[nodiscard]] from barrier's .arrive(...) method by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6950
[Backport branch/3.2.x] Properly specialize cub functions for __nv_bfloat16 by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6940
[Backport 3.2] Avoid waring about missing braces for subobject (#6929) by @miscco in https://github.com/NVIDIA/cccl/pull/6973
[Backport branch/3.2.x] Add missing nvrtc nv target archs by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6933
[Backport branch/3.2.x] Make sure we actually use overflow builtins by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6934
[Backport branch/3.2.x] [libcu++] Static assert that resource is copyable in buffer constructors by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6936
[Backport branch/3.2.x] [libcu++] Rename device_transform back to launch_transform by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6937
[Backport branch/3.2.x] [libcu++] Fix minor version compatibility in 13.X by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6896
[Backport branch/3.2.x] Don't use __builtin_bswap128 during constant evaluation by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6969
[Backport branch/3.2.x] [libcu++] Uncomment some tests and fix launch include after launch was moved to libcu++ by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6989
[Backport branch/3.2.x] [libcu++] Dynamically load CUDA library instead of using the runtime by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6982
[backport 3.2.x] Use cuda.core Linker instead of numba-cuda and fix import issues with experimental namespace by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/7025
[Backport branch/3.2.x] avoid error adding pointer to reference in any_resource by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7018
[Backport branch/3.2.x] [libcu++] Don't require accessibility property on type erased wrappers by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7030
[backport 3.2.x][libcu++] Fix test issues on Windows (#6993) by @pciolkosz in https://github.com/NVIDIA/cccl/pull/7017
[Backport branch/3.2.x] Fix cuda::memcpy async edge cases and add more tests by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7036
[Backport branch/3.2.x] [libcu++] Fix synchronous resource adapter property passing by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7031
[backport 3.2] Backport [#6844], [#6958], [#6619] and [#6957] by @davebayer in https://github.com/NVIDIA/cccl/pull/7038
[Backport branch/3.2.x] Disable LDL/STL checks, for failures seen with NVRTC 13.1 by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7065
[Backport branch/3.2.x] [libcu++] Add explicit alignment specification in buffer (#7005) by @pciolkosz in https://github.com/NVIDIA/cccl/pull/7041
[Backport branch/3.2.x] [libcu++] Correctly handle extended lambda in cuda::launch by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7069
[Backport branch/3.2.x] use references for mdspan internal methods [#7059] by @fbusato in https://github.com/NVIDIA/cccl/pull/7068
[Backport 3.2] Disable test that is failing in multiple configurations (#6745) by @miscco in https://github.com/NVIDIA/cccl/pull/7076
[Backport 3.2] Use resource test fixure members through this (#6717) by @miscco in https://github.com/NVIDIA/cccl/pull/7075
[Backport 3.2] Avoid invalid compiler warning with VS2026 (#7077) by @miscco in https://github.com/NVIDIA/cccl/pull/7081
[Backport 3.2] Avoid compiler issue with MSVC _CCCL_UNREACHABLE (#7080) by @miscco in https://github.com/NVIDIA/cccl/pull/7083
[Backport branch/3.2.x] [libcu++] Make kernel_config member private and allow it in hierarchy queries by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7070
[Backport branch/3.2.x] [thrust] Ignore CUDA free errors in thrust memory resource by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7033
[Backport branch/3.2.x] [libcu++] Remove _view from the shared memory getter name by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7032
[backport 3.2] Backport [#6985] by @fbusato in https://github.com/NVIDIA/cccl/pull/7039
[Backport branch/3.2.x] Explicitly set CCCL_TOPLEVEL_PROJECT to OFF when needed by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7040
[Backport branch/3.2.x] Simplify cuda::host_launch API by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7089
[Backport 3.2] Expand warning suppression for braces around subobject (#7087) by @miscco in https://github.com/NVIDIA/cccl/pull/7091
[Backport to 3.2] Refactor c2h gen to ensure teardown before main (#7067) and Add an option to use CCCL from CTK for C2H (#6848) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/7085
[Backport branch/3.2.x] [libcu++] Fix driver api test after curand changes by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7105
[Backport branch/3.2.x] Enhance DLPack compatibility by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7060
[Backport branch/3.2.x] [libcu++] Check if managed pools are accessible in is_pointer_accessible test by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7116
[Backport branch/3.2.x] Fix calculation of necessary bits in feistel projection by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7117
[Backport branch/3.2.x] Fix incorrect if else logic in fmax by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7112
[Backport branch/3.2.x] Use cudaMemcpyDefault for trivial copies by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7134
[Backport branch/3.2.x] Fix nvrtcc minimum arch for __float128 support by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7141
[BACKPORT 3.2] Generator for prologue/epilogue (#7099) by @miscco in https://github.com/NVIDIA/cccl/pull/7136
[Backport branch/3.2.x] Replace and deprecate compute_capability::major() and compute_capability::minor() by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7143
[Backport branch/3.2.x] Move DLPack include to separate file by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7142
[backport 3.2] [libcu++] Allow all public headers to be included with host compilers only (#7012) by @pciolkosz in https://github.com/NVIDIA/cccl/pull/7146
[Backport branch/3.2.x] Disable reference_wrapper test for VS2026 by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7145
[Backport branch/3.2.x] Revert nested namespace change to <nv/target> by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7153
[backport branch/3.2.x] Backport [#7023], [#7009], [#7139], [#7144] and [#7130] by @davebayer in https://github.com/NVIDIA/cccl/pull/7147
[Backport branch/3.2.x] Add Android-specific assert handling in __cccl/assert.h by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7158
[Backport branch/3.2.x] Align local vector storage arrays in vec transform by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7163
[Backport 3.2] Fix make_tma_descriptor() unit test (#7152) by @miscco in https://github.com/NVIDIA/cccl/pull/7164
[Backport branch/3.2.x] Fixes for thrust::shuffle by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7189
[Backport branch/3.2.x] Do not try to run catch2 tests with nvrtc by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7244
[Backport to 3.2] Fix extracting CUDA stream in cub::DeviceTransform by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/7263
[Backport branch/3.2.x] Change the order of conditions in cuda::barrier by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7273
[Backport 3.2] Fix __query_or CPO by @miscco in https://github.com/NVIDIA/cccl/pull/7267
[Backport branch/3.2.x] Fix is_address_from for cluster_shared for pre-sm_90 by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7301
[Backport branch/3.2.x] Skip checking build prereqs if installing by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7326

New Contributors

@GPMueller made their first contribution in https://github.com/NVIDIA/cccl/pull/5369
@MengAiDev made their first contribution in https://github.com/NVIDIA/cccl/pull/5483
@Copilot made their first contribution in https://github.com/NVIDIA/cccl/pull/5582
@thewilsonator made their first contribution in https://github.com/NVIDIA/cccl/pull/5835
@vyasr made their first contribution in https://github.com/NVIDIA/cccl/pull/5967
@jayavenkatesh19 made their first contribution in https://github.com/NVIDIA/cccl/pull/5826
@cnaples79 made their first contribution in https://github.com/NVIDIA/cccl/pull/6035
@19970126ljl made their first contribution in https://github.com/NVIDIA/cccl/pull/6089
@nirandaperera made their first contribution in https://github.com/NVIDIA/cccl/pull/6370
@dunga1k58bh made their first contribution in https://github.com/NVIDIA/cccl/pull/6625

Full Changelog: https://github.com/NVIDIA/cccl/compare/v3.1.4...v3.2.0

Source: README.md, updated 2026-01-23

CUDA Core Compute Libraries (CCCL) Files