Name | Modified | Size | Downloads / Week |
---|---|---|---|
Parent folder | |||
README.md | 2025-08-08 | 6.7 kB | |
v3.9 source code.tar.gz | 2025-08-08 | 13.3 MB | |
v3.9 source code.zip | 2025-08-08 | 16.6 MB | |
Totals: 3 Items | 29.9 MB | 1 |
Performance Optimizations
Intel Architecture Processors
- Introduced initial support for future Intel Xeon processors with Intel AVX 10.2 and Intel AMX instruction sets support.
This functionality is not dispatched by default and requires opt-in with environment variable
ONEDNN_MAX_CPU_ISA=AVX10_2_512_AMX_2
. - Introduced initial support for future Intel Core processors with Intel AVX 10.2 instruction set support. This functionality is not dispatched by default and requires opt-in with environment variable
ONEDNN_MAX_CPU_ISA=AVX10_2_512
. - Improved initialization time for convolution primitive when a large number of threads is used by introducing a new thread partition estimation and adjusting several blocking parameters.
- Improved performance of
fp8
convolution primitive with scales andbf16
output - Improved performance of matmul primitive with post-ops on processors with Intel AMX support
- Improved performance of RNN primitive for LBR_GRU and VANILLA_LSTM cell types on processors with Intel AVX2 instruction set support
- Improved performance of the following subgraphs with Graph API:
- Scaled Dot Product Attention (SDPA) with implicit causal mask.
- Grouped Query Attention (GQA) flavor specific for GEMMA models.
Intel Graphics Products
- Improved performance on Intel GPUs based on Xe3 architecture.
- Improved matmul performance for Intel Arc Graphics for Intel Core Ultra processors (Series 2) (formerly Lunar Lake).
- Improved RNN primitive performance with LBR_GRU cell type.
- Improved
int8
convolution performance with plain weights and trivial filter. - Improved convolution performance with
NCHW
activations with 1x1 filter and unit strides. - Improved
fp32
softmax performance. - Improved performance of reorder when used with USM host memory.
- Improved performance of the following subgraphs with Graph API:
fp32
SDPA with implicit causal mask.fp16
SDPA on Intel GPUs without Intel XMX cores.
AArch64-based Processors
- Improved
int8
convolution performance. - Improved
bf16
depthwise convolution performance. - Improved
f16
matmul performance with Arm Compute Library (ACL).
Functionality
Functional API
- Introduced Root Mean Square Normalization (RMSNorm) mode for layer normalization primitive. This functionality is optimized for Intel CPUs and Intel GPUs.
- Sparse memory objects and sparse matmul are promoted to production status.
Graph API
- Introduced support for tanh approximation in [
GELU
] operation. - Extended Graph API [
Softmax
] operation to support optionalstats
output. - Introduced fusion support for SDPA training forward and backward propagation.
- Introduced fusion support for SDPA with bottom-right implicit causal mask.
- Introduced
make_scalar_tensor()
API for engine-agnostic scalar tensor creation.
Microkernel API
- Introduced support for
fp8
data type.
Intel Architecture Processors
- Introduced support for select algorithm in binary post-op.
- Introduced source, destination, and weight scales support in
fp8
convolution and deconvolution primitives.
Intel Graphics Products
- Introduced support for select algorithm in binary primitive.
Generic GPU Vendor
- Introduced support for RNN Vanilla backward propagation.
Usability
- Enabled build with
-Wundef
compiler flag. - [Experimental] Introduced support for kernel compilation with SYCL kernel compiler extension.
Validation
- Improved benchdnn performance by optimizing input data filling and testing results comparison steps.
- Improved benchdnn graph driver performance mode via adding CPU memory pool for allocator.
Known Limitations
- The group normalization with
normalization_flags::use_scale
specified produces incorrect results for backward propagation kind in oneDNN v3.9 and earlier. - Binary primitive with certain shapes and Graph API SDPA with bottom right causal mask may hang with SYCL debug runtime on Windows.
fp8
matmul primitive may sporadically produce incorrect results on Intel Arc B-series graphics.int8
inner product primitive with tensors exceeding 4 Gb in size may produce incorrect results on Intel Datacenter GPU Max series.bf16
pooling with tensors exceeding 4 Gb in size may produce incorrect results on Intel Datacenter GPU Max series.bf16
/fp16
matmul with large inner dimension has a performance regression on Intel Datacenter GPU Max Series.bf16
/fp16
convolution withNCHW
activations has a performance regression on Intel Datacenter GPU Max Series.- Softmax with non-trivial strides and blocked format may produce incorrect results.
bf16
layer normalization backpropagation may produce incorrect results on Intel Datacenter GPU Max Series.
Deprecated Functionality
- BLAS-like API including
dnnl::sgemm
,dnnl::gemm_u8s8s32
, anddnnl::gemm_s8s8s32
functions is deprecated and will be removed in future releases. If you are using this API consider switching to matmul primitive.
Thanks to our Contributors
This release contains contributions from the project core team as well as Aditya Tewari @aditew01, Alexander Simonov @asimonov1, @Anallear, Anna Sztukowska @asztukow, Avanish Tiwari @Tiwari-Avanish, Dmitriy Ovchinnikov @inteldimitrius, Kasture Deeksha, Krishna Sai @krishnasai-mcw, Manaal @manaalmj, Marek Michalowski @michalowski-arm, Orel Yehuda @yehudaorel, Ruqiu Cao @rcao8, Tsao Zhong @CaoZhongZ, Viktoriia Gvozdeva @vgvozdeva, Yair Obodovsky @yair-obodovsky, Ye Tao @taoye9, Yuanyuan Chen @cyyever, @gausah-arm, @karmeh01, @pmanczak, and @zhangfeiv0. We would also like to thank everyone who asked questions and reported issues.