Download Latest Version mnn_3.2.0_ios_armv82_cpu_metal_coreml.zip (2.1 MB)
Email in envelope

Get an email when there's a new version of MNN

Home / 3.2.0
Name Modified Size InfoDownloads / Week
Parent folder
mnn_3.2.0_windows_x64_cpu_opencl.zip 2025-06-06 202.7 MB
mnn_3.2.0_macos_x64_arm82_cpu_opencl_metal.zip 2025-06-06 9.2 MB
mnn_3.2.0_linux_x64_cpu_opencl.zip 2025-06-06 207.8 MB
mnn_3.2.0_android_armv7_armv8_cpu_opencl_vulkan.zip 2025-06-06 4.4 MB
mnn_3.2.0_ios_armv82_cpu_metal_coreml.zip 2025-06-06 2.1 MB
3.2.0 MoE Jia Gou He OmniZhi Chi _ Tou Ji Jie Ma Shi Xian _ Xin Zeng Mo Ban Yin Qing _ KlediAI Gong Neng Geng Xin _ Qi Dong Xing Neng You Hua source code.tar.gz 2025-06-06 85.5 MB
3.2.0 MoE Jia Gou He OmniZhi Chi _ Tou Ji Jie Ma Shi Xian _ Xin Zeng Mo Ban Yin Qing _ KlediAI Gong Neng Geng Xin _ Qi Dong Xing Neng You Hua source code.zip 2025-06-06 90.8 MB
README.md 2025-06-06 7.6 kB
Totals: 8 Items   602.5 MB 0

3.2.0 版本更新说明 感谢社区和开发团队的努力,MNN 3.2 版本现已正式发布!本次版本在性能优化、功能扩展以及问题修复方面进行了多项改进,尤其针对大语言模型(LLM)的支持和推理加速进行了重点优化。以下是主要更新内容: 🚀 新特性支持 1. LLM 相关 ○ 支持 Qwen Omni 模型运行。 ○ 支持 Qwen3 MoE 模型。 ○ 支持 Gemma3 (4B/1B) 和 InternVL2.5-1B / deepseek vl 模型导出及 GGUF 转换。 ○ 支持 FastVLM 和 SmolVLM 模型。 ○ LLM Export 工具新增其他低比特量化(2, 3, 5, 6, 7 bit)支持。 ○ 新增 jinja 模板引擎,方便 LLM 对话模板管理。 ○ LLM Export 工具中默认量化的Block修改为64 (经测试体积和Q4_1一致,但精度更高) ○ 支持 GGUF 模型格式转换为 MNN。 2. 集成 KlediAI 相关功能更新 ○ 新增 Q4 非对称量化的 Kernel 实现 ○ 新增 FP16 / FP32 的 SME Kernel 实现 3. LLM 推理加速 ○ 支持基于用户输入的投机解码 (lookahead算法),在典型场景如代码补全情况下,可提升2-3倍解码效率。 4. 模型转换增强 ○ 支持 ONNX 元数据(metadata)转换及读取,并保持输入顺序一致。 5. 硬件加速支持 ○ 新增高通 QNN 后端,目前仅支持 CV 模型运行。 6. 库大小裁剪功能 ○ 新增 MNN_REDUCE_SIZE宏,开启后可以关闭部分优化分支,减少库大小 ⚙️ 性能优化及精度优化 1. 启动优化 ○ CPU 后端支持 mmap 生成缓存文件后删除原始 weight 文件,以便节省存储空间。 ○ CPU Int8/Int4 模式下优化权重重排和加载时间,LLM加载耗时降低到原来的1/3,低于填充 200 token 的耗时。 ○ OpenCL 后端通过内存免拷贝进一步优化加载时间,减少至原来的50%,同样低于填充 200 token 的耗时。 ○ Metal 后端通过内存免拷贝及Kernel改写,降低Int8/Int4权重重排和加载时间,LLM模型加载耗时降低到原先的1/7 2. CPU算子优化 ○ 调整了 Weight Block-wise 量化场景的 Weight 内存布局,并对相关Kernel进行了重写,以减少多线程运行时 GEMV 的 Cache Miss,对应地在 PC CPU 上,LLM 多线程解码性能提升1倍以上。 ○ 新增 input block-wise 量化支持,并且原有的 Weight Block-wise 量化新增对非 1x1 卷积的支持,可较大地提升相关CV模型量化后的精度。比如 Diffusion 场景,CPU生图的效果提升。 3. GPU算子优化 ○ Metal 后端针对 CV/Diffusion 模型使用 simdgroup 指令进行加速,提升1倍左右效率。 ○ Metal 后端新增 ScatterND 算子实现(目前只支持无 overlap 的情况) ○ OpenCL 后端针对 GEMV 扩充场景([e,l,h]矩阵乘参数中,l, h 很大,但 e>1 且 e <16)进行了优化,提升性能1倍左右,以配合投机解码提速需求。

🐞 问题修复 包括但不限于: 1. 修复 Windows 平台上部分情况下模型转换工具 crash 的问题。 2. 修复 CUDA 后端运行 BatchMatMul 出错的问题。 3. 修复 OpenCL 运行 LLM 时,多模态模型使用 CPU 导致出错的问题。 4. 修复 Qwen2/Qwen2.5 的 M-RoPE 支持问题。 5. 修复 Qwen2-7B-Audio 长语音处理出错的问题。 6. 修复 Space2Depth 算子中断言 blockSize > 1 不正确的问题。 7. 修复 Armv7a 架构上 LLM 运行出错的问题。 8. 修复 ONNX 部分量化模型转换失败的问题。 9. 修正 HIAI 部分算子实现错误的问题。

3.2.0 Release Note Thanks to the community and development team's efforts, MNN 3.2 has been officially released! This version introduces significant improvements in performance optimization, feature expansion, and bug fixes, with particular emphasis on enhanced support for Large Language Models (LLM) and inference acceleration. Below are the key updates: 🚀 New Feature Support 1. LLM-related Enhancements a. Added support for running Qwen Omni models b. Added support for Qwen3 MoE models c. Add support for Gemma3 (4B/1B) and InternVL2.5-1B/deepseek vl models d. Added support for FastVLM and SmolVLM models e. LLM Export tool now supports additional low-bit quantization (2, 3, 5, 6, 7 bits) f. Implemented jinja2 templating engine for LLM conversation template management g. Default quantization block size in LLM Export tool changed to 64 (tested to maintain same size as Q4_1 with higher accuracy) h. Added support for converting GGUF model format to MNN 2. KlediAI Integration Updates a. Added asymmetric Q4 quantization kernel implementation b. Added FP16/FP32 SME kernel implementation 3. LLM Inference Acceleration a. Added speculative decoding (lookahead algorithm) based on user input, improving decoding efficiency by 2-3x in typical scenarios like code completion 4. Model Conversion Enhancements a. Added ONNX metadata conversion and reading support while preserving input order consistency 5. Hardware Acceleration Support a. Added Qualcomm QNN backend (currently supports CV models only) 6. Library Size Reduction Feature a. Added the MNN_REDUCE_SIZE macro. When enabled, it can disable certain optimization branches to reduce the library size.

⚙️ Performance & Accuracy Optimization 1. Startup Optimization a. CPU backend supports mmap-based cache file generation with automatic deletion of original weight files to save storage space b. Optimized weight rearrangement and loading time for CPU Int8/Int4 modes, reducing LLM loading time to 1/3 of original (faster than filling 200 tokens) c. OpenCL backend further optimizes loading time via zero-copy memory, reducing to 50% of original (faster than filling 200 tokens) d. Metal backend reduces Int8/Int4 weight rearrangement/loading time through zero-copy memory and kernel rewrites, achieving 1/7 of original LLM loading time 2. CPU Operator Optimization a. Redesigned weight memory layout for block-wise quantization scenarios with rewritten kernels to reduce GEMV cache misses during multi-threading, achieving over 1x improvement in LLM multi-threaded decoding performance on PC CPUs b. Added input block-wise quantization support; existing weight block-wise quantization now supports non-1x1 convolutions, significantly improving quantization accuracy for CV models (e.g., enhanced image generation quality in Diffusion scenarios) 3. GPU Operator Optimization a. Metal backend accelerated CV/Diffusion models using simdgroup instructions, improving efficiency by ~1x b. Added ScatterND operator implementation in Metal backend (supports non-overlapping cases only) c. OpenCL backend optimized GEMV expansion scenarios ([e,l,h] matrix multiplication parameters with large l/h values but small e [1<e<16]), improving performance by ~1x to support speculative decoding acceleration 🐞 Bug FixesIncluding but not limited to: 1. Fixed model conversion tool crashes on Windows under certain scenarios 2. Fixed BatchMatMul execution errors in CUDA backend 3. Fixed CPU-related errors in multimodal models during LLM execution on OpenCL backend 4. Fixed M-RoPE support issues for Qwen2/Qwen2.5 5. Fixed long audio processing errors in Qwen2-7B-Audio 6. Fixed incorrect assertion in Space2Depth operator requiring blockSize > 1 7. Fixed LLM execution errors on Armv7a architecture 8. Fixed ONNX quantized model conversion failures 9. Corrected HIAI operator implementation errors

Source: README.md, updated 2025-06-06