The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
README.md	2026-04-08	131.0 kB	0
v2.5.0 source code.tar.gz	2026-04-08	7.4 MB	0
v2.5.0 source code.zip	2026-04-08	9.3 MB	1
Totals: 3 Items		16.9 MB	1

FastDeploy Release 2.5 Release Note

新增功能

新模型支持

新增Qwen3-VL模型部署支持 [#5763]
新增Qwen3-VL MoE模型部署支持 [#5913]
新增Qwen3-VL和Qwen3-VL MoE CUDA Graph支持 [#5962]
新增GLM模型TP+DP+EP支持 [#6317]

新量化方法支持

新增W4AFP8量化方法支持(v1_loader和v0_loader，支持TP>1) [#5757]
新增NVFP4 MoE在SM100上的支持 [#6003]
新增FusedMoE在Blackwell上的支持 [#5325]
新增统一量化算子 [#5991]
新增FP8量化环境变量FD_USE_PHI_FP8_QUANT支持 [#6320]
新增Weight Only量化方法对QKVGate_proj的支持 [#6669]

PD分离相关功能

新增多模态模型P/D分离支持 [#5323]
新增PD分离部署配置简化和端口管理重构 [#5415]
新增PD分离支持动态C8 IPC [#5750]
新增PD分离RDMA动态C8支持 [#5788]

CUDA Graph相关功能

新增Qwen3-VL和Qwen3-VL MoE CUDA Graph支持 [#5962]
新增reorder ids以分离prefill和decode请求的支持 [#5779]
新增full_cuda_graph控制子图切分 [#6027]
新增max_capture_shape_prefill和cudagraph_capture_sizes_prefill配置 [#6148]
支持CUDAGraph用于P/PD混合Batch，采用SOT子图切分模式 [#6196]
Cuda graph模式下跳过ATTN padding部分计算 [#5985]

RL训练相关功能

新增Rollout Routing Replay支持 [#5405]
新增V1 update/clear API for RL支持 [#6974]
新增Thinking Pattern框架优化 [#4302]
新增限制thinking内容长度的CUDA算子统一，支持回复长度限制与注入序列 [#6511]
R3支持RDMA Store [#5467]
支持通过load_weights函数加载权重 [#5549]
新增pause、update_weights、resume异步RL接口 [#6052]
支持GLM MTP RL Model [#6223] [#6267]
R3支持全层路由Fused Put [#6099]
支持SM100 FP8量化 [#6602]
支持moe_topk_select Paddle原生算子及FP8 MoE量化 [#6935]

KV Cache相关功能

新增KV Cache存储支持 [#5571]
新增attention_store KV Cache后端支持 [#5823]
新增file_store KV Cache后端支持 [#6188]
新增通过attention store上报token index支持 [#6285]
新增RDMACommunicator发送key和value scale支持 [#5737]
新增get_output_kv_signal阻塞读取模式和send_first_token支持 [#5836]

新API/接口支持

新增stop_token_ids支持 [#5399]
新增logprobs/prompt_logprobs token解码开关 [#5463]
新增请求级投机解码指标监控支持 [#5518]
新增健康检查功能 [#5534]
新增请求级延迟细粒度追踪(Tracing Part1) [#5458]
新增Entropy计算支持 [#5692] [#5730]
新增输出缓存默认启用 [#5987]
新增tag phase token enforce生成支持 [#6034]
新增SWA基于appendattn的支持 [#6594]
plugin模型支持mm_processor_kwargs [#6491]
新增多模态模型dummy run支持 [#6045]
新增Norm before Rope支持 [#6332]
新增使用phi permute/unpermute并移除swiglu [#6808]

Engine与架构优化

新增基于ZMQ通信的EngineService跨进程async_llm重构 [#4868]
新增Golang Router用于请求调度和负载均衡 [#5882] [#5966]
新增ZMQ-based FMQ实现和benchmark工具 [#5418]
新增Pool模型prefill batch推理支持 [#5436]
新增Paddle启动版本检查机制 [#5769]
新增可配置worker健康检查超时(FD_WORKER_ALIVE_TIMEOUT) [#5865]
新增FD统计信息上报 [#5646]
新增统一请求完成日志格式并增强统计信息 [#6405]
新增控制台打印统计指标 [#6339] [#6413]
新增断开连接后停止在线服务中对应请求推理的支持 [#5320]

Loader相关功能

新增V1 Loader加载静态C8 scale JSON支持 [#5909]
新增V1 Loader按自然key顺序加载safetensors权重 [#6006]
新增TP+EP 下v1_loader支持 [#5465]
新增Loader dummy load weight支持 [#6169]
新增Loader wint2后端支持 [#6139]
新增Loader处理GPU内存碎片支持 [#6790]

模型层优化

新增所有模型VocabParallelEmbedding的forward_meta支持 [#5524]
对expert_dispatch算子支持更多参数配置 [#5748]
新增FA3对GLM-RoPE的支持 [#5586]
新增EPLB冗余专家支持 [#5918]
新增normalization层参数重命名 [#6133]
新增tracelogger stacklevel支持 [#5766]
支持qkv和gate linear融合 [#6552]

性能优化

算子性能优化

优化gather_logprob算子性能 [#5817]
优化Qwen3 QK RMSNorm算子，通过融合Triton Kernel加速 [#5880]
优化mask_quant和swiglu算子性能 [#6222]
W4AFp8量化场景下gemm算子采用自适应N参数优化 [#5853]
支持FA2/FA3/FA4算子配合attn_mask_q使用 [#6354]

显存优化

MoE prefill阶段添加del操作降低峰值显存 [#5863]
Qwen模型支持动态block_wise_fp8缓存 [#5486]
移除decoder_num_blocks_device的memset操作 [#5982]

调度优化

优化engine-worker-queue任务检查性能 [#5376] [#5580]
减少blocks不足时的preemption发生频率 [#5696]
优化preemption发生时的同步状态处理 [#5796]
优化EP模式下的TTFT延迟 [#6098]
简化available_blocks分配逻辑 [#6874]
支持多模态prefill batch [#5313]

量化相关优化

支持W4AFp8 MTP量化 [#5429]
支持W4AFp8 MoE权重离线permute和加载 [#5613]
支持W4AFp8 DeepEP低延迟两阶段模式 [#5608]

图优化

PaddleOCR-VL ViT部分使用CINN优化 [#5223]
封装deep gemm和triton为python op [#5673]
为per_token_quant等算子添加infershape和dtype支持 [#5762]
封装m_grouped_gemm_fp8_fp8_bf16_nt_contiguous为自定义pyop [#5847]
从cudagraph中移除static_op_get_block_shape_and_split_kv_block [#6081]

其他性能优化

批量计算real max_logprobs优化 [#5430]
支持logprob异步拷贝 [#6362]
避免不必要的penalty计算 [#6078]
前后处理流水线不再执行dict转换 [#5494]
Qwen2.5-VL vision模型采用合并线性层和统一处理优化 [#6037]
支持在自定义allreduce中设置通信组以及解码阶段的all-to-all/transpose融合算子 [#5917]
重构chat_handler和completion_handler，提取基类并使用AsyncLLM [#5195]
更新prompt和prompt_token_ids处理逻辑 [#6334]
在不安装torch时跳过compat guard [#6926]
使用Paddle时为Triton使用独立的driver [#6983]

多硬件支持

昆仑芯XPU

新功能支持

新增 speculate_step_system_cache 支持 [#5397]
支持 get hidden state for mix 功能 [#5513]
新增 speculate_get_logits 功能 [#5497]
支持 PD Disaggregation 场景下 update_inputs_v1 算子 [#5550]
支持 EP+MTP [#5605]
支持 token num = 0 场景 [#5635]
支持 EP4TP4 配置 [#5773]
支持 EP4TP1 配置 (PD disaggregation) [#5860]
支持 Speculative Decoding with PD [#5856]
支持 mm prefill batch [#6072]
支持 plugin model [#6092]
支持 CudaGraph (block attn cuda_graph 支持) [#6116], [#6152], [#6162]
支持从 XPU EP 接口从 xDeepEP 切换到 paddle [#5706]
支持 recover batch sequence [#6142]
支持 noaux_tc [#6326]

性能优化

重构 moe ffn 优化性能 [#5501]
默认设置 top_p=0.0 优化性能 [#5686]
优化 logprob 性能 [#5626], [#5628]
重构 block_attn 参数 'pos_emb_type' [#5511]

Bug修复

修复 mtp multi batch 问题 [#5521]
修复 dp4 问题 [#5946]
修复 moe num_expert 问题 [#6014]
修复 multi-batch bug in VL model [#6015]
修复 text_image_gather_scatter 在 cudagraph 模式下的问题 [#6049]
修复 PD splitwise 模式下 seq_lens_encoder 重置问题 [#6048]
修复 MAX_BSZ 对齐 GPU 设置及 OCR VL 禁用 prefix cache [#5831]

沐曦Metax

新功能支持

新增 CI yaml 配置 [#5520]
支持 cudagraph [#5547]
支持 prefix caching & cpu swap [#5844]
适配不同版本 maca 的 gemm 接口 [#5905]
支持 V1_KVCACHE_SCHEDULER 和 paddleocr-vl rope mode [#5555]

性能优化

优化 MLA backend [#5258]
重构 cutlass moe 并优化 flash attention [#5361]
优化 flash attention backend [#5876]
修改 warpSize 为 WARP_SIZE [#5442]

Bug修复

修复 GetStopFlagsMulti kernel crash 问题 [#5556]
修复 metax runner 问题 [#5629]
修复大图推理时 shape 错误和输出乱码问题 [#5965]
修复 self.share_inputs['preempted_idx']=[] 使用错误 [#6038]
修复 'get_token_penalty_multi_scores' 输入错误 [#6266]
修复 issues based [#6259] [#6338]

Intel HPU

新模型支持

支持 ERNIE-4.5-21B-A3B-Thinking 模型 [#5891]

新功能支持

支持 tensor_wise_fp8 [#5324]
支持 KV cache scheduler v1 [#5648]
支持 chunked prefill [#5903]
支持 MoE EP [#5855]
支持单一 PaddleCustomDevice 发布包 [#5910]

其他

新增 HPU tensorwise_fp8 文档 [#6091]

天数Iluvatar

新功能支持

支持 V1_KVCACHE_SCHEDULER 和 paddleocr-vl rope mode [#5555]

Bug修复

修复 CUDA_VISIBLE_DEVICE 指定时的 FD 启动错误 [#5735]
修复多平台兼容性问题 (使用 paddle.device.get_device_properties) [#6400]

Bug修复

PD分离相关Bug修复

修复PD分离模式下MTP cache attaching问题 [#5884]
修复resource_manager_v1在PD模式下的锁问题 [#5616]
修复PD分离部署时cache int8的问题 [#6571]
修复mix splitwise模式下的pickle加载错误 [#5488]
修复多模态splitwise调度器的bug [#5604]
修复PD重排序问题并添加单元测试 [#6375]
修复MTP场景下PD重排序问题 [#6917]

多模态相关Bug修复

修复PaddleOCR-VL模型参数放置在CPU的问题 [#5413]
修复多模态CUDA Graph问题 [#5266]
修复音频处理结束时的bug [#5464]
修复视频处理bug [#5557]
修复encoder cache bug [#5528]
修复eb5多模态前缀缓存bug [#5638]
修复eb5多模态跳过前缀缓存问题 [#5838]
修复多模态revert bug [#5848], [#6061]
修复eb5前缀bug [#5879]
修复fa3 qwen-vl rope支持问题 [#5869]
修复PaddleOCR-VL非法内存访问问题 [#6042]
修复多模态fetch feature问题 [#6095]
限制prefill batch中多模态请求为1 [#5901]
修复SiglipEncoder中reversed_window_indices的条件判断 [#5795]
修复FlashMask在开源模型上的运行问题 [#6520]
修复MM MTP中不正确的rope embedding [#6586], [#6650]

CUDA Graph相关Bug修复

修复无法进入CUDA Graph的问题 [#5422]
修复0不进入CUDA Graph以节省内存 [#5426]
修复sm89编译错误 [#5809]
修复BatchMLAWithPagedKVCacheKernel的O_tmp问题 [#5895]
更新权重dummy run时重置shared inputs [#6418]

EP并行相关Bug修复

修复custom_all_reduce溢出问题 [#5662]
修复wint4 EP空运行导致的问题 [#5870]
修复ep_moe_expert_combine op返回值不一致问题 [#5812]
修复300B FP8 EP并行测试用例的模型加载错误 [#6436]
修复DP+EP下的NaN错误（添加进程间锁） [#6769]

MTP相关Bug修复

修复MTP在enable_logprob时无logprobs的问题 [#5499]
修复speculative decoding中的attention bug [#5460]
修复speculative decoding中write qknorm cache bug [#5491]
修复splitewise-prefill模式下multistep MTP问题 [#5723]
修复mixed和PD-split模式下multi-step MTP的attn_mask_offset问题 [#5738]
修复MTP权重加载bug [#5744]
修复MTP split kv attention问题 [#5920]
修复MTP logprob在include stop_seq时的hang问题 [#5927]
修复MTP forward meta问题 [#5976]
修复MTP logprob因max_num_logprobs导致的问题 [#6084]
修复GLM MTP中logits计算bug [#6093]
修复MTP acceptance rate下降问题 [#6471]
修复MTP在dummy run时跳过empty_input_forward [#6654]
修复MTP config在RL中的问题 [#6596]
支持suffix decoding [#6967]

Cache相关Bug修复

修复cpu prefix cache bug [#5544]
修复抢占时缓存输出问题 [#5502]
修复抢占后exist_prefill_flag问题 [#6630]
修复dynamic c8在v1 loader中的问题 [#5562]
修复dynamic c8 cache bug [#5958], [#6692]
修复cache manager在MTP或blockwise fp8时未启动的问题 [#5840]
优化cpu和storage cache的准备 [#5777]
修复cache transfer manager updating/clearing问题 [#5930]
将cache创建移回cache transfer process [#6144]
修复cache cleared后cache transfer tasks失败问题 [#6202]
修复storage_backend_type比较bug [#6522]
修复使用block_wise_fp8且无storage backend时cache transfer manager初始化失败 [#6517], [#6564]
修复recycle_gpu_blocks中的安全检查 [#6530]
修复metrics cache tokens问题 [#6001]
修复请求抢占后cache命中率和TTFT不准确问题 [#6626]
修复prefix tree updating超时问题 [#6616]
修复num_cpu_blocks计算问题 [#6473]

量化相关Bug修复

修复W4AFP8数值溢出问题 [#5634]
修复w4afp8 tp=8问题 [#5868]
增加w4afp8 gemm的shape [#5957]
适配hadamard_block_size [#5888]
修复wint2问题 [#6109]
修复weight quant op问题 [#6137]
修复fused_mask_swiglu_fp8_quant bug [#6316]
修复moe activation quant问题 [#5830]

调度相关Bug修复

修复解码时sleep bug [#5461]
修复n>1且enable-logprob时的hung问题 [#5492]
修复Chunked Prefill在max_tokens=1时的问题 [#5736]
修复抢占时超出real_bsz问题 [#5805]
修复enable output caching的bug [#6226]
设置enable_cache_output默认为false [#5751]
修复can_schedule_block_num_threshold计算问题 [#6542]

API/接口相关Bug修复

修复init RequestOutput问题 [#5419]
修复limit_thinking在CUDA kernels中的early return逻辑 [#5471]
修复speculate_limit_thinking_content_length [#5590]
修复process_response_dict支持async in serving_completion [#5758]
修复streaming response中return_token_ids启用时的冗余prompt_logprobs问题 [#5829]
修复console log metrics中waiting queue count [#6453]
支持control socket禁用选项 [#6551]
修复请求中断和推理终止功能的多个bug [#6890]

RL相关Bug修复

移除RL的shutdown_process_group/restart_process_group [#5433]
修复RL weight loading在moe layer的问题 [#5503]
修复RL load_weights [#5642]
修复rl model_weights_signal以支持tp>1 [#5639]
修复rl signal问题 [#5681]
修复RL中MTP config问题 [#6596]
支持Fully Async和PrefixCache [#6727]
支持chunked part files加载并修复IPC snapshot strategy中的model path格式 [#6910]
修复RL中update param问题 [#6722]
添加decoder rope的instantiations [#7010]

其他Bug修复

修复bf16 deepseek loader问题 [#5379]
修复deepseek torch loading [#5410]
修复clearing weight后的不稳定性 [#5493]
修复model executing在clearing/updating完成后跳过 [#5527]
修复Intel HPU平台构建脚本问题 [#5455]
修复Graph Optimization中0size bug [#5495]
修复eplb weight updating [#5529]
移除重复的PaddleOCRVLProcessor初始化代码 [#5526]
修复count_tokens_per_expert_func声明 [#5794]
修复shm在set_data_ipc中打开但未关闭的问题 [#5826]
修复entropy bugs [#5818], [#5941]
修复TP中entropy计算问题 [#5997]
只在CUDA平台运行Triton count_greater_kernel [#5846]
修复logprob因max_num_logprobs导致的问题 [#6067]
修复token_penalty kernel [#6069]
修复mask attention问题 [#6216], [#6214]
修复qk_norm optimization [#6080]
修复shared experts和dense mlp layer不需要TP split的问题 [#6180]
修复tokenizer OOM问题 [#6287]
修复heartbeat signal的sleeptime错误 [#6241]
修复zmq在sampled_token_id=0时hung问题 [#6398]
修复noaux_tc_redundant op的4个返回值处理 [#6384]
修复cutlass的lazy enable_torch_proxy [#6585]
修复reshard错误 [#6537]
修复MC_TCP_BIND_ADDRESS for mooncake store [#6783]
修复grpc在tracing init before workers forked时的失败 [#6744]
修复get_save_output_v1 socket name冲突 [#6759]
用custom_ftok替换ftok [#6824]
设置FD_USE_PHI_MOE_PERMUTE默认为0 [#6888]
修复ErrorInfo code type [#6952]
修复_disable_sequence_parallel_moe_if_needed [#5740]
修复port相关错误 [#6309]
修复download feature bug [#5669]
修复insert_zmq_task_to_scheduler break bug [#5960]
修复rebuild padding bug [#6425]
修复deepgemm import [#6452]
修复assert message [#6310]
修复double shutdown of comm group [#5715]
重命名need_block_num_signal修复shm name冲突 [#5623]
修复更新权重时启用cache storage的问题 [#6720]
修复多api server的rdma script和port check [#5935]
修复worker_process中request counting的误导性日志 [#5939]
修复v0_loader在python3.12的问题 [#6132]
修复tool_calls skipped问题 [#6166]
修复image gen问题 [#6175]
修复get_padding_offset in empty run [#6460]

其它

Benchmark

更新benchmark工具 [#5496] [#5625] [#6335]
更新backend_request_func.py [#5631] [#5633]
支持Completions接口 [#5700]
修复aiohttp streaming返回Chunk too big问题 [#5771]
更新benchmark_serving.py [#5861]
支持SGLang/VLLM获取cached tokens [#6240]
新增Qwen3 VL CE测试 [#6288]
更新README文档 [#6343]

文档

新增text/vl cinn ce配置文档 [#5532]
更新环境变量文档同步最新代码 [#5713]
更新GPU版本至2.3.2 [#5894]
更新FastDeploy版本至2.3.3 [#6010]
更新Docker镜像至2.4.0 [#6168]
新增/v1/pause、/v1/resume、/v1/is_paused接口文档 [#6192]
新增在线量化文档 [#6399]
新增环境变量文档 [#6385]

CI/测试

单元测试补充（Hackathon活动）

新增ernie4_5_vl_processor模块单测 [#5264] [#5265] [#5263]
新增spec_decode/mtp.py单测 [#5533]
新增rollout_model.py单测 [#5552]
新增openai/api_server.py单测 [#5567]
新增scheduler/local_scheduler.py单测 [#5050]
新增guided_decoding模块单测 [#5047] [#5042]
新增entrypoints/engine_client.py单测 [#5807]
新增llm.py单测 [#6108]
新增engine_worker_queue.py单测 [#6102]
新增serving_completion.py单测 [#6227]
新增resource_manager_v1.py单测 [#6243]
新增fused_moe_wint2_backend.py单测 [#6286]
新增zmq_server.py单测 [#6210]
其他功能模块单测补充 [#5057] [#5058] [#5063] [#5060] [#5059] [#5718] [#5717] [#5726] [#5609] [#5328]

CI基础设施优化

新增commit级别RL构建任务 [#5857]
新增CUDA 12.9每日构建任务 [#5936]
新增fd-router构建任务 [#5967]
新增4-GPU端到端测试任务 [#6082]
新增ep4_mtp端到端测试 [#6153]
新增GLM E2E测试（MTP及非MTP）#6163
新增attention TP单元测试 [#5887]
新增attention单元测试用例 [#5931]
新增fused_moe EP TP测试 [#5989]
新增swap_layout单元测试 [#6250]
新增SM100 FP8推理mock测试 [#6273]
新增update weights测试 [#6242]
重构RL测试复用stable_test [#5516]
重构RL测试复用test_metrics.py [#5741]
重构iluvatar_ci [#5588]
修复approve配置 [#5443]
优化stable_test资源调度 [#6235]
切换nightly构建使用FD_UNIFY_BUILD [#6246]
移除--ipc=host和--pid=host配置 [#6270]
更新build_linux_rl.yml [#6274]
更新stable test工作流 [#6352]
更新check-bypass.yml [#6360]
禁用GPU清理 [#5781]
启用custom_device_check重试 [#5786]
暂时禁用fp8测试用例 [#5963]
添加重试和清理机制 [#5725]
减少test_mtp超时时间 [#5512]
适配vl_model基线变更 [#5576] [#6033]
移除不兼容的test_metrics.py [#5578]
添加MTP accept ratio CI用例 [#5570]
添加ERNIE45T 21B sot测试 [#5538]
移除不稳定的IPC测试 [#6190]
支持异步R3精度测试 [#5937]
固定gunicorn版本至25.0.3 [#6499]
禁用test_batch_invariance_op_mm.py [#6549]
切换2.5分支使用Paddle release/3.3 [#6621]
同步CI优化到release/2.5分支 [#6684] [#6964]

代码重构/清理

Speculative Decoding（投机解码）

支持投机解码不同inferseed [#5568]
支持multi-step mtp with cudagraph [#5624] [#5886]
优化draft logprob [#5842]
返回每个head的accepted tokens [#5947]
支持GLM-4.5-Air MTP [#6047]
支持enable_thinking为false时的约束解码 [#6248]
重构MTP pre_process [#6358]

MoE优化

支持GPT-OSS MXFP4量化 [#5435]
支持FP8权重加载 [#5565]
使用max_tokens_per_expert优化MoE网格维度 [#6007]
移除permute_x_fp8_kernel模板NUM_EXPERTS_PER_RANK [#5620]
ep_moe_expert_dispatch支持num_experts_per_rank=5 [#5890]

代码清理与重构

移除无用代码，支持mixed FA3 [#5404]
FA3支持qwen3 [#5441]
支持0-dim tensor进入ar [#5451]
移除add_bias选项 [#5425]
新增cuda_graph断言并只统计实际负载 [#5445]
更新tbo相关代码 [#5485] [#6281]
清理代码 [#5543] [#5548] [#5691]
移除无效paddleocr processor分支 [#5821]
移除stop_nums [#6182]
移除flash_mask_attention未使用参数 [#6218]
移除speculate_get_padding_offset op [#6308]
移除MTP rebuild_padding无用代码 [#6336]
MLA代码清理 [#5979]
添加PADDLE_ENFORCE [#6321]

其他优化

控制台日志重定向到llm日志 [#5680]
防止Paddle版本检查时的core dump [#5657]
插件错误信息抛出 [#5675]
reschedule preempt任务支持可选函数 [#5649]
升级paddleformers至0.4.0 [#5599]
重命名tensor_parallel_degree为tensor_model_parallel_size [#5727]
flash_mask attention pybind [#5783]
禁用ernie5中的chunked_mm_input [#5774]
启用PFCC deep_ep [#5822]
新增flashinfer-python-paddle依赖 [#5912]
TSP last_norm allgather移至model.py [#5924] [#5961] [#5972]
KVCache仅在启用hierarchical cache或kv cache storage时启动传输进程 [#5871]
metrics_port参数传递 [#6056]
新增exist_prefill_flag [#6172]
新增scale_wrapper for per_block_cast_to_fp8 [#6183]
改进deep_ep导入处理 [#6207]
支持多SM架构构建到单一whl包 [#6173]
新增token生成速率监控指标 [#6236]
增强deep_ep导入，支持mixed mode flash_mask_attn [#6238]
重构execute_model支持GPU异步调度 [#6176]
移除cuda_check（多次回滚）#5883 [#5915]
新增data_processor及tool parser插件 [#6096]
新增paddleocr配置yaml [#6097]
新增目录导航到mkdocs配置 [#6121]
支持overlap schedule [#6259]
跳过paddle.to_tensor如果is_not_swapped [#6342]
KVCache Storage支持c8模型 [#6298]
支持cpu-cache-block-num监控 [#6390]
支持从fleet ops导入deepgemm/deepep [#6351]
apiserver和engine启动失败时退出 [#6322]
退出时确保无残留进程 [#6377]
懒写入日志 [#6323]

Cherry-Pick

新增reasoning effort及tool string参数支持 [#6706]
新增qwen3vl prompt_token_ids支持 [#6786]

What's Changed

[New][RL] Support Rollout Routing Replay by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/5405
[loader]fix bf16 deepseek by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5379
[Loader]fix deepseek torch loading by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5410
[Loader][BugFix] Fix some parameters place on CPU in PaddleOCR-VL by @SigureMo in https://github.com/PaddlePaddle/FastDeploy/pull/5413
[PD Disaggregation] Add timestamp for analyzing splitwise deployment by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/5317
[Others] Remove useless code and support FA3 in mixed by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5404
[BugFix] fix can not enter into cuda graph by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5422
[CI]【Hackathon 9th Sprint No.16】功能模块 fastdeploy/input/ernie4_5_vl_processor/process.py 单测补充 by @kesmeey in https://github.com/PaddlePaddle/FastDeploy/pull/5264
[BugFix] 0 not into cuda graph to save memory by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5426
[Quantization] Support w4afp8 mtp by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/5429
[Feature] [Benchmark]: add ZMQ-based FMQ implementation and benchmark tools by @sunlei1024 in https://github.com/PaddlePaddle/FastDeploy/pull/5418
[PD Disaggregation] FD registers to the Router only once. by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/5431
[BugFix] fix init RequestOutput by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/5419
[Feature] Multimodal Model P / D Separation by @ST-XX in https://github.com/PaddlePaddle/FastDeploy/pull/5323
[Engine] [Feature] Refactor async_llm:cross-process with EngineService，based on zmq communication by @xyxinyang in https://github.com/PaddlePaddle/FastDeploy/pull/4868
[Metax] optimize mla backend by @xiaozude in https://github.com/PaddlePaddle/FastDeploy/pull/5258
[BugFix] fix mm cudagraph by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5266
[Optimization] compulte real max_logprobs in batch by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5430
Remove CUDA ERROR 9 of inputs of get_padding_offset kernel by @K11OntheBoat in https://github.com/PaddlePaddle/FastDeploy/pull/5440
[Graph Optimization][CINN] Use CINN in PaddleOCR-VL ViT part by @SigureMo in https://github.com/PaddlePaddle/FastDeploy/pull/5223
[XPU] add speculate_step_system_cache by @RuohengMa in https://github.com/PaddlePaddle/FastDeploy/pull/5397
[PD Disaggregation] Unify the disaggregation info and the pd communication by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/5438
FA3 support qwen3 by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5441
[Feature] Support prefill batch inference for pooling models. by @lizexu123 in https://github.com/PaddlePaddle/FastDeploy/pull/5436
[CI]Modify approve by @YuanRisheng in https://github.com/PaddlePaddle/FastDeploy/pull/5443
allow 0-dim tensor into ar by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5451
[Others] remove add_bias option by @Limerances in https://github.com/PaddlePaddle/FastDeploy/pull/5425
[Metax] modify warpSize to WARP_SIZE by @xiaozude in https://github.com/PaddlePaddle/FastDeploy/pull/5442
[Feature] support stop_token_ids by @lizexu123 in https://github.com/PaddlePaddle/FastDeploy/pull/5399
[Others] Maintain the mtp branch temporarily. by @carryyu in https://github.com/PaddlePaddle/FastDeploy/pull/5446
[CI] Add unittest by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/5328
[BugFix] [RL] remove shutdown_process_group/restart_process_group for RL by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5433
[Speculative Decoding][BugFix]Fix attention bug in spec decoding by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5460
[BugFix] Fix limit_thinking early return logic in CUDA kernels by @Copilot in https://github.com/PaddlePaddle/FastDeploy/pull/5471
[Others] add assert and only count the actual load in cuda_graph by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5445
[BugFix] fix audio end bug by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/5464
[BugFix] fix decode time sleep bug by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/5461
[PD Disaggregation] Decode does not cache requests for preallocating … by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/5453
[Feature]Optimization of Thinking Pattern Framework by @luukunn in https://github.com/PaddlePaddle/FastDeploy/pull/4302
[Metax] refactor cutlass moe and optimize flash attention by @neilzhuu in https://github.com/PaddlePaddle/FastDeploy/pull/5361
[BugFix] fix mix splitwise pickle load error by @ST-XX in https://github.com/PaddlePaddle/FastDeploy/pull/5488
[CI] [XPU]ep+prefix cache+chunk prefill by @zccjjj in https://github.com/PaddlePaddle/FastDeploy/pull/5489
[Feature]Add a switch for logprobs/prompt_logprobs token decoding. by @qwes5s5 in https://github.com/PaddlePaddle/FastDeploy/pull/5463
[BugFix] fix instability after clearing weight by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5493
[Optim] Improve task-checking performance in engine-worker-queue by @Jiang-Jia-Jun in https://github.com/PaddlePaddle/FastDeploy/pull/5376
[BugFix] fix hung when n>1 and --enable-logprob by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5492
[Benchmark] Update benchmark by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/5496
[Others] update tbo related code by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5485
[Docs] Fix nvidia_gpu.md, add sm80 in precompiled by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/5462
[Graph Optimization][BugFix][CI] Fix 0size bug && add unitest by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/5495
[BugFix] Fixed build script issue on Intel HPU platforms by @FocusLuo in https://github.com/PaddlePaddle/FastDeploy/pull/5455
[RL]Fix RL weight loading issue in moe layer by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5503
[BugFix] Fix MTP no logprobs when enable_logprob by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5499
[CI] Reduce timeout of send_request in test_mtp by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5512
[Optimization] support mm prefill batch by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5313
[XPU] support get hidden state for mix by @cmcamdy in https://github.com/PaddlePaddle/FastDeploy/pull/5513
[Feature] Support for request-level speculative decoding metrics monitoring. by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5518
[CI]【Hackathon 9th Sprint No.25】功能模块 fastdeploy/input/ernie4_5_vl_processor/image_preprocessor/image_preprocessor_adaptive.py 单测补充 by @kesmeey in https://github.com/PaddlePaddle/FastDeploy/pull/5265
[Metax] add ci yaml by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/5520
[PD Disaggregation] Distinguish the pipelines for sending kv signal in different prefill by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/5514
[XPU] fix mtp multi batch by @cmcamdy in https://github.com/PaddlePaddle/FastDeploy/pull/5521
[Models] Add forward_meta to VocabParallelEmbedding of all models by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/5524
[XPU] refactor of block_attn param 'pos_emb_type' by @cqulilujia in https://github.com/PaddlePaddle/FastDeploy/pull/5511
[XPU] add speculate_get_logits by @RuohengMa in https://github.com/PaddlePaddle/FastDeploy/pull/5497
[Doc]add text/vl cinn ce config by @tianlef in https://github.com/PaddlePaddle/FastDeploy/pull/5532
[Feature][Optimization] Qwen Support Dynamic block_wise_fp8 cache by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5486
[BugFix] reschedule_preempt_task append waiting & PREEMPTED blocksize by @ST-XX in https://github.com/PaddlePaddle/FastDeploy/pull/5506
[CI][XPU] add mtp case by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/5537
[BugFix] fix encoder cache bug by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5528
[Graph Optimization][CI] Add ERNIE45T 21B sot test by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/5538
[Others] Clean code by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5543
[CI]【Hackathon 9th Sprint No.22】功能模块 fastdeploy/input/ernie4_5_vl_processor/ernie4_5_vl_processor.py 单测补充 by @kesmeey in https://github.com/PaddlePaddle/FastDeploy/pull/5263
[Metax] adapt to the latest develop and support cudagraph by @zhang-chenyi in https://github.com/PaddlePaddle/FastDeploy/pull/5547
[Feature] Add check health in FD by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/5534
[CE]add pd router and wint4 tp4 config by @xiegegege in https://github.com/PaddlePaddle/FastDeploy/pull/5554
[PD Disaggregation][XPU] update_inputs_v1 operator supports PD by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/5550
[Bug Fix] Fix bug for caching output when preempted by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/5502
[Metax] fix GetStopFlagsMulti kernel crash issue by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/5556
[BugFix]Speculative DecodingFix write qknorm cache bug in speculative decoding by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5491
[BugFix] fix dynamic c8 in v1 loader by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5562
[CI] 【Hackathon 9th Sprint No.34】NO.34 功能模块单测补充 by @xunyoyo in https://github.com/PaddlePaddle/FastDeploy/pull/5057
[Feature] Use paddle.compat.enable_torch_proxy in fastdeploy/__init__.py by @littledgg in https://github.com/PaddlePaddle/FastDeploy/pull/5211
Revert "[BugFix] reschedule_preempt_task append waiting & PREEMPTED blocksize" by @Jiang-Jia-Jun in https://github.com/PaddlePaddle/FastDeploy/pull/5575
【NewFeature】support load fp8 weight by @gzy19990617 in https://github.com/PaddlePaddle/FastDeploy/pull/5565
[CI] Adapt vl_model baseline changes due to Paddle update by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5576
[Feature] Support fusedmoe on Blackwell by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/5325
Revert "[Feature] Use paddle.compat.enable_torch_proxy in fastdeploy/__init__.py" by @Jiang-Jia-Jun in https://github.com/PaddlePaddle/FastDeploy/pull/5579
[CI] Remove test_metrics.py due to incompatible forced merge by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5578
[BugFix] fix cpu prefix cache bug by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5544
[Feature] Tracing: Fine-Grained Tracing for Request Latency Part1 by @xiaolei373 in https://github.com/PaddlePaddle/FastDeploy/pull/5458
[RL] R3 Support RDMA Store by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/5467
[BugFix] skip model executing after clearing/updating is done by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5527
[Feature] add ue8m0 for per_token_quant_fp8 by @fxyfxy777 in https://github.com/PaddlePaddle/FastDeploy/pull/5563
[CI] 【Hackathon 9th Sprint No.36】NO.36 功能模块单测补充 -part by @xunyoyo in https://github.com/PaddlePaddle/FastDeploy/pull/5058
[Optim] Optimize costtime in checking tasks in engine-worker-queue by @Jiang-Jia-Jun in https://github.com/PaddlePaddle/FastDeploy/pull/5580
[Feature] FA3 support GLM-RoPE by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5586
[BugFix] fix video bug by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5557
[BugFix] fix speculate_limit_thinking_content_length by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5590
[Others] Clean code && remove GPU sync code by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5548
Revert "[Feature] add ue8m0 for per_token_quant_fp8" by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5611
[Feature] [PD Disaggregation] simplify configuration for pd-disaggregated deployment, and refactor post-init and usage for all ports by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5415
[XPU][CI] xpu add ci test for pd by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/5610
[Speculative Decoding]Support different inferseed in speculate decoding by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5568
[Intel HPU] enable tensor_wise_fp8 by @fmiao2372 in https://github.com/PaddlePaddle/FastDeploy/pull/5324
[BugFix] 移除重复的 PaddleOCRVLProcessor 初始化代码 by @megemini in https://github.com/PaddlePaddle/FastDeploy/pull/5526
[CI]【Hackathon 9th Sprint No.12】功能模块 fastdeploy/spec_decode/mtp.py 单测补充 by @kesmeey in https://github.com/PaddlePaddle/FastDeploy/pull/5533
[CI] Add CI case for MTP accept ratio by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/5570
[Benchmark] Update benchmark tools by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/5625
[CI]【Hackathon 9th Sprint No.14】功能模块 fastdeploy/rl/rollout_model.py 单测补充 by @kesmeey in https://github.com/PaddlePaddle/FastDeploy/pull/5552
[XPU]logprob bug by @qw86972190 in https://github.com/PaddlePaddle/FastDeploy/pull/5626
[Metax] fix metax runner issue by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/5629
[XPU] refactor moe ffn by @zhupengyang in https://github.com/PaddlePaddle/FastDeploy/pull/5501
[benchmark] Update backend_request_func.py by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/5631
[Model] tp+ep support v1_loader by @Wanglongzhi2001 in https://github.com/PaddlePaddle/FastDeploy/pull/5465
[XPU] support for EP+MTP by @lizan1999 in https://github.com/PaddlePaddle/FastDeploy/pull/5605
[CI] 【Hackathon 9th Sprint No.36】NO.36 功能模块单测补充（修复） by @xunyoyo in https://github.com/PaddlePaddle/FastDeploy/pull/5609
[benchmark] Update backend_request_func.py by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/5633
[Iluvatar] Support V1_KVCACHE_SCHEDULER and paddleocr-vl rope mode by @wuyujiji in https://github.com/PaddlePaddle/FastDeploy/pull/5555
[RL]Support loading weights via the load_weights function for RL by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5549
Revert "[XPU][CI] xpu add ci test for pd" by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/5645
[BugFix] fix rl model_weights_signal to support tp>1 by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5639
[CI] 【Hackathon 9th Sprint No.19】NO.19 功能模块单测补充 by @xunyoyo in https://github.com/PaddlePaddle/FastDeploy/pull/5063
[XPU] support token num = 0 by @lizan1999 in https://github.com/PaddlePaddle/FastDeploy/pull/5635
[RL]Fix RL load_weights by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5642
[Intel HPU] enable kv cache scheduler v1 for hpu by @fmiao2372 in https://github.com/PaddlePaddle/FastDeploy/pull/5648
[RL] Update worker_process.py by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5651
[BugFix] Fix eplb weight updating by @RichardWooSJTU in https://github.com/PaddlePaddle/FastDeploy/pull/5529
[BugFix] fix eb5 mm prefix cache bug by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5638
[CI] 【Hackathon 9th Sprint No.38】NO.38 功能模块单测补充 by @xunyoyo in https://github.com/PaddlePaddle/FastDeploy/pull/5060
[Metax] update ci test by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/5652
[BugFix] Fix custom_all_reduce overflow by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5662
[CI] Fix unit_test error of unstable execution by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5660
[RL] provide options for whether shutdown comm group after weights cleared by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5663
[Speculative Decoding]Support multi-step mtp with cudagraph by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5624
[BugFix] Fix the W4AFP8 numerical overflow issue. by @lizexu123 in https://github.com/PaddlePaddle/FastDeploy/pull/5634
[BugFix] fix download feature bug by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/5669
[Quantization] Support w4afp8 moe weight offline permute & load by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/5613
[Quantization] Support w4afp8 DeepEP low latency two stage by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/5608
[Metax] update ci yaml by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/5674
[BugFix] fix rl signal by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5681
[XPU][CI] xpu add ci test for pd + TP2 by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/5653
Revert "Revert "[Feature] Use paddle.compat.enable_torch_proxy in fastdeploy/__init__.py"" by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/5606
[log]console log to llm log by @xiaolei373 in https://github.com/PaddlePaddle/FastDeploy/pull/5680
[XPU][CI] Xpu ci update by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/5690
[CI] 【Hackathon 9th Sprint No.37】NO.37 功能模块单测补充 by @xunyoyo in https://github.com/PaddlePaddle/FastDeploy/pull/5059
[XPU]Set top_p=0.0 by default on XPU to optimize performance by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/5686
[Optim] Remove limitation of number of kvcache blocks by @Jiang-Jia-Jun in https://github.com/PaddlePaddle/FastDeploy/pull/5612
[Others]Prevent core dumps during Paddle version check by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5657
[Metax] update ci name by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/5679
[XPU] modify speculate_verify by @RuohengMa in https://github.com/PaddlePaddle/FastDeploy/pull/5522
[Metax]Update run_ci_metax.sh by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/5698
Revert "[Optim] Remove limitation of number of kvcache blocks" by @DDDivano in https://github.com/PaddlePaddle/FastDeploy/pull/5702
[Docs] Update parameters documentation with latest code defaults and new parameters by @Copilot in https://github.com/PaddlePaddle/FastDeploy/pull/5709
[CI]【Hackathon 9th Sprint No.40】功能模块 fastdeploy/entrypoints/openai/api_server.py 单测补充 by @kesmeey in https://github.com/PaddlePaddle/FastDeploy/pull/5567
[Others] plugin raise error msg by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/5675
[Benchmark]支持Completions接口 by @ophilia-lee in https://github.com/PaddlePaddle/FastDeploy/pull/5700
[Docs] 更新环境变量文档以同步最新代码 by @Copilot in https://github.com/PaddlePaddle/FastDeploy/pull/5713
[Others] reschedule preempt task support optional func by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/5649
[Others]upgrade paddleformer to 0.4.0 by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5599
[Feature] Entropy calculation support by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5692
[Others] clean code by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5691
Revert "[CI] Adapt vl_model baseline changes due to Paddle update" by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5732
[CI] 【Hackathon 9th Sprint No.41】NO.41 功能模块单测补充 -new by @xunyoyo in https://github.com/PaddlePaddle/FastDeploy/pull/5718
[Feature] Add entropy calculation script by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5730
[Others] Rename tensor_parallel_degree to tensor_model_parallel_size for paddleformers 0.4.1 by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5727
[GraphOptimization] Wrap deep gemm and triton as python op by @SigureMo in https://github.com/PaddlePaddle/FastDeploy/pull/5673
[BugFix] Fix Chunked Prefill when max_tokens=1 by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5736
[CI] Refactor RL tests to reuse test_metrics.py by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5741
[Speculative Decoding]Fix multistep MTP in splitewise-prefill mode by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5723
[CI] Refactor RL tests to reuse stable_test by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5516
[BugFix] Set enable_cache_output as false by default by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/5751
[XPU] refine moe_expert_ffn ut by @RuohengMa in https://github.com/PaddlePaddle/FastDeploy/pull/5743
[Loader]Fix bug in MTP weight loading by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5744
[Metax] update ci bash by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/5760
[BugFix] Fix _disable_sequence_parallel_moe_if_needed by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5740
[XPU]ZMQ logprob by @qw86972190 in https://github.com/PaddlePaddle/FastDeploy/pull/5628
[iluvatar][CI] refactor iluvatar_ci by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/5588
[Optimization] refactor(chat_handler,completion_handler): extract base classes and use AsyncLLM by @memoryCoderC in https://github.com/PaddlePaddle/FastDeploy/pull/5195
[Feature] Support KV Cache Storage by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/5571
[CI] Add retry and robust cleanup for removal by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5725
[Speculative Decoding] Fix attn_mask_offset for multi-step MTP in mixed and PD-split modes by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5738
[Feature] Add startup version check mechanism for Paddle by @Copilot in https://github.com/PaddlePaddle/FastDeploy/pull/5769
[Benchmark]调大aiohttp 默认读 buffer size至10M，解决streaming 返回块过大报Chunk too big问题 by @ophilia-lee in https://github.com/PaddlePaddle/FastDeploy/pull/5771
[BugFix] fix mm splitwise scheduler bug by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5604
[Feature] pd support dy-c8 ipc by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5750
[BugFix] fix double shutdown of comm group when rank0 clears weights slower than other ranks by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5715
[CI] Disable GPU cleanup due to CI machine limitations by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5781
[BugFix] Rename need_block_num_signal to fix shm name conflict by @RichardWooSJTU in https://github.com/PaddlePaddle/FastDeploy/pull/5623
[Iluvatar] Fix FD launch error when specifing CUDA_VISBLE_DEVICE by @wuyujiji in https://github.com/PaddlePaddle/FastDeploy/pull/5735
[CI] Enable custom_device_check in CI rerun by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5786
make flash_mask attention pybind by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5783
[FDConfig] disable chunked_mm_input in ernie5 by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5774
[Graph Optimization] Add infershape&dtype to per_token_quant/ep_moe_expert_combine/ep_moe_expert_dispatch_fp8/count_tokens_per_expert_func by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/5762
[BugFix] Fix process_response_dict to support async in serving_completion by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/5758
[Feature] tracelogger stacklevel by @ST-XX in https://github.com/PaddlePaddle/FastDeploy/pull/5766
[BugFix] Change count_tokens_per_expert_func declaration(Tensor -> vector<Tensor>) by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/5794
[BugFix] Correct condition for reversed_window_indices in SiglipEncoder by @SigureMo in https://github.com/PaddlePaddle/FastDeploy/pull/5795
[CI] Fix path error and port conflict by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5803
[BugFix] Fix preemption out of real_bsz by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/5805
[XPU] xpu support ep4tp4 by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/5773
[CI]【Hackathon 9th Sprint No.55】NO.55 功能模块 fastdeploy/scheduler/local_scheduler.py 单测补充 by @essos-bot in https://github.com/PaddlePaddle/FastDeploy/pull/5050
[Model] support more config for expert_dispatch by @Wanglongzhi2001 in https://github.com/PaddlePaddle/FastDeploy/pull/5748
[CI] 【Hackathon 9th Sprint No.33】NO.33 功能模块单测补充 -new by @xunyoyo in https://github.com/PaddlePaddle/FastDeploy/pull/5726
[CI]【Hackathon 9th Sprint No.52】NO.52 功能模块 fastdeploy/model_executor/guided_decoding/ernie_tokenizer.py 单测补充 by @essos-bot in https://github.com/PaddlePaddle/FastDeploy/pull/5047
[BugFix] Fix return value inconsistency for ep_moe_expert_combine op by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/5812
[BugFix] fix compile error in sm89 by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5809
[Models] Add Qwen3-VL Model Support by @CSWYF3634076 in https://github.com/PaddlePaddle/FastDeploy/pull/5763
[Others] remove invalid paddleocr processor elif branch by @CSWYF3634076 in https://github.com/PaddlePaddle/FastDeploy/pull/5821
[BugFix] fix shm opened but not closed in set_data_ipc by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5826
[PD Disaggregation]remove unsed para in RDMACommManager by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5814
[RL] add lm_head_fp32 in RolloutModelConfig by @tianhaodongbd in https://github.com/PaddlePaddle/FastDeploy/pull/5825
[BugFix] Fix entropy bugs by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5818
[Feature] support w4afp8 v1_loader and v0_loader(tp>1) by @lizexu123 in https://github.com/PaddlePaddle/FastDeploy/pull/5757
[CI] Fix archive URL injection in tag image build by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5828
[Optimization] Optimization for gather_logprob by 10GB by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5817
[CI]【Hackathon 9th Sprint No.46】NO.46 功能模块 fastdeploy/model_executor/guided_decoding/xgrammar_backend.py 单测补充 by @essos-bot in https://github.com/PaddlePaddle/FastDeploy/pull/5042
[BugFix] Fix moe activation quant by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/5830
[CI case]Prompt logprob by @xjkmfa in https://github.com/PaddlePaddle/FastDeploy/pull/5835
[BugFix] eb5 mm skip prefix cache by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5838
[XPU]MAX_BSZ aligns gpu settings and disable prefix cache in OCR VL by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/5831
[Speculative Decoding] Optimize draft logprob by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5842
[BugFix] Only Run Triton count_greater_kernel on CUDA platform by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5846
[Metax] adapt prefix caching & cpu swap by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/5844
[Others] remove template NUM_EXPERTS_PER_RANK in permute_x_fp8_kernel by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5620
[BugFix] skip mm revert by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5848
[benchmark] Update benchmark_serving.py by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/5861
[CI] Add commit-level Linux build task for RL by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5857
[BugFix] fix cache manager not launched in case of mtp or blockwise fp8 by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5840
[APIServer][Feature] Add configurable worker health check timeout via FD_WORKER_ALIVE_TIMEOUT by @Copilot in https://github.com/PaddlePaddle/FastDeploy/pull/5865
[Feature] RDMACommunicator send key and value scale by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/5737
[BugFix] Refine the preparation of cpu and storage cache by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/5777
[Optimization] add del to decrease peak memory in MoE prefill by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5863
[BugFix] Fix wint4 ep issue caused by empty run by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/5870
[BugFix]support fa3 qwen-vl rope by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5869
[XPU] Speculative Decoding with PD by @cmcamdy in https://github.com/PaddlePaddle/FastDeploy/pull/5856
[PD Disaggregation] Update usage of pd disaggregation and data parallel by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/5742
[Others] enable use PFCC deep_ep by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/5822
[RL] Change 'model' to the instance variable 'tmp_model' by @tianhaodongbd in https://github.com/PaddlePaddle/FastDeploy/pull/5872
[BugFix] fix w4afp8 tp=8 by @lizexu123 in https://github.com/PaddlePaddle/FastDeploy/pull/5868
[XPU][CI] Add XPU logprobs case by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/5874
revert cuda_check by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5883
[Metax] optimize flash attention backend by @neilzhuu in https://github.com/PaddlePaddle/FastDeploy/pull/5876
[OPs] ep_moe_expert_dispatch.cu dispatch num_experts_per_rank 5 by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5890
[Docs] Update GPU version from 2.3.0 to 2.3.2 in installation documentation by @Copilot in https://github.com/PaddlePaddle/FastDeploy/pull/5894
[UT]support attention test tp by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5887
[Speculative Decoding]Support multi-step mtp with cudagraph by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5886
[BugFix] Fix redundant prompt_logprobs in the second chunk of streaming response when return_token_ids is enabled for v1/completions and fix trace file name by @qwes5s5 in https://github.com/PaddlePaddle/FastDeploy/pull/5829
【BugFixfix】Adapt to hadamard_block_size by @lizexu123 in https://github.com/PaddlePaddle/FastDeploy/pull/5888
[BugFix] Storage backend gets env params by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/5892
[BugFix] fix mtp cache attaching for pd disaggregation by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5884
[KVCache] launch cache transfer processes only if hierarchical cache or kv cache storage is enabled by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5871
[Bugfix]fix model weight signal tensor num by @gzy19990617 in https://github.com/PaddlePaddle/FastDeploy/pull/5900
[XPU]xpu support ep4tp1 in pd disaggregation by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/5860
[BugFix] fix BatchMLAWithPagedKVCacheKernel O_tmp by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5895
[Intel HPU] enable chunked prefill by @fmiao2372 in https://github.com/PaddlePaddle/FastDeploy/pull/5903
[Metax] adapt to gemm interface on different versions of maca by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/5905
[V1 Loader] Support loading static C8 scale JSON by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/5909
[Optim] The gemm of w4afp8 adopts an adaptive N by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/5853
[Iluvatar] remove CUDA_VISIBLE_DEVICE in run_ci_iluvatar.sh by @wuyujiji in https://github.com/PaddlePaddle/FastDeploy/pull/5916
Revert cuda check by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5915
[Feature] support rdma pd dy-c8 by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5788
[BugFix] fix eb5 prefix bug by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5879
[Graph Optimization] Wrap m_grouped_gemm_fp8_fp8_bf16_nt_contiguous as custom pyop by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/5847
[CI] 【Hackathon 9th Sprint No.18】NO.18 功能模块单测补充 -new by @xunyoyo in https://github.com/PaddlePaddle/FastDeploy/pull/5717
[Optimization] Reduce preemption occurrence when blocks not enough by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/5696
[BugFix] fix mtp split kv attetion by @lizhenyun01 in https://github.com/PaddlePaddle/FastDeploy/pull/5920
[Bug fix] Limit multi-modal request for prefill batch to 1 by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/5901
[Feature] Add Golang-based Router for Request Scheduling and Load Balancing by @mouxinqq in https://github.com/PaddlePaddle/FastDeploy/pull/5882
[INTEL_HPU] supported ERNIE-4.5-21B-A3B-Thinking mold by @FocusLuo in https://github.com/PaddlePaddle/FastDeploy/pull/5891
[CI] Add daily build_linux jobs for CUDA 12.9 by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5936
[BugFix] resource_manager_v1 lock PD by @ST-XX in https://github.com/PaddlePaddle/FastDeploy/pull/5616
[Models] Add Qwen3-VL Moe Model Support by @CSWYF3634076 in https://github.com/PaddlePaddle/FastDeploy/pull/5913
[INTEL HPU] support only one release package of PaddleCustomDevice by @FocusLuo in https://github.com/PaddlePaddle/FastDeploy/pull/5910
[XPU] [CI]Update CI workflow to include all file types by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/5943
[Bugfix] Fix mtp logprob hang problem when include stop_seq by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5927
[FDConfig] add flashinfer-python-paddle depend by @BingooYang in https://github.com/PaddlePaddle/FastDeploy/pull/5912
[TSP] last_norm allgather move to model.py by @xiaoluomi in https://github.com/PaddlePaddle/FastDeploy/pull/5924
[BugFix] Fix misleading logging in worker_process for request counting by @Copilot in https://github.com/PaddlePaddle/FastDeploy/pull/5939
[Bugfix] Fix entropy calculation bugs by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5941
[CI] Temporarily disable fp8_cases in base_tests by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5963
[XPU] [CI] Lock PaddlePaddle version in run_xpu_ci_pytest.sh by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/5964
[BugFix] fix dyc8 cache bug by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5958
[Metax] fix shape error & output garbled code when reasoning big pict… by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/5965
[Bugfix] Increase the shape of w4afp8 gemm by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/5957
[Speculative Decoding] Return accepted tokens per head in response by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5947
[CI]【Hackathon 9th Sprint No.50】NO.50 功能模块 fastdeploy/entrypoints/engine_client.py 单测补充 -part [#5045] by @essos-bot in https://github.com/PaddlePaddle/FastDeploy/pull/5807
Revert "[TSP] last_norm allgather move to model.py" by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5961
Revert "Revert "[TSP] last_norm allgather move to model.py"" by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5972
[XPU] move xpu_attn_backend.py to FastDeploy/fastdeploy/model_executor/layers/backends/xpu by @zccjjj in https://github.com/PaddlePaddle/FastDeploy/pull/5878
[Models] Qwen3VL and Qwen3VL-Moe CUDA graph Support by @CSWYF3634076 in https://github.com/PaddlePaddle/FastDeploy/pull/5962
[Feature] Support redundant expert for eplb by @xiaoxiaohehe001 in https://github.com/PaddlePaddle/FastDeploy/pull/5918
[Metax] add ci test file & update run_ci_metax.sh by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/5975
[XPU] fix dp4 by @zhupengyang in https://github.com/PaddlePaddle/FastDeploy/pull/5946
dev_fix_mtp_forward_meta by @xiaoluomi in https://github.com/PaddlePaddle/FastDeploy/pull/5976
MLA clean code by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5979
[Optimization] Remove decoder_num_blocks_device memset by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/5982
[BugFix] [MultiAPIServer] fix rdma script and port check for multi api server by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5935
[Optimization] Support setting communication groups in custom_allreduce and the all-to-all\transpose fused operator during the decoding phase. by @carryyu in https://github.com/PaddlePaddle/FastDeploy/pull/5917
[XPU]add ci test cast for P_EP4TP4 D_EP4TP1 by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/5988
[ci case]Check the chunking of the chat interface by @xjkmfa in https://github.com/PaddlePaddle/FastDeploy/pull/5981
[Optimization] Accelerate Qwen3 QK RMSNorm via Fused Triton Kernel by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/5880
[CI] Add fd-router build_task by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5967
[Docs] Update FastDeploy version to 2.3.3 in NVIDIA GPU installation documentation by @Copilot in https://github.com/PaddlePaddle/FastDeploy/pull/6010
[BugFix] Fix entropy calculation issue in TP by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5997
[BugFix] Fix insert_zmq_task_to_scheduler break bug by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/5960
[Optimization] Do not compute ATTN padding part in In Cuda graph mode by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5985
[V1 Loader] Load safetensors weights in natural keyorder by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/6006
[BugFix] fix metrics cache tokens by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/6001
[XPU][CI] Update XVLLM_PATH setup in run_xpu_ci_pytest.sh by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/6018
[Metax][CI] update test_ernie_28b_vl.py by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/6019
[Metax][CI] update test_ernie_28b_vl.py image result keywords by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/6022
[Featue] Enable output caching by default by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/5987
[BugFix] fix cache transfer manager updating/clearing by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5930
[Graph Optimization] Add full_cuda_graph to control subgraph split by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/6027
[Optim] Robust sync status when preempted happens by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/5796
[XPU][CI] Cache queue port bug fix by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/6030
[Metax][CI] remove 28B VL model test sampling randomness by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/6032
[CI] Adapt vl_model baseline changes due to Paddle update_2 by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6033
[Metax][Doc] update metax gpu 'get_started' doc by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/6035
[Metax][Fix] fix self.share_inputs['preempted_idx']=[] incorrect use by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/6038
[Feature]Report FD statistical information by @luukunn in https://github.com/PaddlePaddle/FastDeploy/pull/5646
[Optimize] Qwen2.5-VL vision model with merged linear layers and unif… by @xiaoxiaohehe001 in https://github.com/PaddlePaddle/FastDeploy/pull/6037
[XPU] fix multi-batch bug in VL model by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/6015
[RL][CI] Support Async R3 And Add Accuracy Test by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/5937
[XPU] fix moe num_expert by @zhupengyang in https://github.com/PaddlePaddle/FastDeploy/pull/6014
[BugFix] fix PaddleOCR-VL illegal memory by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/6042
[Intel HPU] enable MoE EP for hpu by @yanfeich in https://github.com/PaddlePaddle/FastDeploy/pull/5855
[Feature] get_output_kv_signal blocking read mode & send_first_token by @ST-XX in https://github.com/PaddlePaddle/FastDeploy/pull/5836
[XPU][CI] update paddle version by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/6044
【Optim】Optimize grid dimensions using max_tokens_per_expert for MoE models by @lizexu123 in https://github.com/PaddlePaddle/FastDeploy/pull/6007
[XPU] Speculate Decoding + PD, benchmark fix by @cmcamdy in https://github.com/PaddlePaddle/FastDeploy/pull/6036
[CI]Add more cases for attention unit test by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5931
[Feature]Support tag phase token enforce generation by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/6034
[XPU] fix(xpu_model_runner): reset seq_lens_encoder to 0 for decode role in PD splitwise mode by @zhuangzhuang12 in https://github.com/PaddlePaddle/FastDeploy/pull/6048
[UNITEST] make EP TP test_fused_moe CI by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5989
[Feature] Unify quant ops by @fxyfxy777 in https://github.com/PaddlePaddle/FastDeploy/pull/5991
[CI] fix port conflict by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/6054
[Feature] Support stopping the inference for the corresponding request in the online service after a disconnection request. by @qwes5s5 in https://github.com/PaddlePaddle/FastDeploy/pull/5320
[Speculative Decoding] Support MTP for GLM-4.5-Air by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/6047
[Metax][CI] update jenkins github action version by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/6065
only update self.exist_prefill_task_signal in v0 by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/6064
[Bugfix] Fix logprob issues caused by max_num_logprobs by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/6067
[XPU][CI] XPU CI refactor by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/6053
[BugFix] fix mm revert bug by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/6061
[CI] Add 4-GPU e2e test job by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6082
[BugFix] Fix qk_norm optimization by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/6080
[XPU] xpu support mm prefill batch by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/6072
[Speculative Decoding][Bugfix] Fix MTP logprob issues caused by max_num_logprobs by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/6084
[Optimization] Avoid unnecessary penalty computation by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/6078
[CI]Fix test cases failing under Python 3.12 by @ChowMingSing in https://github.com/PaddlePaddle/FastDeploy/pull/6059
[XPU][CI] Xpu ci update by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/6089
[XPU] add pd+mtp ci by @cmcamdy in https://github.com/PaddlePaddle/FastDeploy/pull/6090
[FDConfig] transfer metrics_port by @CyanScholar in https://github.com/PaddlePaddle/FastDeploy/pull/6056
[CE]add paddleocr config yaml by @xiegegege in https://github.com/PaddlePaddle/FastDeploy/pull/6097
[Speculative Decoding][Bugfix] Fix logits computation bug in GLM MTP by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/6093
[Feature] Add PaddleFormers fallback backend by @jackyYang6 in https://github.com/PaddlePaddle/FastDeploy/pull/5999
[XPU] support plugin model by @zhupengyang in https://github.com/PaddlePaddle/FastDeploy/pull/6092
[Graph Optimization] remove static_op_get_block_shape_and_split_kv_block from cudagraph by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/6081
[CI]Fix test case by @ChowMingSing in https://github.com/PaddlePaddle/FastDeploy/pull/6111
[Docs]: fix pre-commit error of markdown by @jackyYang6 in https://github.com/PaddlePaddle/FastDeploy/pull/6100
[XPU] Support CudaGraph(add block attn cuda_graph support) by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/6116
[Docs]fix doc by @luukunn in https://github.com/PaddlePaddle/FastDeploy/pull/6119
[WIP] Add directory guide to mkdocs configuration by @Copilot in https://github.com/PaddlePaddle/FastDeploy/pull/6121
[XPU]XPU FD Release/2.4 Note by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/6125
[BugFix] fix wint2 by @lizexu123 in https://github.com/PaddlePaddle/FastDeploy/pull/6109
[CI] Enable 4-GPU e2e test in nightly and fix docker_tag_build by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6128
[Metax][CI] restore 'moe_expert_dispatch' outputs by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/6130
[Intel HPU] add HPU tensorwise_fp8 readme by @yanfeich in https://github.com/PaddlePaddle/FastDeploy/pull/6091
[XPU] Update Dummy Run To Suppport Mutil-Batch Execution by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/6123
[Iluvartar][CI] Fix the error max_tokens_per_expert referenced before… by @wuyujiji in https://github.com/PaddlePaddle/FastDeploy/pull/6083
[BugFix] fix python3.12 v0_loader by @lizexu123 in https://github.com/PaddlePaddle/FastDeploy/pull/6132
[Models]Rename params of normalization layer. by @K11OntheBoat in https://github.com/PaddlePaddle/FastDeploy/pull/6133
[XPU] change XPU EP interface from xDeepEP to paddle by @zccjjj in https://github.com/PaddlePaddle/FastDeploy/pull/5706
[XPU]Update Release Note For Release2.4 by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/6143
[docx] by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/6145
[Optimization] The pre- and post-processing pipeline do not perform dict conversion by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/5494
[BugFix] fix weight quant op by @fxyfxy777 in https://github.com/PaddlePaddle/FastDeploy/pull/6137
[RL] router supports divided rollout by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/6150
Support MXFP4 for GPT-OSS by @Limerances in https://github.com/PaddlePaddle/FastDeploy/pull/5435
[XPU][Graph Optimization] XPU Support CUDAGraph by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/6152
[CI] Add ep4_mtp e2e test by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6153
[RL][R3] Fix typo by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/6046
【Optimization】update data_processor & add tool parser plugins by @luukunn in https://github.com/PaddlePaddle/FastDeploy/pull/6096
[RL] [APIServer] add more status codes for update/clear api by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/6141
[XPU] Enable CudaGraph by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/6162
[Feature] [KVCache] support attention_store kv cache backend by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5823
[Graph Optimization] Add max_capture_shape_prefill && cudagraph_capture_sizes_prefill by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/6148
[Docs] Update FastDeploy Docker image to 2.4.0 for Nvidia GPU installation by @Copilot in https://github.com/PaddlePaddle/FastDeploy/pull/6168
[RL] add pause, update_weights, resume interface for async RL by @mitu626 in https://github.com/PaddlePaddle/FastDeploy/pull/6052
[UT] Add GLM E2E tests for non-MTP and MTP by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/6163
[BugFix] skipped tool_calls by @luukunn in https://github.com/PaddlePaddle/FastDeploy/pull/6166
[BugFix]fix image gen by @luukunn in https://github.com/PaddlePaddle/FastDeploy/pull/6175
[Model Runner] Add exist_prefill_flag by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/6172
[XPU] support recover batch sequence by @lizan1999 in https://github.com/PaddlePaddle/FastDeploy/pull/6142
add scale_wrapper for per_block_cast_to_fp8 by @fxyfxy777 in https://github.com/PaddlePaddle/FastDeploy/pull/6183
[Docs] add docs of /v1/pause、/v1/resume、/v1/is_paused by @mitu626 in https://github.com/PaddlePaddle/FastDeploy/pull/6192
[XPU] [CI] add xpu logprobs case by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/6187
[XPU] fix text_image_gather_scatter in cudagraph mode by @RuohengMa in https://github.com/PaddlePaddle/FastDeploy/pull/6049
[BugFix] move cache creation back to cache transfer process and adapt clear/update by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/6144
Improve deep_ep import handling with logging by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/6207
[CI][BugFIx] Remove flaky IPC-related test by @xyxinyang in https://github.com/PaddlePaddle/FastDeploy/pull/6190
[build] support build sm 80,86,89,90 to one whl package by @mitu626 in https://github.com/PaddlePaddle/FastDeploy/pull/6173
[Others] remove stop_nums by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/6182
[Model Runner] Prepare token count and move FA3 initialization into the graph by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/6170
[Loader] support dummy load weight by @CSWYF3634076 in https://github.com/PaddlePaddle/FastDeploy/pull/6169
[XPU][CI]Add Cuda Graph CI Case by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/6229
[Bug Fix] fix mask attention by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/6216
[BugFix] fix mask_attn by @xiaoxiaohehe001 in https://github.com/PaddlePaddle/FastDeploy/pull/6214
[RL] add version to the key of cache storage && refine raising error by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/6160
[XPU] [CI] XPU CI Updata by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/6211
remove unneeded para from flash_mask_attention by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/6218
[benchmark]支持SGLang/VLLM获取cached tokens by @ophilia-lee in https://github.com/PaddlePaddle/FastDeploy/pull/6240
[Metrics] Added metrics for monitoring token generation rate per request. by @qwes5s5 in https://github.com/PaddlePaddle/FastDeploy/pull/6236
[CI] add update weights tests by @DDDivano in https://github.com/PaddlePaddle/FastDeploy/pull/6242
[CI] adjust resource scheduling of _stable_test by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6235
[CI] Switch nightly build to use FD_UNIFY_BUILD by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6246
[Others] enhance deep_ep import and support mixed mode flash_mask_attn by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/6238
[BugFix] Fix token_penalty kernel by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/6069
[Model Runner] Refactor execute_model for GPU async scheduling by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/6176
[Others] Support constrained decoding when enable_thinking is false by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/6248
[Bug fix] Fix multi modal fetch feature by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/6095
[Models][BugFix] shared experts and dense mlp layer do not require TP split by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/6180
[CI] Add unit test for swap_layout && remove unit test of splitwise_scheduler by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/6250
[CI] Fix nightly cu129 build_outputs upload failure by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6264
[RL] Support GLM MTP RL Model by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/6223
[Feature] Support Ernie FP8 on sm100 by @zoooo0820 in https://github.com/PaddlePaddle/FastDeploy/pull/5593
[Feature] Support NVFP4 MoE on SM100 by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/6003
[BugFix] allow return code 250 in tests/distributed/test_fusedmoe_ep_entry.py by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/6269
[Graph Optimization] Support CUDAGraph for P/PD mixed Batch using SOT subgraph spliting mode by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/6196
[CI] Remove --ipc=host and --pid=host from _stable_test.yml by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6270
[CI] Update _build_linux_rl.yml by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/6274
[Metax][Fix] fix 'get_token_penalty_multi_scores' input error based (… by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/6266
[Feature] Enhance Router with /v1/completions, docs, scripts, and version info by @mouxinqq in https://github.com/PaddlePaddle/FastDeploy/pull/5966
[BugFix] Fix bug for enable output caching by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/6226
Revert "[Feature] Support Ernie FP8 on sm100" by @zoooo0820 in https://github.com/PaddlePaddle/FastDeploy/pull/6275
[CI] 【Hackathon 10th Spring No.41】功能模块 fastdeploy/entrypoints/llm.py 单测补充 by @xunyoyo in https://github.com/PaddlePaddle/FastDeploy/pull/6108
[Metax] adapt to the latest develop by @xiaozude in https://github.com/PaddlePaddle/FastDeploy/pull/6282
[CI] 【Hackathon 10th Spring No.30】功能模块 fastdeploy/inter_communicator/engine_worker_queue.py单测补充 by @kesmeey in https://github.com/PaddlePaddle/FastDeploy/pull/6102
[Feature] Support report token index by attention store by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/6285
[CI] 【Hackathon 10th Spring No.38】功能模块 fastdeploy/entrypoints/openai/serving_completion.py单测补充 by @xunyoyo in https://github.com/PaddlePaddle/FastDeploy/pull/6227
[Optimize] optimize mask_quant & swiglu by @fxyfxy777 in https://github.com/PaddlePaddle/FastDeploy/pull/6222
Revert "[RL] Support GLM MTP RL Model" by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/6301
[Bug Fix] fix tokenizer oom by @ApplEOFDiscord in https://github.com/PaddlePaddle/FastDeploy/pull/6287
[BugFix] fix fused_mask_swiglu_fp8_quant bug by @fxyfxy777 in https://github.com/PaddlePaddle/FastDeploy/pull/6316
[Benchmark] Ce qwen3 vl by @xjkmfa in https://github.com/PaddlePaddle/FastDeploy/pull/6288
[BugFix] Fix heartbeat signal's sleeptime error by @CyanScholar in https://github.com/PaddlePaddle/FastDeploy/pull/6241
[config] fix assert message by @zccjjj in https://github.com/PaddlePaddle/FastDeploy/pull/6310
[Feature] [KVCache] support file_store kv cache backend by @Moonchild1227 in https://github.com/PaddlePaddle/FastDeploy/pull/6188
remove speculate_get_padding_offset op by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/6308
cp 1131 tbo to develop by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/6281
[Feature]Support reorder ids to split prefill and decodes by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5779
[CI] 【Hackathon 10th Spring No.21】功能模块 fastdeploy/engine/sched/resource_manager_v1.py 单测补充 by @kesmeey in https://github.com/PaddlePaddle/FastDeploy/pull/6243
[Feature] Support Ernie FP8 on sm100 ( the fixed version) by @zoooo0820 in https://github.com/PaddlePaddle/FastDeploy/pull/6304
[BugFix] Fix port-releated errors in mix mode when FD_ENABLE_INTERNAL_ADAPTER is enabled by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/6309
[CE]add 21b cpu cache ,glm mtp,glm for rl config by @xiegegege in https://github.com/PaddlePaddle/FastDeploy/pull/6328
[Others] lazy write log when writing by @CSWYF3634076 in https://github.com/PaddlePaddle/FastDeploy/pull/6323
[RL] R3 Support Fused Put the Routing of All Layers by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/6099
[benchmark] update benchmark tools by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/6335
[CI]【Hackathon 10th Spring No.37】功能模块 fastdeploy/model_executor/layers/moe/fused_moe_wint2_backend.py单测补充 by @xunyoyo in https://github.com/PaddlePaddle/FastDeploy/pull/6286
add PADDLE_ENFORCE by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/6321
[Model Runner] Support overlap schedule by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/6259
[Metax][CI] update ci test files by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/6340
[Feature] FD_USE_PHI_FP8_QUANT by @fxyfxy777 in https://github.com/PaddlePaddle/FastDeploy/pull/6320
[Feature] Fix counter release logic & update go-router download URL by @mouxinqq in https://github.com/PaddlePaddle/FastDeploy/pull/6280
[Optimize] Optimize ttft for ep by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/6098
[Metax][Fix] fix issues based [#6259] by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/6338
[Metax][CI] fix run_ci_metax.sh error by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/6341
[benchmark] Update README.md by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/6343
[Others] Skip paddle.to_tensor if is_not_swapped by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/6342
[Others] add mock unittest for sm100 FP8 inference by @zoooo0820 in https://github.com/PaddlePaddle/FastDeploy/pull/6273
[Optimization]update prompt & prompt_token_ids by @luukunn in https://github.com/PaddlePaddle/FastDeploy/pull/6334
[RL] Support GLM MTP RL Model by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/6267
[CI] Update stable test workflow and run.sh script by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6352
[XPU] support noaux_tc by @lizan1999 in https://github.com/PaddlePaddle/FastDeploy/pull/6326
[Optimization] Support FA2/FA3/FA4 with attn_mask_q by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/6354
[Feature] Support Norm before Rope. by @K11OntheBoat in https://github.com/PaddlePaddle/FastDeploy/pull/6332
[CI] Update check-bypass.yml by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6360
Remove MTP rebuil_padding useless code by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/6336
[Metax][CI] update metax ci files by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/6364
[Metax][CI] restore 21b/28b ci test file by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/6368
[Feature] console print statistical metrics by @CSWYF3634076 in https://github.com/PaddlePaddle/FastDeploy/pull/6339
[Feature] support glm tp+dp+ep by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/6317
[KVCache] Storage cache supports c8 model by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/6298
[BugFix] fix cache transfer tasks failure after cache cleared by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/6202
[Metax][CI] add paddleocr ci test by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/6379
[Metrics] Support cpu-cache-block-num by @Jiang-Jia-Jun in https://github.com/PaddlePaddle/FastDeploy/pull/6390
[MTP] refactor MTP pre_process by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/6358
[ci case]Prompt logprobs precision by @xjkmfa in https://github.com/PaddlePaddle/FastDeploy/pull/6381
[Others] support import deepgemm/deepep from fleet ops by @zoooo0820 in https://github.com/PaddlePaddle/FastDeploy/pull/6351
[BugFix]fix handle 4 return values from noaux_tc_redundant op by @mattheliu in https://github.com/PaddlePaddle/FastDeploy/pull/6384
[CI] 【Hackathon 10th Spring No.25】功能模块 fastdeploy/inter_communicator/zmq_server.py 单测补充 by @0Ayachi0 in https://github.com/PaddlePaddle/FastDeploy/pull/6210
[Docs] Add Doc for Online quantification by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/6399
[BugFix] fix zmq hung when sampled_token_id=0 by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/6398
[loader]supoort wint2 backend by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/6139
[Engine] apiserver&engine exit when work failed to start by @CSWYF3634076 in https://github.com/PaddlePaddle/FastDeploy/pull/6322
[Docs]add environment_variables by @luukunn in https://github.com/PaddlePaddle/FastDeploy/pull/6385
[Metax][CI] e2e ci tests enable cuda graph by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/6401
[Optimization] Support logprob async copy by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/6362
[Feature] consider multimodal model when dummy run by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/6045
Revert "[Optimize] Optimize ttft for ep" by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/6402
[BugFix][Iluvatar] Use paddle.device.get_device_properties for multi-platform compatibility by @mattheliu in https://github.com/PaddlePaddle/FastDeploy/pull/6400
[Others] Exit to ensure no residual processes (cpu cache & dp) by @CSWYF3634076 in https://github.com/PaddlePaddle/FastDeploy/pull/6377
[BugFix] PD reorder fix and add ut by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/6375
[Feature] 统一请求完成日志格式并增强统计信息 by @Jiang-Jia-Jun in https://github.com/PaddlePaddle/FastDeploy/pull/6405
[XPU] change base XPU docker image by @cqulilujia in https://github.com/PaddlePaddle/FastDeploy/pull/6411
[CI] Fix stable_test and add cherry-pick automation by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6415
[Feature] console print metrics add env by @CSWYF3634076 in https://github.com/PaddlePaddle/FastDeploy/pull/6413
[BugFix][Cherry-Pick] add reset shared inputs when update weight dummy run(#6331) by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/6418
[Cherry-Pick][BugFix] Fix rebuild padding bug (#6422) by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/6425
[Cherry-Pick][CI]fix fa4 test (#6408)(#6424) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/6421
[Cherry-Pick][BugFix] Fix model loading error for 300B FP8 EP parallel test case (#6382) by @zoooo0820 in https://github.com/PaddlePaddle/FastDeploy/pull/6436
[Cherry-Pick] Revert "[XPU] change base XPU docker image"(6427) by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6459
[Cherry-Pick 2.5][BugFix] Fix get_padding_offset in empty run by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/6460
[Cherry-Pick][BugFix]fix deepgemm import (#6451) by @zoooo0820 in https://github.com/PaddlePaddle/FastDeploy/pull/6452
[BugFix][Cherry-Pick] fix mtp acceptance rate decline cp (#6470) by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/6471
[Cherry-Pick] [BugFix] fix num_cpu_blocks computation (#6438) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/6473
[Cherry-Pick][CI] Pin gunicorn version to 25.0.3(#6497) by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6499
[Cherry-Pick][CI] Optimize unittest and fix title format(#6464) by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6468
[Cherry-Pick][BugFix]fix console log metrics waitting queue count from [#6432] by @CSWYF3634076 in https://github.com/PaddlePaddle/FastDeploy/pull/6453
[Feature] support mm_processor_kwargs for flexible model by @Wanglongzhi2001 in https://github.com/PaddlePaddle/FastDeploy/pull/6491
[Cherry-Pick][OP][Feature] 统一 limit_thinking_content_length CUDA 算子，支持回复长度限制与注入序列 by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/6511
[Cherry-Pick][BugFix] Fix storage_backend_type comparison bug in cache_transfer_manager.py (#6514) by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/6522
[Cherry-Pick][BugFix] FlashAttnBackend Supports OpenSource Model run FlashMask(#6518) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/6520
[Cherry-Pick][Bugfix] cherry-pick [#6466] and [#6528] to release/2.5 by @zoooo0820 in https://github.com/PaddlePaddle/FastDeploy/pull/6529
[BugFix][Cherry-Pick] Fix reshard error（#6536） by @DrownFish19 in https://github.com/PaddlePaddle/FastDeploy/pull/6537
[Cherry-Pick][CI] disable test_batch_invariance_op_mm.py in unit_test(#6548) by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6549
[Cherry-Pick] [BugFix] fix cache transfer manager init failed when using block_wise_fp8 and no storage backend (#6516) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/6517
[Cherry-Pick][BugFix][APIServer] Enable control socket disable option in API server (#6545) by @Copilot in https://github.com/PaddlePaddle/FastDeploy/pull/6551
[Cherry-Pick][BugFix][Scheduler] Fix can_schedule_block_num_threshold calculation(#6541) by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/6542
[Cherry-Pick][BugFix][RL] Set GPU flags for paddle in cache transfer manager (#6534) by @Copilot in https://github.com/PaddlePaddle/FastDeploy/pull/6550
[Cherry-Pick] [BugFix] fixup for cache transfer manager init failed when using block_wise_fp8 and no storage backend (#6516) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/6564
[Cherry-Pick] [BugFix] fix cache int8 for pd disaggregated deployment (#6563) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/6571
[Cherry-Pick][BugFix] lazy enable_torch_proxy for cutlass (#6523) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/6585
[BugFix][Cherry-Pick] Add safety checks in recycle_gpu_blocks to prevent block allocation errors(#6531) by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/6530
[Cherry-Pick][CI] Fix tests to resolve failure(#6557,#6572) by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6590
[Cherry-Pick][Feature]Supports SWA based on appendattn [#6547] by @chang-wenbin in https://github.com/PaddlePaddle/FastDeploy/pull/6594
[Cherry-pick] support qkv&gate linear fusion [#6455] by @Wanglongzhi2001 in https://github.com/PaddlePaddle/FastDeploy/pull/6552
[Cherry-Pick][BugFix] fix mtp_config in rl (#6595) by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/6596
[Cherry-Pick] [BugFix] fix prefix tree updating timeout (#6615) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/6616
[CI] Switch 2.5 branch to use Paddle release/3.3 by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6621
[Cherry-Pick] [Bug Fix] Fix MM mtp incorrect rope emb(#6581) by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/6586
[Cherry-Pick][BugFix] Fix exist_prefill_flag when preempted task exist (#6629) by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/6630
Revert "[Cherry-Pick] [Bug Fix] Fix MM mtp incorrect rope emb(#6581)" by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/6633
[XPU][CI]Update _build_xpu.yml by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/6640
[Cherry-Pick] [RL] Support SM100 FP8 quantization in RL [#6601] by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/6602
[Cherry-Pick][BugFix] fix flash attn mtp rope emb bug by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/6650
[BugFix][MTP] Skip empty_input_forward during dummy run by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/6654
[Cherry-Pick] [BugFix] Fix inaccurate cache hit rate and TTFT after request preemption by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/6626
[Cherry-Pick][Feature]weight only quant method support QKVGate_proj (#6612) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/6669
[Cherry-Pick][XPU][CI] Fix XPU CI Bug(#6658) by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6679
[Cherry-Pick][CI] Sync CI optimizations from develop to release/2.5(#6645 by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6684
[Cherry-Pick][BugFix] Fix error in dynamic c8 cache (#6544) by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/6692
[Cherry-Pick]add reasoning effort & string arguments in tool#6704#6656 by @luukunn in https://github.com/PaddlePaddle/FastDeploy/pull/6706
[Cherry-Pick][RL]fix update param [#6723] by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/6722
[Cherry-Pick][BugFix] Fix updating weight when enable cache storage (#6719) by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/6720
[Cherry-Pick] [BugFix] fix grpc failure when tracing init before workers forked (#6732) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/6744
[Cherry-Pick][BugFix][KVCache] Add inter-process lock to fix NaN error under DP+EP(#6724) by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6769
[XPU][CI]Cherry-Pick PR and Update CI Case by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/6619
[Cherry-Pick] [BugFix] resolve get_save_output_v1 socket name conflicts between multiple instances (#6758) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/6759
[Cherry-Pick][BuFix]Set MC_TCP_BIND_ADDRESS for mooncake store(#6782) by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/6783
[Cherry-Pick] [Processor]add qwen3vl prompt_token_ids support (#6764) by @CSWYF3634076 in https://github.com/PaddlePaddle/FastDeploy/pull/6786
[RL][Cherry-Pick] add stream guard (#6814) by @liufengwei0103 in https://github.com/PaddlePaddle/FastDeploy/pull/6823
[Cherry-Pick][Feature] use phi permute/unpermute & rm swiglu (#6361) by @fxyfxy777 in https://github.com/PaddlePaddle/FastDeploy/pull/6808
[Cherry-Pick] [BugFix] replace ftok with custom_ftok in get_output/save_output ops (#6822) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/6824
[Cherry-Pick][Loader]Add support for handling GPU memory fragmentation. by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/6790
[Cherry-Pick][Optim] Simplify available_blocks assignment logic (#6819) by @Copilot in https://github.com/PaddlePaddle/FastDeploy/pull/6874
[Cherry-Pick][BugFix] Fix several bugs in the request interruption and inference termination functionality(#6743) by @qwes5s5 in https://github.com/PaddlePaddle/FastDeploy/pull/6890
[Cherry-Pick][BugFix] Set FD_USE_PHI_MOE_PERMUTE = 0 Default(#6886) by @fxyfxy777 in https://github.com/PaddlePaddle/FastDeploy/pull/6888
[Cherry-Pick][Optimization] Skip compat guard when torch is not installed(#6913) by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6926
[Cherry-Pick][RL] cherry-pick [#6862] support qkrmsnorm use proxy-norm by @zoooo0820 in https://github.com/PaddlePaddle/FastDeploy/pull/6859
[RL][Cherry-Pick] Support Fully Async and PrefixCache by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/6727
[Cherry-Pick][BugFix] Fix ErrorInfo code type(#6951) by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6952
[Cherry-Pick][CI] Sync develop optimizations to 2.5(#6745) by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6964
[Cherry-Pick][RL][BugFix][Optimization] Support chunked part files loading and fix model path format in IPC snapshot strategy(#6852) by @wikilsh in https://github.com/PaddlePaddle/FastDeploy/pull/6910
[Cherry-Pick][RL] add worker_process no grad (#6971) by @xiaoxiaohehe001 in https://github.com/PaddlePaddle/FastDeploy/pull/6972
[Cherry-Pick][Speculative Decoding] Support suffix decoding (#6403) by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/6967
[Cherry-Pick][RL] Support moe_topk_select using Paddle native operators and Add fused stack-transpose-quant for BlockWiseFP8 MoE weight quantization and swiglu-fp8-quant op for DeepGemmFusedMoE for training alignment (#6850) by @DanielSun11 in https://github.com/PaddlePaddle/FastDeploy/pull/6935
[Cherry-Pick][Optimization] Use a separate driver when using Triton with Paddle (#6897) by @SigureMo in https://github.com/PaddlePaddle/FastDeploy/pull/6983
[Cherry-Pick][Others] Fix PD reorder for MTP [#6792] by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/6917
[Cherry-Pick][CI] Sync develop fix and optimizations to 2.5(#6975) by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6987
[RL][Cherry-Pick] RoPE without fmad opt (#6901) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/6902
[Cherry-Pick] [Feature] support v1 update/clear api for RL (#6761) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/6974
[Cherry-Pick][BugFix][RL]add instantiations for decoder rope enfore_fmul_rn=true(#7009) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/7010

New Contributors

@RuohengMa made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/5397
@zhang-chenyi made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/5547
@BingooYang made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/5912
@yanfeich made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/5855
@ChowMingSing made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/6059
@CyanScholar made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/6056
@Moonchild1227 made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/6188

Full Changelog: https://github.com/PaddlePaddle/FastDeploy/compare/v2.4.0...v2.5.0

Source: README.md, updated 2026-04-08

FastDeploy Files

High-performance Inference and Deployment Toolkit for LLMs and VLMs

Get an email when there's a new version of FastDeploy

FastDeploy Release 2.5 Release Note

新增功能

新模型支持

新量化方法支持

PD分离相关功能

CUDA Graph相关功能

RL训练相关功能

KV Cache相关功能

新API/接口支持

Engine与架构优化

Loader相关功能

模型层优化

性能优化

算子性能优化

显存优化

调度优化

量化相关优化

图优化

其他性能优化

多硬件支持

昆仑芯XPU

新功能支持

性能优化

Bug修复

沐曦Metax

新功能支持

性能优化

Bug修复

Intel HPU

新模型支持

新功能支持

其他

天数Iluvatar

新功能支持

Bug修复

Bug修复

PD分离相关Bug修复

多模态相关Bug修复

CUDA Graph相关Bug修复

EP并行相关Bug修复

MTP相关Bug修复

Cache相关Bug修复

量化相关Bug修复

调度相关Bug修复

API/接口相关Bug修复

RL相关Bug修复

其他Bug修复

其它

Benchmark

文档

CI/测试

代码重构/清理

Cherry-Pick

What's Changed

New Contributors