| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| Parent folder | |||
| README.md | 2026-04-08 | 131.0 kB | |
| v2.5.0 source code.tar.gz | 2026-04-08 | 7.4 MB | |
| v2.5.0 source code.zip | 2026-04-08 | 9.3 MB | |
| Totals: 3 Items | 16.9 MB | 1 | |
FastDeploy Release 2.5 Release Note
新增功能
新模型支持
- 新增Qwen3-VL模型部署支持 [#5763]
- 新增Qwen3-VL MoE模型部署支持 [#5913]
- 新增Qwen3-VL和Qwen3-VL MoE CUDA Graph支持 [#5962]
- 新增GLM模型TP+DP+EP支持 [#6317]
新量化方法支持
- 新增W4AFP8量化方法支持(v1_loader和v0_loader,支持TP>1) [#5757]
- 新增NVFP4 MoE在SM100上的支持 [#6003]
- 新增FusedMoE在Blackwell上的支持 [#5325]
- 新增统一量化算子 [#5991]
- 新增FP8量化环境变量FD_USE_PHI_FP8_QUANT支持 [#6320]
- 新增Weight Only量化方法对QKVGate_proj的支持 [#6669]
PD分离相关功能
- 新增多模态模型P/D分离支持 [#5323]
- 新增PD分离部署配置简化和端口管理重构 [#5415]
- 新增PD分离支持动态C8 IPC [#5750]
- 新增PD分离RDMA动态C8支持 [#5788]
CUDA Graph相关功能
- 新增Qwen3-VL和Qwen3-VL MoE CUDA Graph支持 [#5962]
- 新增reorder ids以分离prefill和decode请求的支持 [#5779]
- 新增full_cuda_graph控制子图切分 [#6027]
- 新增max_capture_shape_prefill和cudagraph_capture_sizes_prefill配置 [#6148]
- 支持CUDAGraph用于P/PD混合Batch,采用SOT子图切分模式 [#6196]
- Cuda graph模式下跳过ATTN padding部分计算 [#5985]
RL训练相关功能
- 新增Rollout Routing Replay支持 [#5405]
- 新增V1 update/clear API for RL支持 [#6974]
- 新增Thinking Pattern框架优化 [#4302]
- 新增限制thinking内容长度的CUDA算子统一,支持回复长度限制与注入序列 [#6511]
- R3支持RDMA Store [#5467]
- 支持通过load_weights函数加载权重 [#5549]
- 新增pause、update_weights、resume异步RL接口 [#6052]
- 支持GLM MTP RL Model [#6223] [#6267]
- R3支持全层路由Fused Put [#6099]
- 支持SM100 FP8量化 [#6602]
- 支持moe_topk_select Paddle原生算子及FP8 MoE量化 [#6935]
KV Cache相关功能
- 新增KV Cache存储支持 [#5571]
- 新增attention_store KV Cache后端支持 [#5823]
- 新增file_store KV Cache后端支持 [#6188]
- 新增通过attention store上报token index支持 [#6285]
- 新增RDMACommunicator发送key和value scale支持 [#5737]
- 新增get_output_kv_signal阻塞读取模式和send_first_token支持 [#5836]
新API/接口支持
- 新增stop_token_ids支持 [#5399]
- 新增logprobs/prompt_logprobs token解码开关 [#5463]
- 新增请求级投机解码指标监控支持 [#5518]
- 新增健康检查功能 [#5534]
- 新增请求级延迟细粒度追踪(Tracing Part1) [#5458]
- 新增Entropy计算支持 [#5692] [#5730]
- 新增输出缓存默认启用 [#5987]
- 新增tag phase token enforce生成支持 [#6034]
- 新增SWA基于appendattn的支持 [#6594]
- plugin模型支持mm_processor_kwargs [#6491]
- 新增多模态模型dummy run支持 [#6045]
- 新增Norm before Rope支持 [#6332]
- 新增使用phi permute/unpermute并移除swiglu [#6808]
Engine与架构优化
- 新增基于ZMQ通信的EngineService跨进程async_llm重构 [#4868]
- 新增Golang Router用于请求调度和负载均衡 [#5882] [#5966]
- 新增ZMQ-based FMQ实现和benchmark工具 [#5418]
- 新增Pool模型prefill batch推理支持 [#5436]
- 新增Paddle启动版本检查机制 [#5769]
- 新增可配置worker健康检查超时(FD_WORKER_ALIVE_TIMEOUT) [#5865]
- 新增FD统计信息上报 [#5646]
- 新增统一请求完成日志格式并增强统计信息 [#6405]
- 新增控制台打印统计指标 [#6339] [#6413]
- 新增断开连接后停止在线服务中对应请求推理的支持 [#5320]
Loader相关功能
- 新增V1 Loader加载静态C8 scale JSON支持 [#5909]
- 新增V1 Loader按自然key顺序加载safetensors权重 [#6006]
- 新增TP+EP 下v1_loader支持 [#5465]
- 新增Loader dummy load weight支持 [#6169]
- 新增Loader wint2后端支持 [#6139]
- 新增Loader处理GPU内存碎片支持 [#6790]
模型层优化
- 新增所有模型VocabParallelEmbedding的forward_meta支持 [#5524]
- 对expert_dispatch算子支持更多参数配置 [#5748]
- 新增FA3对GLM-RoPE的支持 [#5586]
- 新增EPLB冗余专家支持 [#5918]
- 新增normalization层参数重命名 [#6133]
- 新增tracelogger stacklevel支持 [#5766]
- 支持qkv和gate linear融合 [#6552]
性能优化
算子性能优化
- 优化gather_logprob算子性能 [#5817]
- 优化Qwen3 QK RMSNorm算子,通过融合Triton Kernel加速 [#5880]
- 优化mask_quant和swiglu算子性能 [#6222]
- W4AFp8量化场景下gemm算子采用自适应N参数优化 [#5853]
- 支持FA2/FA3/FA4算子配合attn_mask_q使用 [#6354]
显存优化
- MoE prefill阶段添加del操作降低峰值显存 [#5863]
- Qwen模型支持动态block_wise_fp8缓存 [#5486]
- 移除decoder_num_blocks_device的memset操作 [#5982]
调度优化
- 优化engine-worker-queue任务检查性能 [#5376] [#5580]
- 减少blocks不足时的preemption发生频率 [#5696]
- 优化preemption发生时的同步状态处理 [#5796]
- 优化EP模式下的TTFT延迟 [#6098]
- 简化available_blocks分配逻辑 [#6874]
- 支持多模态prefill batch [#5313]
量化相关优化
- 支持W4AFp8 MTP量化 [#5429]
- 支持W4AFp8 MoE权重离线permute和加载 [#5613]
- 支持W4AFp8 DeepEP低延迟两阶段模式 [#5608]
图优化
- PaddleOCR-VL ViT部分使用CINN优化 [#5223]
- 封装deep gemm和triton为python op [#5673]
- 为per_token_quant等算子添加infershape和dtype支持 [#5762]
- 封装m_grouped_gemm_fp8_fp8_bf16_nt_contiguous为自定义pyop [#5847]
- 从cudagraph中移除static_op_get_block_shape_and_split_kv_block [#6081]
其他性能优化
- 批量计算real max_logprobs优化 [#5430]
- 支持logprob异步拷贝 [#6362]
- 避免不必要的penalty计算 [#6078]
- 前后处理流水线不再执行dict转换 [#5494]
- Qwen2.5-VL vision模型采用合并线性层和统一处理优化 [#6037]
- 支持在自定义allreduce中设置通信组以及解码阶段的all-to-all/transpose融合算子 [#5917]
- 重构chat_handler和completion_handler,提取基类并使用AsyncLLM [#5195]
- 更新prompt和prompt_token_ids处理逻辑 [#6334]
- 在不安装torch时跳过compat guard [#6926]
- 使用Paddle时为Triton使用独立的driver [#6983]
多硬件支持
昆仑芯XPU
新功能支持
- 新增 speculate_step_system_cache 支持 [#5397]
- 支持 get hidden state for mix 功能 [#5513]
- 新增 speculate_get_logits 功能 [#5497]
- 支持 PD Disaggregation 场景下 update_inputs_v1 算子 [#5550]
- 支持 EP+MTP [#5605]
- 支持 token num = 0 场景 [#5635]
- 支持 EP4TP4 配置 [#5773]
- 支持 EP4TP1 配置 (PD disaggregation) [#5860]
- 支持 Speculative Decoding with PD [#5856]
- 支持 mm prefill batch [#6072]
- 支持 plugin model [#6092]
- 支持 CudaGraph (block attn cuda_graph 支持) [#6116], [#6152], [#6162]
- 支持从 XPU EP 接口从 xDeepEP 切换到 paddle [#5706]
- 支持 recover batch sequence [#6142]
- 支持 noaux_tc [#6326]
性能优化
- 重构 moe ffn 优化性能 [#5501]
- 默认设置 top_p=0.0 优化性能 [#5686]
- 优化 logprob 性能 [#5626], [#5628]
- 重构 block_attn 参数 'pos_emb_type' [#5511]
Bug修复
- 修复 mtp multi batch 问题 [#5521]
- 修复 dp4 问题 [#5946]
- 修复 moe num_expert 问题 [#6014]
- 修复 multi-batch bug in VL model [#6015]
- 修复 text_image_gather_scatter 在 cudagraph 模式下的问题 [#6049]
- 修复 PD splitwise 模式下 seq_lens_encoder 重置问题 [#6048]
- 修复 MAX_BSZ 对齐 GPU 设置及 OCR VL 禁用 prefix cache [#5831]
沐曦Metax
新功能支持
- 新增 CI yaml 配置 [#5520]
- 支持 cudagraph [#5547]
- 支持 prefix caching & cpu swap [#5844]
- 适配不同版本 maca 的 gemm 接口 [#5905]
- 支持 V1_KVCACHE_SCHEDULER 和 paddleocr-vl rope mode [#5555]
性能优化
- 优化 MLA backend [#5258]
- 重构 cutlass moe 并优化 flash attention [#5361]
- 优化 flash attention backend [#5876]
- 修改 warpSize 为 WARP_SIZE [#5442]
Bug修复
- 修复 GetStopFlagsMulti kernel crash 问题 [#5556]
- 修复 metax runner 问题 [#5629]
- 修复大图推理时 shape 错误和输出乱码问题 [#5965]
- 修复 self.share_inputs['preempted_idx']=[] 使用错误 [#6038]
- 修复 'get_token_penalty_multi_scores' 输入错误 [#6266]
- 修复 issues based [#6259] [#6338]
Intel HPU
新模型支持
- 支持 ERNIE-4.5-21B-A3B-Thinking 模型 [#5891]
新功能支持
- 支持 tensor_wise_fp8 [#5324]
- 支持 KV cache scheduler v1 [#5648]
- 支持 chunked prefill [#5903]
- 支持 MoE EP [#5855]
- 支持单一 PaddleCustomDevice 发布包 [#5910]
其他
- 新增 HPU tensorwise_fp8 文档 [#6091]
天数Iluvatar
新功能支持
- 支持 V1_KVCACHE_SCHEDULER 和 paddleocr-vl rope mode [#5555]
Bug修复
- 修复 CUDA_VISIBLE_DEVICE 指定时的 FD 启动错误 [#5735]
- 修复多平台兼容性问题 (使用 paddle.device.get_device_properties) [#6400]
Bug修复
PD分离相关Bug修复
- 修复PD分离模式下MTP cache attaching问题 [#5884]
- 修复resource_manager_v1在PD模式下的锁问题 [#5616]
- 修复PD分离部署时cache int8的问题 [#6571]
- 修复mix splitwise模式下的pickle加载错误 [#5488]
- 修复多模态splitwise调度器的bug [#5604]
- 修复PD重排序问题并添加单元测试 [#6375]
- 修复MTP场景下PD重排序问题 [#6917]
多模态相关Bug修复
- 修复PaddleOCR-VL模型参数放置在CPU的问题 [#5413]
- 修复多模态CUDA Graph问题 [#5266]
- 修复音频处理结束时的bug [#5464]
- 修复视频处理bug [#5557]
- 修复encoder cache bug [#5528]
- 修复eb5多模态前缀缓存bug [#5638]
- 修复eb5多模态跳过前缀缓存问题 [#5838]
- 修复多模态revert bug [#5848], [#6061]
- 修复eb5前缀bug [#5879]
- 修复fa3 qwen-vl rope支持问题 [#5869]
- 修复PaddleOCR-VL非法内存访问问题 [#6042]
- 修复多模态fetch feature问题 [#6095]
- 限制prefill batch中多模态请求为1 [#5901]
- 修复SiglipEncoder中reversed_window_indices的条件判断 [#5795]
- 修复FlashMask在开源模型上的运行问题 [#6520]
- 修复MM MTP中不正确的rope embedding [#6586], [#6650]
CUDA Graph相关Bug修复
- 修复无法进入CUDA Graph的问题 [#5422]
- 修复0不进入CUDA Graph以节省内存 [#5426]
- 修复sm89编译错误 [#5809]
- 修复BatchMLAWithPagedKVCacheKernel的O_tmp问题 [#5895]
- 更新权重dummy run时重置shared inputs [#6418]
EP并行相关Bug修复
- 修复custom_all_reduce溢出问题 [#5662]
- 修复wint4 EP空运行导致的问题 [#5870]
- 修复ep_moe_expert_combine op返回值不一致问题 [#5812]
- 修复300B FP8 EP并行测试用例的模型加载错误 [#6436]
- 修复DP+EP下的NaN错误(添加进程间锁) [#6769]
MTP相关Bug修复
- 修复MTP在enable_logprob时无logprobs的问题 [#5499]
- 修复speculative decoding中的attention bug [#5460]
- 修复speculative decoding中write qknorm cache bug [#5491]
- 修复splitewise-prefill模式下multistep MTP问题 [#5723]
- 修复mixed和PD-split模式下multi-step MTP的attn_mask_offset问题 [#5738]
- 修复MTP权重加载bug [#5744]
- 修复MTP split kv attention问题 [#5920]
- 修复MTP logprob在include stop_seq时的hang问题 [#5927]
- 修复MTP forward meta问题 [#5976]
- 修复MTP logprob因max_num_logprobs导致的问题 [#6084]
- 修复GLM MTP中logits计算bug [#6093]
- 修复MTP acceptance rate下降问题 [#6471]
- 修复MTP在dummy run时跳过empty_input_forward [#6654]
- 修复MTP config在RL中的问题 [#6596]
- 支持suffix decoding [#6967]
Cache相关Bug修复
- 修复cpu prefix cache bug [#5544]
- 修复抢占时缓存输出问题 [#5502]
- 修复抢占后exist_prefill_flag问题 [#6630]
- 修复dynamic c8在v1 loader中的问题 [#5562]
- 修复dynamic c8 cache bug [#5958], [#6692]
- 修复cache manager在MTP或blockwise fp8时未启动的问题 [#5840]
- 优化cpu和storage cache的准备 [#5777]
- 修复cache transfer manager updating/clearing问题 [#5930]
- 将cache创建移回cache transfer process [#6144]
- 修复cache cleared后cache transfer tasks失败问题 [#6202]
- 修复storage_backend_type比较bug [#6522]
- 修复使用block_wise_fp8且无storage backend时cache transfer manager初始化失败 [#6517], [#6564]
- 修复recycle_gpu_blocks中的安全检查 [#6530]
- 修复metrics cache tokens问题 [#6001]
- 修复请求抢占后cache命中率和TTFT不准确问题 [#6626]
- 修复prefix tree updating超时问题 [#6616]
- 修复num_cpu_blocks计算问题 [#6473]
量化相关Bug修复
- 修复W4AFP8数值溢出问题 [#5634]
- 修复w4afp8 tp=8问题 [#5868]
- 增加w4afp8 gemm的shape [#5957]
- 适配hadamard_block_size [#5888]
- 修复wint2问题 [#6109]
- 修复weight quant op问题 [#6137]
- 修复fused_mask_swiglu_fp8_quant bug [#6316]
- 修复moe activation quant问题 [#5830]
调度相关Bug修复
- 修复解码时sleep bug [#5461]
- 修复n>1且enable-logprob时的hung问题 [#5492]
- 修复Chunked Prefill在max_tokens=1时的问题 [#5736]
- 修复抢占时超出real_bsz问题 [#5805]
- 修复enable output caching的bug [#6226]
- 设置enable_cache_output默认为false [#5751]
- 修复can_schedule_block_num_threshold计算问题 [#6542]
API/接口相关Bug修复
- 修复init RequestOutput问题 [#5419]
- 修复limit_thinking在CUDA kernels中的early return逻辑 [#5471]
- 修复speculate_limit_thinking_content_length [#5590]
- 修复process_response_dict支持async in serving_completion [#5758]
- 修复streaming response中return_token_ids启用时的冗余prompt_logprobs问题 [#5829]
- 修复console log metrics中waiting queue count [#6453]
- 支持control socket禁用选项 [#6551]
- 修复请求中断和推理终止功能的多个bug [#6890]
RL相关Bug修复
- 移除RL的shutdown_process_group/restart_process_group [#5433]
- 修复RL weight loading在moe layer的问题 [#5503]
- 修复RL load_weights [#5642]
- 修复rl model_weights_signal以支持tp>1 [#5639]
- 修复rl signal问题 [#5681]
- 修复RL中MTP config问题 [#6596]
- 支持Fully Async和PrefixCache [#6727]
- 支持chunked part files加载并修复IPC snapshot strategy中的model path格式 [#6910]
- 修复RL中update param问题 [#6722]
- 添加decoder rope的instantiations [#7010]
其他Bug修复
- 修复bf16 deepseek loader问题 [#5379]
- 修复deepseek torch loading [#5410]
- 修复clearing weight后的不稳定性 [#5493]
- 修复model executing在clearing/updating完成后跳过 [#5527]
- 修复Intel HPU平台构建脚本问题 [#5455]
- 修复Graph Optimization中0size bug [#5495]
- 修复eplb weight updating [#5529]
- 移除重复的PaddleOCRVLProcessor初始化代码 [#5526]
- 修复count_tokens_per_expert_func声明 [#5794]
- 修复shm在set_data_ipc中打开但未关闭的问题 [#5826]
- 修复entropy bugs [#5818], [#5941]
- 修复TP中entropy计算问题 [#5997]
- 只在CUDA平台运行Triton count_greater_kernel [#5846]
- 修复logprob因max_num_logprobs导致的问题 [#6067]
- 修复token_penalty kernel [#6069]
- 修复mask attention问题 [#6216], [#6214]
- 修复qk_norm optimization [#6080]
- 修复shared experts和dense mlp layer不需要TP split的问题 [#6180]
- 修复tokenizer OOM问题 [#6287]
- 修复heartbeat signal的sleeptime错误 [#6241]
- 修复zmq在sampled_token_id=0时hung问题 [#6398]
- 修复noaux_tc_redundant op的4个返回值处理 [#6384]
- 修复cutlass的lazy enable_torch_proxy [#6585]
- 修复reshard错误 [#6537]
- 修复MC_TCP_BIND_ADDRESS for mooncake store [#6783]
- 修复grpc在tracing init before workers forked时的失败 [#6744]
- 修复get_save_output_v1 socket name冲突 [#6759]
- 用custom_ftok替换ftok [#6824]
- 设置FD_USE_PHI_MOE_PERMUTE默认为0 [#6888]
- 修复ErrorInfo code type [#6952]
- 修复_disable_sequence_parallel_moe_if_needed [#5740]
- 修复port相关错误 [#6309]
- 修复download feature bug [#5669]
- 修复insert_zmq_task_to_scheduler break bug [#5960]
- 修复rebuild padding bug [#6425]
- 修复deepgemm import [#6452]
- 修复assert message [#6310]
- 修复double shutdown of comm group [#5715]
- 重命名need_block_num_signal修复shm name冲突 [#5623]
- 修复更新权重时启用cache storage的问题 [#6720]
- 修复多api server的rdma script和port check [#5935]
- 修复worker_process中request counting的误导性日志 [#5939]
- 修复v0_loader在python3.12的问题 [#6132]
- 修复tool_calls skipped问题 [#6166]
- 修复image gen问题 [#6175]
- 修复get_padding_offset in empty run [#6460]
其它
Benchmark
- 更新benchmark工具 [#5496] [#5625] [#6335]
- 更新backend_request_func.py [#5631] [#5633]
- 支持Completions接口 [#5700]
- 修复aiohttp streaming返回Chunk too big问题 [#5771]
- 更新benchmark_serving.py [#5861]
- 支持SGLang/VLLM获取cached tokens [#6240]
- 新增Qwen3 VL CE测试 [#6288]
- 更新README文档 [#6343]
文档
- 新增text/vl cinn ce配置文档 [#5532]
- 更新环境变量文档同步最新代码 [#5713]
- 更新GPU版本至2.3.2 [#5894]
- 更新FastDeploy版本至2.3.3 [#6010]
- 更新Docker镜像至2.4.0 [#6168]
- 新增/v1/pause、/v1/resume、/v1/is_paused接口文档 [#6192]
- 新增在线量化文档 [#6399]
- 新增环境变量文档 [#6385]
CI/测试
单元测试补充(Hackathon活动)
- 新增ernie4_5_vl_processor模块单测 [#5264] [#5265] [#5263]
- 新增spec_decode/mtp.py单测 [#5533]
- 新增rollout_model.py单测 [#5552]
- 新增openai/api_server.py单测 [#5567]
- 新增scheduler/local_scheduler.py单测 [#5050]
- 新增guided_decoding模块单测 [#5047] [#5042]
- 新增entrypoints/engine_client.py单测 [#5807]
- 新增llm.py单测 [#6108]
- 新增engine_worker_queue.py单测 [#6102]
- 新增serving_completion.py单测 [#6227]
- 新增resource_manager_v1.py单测 [#6243]
- 新增fused_moe_wint2_backend.py单测 [#6286]
- 新增zmq_server.py单测 [#6210]
- 其他功能模块单测补充 [#5057] [#5058] [#5063] [#5060] [#5059] [#5718] [#5717] [#5726] [#5609] [#5328]
CI基础设施优化
- 新增commit级别RL构建任务 [#5857]
- 新增CUDA 12.9每日构建任务 [#5936]
- 新增fd-router构建任务 [#5967]
- 新增4-GPU端到端测试任务 [#6082]
- 新增ep4_mtp端到端测试 [#6153]
- 新增GLM E2E测试(MTP及非MTP)#6163
- 新增attention TP单元测试 [#5887]
- 新增attention单元测试用例 [#5931]
- 新增fused_moe EP TP测试 [#5989]
- 新增swap_layout单元测试 [#6250]
- 新增SM100 FP8推理mock测试 [#6273]
- 新增update weights测试 [#6242]
- 重构RL测试复用stable_test [#5516]
- 重构RL测试复用test_metrics.py [#5741]
- 重构iluvatar_ci [#5588]
- 修复approve配置 [#5443]
- 优化stable_test资源调度 [#6235]
- 切换nightly构建使用FD_UNIFY_BUILD [#6246]
- 移除--ipc=host和--pid=host配置 [#6270]
- 更新build_linux_rl.yml [#6274]
- 更新stable test工作流 [#6352]
- 更新check-bypass.yml [#6360]
- 禁用GPU清理 [#5781]
- 启用custom_device_check重试 [#5786]
- 暂时禁用fp8测试用例 [#5963]
- 添加重试和清理机制 [#5725]
- 减少test_mtp超时时间 [#5512]
- 适配vl_model基线变更 [#5576] [#6033]
- 移除不兼容的test_metrics.py [#5578]
- 添加MTP accept ratio CI用例 [#5570]
- 添加ERNIE45T 21B sot测试 [#5538]
- 移除不稳定的IPC测试 [#6190]
- 支持异步R3精度测试 [#5937]
- 固定gunicorn版本至25.0.3 [#6499]
- 禁用test_batch_invariance_op_mm.py [#6549]
- 切换2.5分支使用Paddle release/3.3 [#6621]
- 同步CI优化到release/2.5分支 [#6684] [#6964]
代码重构/清理
Speculative Decoding(投机解码)
- 支持投机解码不同inferseed [#5568]
- 支持multi-step mtp with cudagraph [#5624] [#5886]
- 优化draft logprob [#5842]
- 返回每个head的accepted tokens [#5947]
- 支持GLM-4.5-Air MTP [#6047]
- 支持enable_thinking为false时的约束解码 [#6248]
- 重构MTP pre_process [#6358]
MoE优化
- 支持GPT-OSS MXFP4量化 [#5435]
- 支持FP8权重加载 [#5565]
- 使用max_tokens_per_expert优化MoE网格维度 [#6007]
- 移除permute_x_fp8_kernel模板NUM_EXPERTS_PER_RANK [#5620]
- ep_moe_expert_dispatch支持num_experts_per_rank=5 [#5890]
代码清理与重构
- 移除无用代码,支持mixed FA3 [#5404]
- FA3支持qwen3 [#5441]
- 支持0-dim tensor进入ar [#5451]
- 移除add_bias选项 [#5425]
- 新增cuda_graph断言并只统计实际负载 [#5445]
- 更新tbo相关代码 [#5485] [#6281]
- 清理代码 [#5543] [#5548] [#5691]
- 移除无效paddleocr processor分支 [#5821]
- 移除stop_nums [#6182]
- 移除flash_mask_attention未使用参数 [#6218]
- 移除speculate_get_padding_offset op [#6308]
- 移除MTP rebuild_padding无用代码 [#6336]
- MLA代码清理 [#5979]
- 添加PADDLE_ENFORCE [#6321]
其他优化
- 控制台日志重定向到llm日志 [#5680]
- 防止Paddle版本检查时的core dump [#5657]
- 插件错误信息抛出 [#5675]
- reschedule preempt任务支持可选函数 [#5649]
- 升级paddleformers至0.4.0 [#5599]
- 重命名tensor_parallel_degree为tensor_model_parallel_size [#5727]
- flash_mask attention pybind [#5783]
- 禁用ernie5中的chunked_mm_input [#5774]
- 启用PFCC deep_ep [#5822]
- 新增flashinfer-python-paddle依赖 [#5912]
- TSP last_norm allgather移至model.py [#5924] [#5961] [#5972]
- KVCache仅在启用hierarchical cache或kv cache storage时启动传输进程 [#5871]
- metrics_port参数传递 [#6056]
- 新增exist_prefill_flag [#6172]
- 新增scale_wrapper for per_block_cast_to_fp8 [#6183]
- 改进deep_ep导入处理 [#6207]
- 支持多SM架构构建到单一whl包 [#6173]
- 新增token生成速率监控指标 [#6236]
- 增强deep_ep导入,支持mixed mode flash_mask_attn [#6238]
- 重构execute_model支持GPU异步调度 [#6176]
- 移除cuda_check(多次回滚)#5883 [#5915]
- 新增data_processor及tool parser插件 [#6096]
- 新增paddleocr配置yaml [#6097]
- 新增目录导航到mkdocs配置 [#6121]
- 支持overlap schedule [#6259]
- 跳过paddle.to_tensor如果is_not_swapped [#6342]
- KVCache Storage支持c8模型 [#6298]
- 支持cpu-cache-block-num监控 [#6390]
- 支持从fleet ops导入deepgemm/deepep [#6351]
- apiserver和engine启动失败时退出 [#6322]
- 退出时确保无残留进程 [#6377]
- 懒写入日志 [#6323]
Cherry-Pick
- 新增reasoning effort及tool string参数支持 [#6706]
- 新增qwen3vl prompt_token_ids支持 [#6786]
What's Changed
- [New][RL] Support Rollout Routing Replay by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/5405
- [loader]fix bf16 deepseek by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5379
- [Loader]fix deepseek torch loading by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5410
- [Loader][BugFix] Fix some parameters place on CPU in PaddleOCR-VL by @SigureMo in https://github.com/PaddlePaddle/FastDeploy/pull/5413
- [PD Disaggregation] Add timestamp for analyzing splitwise deployment by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/5317
- [Others] Remove useless code and support FA3 in mixed by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5404
- [BugFix] fix can not enter into cuda graph by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5422
- [CI]【Hackathon 9th Sprint No.16】功能模块 fastdeploy/input/ernie4_5_vl_processor/process.py 单测补充 by @kesmeey in https://github.com/PaddlePaddle/FastDeploy/pull/5264
- [BugFix] 0 not into cuda graph to save memory by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5426
- [Quantization] Support w4afp8 mtp by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/5429
- [Feature] [Benchmark]: add ZMQ-based FMQ implementation and benchmark tools by @sunlei1024 in https://github.com/PaddlePaddle/FastDeploy/pull/5418
- [PD Disaggregation] FD registers to the Router only once. by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/5431
- [BugFix] fix init RequestOutput by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/5419
- [Feature] Multimodal Model P / D Separation by @ST-XX in https://github.com/PaddlePaddle/FastDeploy/pull/5323
- [Engine] [Feature] Refactor async_llm:cross-process with EngineService,based on zmq communication by @xyxinyang in https://github.com/PaddlePaddle/FastDeploy/pull/4868
- [Metax] optimize mla backend by @xiaozude in https://github.com/PaddlePaddle/FastDeploy/pull/5258
- [BugFix] fix mm cudagraph by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5266
- [Optimization] compulte real max_logprobs in batch by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5430
- Remove CUDA ERROR 9 of inputs of get_padding_offset kernel by @K11OntheBoat in https://github.com/PaddlePaddle/FastDeploy/pull/5440
- [Graph Optimization][CINN] Use CINN in PaddleOCR-VL ViT part by @SigureMo in https://github.com/PaddlePaddle/FastDeploy/pull/5223
- [XPU] add speculate_step_system_cache by @RuohengMa in https://github.com/PaddlePaddle/FastDeploy/pull/5397
- [PD Disaggregation] Unify the disaggregation info and the pd communication by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/5438
- FA3 support qwen3 by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5441
- [Feature] Support prefill batch inference for pooling models. by @lizexu123 in https://github.com/PaddlePaddle/FastDeploy/pull/5436
- [CI]Modify approve by @YuanRisheng in https://github.com/PaddlePaddle/FastDeploy/pull/5443
- allow 0-dim tensor into ar by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5451
- [Others] remove add_bias option by @Limerances in https://github.com/PaddlePaddle/FastDeploy/pull/5425
- [Metax] modify warpSize to WARP_SIZE by @xiaozude in https://github.com/PaddlePaddle/FastDeploy/pull/5442
- [Feature] support stop_token_ids by @lizexu123 in https://github.com/PaddlePaddle/FastDeploy/pull/5399
- [Others] Maintain the mtp branch temporarily. by @carryyu in https://github.com/PaddlePaddle/FastDeploy/pull/5446
- [CI] Add unittest by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/5328
- [BugFix] [RL] remove shutdown_process_group/restart_process_group for RL by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5433
- [Speculative Decoding][BugFix]Fix attention bug in spec decoding by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5460
- [BugFix] Fix limit_thinking early return logic in CUDA kernels by @Copilot in https://github.com/PaddlePaddle/FastDeploy/pull/5471
- [Others] add assert and only count the actual load in cuda_graph by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5445
- [BugFix] fix audio end bug by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/5464
- [BugFix] fix decode time sleep bug by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/5461
- [PD Disaggregation] Decode does not cache requests for preallocating … by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/5453
- [Feature]Optimization of Thinking Pattern Framework by @luukunn in https://github.com/PaddlePaddle/FastDeploy/pull/4302
- [Metax] refactor cutlass moe and optimize flash attention by @neilzhuu in https://github.com/PaddlePaddle/FastDeploy/pull/5361
- [BugFix] fix mix splitwise pickle load error by @ST-XX in https://github.com/PaddlePaddle/FastDeploy/pull/5488
- [CI] [XPU]ep+prefix cache+chunk prefill by @zccjjj in https://github.com/PaddlePaddle/FastDeploy/pull/5489
- [Feature]Add a switch for logprobs/prompt_logprobs token decoding. by @qwes5s5 in https://github.com/PaddlePaddle/FastDeploy/pull/5463
- [BugFix] fix instability after clearing weight by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5493
- [Optim] Improve task-checking performance in engine-worker-queue by @Jiang-Jia-Jun in https://github.com/PaddlePaddle/FastDeploy/pull/5376
- [BugFix] fix hung when n>1 and --enable-logprob by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5492
- [Benchmark] Update benchmark by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/5496
- [Others] update tbo related code by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5485
- [Docs] Fix nvidia_gpu.md, add sm80 in precompiled by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/5462
- [Graph Optimization][BugFix][CI] Fix 0size bug && add unitest by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/5495
- [BugFix] Fixed build script issue on Intel HPU platforms by @FocusLuo in https://github.com/PaddlePaddle/FastDeploy/pull/5455
- [RL]Fix RL weight loading issue in moe layer by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5503
- [BugFix] Fix MTP no logprobs when enable_logprob by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5499
- [CI] Reduce timeout of send_request in test_mtp by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5512
- [Optimization] support mm prefill batch by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5313
- [XPU] support get hidden state for mix by @cmcamdy in https://github.com/PaddlePaddle/FastDeploy/pull/5513
- [Feature] Support for request-level speculative decoding metrics monitoring. by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5518
- [CI]【Hackathon 9th Sprint No.25】功能模块 fastdeploy/input/ernie4_5_vl_processor/image_preprocessor/image_preprocessor_adaptive.py 单测补充 by @kesmeey in https://github.com/PaddlePaddle/FastDeploy/pull/5265
- [Metax] add ci yaml by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/5520
- [PD Disaggregation] Distinguish the pipelines for sending kv signal in different prefill by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/5514
- [XPU] fix mtp multi batch by @cmcamdy in https://github.com/PaddlePaddle/FastDeploy/pull/5521
- [Models] Add forward_meta to VocabParallelEmbedding of all models by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/5524
- [XPU] refactor of block_attn param 'pos_emb_type' by @cqulilujia in https://github.com/PaddlePaddle/FastDeploy/pull/5511
- [XPU] add speculate_get_logits by @RuohengMa in https://github.com/PaddlePaddle/FastDeploy/pull/5497
- [Doc]add text/vl cinn ce config by @tianlef in https://github.com/PaddlePaddle/FastDeploy/pull/5532
- [Feature][Optimization] Qwen Support Dynamic block_wise_fp8 cache by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5486
- [BugFix] reschedule_preempt_task append waiting & PREEMPTED blocksize by @ST-XX in https://github.com/PaddlePaddle/FastDeploy/pull/5506
- [CI][XPU] add mtp case by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/5537
- [BugFix] fix encoder cache bug by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5528
- [Graph Optimization][CI] Add ERNIE45T 21B sot test by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/5538
- [Others] Clean code by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5543
- [CI]【Hackathon 9th Sprint No.22】功能模块 fastdeploy/input/ernie4_5_vl_processor/ernie4_5_vl_processor.py 单测补充 by @kesmeey in https://github.com/PaddlePaddle/FastDeploy/pull/5263
- [Metax] adapt to the latest develop and support cudagraph by @zhang-chenyi in https://github.com/PaddlePaddle/FastDeploy/pull/5547
- [Feature] Add check health in FD by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/5534
- [CE]add pd router and wint4 tp4 config by @xiegegege in https://github.com/PaddlePaddle/FastDeploy/pull/5554
- [PD Disaggregation][XPU] update_inputs_v1 operator supports PD by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/5550
- [Bug Fix] Fix bug for caching output when preempted by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/5502
- [Metax] fix GetStopFlagsMulti kernel crash issue by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/5556
- [BugFix]Speculative DecodingFix write qknorm cache bug in speculative decoding by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5491
- [BugFix] fix dynamic c8 in v1 loader by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5562
- [CI] 【Hackathon 9th Sprint No.34】NO.34 功能模块单测补充 by @xunyoyo in https://github.com/PaddlePaddle/FastDeploy/pull/5057
- [Feature] Use
paddle.compat.enable_torch_proxyinfastdeploy/__init__.pyby @littledgg in https://github.com/PaddlePaddle/FastDeploy/pull/5211 - Revert "[BugFix] reschedule_preempt_task append waiting & PREEMPTED blocksize" by @Jiang-Jia-Jun in https://github.com/PaddlePaddle/FastDeploy/pull/5575
- 【NewFeature】support load fp8 weight by @gzy19990617 in https://github.com/PaddlePaddle/FastDeploy/pull/5565
- [CI] Adapt vl_model baseline changes due to Paddle update by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5576
- [Feature] Support fusedmoe on Blackwell by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/5325
- Revert "[Feature] Use
paddle.compat.enable_torch_proxyinfastdeploy/__init__.py" by @Jiang-Jia-Jun in https://github.com/PaddlePaddle/FastDeploy/pull/5579 - [CI] Remove test_metrics.py due to incompatible forced merge by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5578
- [BugFix] fix cpu prefix cache bug by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5544
- [Feature] Tracing: Fine-Grained Tracing for Request Latency Part1 by @xiaolei373 in https://github.com/PaddlePaddle/FastDeploy/pull/5458
- [RL] R3 Support RDMA Store by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/5467
- [BugFix] skip model executing after clearing/updating is done by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5527
- [Feature] add ue8m0 for per_token_quant_fp8 by @fxyfxy777 in https://github.com/PaddlePaddle/FastDeploy/pull/5563
- [CI] 【Hackathon 9th Sprint No.36】NO.36 功能模块单测补充 -part by @xunyoyo in https://github.com/PaddlePaddle/FastDeploy/pull/5058
- [Optim] Optimize costtime in checking tasks in engine-worker-queue by @Jiang-Jia-Jun in https://github.com/PaddlePaddle/FastDeploy/pull/5580
- [Feature] FA3 support GLM-RoPE by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5586
- [BugFix] fix video bug by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5557
- [BugFix] fix speculate_limit_thinking_content_length by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5590
- [Others] Clean code && remove GPU sync code by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5548
- Revert "[Feature] add ue8m0 for per_token_quant_fp8" by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5611
- [Feature] [PD Disaggregation] simplify configuration for pd-disaggregated deployment, and refactor post-init and usage for all ports by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5415
- [XPU][CI] xpu add ci test for pd by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/5610
- [Speculative Decoding]Support different inferseed in speculate decoding by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5568
- [Intel HPU] enable tensor_wise_fp8 by @fmiao2372 in https://github.com/PaddlePaddle/FastDeploy/pull/5324
- [BugFix] 移除重复的 PaddleOCRVLProcessor 初始化代码 by @megemini in https://github.com/PaddlePaddle/FastDeploy/pull/5526
- [CI]【Hackathon 9th Sprint No.12】功能模块 fastdeploy/spec_decode/mtp.py 单测补充 by @kesmeey in https://github.com/PaddlePaddle/FastDeploy/pull/5533
- [CI] Add CI case for MTP accept ratio by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/5570
- [Benchmark] Update benchmark tools by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/5625
- [CI]【Hackathon 9th Sprint No.14】功能模块 fastdeploy/rl/rollout_model.py 单测补充 by @kesmeey in https://github.com/PaddlePaddle/FastDeploy/pull/5552
- [XPU]logprob bug by @qw86972190 in https://github.com/PaddlePaddle/FastDeploy/pull/5626
- [Metax] fix metax runner issue by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/5629
- [XPU] refactor moe ffn by @zhupengyang in https://github.com/PaddlePaddle/FastDeploy/pull/5501
- [benchmark] Update backend_request_func.py by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/5631
- [Model] tp+ep support v1_loader by @Wanglongzhi2001 in https://github.com/PaddlePaddle/FastDeploy/pull/5465
- [XPU] support for EP+MTP by @lizan1999 in https://github.com/PaddlePaddle/FastDeploy/pull/5605
- [CI] 【Hackathon 9th Sprint No.36】NO.36 功能模块单测补充 (修复) by @xunyoyo in https://github.com/PaddlePaddle/FastDeploy/pull/5609
- [benchmark] Update backend_request_func.py by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/5633
- [Iluvatar] Support V1_KVCACHE_SCHEDULER and paddleocr-vl rope mode by @wuyujiji in https://github.com/PaddlePaddle/FastDeploy/pull/5555
- [RL]Support loading weights via the load_weights function for RL by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5549
- Revert "[XPU][CI] xpu add ci test for pd" by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/5645
- [BugFix] fix rl model_weights_signal to support tp>1 by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5639
- [CI] 【Hackathon 9th Sprint No.19】NO.19 功能模块单测补充 by @xunyoyo in https://github.com/PaddlePaddle/FastDeploy/pull/5063
- [XPU] support token num = 0 by @lizan1999 in https://github.com/PaddlePaddle/FastDeploy/pull/5635
- [RL]Fix RL load_weights by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5642
- [Intel HPU] enable kv cache scheduler v1 for hpu by @fmiao2372 in https://github.com/PaddlePaddle/FastDeploy/pull/5648
- [RL] Update worker_process.py by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5651
- [BugFix] Fix eplb weight updating by @RichardWooSJTU in https://github.com/PaddlePaddle/FastDeploy/pull/5529
- [BugFix] fix eb5 mm prefix cache bug by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5638
- [CI] 【Hackathon 9th Sprint No.38】NO.38 功能模块单测补充 by @xunyoyo in https://github.com/PaddlePaddle/FastDeploy/pull/5060
- [Metax] update ci test by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/5652
- [BugFix] Fix custom_all_reduce overflow by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5662
- [CI] Fix unit_test error of unstable execution by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5660
- [RL] provide options for whether shutdown comm group after weights cleared by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5663
- [Speculative Decoding]Support multi-step mtp with cudagraph by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5624
- [BugFix] Fix the W4AFP8 numerical overflow issue. by @lizexu123 in https://github.com/PaddlePaddle/FastDeploy/pull/5634
- [BugFix] fix download feature bug by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/5669
- [Quantization] Support w4afp8 moe weight offline permute & load by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/5613
- [Quantization] Support w4afp8 DeepEP low latency two stage by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/5608
- [Metax] update ci yaml by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/5674
- [BugFix] fix rl signal by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5681
- [XPU][CI] xpu add ci test for pd + TP2 by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/5653
- Revert "Revert "[Feature] Use
paddle.compat.enable_torch_proxyinfastdeploy/__init__.py"" by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/5606 - [log]console log to llm log by @xiaolei373 in https://github.com/PaddlePaddle/FastDeploy/pull/5680
- [XPU][CI] Xpu ci update by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/5690
- [CI] 【Hackathon 9th Sprint No.37】NO.37 功能模块单测补充 by @xunyoyo in https://github.com/PaddlePaddle/FastDeploy/pull/5059
- [XPU]Set top_p=0.0 by default on XPU to optimize performance by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/5686
- [Optim] Remove limitation of number of kvcache blocks by @Jiang-Jia-Jun in https://github.com/PaddlePaddle/FastDeploy/pull/5612
- [Others]Prevent core dumps during Paddle version check by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5657
- [Metax] update ci name by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/5679
- [XPU] modify speculate_verify by @RuohengMa in https://github.com/PaddlePaddle/FastDeploy/pull/5522
- [Metax]Update run_ci_metax.sh by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/5698
- Revert "[Optim] Remove limitation of number of kvcache blocks" by @DDDivano in https://github.com/PaddlePaddle/FastDeploy/pull/5702
- [Docs] Update parameters documentation with latest code defaults and new parameters by @Copilot in https://github.com/PaddlePaddle/FastDeploy/pull/5709
- [CI]【Hackathon 9th Sprint No.40】功能模块 fastdeploy/entrypoints/openai/api_server.py 单测补充 by @kesmeey in https://github.com/PaddlePaddle/FastDeploy/pull/5567
- [Others] plugin raise error msg by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/5675
- [Benchmark]支持Completions接口 by @ophilia-lee in https://github.com/PaddlePaddle/FastDeploy/pull/5700
- [Docs] 更新环境变量文档以同步最新代码 by @Copilot in https://github.com/PaddlePaddle/FastDeploy/pull/5713
- [Others] reschedule preempt task support optional func by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/5649
- [Others]upgrade paddleformer to 0.4.0 by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5599
- [Feature] Entropy calculation support by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5692
- [Others] clean code by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5691
- Revert "[CI] Adapt vl_model baseline changes due to Paddle update" by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5732
- [CI] 【Hackathon 9th Sprint No.41】NO.41 功能模块单测补充 -new by @xunyoyo in https://github.com/PaddlePaddle/FastDeploy/pull/5718
- [Feature] Add entropy calculation script by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5730
- [Others] Rename tensor_parallel_degree to tensor_model_parallel_size for paddleformers 0.4.1 by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5727
- [GraphOptimization] Wrap deep gemm and triton as python op by @SigureMo in https://github.com/PaddlePaddle/FastDeploy/pull/5673
- [BugFix] Fix Chunked Prefill when max_tokens=1 by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5736
- [CI] Refactor RL tests to reuse test_metrics.py by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5741
- [Speculative Decoding]Fix multistep MTP in splitewise-prefill mode by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5723
- [CI] Refactor RL tests to reuse stable_test by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5516
- [BugFix] Set enable_cache_output as false by default by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/5751
- [XPU] refine moe_expert_ffn ut by @RuohengMa in https://github.com/PaddlePaddle/FastDeploy/pull/5743
- [Loader]Fix bug in MTP weight loading by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5744
- [Metax] update ci bash by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/5760
- [BugFix] Fix _disable_sequence_parallel_moe_if_needed by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5740
- [XPU]ZMQ logprob by @qw86972190 in https://github.com/PaddlePaddle/FastDeploy/pull/5628
- [iluvatar][CI] refactor iluvatar_ci by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/5588
- [Optimization] refactor(chat_handler,completion_handler): extract base classes and use AsyncLLM by @memoryCoderC in https://github.com/PaddlePaddle/FastDeploy/pull/5195
- [Feature] Support KV Cache Storage by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/5571
- [CI] Add retry and robust cleanup for removal by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5725
- [Speculative Decoding] Fix attn_mask_offset for multi-step MTP in mixed and PD-split modes by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5738
- [Feature] Add startup version check mechanism for Paddle by @Copilot in https://github.com/PaddlePaddle/FastDeploy/pull/5769
- [Benchmark]调大aiohttp 默认读 buffer size至10M,解决streaming 返回块过大报Chunk too big问题 by @ophilia-lee in https://github.com/PaddlePaddle/FastDeploy/pull/5771
- [BugFix] fix mm splitwise scheduler bug by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5604
- [Feature] pd support dy-c8 ipc by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5750
- [BugFix] fix double shutdown of comm group when rank0 clears weights slower than other ranks by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5715
- [CI] Disable GPU cleanup due to CI machine limitations by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5781
- [BugFix] Rename
need_block_num_signalto fix shm name conflict by @RichardWooSJTU in https://github.com/PaddlePaddle/FastDeploy/pull/5623 - [Iluvatar] Fix FD launch error when specifing CUDA_VISBLE_DEVICE by @wuyujiji in https://github.com/PaddlePaddle/FastDeploy/pull/5735
- [CI] Enable custom_device_check in CI rerun by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5786
- make flash_mask attention pybind by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5783
- [FDConfig] disable chunked_mm_input in ernie5 by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5774
- [Graph Optimization] Add infershape&dtype to
per_token_quant/ep_moe_expert_combine/ep_moe_expert_dispatch_fp8/count_tokens_per_expert_funcby @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/5762 - [BugFix] Fix process_response_dict to support async in serving_completion by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/5758
- [Feature] tracelogger stacklevel by @ST-XX in https://github.com/PaddlePaddle/FastDeploy/pull/5766
- [BugFix] Change
count_tokens_per_expert_funcdeclaration(Tensor->vector<Tensor>) by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/5794 - [BugFix] Correct condition for
reversed_window_indicesinSiglipEncoderby @SigureMo in https://github.com/PaddlePaddle/FastDeploy/pull/5795 - [CI] Fix path error and port conflict by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5803
- [BugFix] Fix preemption out of real_bsz by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/5805
- [XPU] xpu support ep4tp4 by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/5773
- [CI]【Hackathon 9th Sprint No.55】NO.55 功能模块 fastdeploy/scheduler/local_scheduler.py 单测补充 by @essos-bot in https://github.com/PaddlePaddle/FastDeploy/pull/5050
- [Model] support more config for expert_dispatch by @Wanglongzhi2001 in https://github.com/PaddlePaddle/FastDeploy/pull/5748
- [CI] 【Hackathon 9th Sprint No.33】NO.33 功能模块单测补充 -new by @xunyoyo in https://github.com/PaddlePaddle/FastDeploy/pull/5726
- [CI]【Hackathon 9th Sprint No.52】NO.52 功能模块 fastdeploy/model_executor/guided_decoding/ernie_tokenizer.py 单测补充 by @essos-bot in https://github.com/PaddlePaddle/FastDeploy/pull/5047
- [BugFix] Fix return value inconsistency for
ep_moe_expert_combineop by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/5812 - [BugFix] fix compile error in sm89 by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5809
- [Models] Add Qwen3-VL Model Support by @CSWYF3634076 in https://github.com/PaddlePaddle/FastDeploy/pull/5763
- [Others] remove invalid paddleocr processor elif branch by @CSWYF3634076 in https://github.com/PaddlePaddle/FastDeploy/pull/5821
- [BugFix] fix shm opened but not closed in set_data_ipc by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5826
- [PD Disaggregation]remove unsed para in RDMACommManager by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5814
- [RL] add lm_head_fp32 in RolloutModelConfig by @tianhaodongbd in https://github.com/PaddlePaddle/FastDeploy/pull/5825
- [BugFix] Fix entropy bugs by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5818
- [Feature] support w4afp8 v1_loader and v0_loader(tp>1) by @lizexu123 in https://github.com/PaddlePaddle/FastDeploy/pull/5757
- [CI] Fix archive URL injection in tag image build by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5828
- [Optimization] Optimization for gather_logprob by 10GB by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5817
- [CI]【Hackathon 9th Sprint No.46】NO.46 功能模块 fastdeploy/model_executor/guided_decoding/xgrammar_backend.py 单测补充 by @essos-bot in https://github.com/PaddlePaddle/FastDeploy/pull/5042
- [BugFix] Fix moe activation quant by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/5830
- [CI case]Prompt logprob by @xjkmfa in https://github.com/PaddlePaddle/FastDeploy/pull/5835
- [BugFix] eb5 mm skip prefix cache by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5838
- [XPU]MAX_BSZ aligns gpu settings and disable prefix cache in OCR VL by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/5831
- [Speculative Decoding] Optimize draft logprob by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5842
- [BugFix] Only Run Triton count_greater_kernel on CUDA platform by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5846
- [Metax] adapt prefix caching & cpu swap by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/5844
- [Others] remove template NUM_EXPERTS_PER_RANK in permute_x_fp8_kernel by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5620
- [BugFix] skip mm revert by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5848
- [benchmark] Update benchmark_serving.py by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/5861
- [CI] Add commit-level Linux build task for RL by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5857
- [BugFix] fix cache manager not launched in case of mtp or blockwise fp8 by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5840
- [APIServer][Feature] Add configurable worker health check timeout via FD_WORKER_ALIVE_TIMEOUT by @Copilot in https://github.com/PaddlePaddle/FastDeploy/pull/5865
- [Feature] RDMACommunicator send key and value scale by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/5737
- [BugFix] Refine the preparation of cpu and storage cache by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/5777
- [Optimization] add del to decrease peak memory in MoE prefill by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5863
- [BugFix] Fix wint4 ep issue caused by empty run by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/5870
- [BugFix]support fa3 qwen-vl rope by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5869
- [XPU] Speculative Decoding with PD by @cmcamdy in https://github.com/PaddlePaddle/FastDeploy/pull/5856
- [PD Disaggregation] Update usage of pd disaggregation and data parallel by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/5742
- [Others] enable use PFCC deep_ep by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/5822
- [RL] Change 'model' to the instance variable 'tmp_model' by @tianhaodongbd in https://github.com/PaddlePaddle/FastDeploy/pull/5872
- [BugFix] fix w4afp8 tp=8 by @lizexu123 in https://github.com/PaddlePaddle/FastDeploy/pull/5868
- [XPU][CI] Add XPU logprobs case by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/5874
- revert cuda_check by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5883
- [Metax] optimize flash attention backend by @neilzhuu in https://github.com/PaddlePaddle/FastDeploy/pull/5876
- [OPs] ep_moe_expert_dispatch.cu dispatch num_experts_per_rank 5 by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5890
- [Docs] Update GPU version from 2.3.0 to 2.3.2 in installation documentation by @Copilot in https://github.com/PaddlePaddle/FastDeploy/pull/5894
- [UT]support attention test tp by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5887
- [Speculative Decoding]Support multi-step mtp with cudagraph by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5886
- [BugFix] Fix redundant prompt_logprobs in the second chunk of streaming response when return_token_ids is enabled for v1/completions and fix trace file name by @qwes5s5 in https://github.com/PaddlePaddle/FastDeploy/pull/5829
- 【BugFixfix】Adapt to hadamard_block_size by @lizexu123 in https://github.com/PaddlePaddle/FastDeploy/pull/5888
- [BugFix] Storage backend gets env params by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/5892
- [BugFix] fix mtp cache attaching for pd disaggregation by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5884
- [KVCache] launch cache transfer processes only if hierarchical cache or kv cache storage is enabled by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5871
- [Bugfix]fix model weight signal tensor num by @gzy19990617 in https://github.com/PaddlePaddle/FastDeploy/pull/5900
- [XPU]xpu support ep4tp1 in pd disaggregation by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/5860
- [BugFix] fix BatchMLAWithPagedKVCacheKernel O_tmp by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5895
- [Intel HPU] enable chunked prefill by @fmiao2372 in https://github.com/PaddlePaddle/FastDeploy/pull/5903
- [Metax] adapt to gemm interface on different versions of maca by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/5905
- [V1 Loader] Support loading static C8 scale JSON by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/5909
- [Optim] The gemm of w4afp8 adopts an adaptive N by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/5853
- [Iluvatar] remove CUDA_VISIBLE_DEVICE in run_ci_iluvatar.sh by @wuyujiji in https://github.com/PaddlePaddle/FastDeploy/pull/5916
- Revert cuda check by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5915
- [Feature] support rdma pd dy-c8 by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5788
- [BugFix] fix eb5 prefix bug by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5879
- [Graph Optimization] Wrap
m_grouped_gemm_fp8_fp8_bf16_nt_contiguousas custom pyop by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/5847 - [CI] 【Hackathon 9th Sprint No.18】NO.18 功能模块单测补充 -new by @xunyoyo in https://github.com/PaddlePaddle/FastDeploy/pull/5717
- [Optimization] Reduce preemption occurrence when blocks not enough by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/5696
- [BugFix] fix mtp split kv attetion by @lizhenyun01 in https://github.com/PaddlePaddle/FastDeploy/pull/5920
- [Bug fix] Limit multi-modal request for prefill batch to 1 by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/5901
- [Feature] Add Golang-based Router for Request Scheduling and Load Balancing by @mouxinqq in https://github.com/PaddlePaddle/FastDeploy/pull/5882
- [INTEL_HPU] supported ERNIE-4.5-21B-A3B-Thinking mold by @FocusLuo in https://github.com/PaddlePaddle/FastDeploy/pull/5891
- [CI] Add daily build_linux jobs for CUDA 12.9 by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5936
- [BugFix] resource_manager_v1 lock PD by @ST-XX in https://github.com/PaddlePaddle/FastDeploy/pull/5616
- [Models] Add Qwen3-VL Moe Model Support by @CSWYF3634076 in https://github.com/PaddlePaddle/FastDeploy/pull/5913
- [INTEL HPU] support only one release package of PaddleCustomDevice by @FocusLuo in https://github.com/PaddlePaddle/FastDeploy/pull/5910
- [XPU] [CI]Update CI workflow to include all file types by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/5943
- [Bugfix] Fix mtp logprob hang problem when include stop_seq by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5927
- [FDConfig] add flashinfer-python-paddle depend by @BingooYang in https://github.com/PaddlePaddle/FastDeploy/pull/5912
- [TSP] last_norm allgather move to model.py by @xiaoluomi in https://github.com/PaddlePaddle/FastDeploy/pull/5924
- [BugFix] Fix misleading logging in worker_process for request counting by @Copilot in https://github.com/PaddlePaddle/FastDeploy/pull/5939
- [Bugfix] Fix entropy calculation bugs by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5941
- [CI] Temporarily disable fp8_cases in base_tests by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5963
- [XPU] [CI] Lock PaddlePaddle version in run_xpu_ci_pytest.sh by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/5964
- [BugFix] fix dyc8 cache bug by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5958
- [Metax] fix shape error & output garbled code when reasoning big pict… by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/5965
- [Bugfix] Increase the shape of w4afp8 gemm by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/5957
- [Speculative Decoding] Return accepted tokens per head in response by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5947
- [CI]【Hackathon 9th Sprint No.50】NO.50 功能模块 fastdeploy/entrypoints/engine_client.py 单测补充 -part [#5045] by @essos-bot in https://github.com/PaddlePaddle/FastDeploy/pull/5807
- Revert "[TSP] last_norm allgather move to model.py" by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5961
- Revert "Revert "[TSP] last_norm allgather move to model.py"" by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5972
- [XPU] move xpu_attn_backend.py to FastDeploy/fastdeploy/model_executor/layers/backends/xpu by @zccjjj in https://github.com/PaddlePaddle/FastDeploy/pull/5878
- [Models] Qwen3VL and Qwen3VL-Moe CUDA graph Support by @CSWYF3634076 in https://github.com/PaddlePaddle/FastDeploy/pull/5962
- [Feature] Support redundant expert for eplb by @xiaoxiaohehe001 in https://github.com/PaddlePaddle/FastDeploy/pull/5918
- [Metax] add ci test file & update run_ci_metax.sh by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/5975
- [XPU] fix dp4 by @zhupengyang in https://github.com/PaddlePaddle/FastDeploy/pull/5946
- dev_fix_mtp_forward_meta by @xiaoluomi in https://github.com/PaddlePaddle/FastDeploy/pull/5976
- MLA clean code by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5979
- [Optimization] Remove decoder_num_blocks_device memset by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/5982
- [BugFix] [MultiAPIServer] fix rdma script and port check for multi api server by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5935
- [Optimization] Support setting communication groups in custom_allreduce and the all-to-all\transpose fused operator during the decoding phase. by @carryyu in https://github.com/PaddlePaddle/FastDeploy/pull/5917
- [XPU]add ci test cast for P_EP4TP4 D_EP4TP1 by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/5988
- [ci case]Check the chunking of the chat interface by @xjkmfa in https://github.com/PaddlePaddle/FastDeploy/pull/5981
- [Optimization] Accelerate Qwen3 QK RMSNorm via Fused Triton Kernel by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/5880
- [CI] Add fd-router build_task by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5967
- [Docs] Update FastDeploy version to 2.3.3 in NVIDIA GPU installation documentation by @Copilot in https://github.com/PaddlePaddle/FastDeploy/pull/6010
- [BugFix] Fix entropy calculation issue in TP by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5997
- [BugFix] Fix insert_zmq_task_to_scheduler break bug by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/5960
- [Optimization] Do not compute ATTN padding part in In Cuda graph mode by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5985
- [V1 Loader] Load safetensors weights in natural keyorder by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/6006
- [BugFix] fix metrics cache tokens by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/6001
- [XPU][CI] Update XVLLM_PATH setup in run_xpu_ci_pytest.sh by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/6018
- [Metax][CI] update test_ernie_28b_vl.py by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/6019
- [Metax][CI] update test_ernie_28b_vl.py image result keywords by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/6022
- [Featue] Enable output caching by default by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/5987
- [BugFix] fix cache transfer manager updating/clearing by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5930
- [Graph Optimization] Add
full_cuda_graphto control subgraph split by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/6027 - [Optim] Robust sync status when preempted happens by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/5796
- [XPU][CI] Cache queue port bug fix by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/6030
- [Metax][CI] remove 28B VL model test sampling randomness by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/6032
- [CI] Adapt vl_model baseline changes due to Paddle update_2 by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6033
- [Metax][Doc] update metax gpu 'get_started' doc by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/6035
- [Metax][Fix] fix self.share_inputs['preempted_idx']=[] incorrect use by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/6038
- [Feature]Report FD statistical information by @luukunn in https://github.com/PaddlePaddle/FastDeploy/pull/5646
- [Optimize] Qwen2.5-VL vision model with merged linear layers and unif… by @xiaoxiaohehe001 in https://github.com/PaddlePaddle/FastDeploy/pull/6037
- [XPU] fix multi-batch bug in VL model by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/6015
- [RL][CI] Support Async R3 And Add Accuracy Test by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/5937
- [XPU] fix moe num_expert by @zhupengyang in https://github.com/PaddlePaddle/FastDeploy/pull/6014
- [BugFix] fix PaddleOCR-VL illegal memory by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/6042
- [Intel HPU] enable MoE EP for hpu by @yanfeich in https://github.com/PaddlePaddle/FastDeploy/pull/5855
- [Feature] get_output_kv_signal blocking read mode & send_first_token by @ST-XX in https://github.com/PaddlePaddle/FastDeploy/pull/5836
- [XPU][CI] update paddle version by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/6044
- 【Optim】Optimize grid dimensions using max_tokens_per_expert for MoE models by @lizexu123 in https://github.com/PaddlePaddle/FastDeploy/pull/6007
- [XPU] Speculate Decoding + PD, benchmark fix by @cmcamdy in https://github.com/PaddlePaddle/FastDeploy/pull/6036
- [CI]Add more cases for attention unit test by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5931
- [Feature]Support tag phase token enforce generation by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/6034
- [XPU] fix(xpu_model_runner): reset seq_lens_encoder to 0 for decode role in PD splitwise mode by @zhuangzhuang12 in https://github.com/PaddlePaddle/FastDeploy/pull/6048
- [UNITEST] make EP TP test_fused_moe CI by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5989
- [Feature] Unify quant ops by @fxyfxy777 in https://github.com/PaddlePaddle/FastDeploy/pull/5991
- [CI] fix port conflict by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/6054
- [Feature] Support stopping the inference for the corresponding request in the online service after a disconnection request. by @qwes5s5 in https://github.com/PaddlePaddle/FastDeploy/pull/5320
- [Speculative Decoding] Support MTP for GLM-4.5-Air by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/6047
- [Metax][CI] update jenkins github action version by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/6065
- only update self.exist_prefill_task_signal in v0 by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/6064
- [Bugfix] Fix logprob issues caused by max_num_logprobs by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/6067
- [XPU][CI] XPU CI refactor by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/6053
- [BugFix] fix mm revert bug by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/6061
- [CI] Add 4-GPU e2e test job by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6082
- [BugFix] Fix qk_norm optimization by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/6080
- [XPU] xpu support mm prefill batch by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/6072
- [Speculative Decoding][Bugfix] Fix MTP logprob issues caused by max_num_logprobs by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/6084
- [Optimization] Avoid unnecessary penalty computation by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/6078
- [CI]Fix test cases failing under Python 3.12 by @ChowMingSing in https://github.com/PaddlePaddle/FastDeploy/pull/6059
- [XPU][CI] Xpu ci update by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/6089
- [XPU] add pd+mtp ci by @cmcamdy in https://github.com/PaddlePaddle/FastDeploy/pull/6090
- [FDConfig] transfer metrics_port by @CyanScholar in https://github.com/PaddlePaddle/FastDeploy/pull/6056
- [CE]add paddleocr config yaml by @xiegegege in https://github.com/PaddlePaddle/FastDeploy/pull/6097
- [Speculative Decoding][Bugfix] Fix logits computation bug in GLM MTP by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/6093
- [Feature] Add PaddleFormers fallback backend by @jackyYang6 in https://github.com/PaddlePaddle/FastDeploy/pull/5999
- [XPU] support plugin model by @zhupengyang in https://github.com/PaddlePaddle/FastDeploy/pull/6092
- [Graph Optimization] remove static_op_get_block_shape_and_split_kv_block from cudagraph by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/6081
- [CI]Fix test case by @ChowMingSing in https://github.com/PaddlePaddle/FastDeploy/pull/6111
- [Docs]: fix pre-commit error of markdown by @jackyYang6 in https://github.com/PaddlePaddle/FastDeploy/pull/6100
- [XPU] Support CudaGraph(add block attn cuda_graph support) by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/6116
- [Docs]fix doc by @luukunn in https://github.com/PaddlePaddle/FastDeploy/pull/6119
- [WIP] Add directory guide to mkdocs configuration by @Copilot in https://github.com/PaddlePaddle/FastDeploy/pull/6121
- [XPU]XPU FD Release/2.4 Note by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/6125
- [BugFix] fix wint2 by @lizexu123 in https://github.com/PaddlePaddle/FastDeploy/pull/6109
- [CI] Enable 4-GPU e2e test in nightly and fix docker_tag_build by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6128
- [Metax][CI] restore 'moe_expert_dispatch' outputs by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/6130
- [Intel HPU] add HPU tensorwise_fp8 readme by @yanfeich in https://github.com/PaddlePaddle/FastDeploy/pull/6091
- [XPU] Update Dummy Run To Suppport Mutil-Batch Execution by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/6123
- [Iluvartar][CI] Fix the error max_tokens_per_expert referenced before… by @wuyujiji in https://github.com/PaddlePaddle/FastDeploy/pull/6083
- [BugFix] fix python3.12 v0_loader by @lizexu123 in https://github.com/PaddlePaddle/FastDeploy/pull/6132
- [Models]Rename params of normalization layer. by @K11OntheBoat in https://github.com/PaddlePaddle/FastDeploy/pull/6133
- [XPU] change XPU EP interface from xDeepEP to paddle by @zccjjj in https://github.com/PaddlePaddle/FastDeploy/pull/5706
- [XPU]Update Release Note For Release2.4 by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/6143
- [docx] by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/6145
- [Optimization] The pre- and post-processing pipeline do not perform dict conversion by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/5494
- [BugFix] fix weight quant op by @fxyfxy777 in https://github.com/PaddlePaddle/FastDeploy/pull/6137
- [RL] router supports divided rollout by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/6150
- Support MXFP4 for GPT-OSS by @Limerances in https://github.com/PaddlePaddle/FastDeploy/pull/5435
- [XPU][Graph Optimization] XPU Support CUDAGraph by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/6152
- [CI] Add ep4_mtp e2e test by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6153
- [RL][R3] Fix typo by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/6046
- 【Optimization】update data_processor & add tool parser plugins by @luukunn in https://github.com/PaddlePaddle/FastDeploy/pull/6096
- [RL] [APIServer] add more status codes for update/clear api by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/6141
- [XPU] Enable CudaGraph by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/6162
- [Feature] [KVCache] support attention_store kv cache backend by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5823
- [Graph Optimization] Add
max_capture_shape_prefill&&cudagraph_capture_sizes_prefillby @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/6148 - [Docs] Update FastDeploy Docker image to 2.4.0 for Nvidia GPU installation by @Copilot in https://github.com/PaddlePaddle/FastDeploy/pull/6168
- [RL] add pause, update_weights, resume interface for async RL by @mitu626 in https://github.com/PaddlePaddle/FastDeploy/pull/6052
- [UT] Add GLM E2E tests for non-MTP and MTP by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/6163
- [BugFix] skipped tool_calls by @luukunn in https://github.com/PaddlePaddle/FastDeploy/pull/6166
- [BugFix]fix image gen by @luukunn in https://github.com/PaddlePaddle/FastDeploy/pull/6175
- [Model Runner] Add exist_prefill_flag by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/6172
- [XPU] support recover batch sequence by @lizan1999 in https://github.com/PaddlePaddle/FastDeploy/pull/6142
- add scale_wrapper for per_block_cast_to_fp8 by @fxyfxy777 in https://github.com/PaddlePaddle/FastDeploy/pull/6183
- [Docs] add docs of /v1/pause、/v1/resume、/v1/is_paused by @mitu626 in https://github.com/PaddlePaddle/FastDeploy/pull/6192
- [XPU] [CI] add xpu logprobs case by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/6187
- [XPU] fix text_image_gather_scatter in cudagraph mode by @RuohengMa in https://github.com/PaddlePaddle/FastDeploy/pull/6049
- [BugFix] move cache creation back to cache transfer process and adapt clear/update by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/6144
- Improve deep_ep import handling with logging by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/6207
- [CI][BugFIx] Remove flaky IPC-related test by @xyxinyang in https://github.com/PaddlePaddle/FastDeploy/pull/6190
- [build] support build sm 80,86,89,90 to one whl package by @mitu626 in https://github.com/PaddlePaddle/FastDeploy/pull/6173
- [Others] remove stop_nums by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/6182
- [Model Runner] Prepare token count and move FA3 initialization into the graph by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/6170
- [Loader] support dummy load weight by @CSWYF3634076 in https://github.com/PaddlePaddle/FastDeploy/pull/6169
- [XPU][CI]Add Cuda Graph CI Case by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/6229
- [Bug Fix] fix mask attention by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/6216
- [BugFix] fix mask_attn by @xiaoxiaohehe001 in https://github.com/PaddlePaddle/FastDeploy/pull/6214
- [RL] add version to the key of cache storage && refine raising error by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/6160
- [XPU] [CI] XPU CI Updata by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/6211
- remove unneeded para from flash_mask_attention by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/6218
- [benchmark]支持SGLang/VLLM获取cached tokens by @ophilia-lee in https://github.com/PaddlePaddle/FastDeploy/pull/6240
- [Metrics] Added metrics for monitoring token generation rate per request. by @qwes5s5 in https://github.com/PaddlePaddle/FastDeploy/pull/6236
- [CI] add update weights tests by @DDDivano in https://github.com/PaddlePaddle/FastDeploy/pull/6242
- [CI] adjust resource scheduling of _stable_test by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6235
- [CI] Switch nightly build to use FD_UNIFY_BUILD by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6246
- [Others] enhance deep_ep import and support mixed mode flash_mask_attn by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/6238
- [BugFix] Fix token_penalty kernel by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/6069
- [Model Runner] Refactor execute_model for GPU async scheduling by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/6176
- [Others] Support constrained decoding when enable_thinking is false by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/6248
- [Bug fix] Fix multi modal fetch feature by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/6095
- [Models][BugFix] shared experts and dense mlp layer do not require TP split by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/6180
- [CI] Add unit test for swap_layout && remove unit test of splitwise_scheduler by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/6250
- [CI] Fix nightly cu129 build_outputs upload failure by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6264
- [RL] Support GLM MTP RL Model by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/6223
- [Feature] Support Ernie FP8 on sm100 by @zoooo0820 in https://github.com/PaddlePaddle/FastDeploy/pull/5593
- [Feature] Support NVFP4 MoE on SM100 by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/6003
- [BugFix] allow return code 250 in tests/distributed/test_fusedmoe_ep_entry.py by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/6269
- [Graph Optimization] Support CUDAGraph for P/PD mixed Batch using SOT subgraph spliting mode by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/6196
- [CI] Remove --ipc=host and --pid=host from _stable_test.yml by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6270
- [CI] Update _build_linux_rl.yml by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/6274
- [Metax][Fix] fix 'get_token_penalty_multi_scores' input error based (… by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/6266
- [Feature] Enhance Router with /v1/completions, docs, scripts, and version info by @mouxinqq in https://github.com/PaddlePaddle/FastDeploy/pull/5966
- [BugFix] Fix bug for enable output caching by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/6226
- Revert "[Feature] Support Ernie FP8 on sm100" by @zoooo0820 in https://github.com/PaddlePaddle/FastDeploy/pull/6275
- [CI] 【Hackathon 10th Spring No.41】功能模块 fastdeploy/entrypoints/llm.py 单测补充 by @xunyoyo in https://github.com/PaddlePaddle/FastDeploy/pull/6108
- [Metax] adapt to the latest develop by @xiaozude in https://github.com/PaddlePaddle/FastDeploy/pull/6282
- [CI] 【Hackathon 10th Spring No.30】功能模块 fastdeploy/inter_communicator/engine_worker_queue.py单测补充 by @kesmeey in https://github.com/PaddlePaddle/FastDeploy/pull/6102
- [Feature] Support report token index by attention store by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/6285
- [CI] 【Hackathon 10th Spring No.38】功能模块 fastdeploy/entrypoints/openai/serving_completion.py单测补充 by @xunyoyo in https://github.com/PaddlePaddle/FastDeploy/pull/6227
- [Optimize] optimize mask_quant & swiglu by @fxyfxy777 in https://github.com/PaddlePaddle/FastDeploy/pull/6222
- Revert "[RL] Support GLM MTP RL Model" by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/6301
- [Bug Fix] fix tokenizer oom by @ApplEOFDiscord in https://github.com/PaddlePaddle/FastDeploy/pull/6287
- [BugFix] fix fused_mask_swiglu_fp8_quant bug by @fxyfxy777 in https://github.com/PaddlePaddle/FastDeploy/pull/6316
- [Benchmark] Ce qwen3 vl by @xjkmfa in https://github.com/PaddlePaddle/FastDeploy/pull/6288
- [BugFix] Fix heartbeat signal's sleeptime error by @CyanScholar in https://github.com/PaddlePaddle/FastDeploy/pull/6241
- [config] fix assert message by @zccjjj in https://github.com/PaddlePaddle/FastDeploy/pull/6310
- [Feature] [KVCache] support file_store kv cache backend by @Moonchild1227 in https://github.com/PaddlePaddle/FastDeploy/pull/6188
- remove speculate_get_padding_offset op by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/6308
- cp 1131 tbo to develop by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/6281
- [Feature]Support reorder ids to split prefill and decodes by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5779
- [CI] 【Hackathon 10th Spring No.21】功能模块 fastdeploy/engine/sched/resource_manager_v1.py 单测补充 by @kesmeey in https://github.com/PaddlePaddle/FastDeploy/pull/6243
- [Feature] Support Ernie FP8 on sm100 ( the fixed version) by @zoooo0820 in https://github.com/PaddlePaddle/FastDeploy/pull/6304
- [BugFix] Fix port-releated errors in mix mode when FD_ENABLE_INTERNAL_ADAPTER is enabled by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/6309
- [CE]add 21b cpu cache ,glm mtp,glm for rl config by @xiegegege in https://github.com/PaddlePaddle/FastDeploy/pull/6328
- [Others] lazy write log when writing by @CSWYF3634076 in https://github.com/PaddlePaddle/FastDeploy/pull/6323
- [RL] R3 Support Fused Put the Routing of All Layers by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/6099
- [benchmark] update benchmark tools by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/6335
- [CI]【Hackathon 10th Spring No.37】功能模块 fastdeploy/model_executor/layers/moe/fused_moe_wint2_backend.py单测补充 by @xunyoyo in https://github.com/PaddlePaddle/FastDeploy/pull/6286
- add PADDLE_ENFORCE by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/6321
- [Model Runner] Support overlap schedule by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/6259
- [Metax][CI] update ci test files by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/6340
- [Feature] FD_USE_PHI_FP8_QUANT by @fxyfxy777 in https://github.com/PaddlePaddle/FastDeploy/pull/6320
- [Feature] Fix counter release logic & update go-router download URL by @mouxinqq in https://github.com/PaddlePaddle/FastDeploy/pull/6280
- [Optimize] Optimize ttft for ep by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/6098
- [Metax][Fix] fix issues based [#6259] by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/6338
- [Metax][CI] fix run_ci_metax.sh error by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/6341
- [benchmark] Update README.md by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/6343
- [Others] Skip paddle.to_tensor if is_not_swapped by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/6342
- [Others] add mock unittest for sm100 FP8 inference by @zoooo0820 in https://github.com/PaddlePaddle/FastDeploy/pull/6273
- [Optimization]update prompt & prompt_token_ids by @luukunn in https://github.com/PaddlePaddle/FastDeploy/pull/6334
- [RL] Support GLM MTP RL Model by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/6267
- [CI] Update stable test workflow and run.sh script by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6352
- [XPU] support noaux_tc by @lizan1999 in https://github.com/PaddlePaddle/FastDeploy/pull/6326
- [Optimization] Support FA2/FA3/FA4 with attn_mask_q by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/6354
- [Feature] Support Norm before Rope. by @K11OntheBoat in https://github.com/PaddlePaddle/FastDeploy/pull/6332
- [CI] Update check-bypass.yml by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6360
- Remove MTP rebuil_padding useless code by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/6336
- [Metax][CI] update metax ci files by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/6364
- [Metax][CI] restore 21b/28b ci test file by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/6368
- [Feature] console print statistical metrics by @CSWYF3634076 in https://github.com/PaddlePaddle/FastDeploy/pull/6339
- [Feature] support glm tp+dp+ep by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/6317
- [KVCache] Storage cache supports c8 model by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/6298
- [BugFix] fix cache transfer tasks failure after cache cleared by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/6202
- [Metax][CI] add paddleocr ci test by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/6379
- [Metrics] Support cpu-cache-block-num by @Jiang-Jia-Jun in https://github.com/PaddlePaddle/FastDeploy/pull/6390
- [MTP] refactor MTP pre_process by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/6358
- [ci case]Prompt logprobs precision by @xjkmfa in https://github.com/PaddlePaddle/FastDeploy/pull/6381
- [Others] support import deepgemm/deepep from fleet ops by @zoooo0820 in https://github.com/PaddlePaddle/FastDeploy/pull/6351
- [BugFix]fix handle 4 return values from noaux_tc_redundant op by @mattheliu in https://github.com/PaddlePaddle/FastDeploy/pull/6384
- [CI] 【Hackathon 10th Spring No.25】功能模块 fastdeploy/inter_communicator/zmq_server.py 单测补充 by @0Ayachi0 in https://github.com/PaddlePaddle/FastDeploy/pull/6210
- [Docs] Add Doc for Online quantification by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/6399
- [BugFix] fix zmq hung when sampled_token_id=0 by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/6398
- [loader]supoort wint2 backend by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/6139
- [Engine] apiserver&engine exit when work failed to start by @CSWYF3634076 in https://github.com/PaddlePaddle/FastDeploy/pull/6322
- [Docs]add environment_variables by @luukunn in https://github.com/PaddlePaddle/FastDeploy/pull/6385
- [Metax][CI] e2e ci tests enable cuda graph by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/6401
- [Optimization] Support logprob async copy by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/6362
- [Feature] consider multimodal model when dummy run by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/6045
- Revert "[Optimize] Optimize ttft for ep" by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/6402
- [BugFix][Iluvatar] Use paddle.device.get_device_properties for multi-platform compatibility by @mattheliu in https://github.com/PaddlePaddle/FastDeploy/pull/6400
- [Others] Exit to ensure no residual processes (cpu cache & dp) by @CSWYF3634076 in https://github.com/PaddlePaddle/FastDeploy/pull/6377
- [BugFix] PD reorder fix and add ut by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/6375
- [Feature] 统一请求完成日志格式并增强统计信息 by @Jiang-Jia-Jun in https://github.com/PaddlePaddle/FastDeploy/pull/6405
- [XPU] change base XPU docker image by @cqulilujia in https://github.com/PaddlePaddle/FastDeploy/pull/6411
- [CI] Fix stable_test and add cherry-pick automation by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6415
- [Feature] console print metrics add env by @CSWYF3634076 in https://github.com/PaddlePaddle/FastDeploy/pull/6413
- [BugFix][Cherry-Pick] add reset shared inputs when update weight dummy run(#6331) by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/6418
- [Cherry-Pick][BugFix] Fix rebuild padding bug (#6422) by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/6425
- [Cherry-Pick][CI]fix fa4 test (#6408)(#6424) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/6421
- [Cherry-Pick][BugFix] Fix model loading error for 300B FP8 EP parallel test case (#6382) by @zoooo0820 in https://github.com/PaddlePaddle/FastDeploy/pull/6436
- [Cherry-Pick] Revert "[XPU] change base XPU docker image"(6427) by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6459
- [Cherry-Pick 2.5][BugFix] Fix get_padding_offset in empty run by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/6460
- [Cherry-Pick][BugFix]fix deepgemm import (#6451) by @zoooo0820 in https://github.com/PaddlePaddle/FastDeploy/pull/6452
- [BugFix][Cherry-Pick] fix mtp acceptance rate decline cp (#6470) by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/6471
- [Cherry-Pick] [BugFix] fix num_cpu_blocks computation (#6438) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/6473
- [Cherry-Pick][CI] Pin gunicorn version to 25.0.3(#6497) by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6499
- [Cherry-Pick][CI] Optimize unittest and fix title format(#6464) by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6468
- [Cherry-Pick][BugFix]fix console log metrics waitting queue count from [#6432] by @CSWYF3634076 in https://github.com/PaddlePaddle/FastDeploy/pull/6453
- [Feature] support mm_processor_kwargs for flexible model by @Wanglongzhi2001 in https://github.com/PaddlePaddle/FastDeploy/pull/6491
- [Cherry-Pick][OP][Feature] 统一 limit_thinking_content_length CUDA 算子,支持回复长度限制与注入序列 by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/6511
- [Cherry-Pick][BugFix] Fix storage_backend_type comparison bug in cache_transfer_manager.py (#6514) by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/6522
- [Cherry-Pick][BugFix] FlashAttnBackend Supports OpenSource Model run FlashMask(#6518) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/6520
- [Cherry-Pick][Bugfix] cherry-pick [#6466] and [#6528] to release/2.5 by @zoooo0820 in https://github.com/PaddlePaddle/FastDeploy/pull/6529
- [BugFix][Cherry-Pick] Fix reshard error(#6536) by @DrownFish19 in https://github.com/PaddlePaddle/FastDeploy/pull/6537
- [Cherry-Pick][CI] disable test_batch_invariance_op_mm.py in unit_test(#6548) by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6549
- [Cherry-Pick] [BugFix] fix cache transfer manager init failed when using block_wise_fp8 and no storage backend (#6516) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/6517
- [Cherry-Pick][BugFix][APIServer] Enable control socket disable option in API server (#6545) by @Copilot in https://github.com/PaddlePaddle/FastDeploy/pull/6551
- [Cherry-Pick][BugFix][Scheduler] Fix can_schedule_block_num_threshold calculation(#6541) by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/6542
- [Cherry-Pick][BugFix][RL] Set GPU flags for paddle in cache transfer manager (#6534) by @Copilot in https://github.com/PaddlePaddle/FastDeploy/pull/6550
- [Cherry-Pick] [BugFix] fixup for cache transfer manager init failed when using block_wise_fp8 and no storage backend (#6516) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/6564
- [Cherry-Pick] [BugFix] fix cache int8 for pd disaggregated deployment (#6563) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/6571
- [Cherry-Pick][BugFix] lazy enable_torch_proxy for cutlass (#6523) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/6585
- [BugFix][Cherry-Pick] Add safety checks in recycle_gpu_blocks to prevent block allocation errors(#6531) by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/6530
- [Cherry-Pick][CI] Fix tests to resolve failure(#6557,#6572) by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6590
- [Cherry-Pick][Feature]Supports SWA based on appendattn [#6547] by @chang-wenbin in https://github.com/PaddlePaddle/FastDeploy/pull/6594
- [Cherry-pick] support qkv&gate linear fusion [#6455] by @Wanglongzhi2001 in https://github.com/PaddlePaddle/FastDeploy/pull/6552
- [Cherry-Pick][BugFix] fix mtp_config in rl (#6595) by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/6596
- [Cherry-Pick] [BugFix] fix prefix tree updating timeout (#6615) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/6616
- [CI] Switch 2.5 branch to use Paddle release/3.3 by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6621
- [Cherry-Pick] [Bug Fix] Fix MM mtp incorrect rope emb(#6581) by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/6586
- [Cherry-Pick][BugFix] Fix exist_prefill_flag when preempted task exist (#6629) by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/6630
- Revert "[Cherry-Pick] [Bug Fix] Fix MM mtp incorrect rope emb(#6581)" by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/6633
- [XPU][CI]Update _build_xpu.yml by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/6640
- [Cherry-Pick] [RL] Support SM100 FP8 quantization in RL [#6601] by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/6602
- [Cherry-Pick][BugFix] fix flash attn mtp rope emb bug by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/6650
- [BugFix][MTP] Skip empty_input_forward during dummy run by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/6654
- [Cherry-Pick] [BugFix] Fix inaccurate cache hit rate and TTFT after request preemption by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/6626
- [Cherry-Pick][Feature]weight only quant method support QKVGate_proj (#6612) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/6669
- [Cherry-Pick][XPU][CI] Fix XPU CI Bug(#6658) by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6679
- [Cherry-Pick][CI] Sync CI optimizations from develop to release/2.5(#6645 by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6684
- [Cherry-Pick][BugFix] Fix error in dynamic c8 cache (#6544) by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/6692
- [Cherry-Pick]add reasoning effort & string arguments in tool#6704#6656 by @luukunn in https://github.com/PaddlePaddle/FastDeploy/pull/6706
- [Cherry-Pick][RL]fix update param [#6723] by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/6722
- [Cherry-Pick][BugFix] Fix updating weight when enable cache storage (#6719) by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/6720
- [Cherry-Pick] [BugFix] fix grpc failure when tracing init before workers forked (#6732) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/6744
- [Cherry-Pick][BugFix][KVCache] Add inter-process lock to fix NaN error under DP+EP(#6724) by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6769
- [XPU][CI]Cherry-Pick PR and Update CI Case by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/6619
- [Cherry-Pick] [BugFix] resolve get_save_output_v1 socket name conflicts between multiple instances (#6758) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/6759
- [Cherry-Pick][BuFix]Set MC_TCP_BIND_ADDRESS for mooncake store(#6782) by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/6783
- [Cherry-Pick] [Processor]add qwen3vl prompt_token_ids support (#6764) by @CSWYF3634076 in https://github.com/PaddlePaddle/FastDeploy/pull/6786
- [RL][Cherry-Pick] add stream guard (#6814) by @liufengwei0103 in https://github.com/PaddlePaddle/FastDeploy/pull/6823
- [Cherry-Pick][Feature] use phi permute/unpermute & rm swiglu (#6361) by @fxyfxy777 in https://github.com/PaddlePaddle/FastDeploy/pull/6808
- [Cherry-Pick] [BugFix] replace ftok with custom_ftok in get_output/save_output ops (#6822) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/6824
- [Cherry-Pick][Loader]Add support for handling GPU memory fragmentation. by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/6790
- [Cherry-Pick][Optim] Simplify available_blocks assignment logic (#6819) by @Copilot in https://github.com/PaddlePaddle/FastDeploy/pull/6874
- [Cherry-Pick][BugFix] Fix several bugs in the request interruption and inference termination functionality(#6743) by @qwes5s5 in https://github.com/PaddlePaddle/FastDeploy/pull/6890
- [Cherry-Pick][BugFix] Set FD_USE_PHI_MOE_PERMUTE = 0 Default(#6886) by @fxyfxy777 in https://github.com/PaddlePaddle/FastDeploy/pull/6888
- [Cherry-Pick][Optimization] Skip compat guard when torch is not installed(#6913) by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6926
- [Cherry-Pick][RL] cherry-pick [#6862] support qkrmsnorm use proxy-norm by @zoooo0820 in https://github.com/PaddlePaddle/FastDeploy/pull/6859
- [RL][Cherry-Pick] Support Fully Async and PrefixCache by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/6727
- [Cherry-Pick][BugFix] Fix ErrorInfo code type(#6951) by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6952
- [Cherry-Pick][CI] Sync develop optimizations to 2.5(#6745) by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6964
- [Cherry-Pick][RL][BugFix][Optimization] Support chunked part files loading and fix model path format in IPC snapshot strategy(#6852) by @wikilsh in https://github.com/PaddlePaddle/FastDeploy/pull/6910
- [Cherry-Pick][RL] add worker_process no grad (#6971) by @xiaoxiaohehe001 in https://github.com/PaddlePaddle/FastDeploy/pull/6972
- [Cherry-Pick][Speculative Decoding] Support suffix decoding (#6403) by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/6967
- [Cherry-Pick][RL] Support moe_topk_select using Paddle native operators and Add fused stack-transpose-quant for BlockWiseFP8 MoE weight quantization and swiglu-fp8-quant op for DeepGemmFusedMoE for training alignment (#6850) by @DanielSun11 in https://github.com/PaddlePaddle/FastDeploy/pull/6935
- [Cherry-Pick][Optimization] Use a separate driver when using Triton with Paddle (#6897) by @SigureMo in https://github.com/PaddlePaddle/FastDeploy/pull/6983
- [Cherry-Pick][Others] Fix PD reorder for MTP [#6792] by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/6917
- [Cherry-Pick][CI] Sync develop fix and optimizations to 2.5(#6975) by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/6987
- [RL][Cherry-Pick] RoPE without fmad opt (#6901) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/6902
- [Cherry-Pick] [Feature] support v1 update/clear api for RL (#6761) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/6974
- [Cherry-Pick][BugFix][RL]add instantiations for decoder rope enfore_fmul_rn=true(#7009) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/7010
New Contributors
- @RuohengMa made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/5397
- @zhang-chenyi made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/5547
- @BingooYang made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/5912
- @yanfeich made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/5855
- @ChowMingSing made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/6059
- @CyanScholar made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/6056
- @Moonchild1227 made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/6188
Full Changelog: https://github.com/PaddlePaddle/FastDeploy/compare/v2.4.0...v2.5.0