FastDeploy - Browse /v2.4.0 at SourceForge.net

The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
README.md	2026-01-20	69.0 kB	0
v2.4.0 source code.tar.gz	2026-01-20	6.0 MB	0
v2.4.0 source code.zip	2026-01-20	7.6 MB	1
Totals: 3 Items		13.7 MB	1

核心推理能力与模型支持增强

支持文本 prompt_logprob 及全量 logprob 能力 [#4769]
支持离线推理中基于 ZMQ 的 logprobs / prompt_logprobs，并引入 max_logprobs 参数 [#4897]
支持在线推理中基于 ZMQ 的 logprobs / prompt_logprobs，并优化通信方式 [#5089]
新增 logprobs / prompt_logprobs 的 token_id 解码控制开关 [#5463]
受限解码新增 llguidance 后端 [#5124]
CUDAGraph 支持投机解码 Draft Model 加速(默认关闭)
[Speculative Decoding] 解耦 draft_tokens 后处理流程 [#5205]
支持 Pooling 模型 Runner
支持 Reward 模型
Pooling 模型通用 embedding 接口 [#4344]
Pooling 模型定制 reward 接口 [#4518]
新增开源模型 Ernie-4.5-VL-28B-A3B-Thinking 的 reasoning_parser，兼容 - / _ 命名规则 [#4571] [#4668]
支持通过 chat_template_kwargs.options.thinking_mode 控制思考开关
支持多模模型传入 prompt_token_ids 请求，并通过 messages 输入多模数据，实现 tokens-in / tokens-out 能力

并行架构、调度与 MoE 能力演进

GLM / Qwen 模型消除 EP 空跑时的通信开销 [#5254]
支持 MoE 分 chunk 执行 [#4575]
支持 EPLB（Expert Load Balancing）#4782
支持 EPLB 重排与冗余专家策略 [#5142] [#5143] [#5178] [#5239] [#5918]
支持路由重放机制
PD 分离支持 Deepseek V3 模型 EP 并行部署 [#5251]
PD 分离支持 Qwen3-MoE 模型 EP 并行部署 [#4691]
PD 分离支持 Prefill 与 Decode 使用不同 TP Size [#5296]
新增 Python 版本 Router，支持集中式与分离式部署调度 [#4709]
支持多步 MTP + CUDAGraph + PD 分离
支持 MTP 无损验证
支持 MTP 分 chunk [#5343]

多模态、缓存与量化能力增强

支持多模单 batch、纯文本多 batch 混合 Prefill 调度 [#4611]
支持多模 Prefix Cache [#4803]
动态量化支持 Prefix Cache [#5125]
修复并支持多模 Prefix Cache 与 CUDAGraph 同时开启 [#4679]
支持 W4AFP8 动态量化 [#5282]
支持静态 C8 scale 单独加载 [#4624]
完善 Machete 对不同量化 group size 的支持 [#4911]
支持 Flash Mask Attention Backend 接入 [#5104] [#5134] [#5387]
v1 Loader 加载性能优化 [#4532]
支持预编译包功能 [#4729]

多硬件平台支持扩展

P800

支持多模 Prefix Cache [#5356]
支持 PD 分离 [#5179]
支持思考模型思考强度限制 [#4761]
支持 TP + EP 并行 [#4688] [#4836]

Intel HPU

新增 Prefix Caching 支持 [#4971]
新增 Chunked Prefill 支持 [#5289]

Iluvatar GPU

支持 ERNIE-4.5-21B-A3B 与 ERNIE-4.5-VL-28B-A3B-Thinking [#4774] [#4995]
修复多项 CI 问题 [#4972] [#5012] [#5100]

MetaX

支持 ERNIE-4.5-VL-28B [#4820]
新增 Cutlass MoE [#4602] [#4685] [#5128]
支持 default_v1 loader [#4956] [#5001]
优化 Flash MLA 性能 [#4915]
新增 Triton MoE 的 default_v1 loader 与 quant_config [#5030]
支持 ENABLE_V1_KVCACHE_SCHEDULER [#5163]

性能优化、可观测性与稳定性修复

性能与通信优化

AppendAttn 算子支持 CUDA-PDL [#5072]
DeepGemm H2D 消除 [#5262]
优化集中式 EP 通信逻辑 [#5145]
移除 CUDA Graph 下 Append Attention 的 DtoH 同步开销
支持两阶段低时延通信 [#4162]
支持 TP + EP 混合并行 [#4615] [#5315] [#5353]
默认编译 RDMA，降低多模 CUDAGraph 开销

可观测性与安全

支持基于请求级别的细粒度链路追踪 [#5458]
添加 trace_id / span_id 自动注入与开关 [#4692] [#5765]
新增 --api-key 权限校验参数 [#4806]

稳定性与 Bug 修复

修复 logprob / prompt_logprob 计算、序列化及通信相关问题 [#4681] [#4884] [#5237] [#5335]
修复 EP、PD 分离、MTP、Prefix Cache、量化、多模态等多类推理场景下的稳定性问题
修复多硬件（XPU / MetaX / Luvatar / P800）算子与参数校验问题

What's Changed

[BugFix] fix total_block_num init error in worker_process by @RichardWooSJTU in https://github.com/PaddlePaddle/FastDeploy/pull/4553
[BugFix] Fix graph opt test case by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4634
[Feature] add mm token usage by @ApplEOFDiscord in https://github.com/PaddlePaddle/FastDeploy/pull/4570
[XPU] Update the return value of TextImageGatherScatter by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/4636
[Docs] Add PaddleOCR-VL-0.9B best practices by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4658
[XPU] fix pos_emb_type bug by @cqulilujia in https://github.com/PaddlePaddle/FastDeploy/pull/4638
[Docs] add Qwen25vl yaml by @xjkmfa in https://github.com/PaddlePaddle/FastDeploy/pull/4662
[Feature] add a new reasoning parser by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4571
[XPU] [CI] Increase pytest timeout for XPU ep test by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4665
add noaux_tc to unitest fused_moe by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4656
[EP] fix several bugs in data parallel by @ltd0924 in https://github.com/PaddlePaddle/FastDeploy/pull/4657
[OP] Add InferShape&InferDtype for per_token_quant_padding by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/4667
【Hackathon 9th No.86】autogen MoeFastHardamardImplWrapper template_instantiation by @ccsuzzh in https://github.com/PaddlePaddle/FastDeploy/pull/4592
[UT] Add ut for speculative sampler by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/4650
[Doc] update docs by @ApplEOFDiscord in https://github.com/PaddlePaddle/FastDeploy/pull/4675
[Graph Optimization] Add the CUDAGraph usage switch for Draft Model by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4601
[CI] Add test for paddleocr_vl by @Limerances in https://github.com/PaddlePaddle/FastDeploy/pull/4627
[unitest]add real gate_correction_bias weight to mock real data dispatch by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4676
[noauxtc_kernel] remove useless code by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4643
[BugFix] fix offline llm chat "enable_thinking" is always "False" by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4686
[BugFix] fix total_block_num init error in worker_process and test_async_llm not throw error by @xyxinyang in https://github.com/PaddlePaddle/FastDeploy/pull/4687
[BugFix] fix --logprobs-mode raw_logits by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4681
[XPU] xpu currently disable prefix cache for VL model by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/4695
[XPU] [CI] Add Vl case by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4649
[BugFix] Fix finish reason in _create_chat_completion_choice by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4582
[Feature] Unify the registration name recognition for tool_parser and reasoning_parser to “-” by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4668
[BugFix] fix unittest of get_save_output_v1 by @Wanglongzhi2001 in https://github.com/PaddlePaddle/FastDeploy/pull/4701
[XPU] [CI] Lock xvllm version by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4715
[Graph Optimization] SOT+CUDAGraph support ERNIE4.5T VL 28B / 424B by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/4645
[Feature] support mtp distribution equivalence verification by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/4699
[KVCache] Support kv cache scale load by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/4624
add flops and bandwidth to test_ffn.py by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4704
benchmark工具支持受限解码场景指定response_format by @ophilia-lee in https://github.com/PaddlePaddle/FastDeploy/pull/4718
[CI] add missing unit tests for tokenizer_cli by @xiaolei373 in https://github.com/PaddlePaddle/FastDeploy/pull/4620
[Scheduler] update v1 prefill batch by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/4611
[BugFix] Fix profile run in pd-disaggregated deployment by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/4584
[BugFix] fix mm prefix_cache cuda error bug by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/4679
[Feature] Check bos url by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/4711
[BugFix] fix wint2 config by @chang-wenbin in https://github.com/PaddlePaddle/FastDeploy/pull/4721
[FDConfig] [PD Disaggregation] [Graph Optimization] Close Cudagraph for P node when PD Disaggregation by @littledgg in https://github.com/PaddlePaddle/FastDeploy/pull/4632
[XPU] xpu support neox style ROPE by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/4719
[BugFix] Skip building native architecture when specifying arch list by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4727
fix noaux by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4731
[BugFix] fix thinking bug by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/4710
[CI] Fix rollout_model test logic by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4730
[Feature] support pooling model runner by @lizexu123 in https://github.com/PaddlePaddle/FastDeploy/pull/4590
format code by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4720
[CI] fix some ci yaml by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4747
[Docs]Update XPU document version to 2.3.0 by @yyssys in https://github.com/PaddlePaddle/FastDeploy/pull/4741
[Speculative Decoding][MTP]Support mtp in splitewise and scheduler_v1 mode by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/4743
[Speculative Decoding][MTP]Support attn mask offset by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/4641
[Docs]Add parameter to the start service command by @yyssys in https://github.com/PaddlePaddle/FastDeploy/pull/4753
[Docs]Add parameter by @yyssys in https://github.com/PaddlePaddle/FastDeploy/pull/4755
[Docs] fix PaddleOCR-VL docs bug by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4702
[Feature] Support eplb for fd by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/4599
[XPU] add v1 support for bf16 by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/4744
【DataProcessor】add options thinking_mode by @luukunn in https://github.com/PaddlePaddle/FastDeploy/pull/4735
[Optimize] Support and robust for tpN for PD by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/4595
[Docs] fix error by @yyssys in https://github.com/PaddlePaddle/FastDeploy/pull/4768
[CI]test common model by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/4697
[Metax] adapt cutlass moe for ernie-vl by @neilzhuu in https://github.com/PaddlePaddle/FastDeploy/pull/4685
fix dynamic Cfp8 for RL load by @rsmallblue in https://github.com/PaddlePaddle/FastDeploy/pull/4144
[Docs] PaddleOCR-VL add RTX3060 server param by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4765
[BugFix] fix deepseek cuda error by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/4739
[XPU][CI] fix ci base value bug by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4783
[OP]Fix attn_params by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/4787
[CI]delete test_common_model by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/4794
[XPU] fix thinking bug where output only contains reasoning_content by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/4761
[XPU] add deployment doc for PaddleOCR-VL in XPU by @cqulilujia in https://github.com/PaddlePaddle/FastDeploy/pull/4784
[BugFix] Fix ernie4_5_vl_processor.py and qwen_vl_processor.py can not disable thinking by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4762
supports internode_ll_two_stage by @carryyu in https://github.com/PaddlePaddle/FastDeploy/pull/4162
supports pd partn by @carryyu in https://github.com/PaddlePaddle/FastDeploy/pull/4615
[Docs] Add new support models by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4801
[CI] Refactor CE wheel upload for multiple target paths by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4790
[Docx] update mkdocs.yml by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/4804
[BugFix] Fix step_shm_value in PD disaggregated deployment by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/4780
Update Unit Test for PaddleOCR-VL by @Limerances in https://github.com/PaddlePaddle/FastDeploy/pull/4802
[Metax] adapt cutlass moe and fix mla attention for DeepSeek by @xiaozude in https://github.com/PaddlePaddle/FastDeploy/pull/4602
[Feature][Executor] GPU Model Runner Supports prompt_logprobs and max_logprobs by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4769
[get_padding_offset.] clean get_padding_offset.cu by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4777
support ep+tp at op layer by @zhupengyang in https://github.com/PaddlePaddle/FastDeploy/pull/4688
[BugFix] fix reasoning parser register name by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4795
remove input_ids from ForwardMeta by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4793
[Feature] Add timestamp for profiler by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/4726
[XPU]Support V1 loader in weight_only Model by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/4808
[Bug Fix] process transparent image by @ApplEOFDiscord in https://github.com/PaddlePaddle/FastDeploy/pull/4807
add paddleocr_vl benchmark by @zhang-prog in https://github.com/PaddlePaddle/FastDeploy/pull/4833
[Doc] Update docs for v2.3.0rc0 by @Jiang-Jia-Jun in https://github.com/PaddlePaddle/FastDeploy/pull/4828
[BugFix] fix messages being inplace modified in offline chat api by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/4831
【New Feature】W4afp8 supports per group quantization by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/4272
[CI] fix docker_build error and add tag-base by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4810
[PD Disaggregation] Support Qwen3-MoE use PD + EP inference. by @K11OntheBoat in https://github.com/PaddlePaddle/FastDeploy/pull/4691
remove seq_lens_this_time by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4821
[BugFix] Fix ernie_vl_reasoning_parsers.py 'end_token' to 'think_end_token' by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4805
Fix: ci port conflict by @sunlei1024 in https://github.com/PaddlePaddle/FastDeploy/pull/4840
[CI] Add unittest for activation, native_paddle_backend, w4a8, w4afp8, platforms/utils by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/4812
[XPU][CI]Change ci vl model to 28 b by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4764
[Fix] fix ernie4_5_vl model torch format loadding by @aquagull in https://github.com/PaddlePaddle/FastDeploy/pull/4447
[Feature] [PD] add simple router and refine splitwise deployment by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/4709
[Docs] fix: correct typo in nvidia_gpu.md by @playaswd in https://github.com/PaddlePaddle/FastDeploy/pull/4848
[BugFix] Fix list to List by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/4818
[BugFix] Del get_act_fn, _load_st_projector by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/4824
[Benchmark] Enhance benchmark output logging by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/4682
[XPU] ep+tp all2all by @zhupengyang in https://github.com/PaddlePaddle/FastDeploy/pull/4836
[CI] Add Check PR Template by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4481
Revert "【New Feature】W4afp8 supports per group quantization" by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4854
[CI] Update deploy.py by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/4850
[CI] Optimize port cleanup logic by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4860
[Bug Fix] fix ernie4_5_vl_moe by @LokeZhou in https://github.com/PaddlePaddle/FastDeploy/pull/4843
Revert "[Bug Fix] fix ernie4_5_vl_moe" by @Jiang-Jia-Jun in https://github.com/PaddlePaddle/FastDeploy/pull/4863
[Feature] support mm disable_chunked by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/4803
[CI] Update ERNIE-4.5-VL baseline to adapt to MoE changes by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4867
[CI] Refactor check-bypass logic in run_tests_with_coverage by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4655
[Others] Delete PaddleOCR Useless Function by @Limerances in https://github.com/PaddlePaddle/FastDeploy/pull/4815
[Feature] Optim PaddleOCR-VL by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4873
[XPU] fix ep_tp all2all ci by @zhupengyang in https://github.com/PaddlePaddle/FastDeploy/pull/4876
[XPU] modify 424B model deployment parameter by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/4888
[XPU][CI] Ci bug fix by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4889
[BugFix] fix token_processor zmq by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4827
[CI] fix docker_build error of ciuse by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4886
[Metax] support ERNIE-4.5-VL-28B by @neilzhuu in https://github.com/PaddlePaddle/FastDeploy/pull/4820
[BugFix] max_lgprobes=-1 maps to ori_vocab_size by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4884
[Feature] Enable FastDeploy to support adding the “--api-key” authentication parameter. by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4806
[Docs]Supplement the English and Chinese user documentation for Tool calling by @AuferGachet in https://github.com/PaddlePaddle/FastDeploy/pull/4895
[XPU][CI]Update test assertion and base response value by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4907
[BugFix] When the value of "temperature" is 0, adjust it to 1e-06 by @luukunn in https://github.com/PaddlePaddle/FastDeploy/pull/4900
[Docs] add api-key usage instructions by @LiqinruiG in https://github.com/PaddlePaddle/FastDeploy/pull/4902
[CI] Add four unittest by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/4906
[Bug Fix] fix bug for PD EP by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/4823
[DeepEP] support async prefill by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4899
[XPU]Update documentation by @qw86972190 in https://github.com/PaddlePaddle/FastDeploy/pull/4917
[Docs] Improve reasoning_out docs by @LiqinruiG in https://github.com/PaddlePaddle/FastDeploy/pull/4901
[BugFix] Fix inference_start_time by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4922
[BugFix] Add support for weight shape constraints and group size selection in Machete by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/4911
[XPU] [CI]Change CI to multi-concurrency by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4866
[Docs] add doc for glm by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4933
[Opti] Unlimit zmq message lens limit by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/4465
[TSP] Support qwen3 moe tsp + cudagraph by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/4871
Update docs for v2.3.0 by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/4938
[Docs] add ERNIE-4.5-VL-28B-A3B-Thinking instruction by @LiqinruiG in https://github.com/PaddlePaddle/FastDeploy/pull/4937
[BugFix][Models] Add tie_word_embeddings for lmhead by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/4916
[Iluvatar] add vl into ci and support v1 loader by @wuyujiji in https://github.com/PaddlePaddle/FastDeploy/pull/4774
[Docs] add ERNIE-4.5-VL-28B-A3B-Thinking instruction by @LiqinruiG in https://github.com/PaddlePaddle/FastDeploy/pull/4944
[XPU][Doc]Update XPU release2.3 note by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/4939
[XPU] fix xpu deployment md by @cqulilujia in https://github.com/PaddlePaddle/FastDeploy/pull/4941
[CI][XPU]Update run_ci_xpu.sh to lock paddlepaddle-xpu version by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4949
[Perf] Support tensor transmission between work and engine with zero-copy to improve efficiency by @sunlei1024 in https://github.com/PaddlePaddle/FastDeploy/pull/4839
[PD Disaggregation]Replace paddle.max by numpy to avoid useless error log by @K11OntheBoat in https://github.com/PaddlePaddle/FastDeploy/pull/4893
[CI] Update test_api_key.py by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4948
[Others] Add Tests for GPU Model Runner and Logprobs Output by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4913
[Iluvatar][Doc] Add ERNIE-4.5-VL-28B-A3B-Thinking doc by @wuyujiji in https://github.com/PaddlePaddle/FastDeploy/pull/4955
[ATTENTION] by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4945
[CI][XPU]Update health check endpoint to use port variable by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4965
[CI] fix apt_sources error of focal in docker_build by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4961
[Loader] Refactor PT model loading by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/4532
[CI][XPU] Change Paddle Version to Nightly by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4973
[CI] Add five unittest by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/4958
[Docs] Add License in Unittest by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/4957
[CI] remove useless tests in docker_build by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4974
[CI] Update PORT range to avoid conflict with system ports by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4953
[Benchmark] Add GEMM & MoE kernel bench by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/4809
[Iluvatar][CI] fix safetensors_rust.SafetensorError: framework paddle… by @wuyujiji in https://github.com/PaddlePaddle/FastDeploy/pull/4972
[KVCache] support unified cache backend by @ltd0924 in https://github.com/PaddlePaddle/FastDeploy/pull/4903
[loader]Update requirements and xpu ci by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/4969
[CI][XPU] Fix EP Case Bug by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4976
[BugFix] Avoid loading training file by @BossPi in https://github.com/PaddlePaddle/FastDeploy/pull/4966
[Metax] support default_v1 loader & thinking model by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/4956
[Metax] optimize flash mla by @xiaozude in https://github.com/PaddlePaddle/FastDeploy/pull/4915
[Docs] remove load default_v1 since already been as default by @zoooo0820 in https://github.com/PaddlePaddle/FastDeploy/pull/4980
[XPU] fix text_image_gather_scatter op by @cqulilujia in https://github.com/PaddlePaddle/FastDeploy/pull/4882
[Logprobs]Support prompt_logprobs and max_logprobs by @qwes5s5 in https://github.com/PaddlePaddle/FastDeploy/pull/4897
[CI] fix test_model_cache by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/4982
[BugFix] fix VL fp8 bug when moe token_num is 0 by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4928
[BugFix] Fix mtp tsp by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/4990
[CI] set DG_NVCC_OVERRIDE_CPP_STANDARD in test_quantized_linear by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4995
[FDConfig] add block number verfied by @ltd0924 in https://github.com/PaddlePaddle/FastDeploy/pull/4983
[Optimization] Skip memcpy(DtoH) capture in get_block_shape_and_split_kv_block by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/4988
[BugFix] fix num_requests_running after clear_data by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/4927
[worker_process.py]modify some var name by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4749
[Loader]Fix and complete the MTP loader by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/4985
[XPU] [CI] Change CI ep test from offline to online by @zccjjj in https://github.com/PaddlePaddle/FastDeploy/pull/4885
[BugFix][Metax] Fix metax compile issue in get_block_shape_and_split_kv_block by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/5000
[Feature] Enhance build script, add pre_wheel logic by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/4729
【New Feature】W4afp8 supports per group quantization by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/4987
optimize dy_cfp8's performance by @carryyu in https://github.com/PaddlePaddle/FastDeploy/pull/4126
[BugFix] adjust max_tokens and min_tokens when continue to generate tokens by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/5010
[PD Disaggregation] remove splitwise deployment on single node and refine the code by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/4891
[CI]【Hackathon 9th Sprint No.56】NO.56 功能模块 fastdeploy/multimodal/utils.py 单测补充 by @essos-bot in https://github.com/PaddlePaddle/FastDeploy/pull/4954
[Docs] Fix broken commitID by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/5008
[CI] Temporarily lock paddlepaddle-gpu as of 20251112 by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5017
[ATTENTION] unitest by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4962
[Executor]move batch_id_per_token by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4853
[BugFix] Revert skip capture by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/5023
[Others] check args max_logprobs by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5018
[CI]【Hackathon 9th Sprint No.32】NO.32 功能模块 fastdeploy/input/ernie4_5_vl_processor/process_video.py 单测补充 by @WintersMontagne10335 in https://github.com/PaddlePaddle/FastDeploy/pull/5011
[Optimization] xgrammar async compile, multi thread, speed up by @ST-XX in https://github.com/PaddlePaddle/FastDeploy/pull/4835
[CI][XPU] Optimize CI logs and variable names by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/5025
[Intel HPU] enable level 1 prefix caching and fix some bugs by @fmiao2372 in https://github.com/PaddlePaddle/FastDeploy/pull/4971
[Iluvatar][CI] Fix moe_expert_dispatch cannot support dequant_scale by @wuyujiji in https://github.com/PaddlePaddle/FastDeploy/pull/5012
【Fix】fix deepep dispatch by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/5036
[Metax] support default_v1 loader and quant_config is None for triton… by @xiaozude in https://github.com/PaddlePaddle/FastDeploy/pull/5030
[APIServer] metrics use port the same as api_port by @xyxinyang in https://github.com/PaddlePaddle/FastDeploy/pull/5016
[Log] Add trace log and add loggingInstrumentor tool by @qwes5s5 in https://github.com/PaddlePaddle/FastDeploy/pull/4692
[CI]【Hackathon 9th Sprint No.13】NO.13 功能模块 fastdeploy/model_executor/ops/triton_ops/triton_utils.py 单测补充 by @WintersMontagne10335 in https://github.com/PaddlePaddle/FastDeploy/pull/5035
【Hackathon 9th No.109】[CppExtension] Support build Custom OP in setuptools 80+ -part by @megemini in https://github.com/PaddlePaddle/FastDeploy/pull/4977
[CI]【Hackathon 9th Sprint No.28】NO.28 功能模块 fastdeploy/model_executor/ops/triton_ops/triton_utils_v2.py 单测补充 by @WintersMontagne10335 in https://github.com/PaddlePaddle/FastDeploy/pull/5073
[BugFix] rollback max_tokens and min_tokens when continue to infer by @LiqinruiG in https://github.com/PaddlePaddle/FastDeploy/pull/5052
[Intel HPU] fix bugs caused by other commits by @fmiao2372 in https://github.com/PaddlePaddle/FastDeploy/pull/5074
[XPU][CI] fix ci case bug by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/5084
Revert "[BugFix] Revert skip capture" by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/5080
[Fix] Fix block allocation issue when MTP and logprobs are enabled by @sunlei1024 in https://github.com/PaddlePaddle/FastDeploy/pull/5077
revert group size 3 by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5079
[INTEL_HPU] enabled fastdeploy PR testing by @FocusLuo in https://github.com/PaddlePaddle/FastDeploy/pull/4596
[Feature][OP] Append Attn Support CUDA-PDL by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5072
【Hackathon 9th No.76】supplementary unit test for XGrammarChecker by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/4075
[CI] Enable check_pr_template in CI rerun by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5093
[Metax] support default_v1 loader based [#4988] by @StareAtYou in https://github.com/PaddlePaddle/FastDeploy/pull/5001
[Iluvatar][CI] disable compiling cudaLaunch API by @wuyujiji in https://github.com/PaddlePaddle/FastDeploy/pull/5100
Revert "[CI] Temporarily lock paddlepaddle-gpu as of 20251112" by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5098
[OP] format flash_mask_attn by @lizhenyun01 in https://github.com/PaddlePaddle/FastDeploy/pull/5104
[unitest]clean code by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5094
[Docs]fix_cli_docs by @xiaolei373 in https://github.com/PaddlePaddle/FastDeploy/pull/5109
[BugFix] unify max_tokens by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4968
[HPU][CI]Update Docker image in CI workflow by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/5108
[PD Disaggregation]Fix dummy run when use PD Disaggregation with EP inference. by @K11OntheBoat in https://github.com/PaddlePaddle/FastDeploy/pull/5112
[Feature] ThreadPoolExecutor async fill_token_bitmask by @ST-XX in https://github.com/PaddlePaddle/FastDeploy/pull/5083
[XPU][Docs]Update document by @qw86972190 in https://github.com/PaddlePaddle/FastDeploy/pull/5091
[CI]【Hackathon 9th Sprint No.31】NO.31 功能模块 fastdeploy/input/ernie4_5_processor.py 单测补充 by @WintersMontagne10335 in https://github.com/PaddlePaddle/FastDeploy/pull/5097
[RL]Resolve shape mismatch problems in RL-related modules by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5032
[CI]Exclude abstract methods and irrelevant backend files by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5031
[CI] add metrics case by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/5115
【Hackathon 9th No.109】[CppExtension] [XPU] Support build Custom OP in setuptools 80+ -part by @megemini in https://github.com/PaddlePaddle/FastDeploy/pull/5106
[Docs] add ebvlthinking yaml by @tianlef in https://github.com/PaddlePaddle/FastDeploy/pull/5120
[Metax][BugFix] Fix METAX_GPU OPs Compile Error by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5114
[Feature] Add an unquantized option for MoE and Dense quant type by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/4813
[BugFix] rollback max_tokens and min_tokens when continue to infer by @LiqinruiG in https://github.com/PaddlePaddle/FastDeploy/pull/5082
[CI] Add workflow to auto-remove skip-ci labels after new commits by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5129
[BugFix] Support skipping activation scale loading for w4afp8 by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/5117
[Feature] support async download features by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5003
[CI] Temporarily lock paddlepaddle-gpu as of 20251118 by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5136
[HPU][CI]Hpu ci update by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/5116
[Speculative Decoding][MTP]Support stop_seqs and pd-split mode by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5029
[Metax] optimize cutlass moe and flash attention backend by @neilzhuu in https://github.com/PaddlePaddle/FastDeploy/pull/5128
[Scheduler] Support chunk prefill for video input by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/5107
[Others]get_block_shape_and_split_kv_block clean code by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5123
[Optimization] default compile rdma, reduce cudagraph buffer size in mm, fix some config bug by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5121
[Others] clean code by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5133
[CI][XPU] Add XPU chunked_prefill && prefix_caching case by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/5139
[Graph Optimization][SOT] Eliminate BreakGraph by move import stmt to top by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/5146
[BugFix] Fix zero workspace returned by CUB size query under CUDA Graph in MoE dispatch by @littledgg in https://github.com/PaddlePaddle/FastDeploy/pull/5087
[BugFix] [PD Disaggregation] Fix schedule error in splitwise deployment by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/5149
[BugFix] [PD Disaggregation] fix v1 scheduler prefill node profile run & ipc transfer protocol by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5132
[Feature] support bos download retry by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5137
[CI] Unified diff coverage upload logic by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5127
[CI]【Hackathon 9th Sprint No.51】NO.51 功能模块 fastdeploy/scheduler/dp_scheduler.py 单测补充 by @essos-bot in https://github.com/PaddlePaddle/FastDeploy/pull/5046
[PD Disaggregation][XPU] Add XPU support for PD disaggregation by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/5113
[Feature] Support noaux for eplb by @xiaoxiaohehe001 in https://github.com/PaddlePaddle/FastDeploy/pull/5143
[RL]Fix missing is_distributed attribute by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5150
[ENV] support AK SK ENCPOINT while get the multi_modal's feature by @lizhenyun01 in https://github.com/PaddlePaddle/FastDeploy/pull/5159
[Speculative Decoding][MTP] Support static CacheKV C8 quantization and optimize memory usage by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5155
[PD Disaggregation] [Refine] Refine splitwise deployment by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/5151
[Fix] Fix noaux ep test by @xiaoxiaohehe001 in https://github.com/PaddlePaddle/FastDeploy/pull/5161
[Polish] Simplify repr method in Request class by @Jiang-Jia-Jun in https://github.com/PaddlePaddle/FastDeploy/pull/5153
[BugFix] fix num of rdma_comm_ports check by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5168
[Optimization] Improve perf for fd response token with internal adapter by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/4992
[BugFix] fix reschedule with mtp + logprob by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5165
[Feature] dyc8 support prefixcache by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5125
[Feature] remove to_numpy by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5162
【Hackathon 9th No.109】[CppExtension] 添加 fastdeploy_ops 目录到 package_data 以支持现代打包方式 - part by @megemini in https://github.com/PaddlePaddle/FastDeploy/pull/5156
[CI] fix coverage_report in daily test by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5175
[Others] unitest tests/layers/test_attention_layer.py by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5174
[CI] Ignore new custom ops stub file in coveragerc by @SigureMo in https://github.com/PaddlePaddle/FastDeploy/pull/5177
[CI] add output for last_token in test_streaming_with_stop_str by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5170
[XPU]Update documentation by @qw86972190 in https://github.com/PaddlePaddle/FastDeploy/pull/5180
[Fix] Fix eplb bug and support fp8 load weight by @xiaoxiaohehe001 in https://github.com/PaddlePaddle/FastDeploy/pull/5178
[CI] 【Hackathon 9th Sprint No.18】NO.18 功能模块单测补充 -part by @xunyoyo in https://github.com/PaddlePaddle/FastDeploy/pull/5064
[BugFix] fix release block ids by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/5184
[XPU][CI] change VL model to 28B-VL-thinking by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/5169
[Feature] Supports separate loading of offline quantization for moe. by @xiaoxiaohehe001 in https://github.com/PaddlePaddle/FastDeploy/pull/5142
[Metax] support ENABLE_V1_KVCACHE_SCHEDULER by @xiaozude in https://github.com/PaddlePaddle/FastDeploy/pull/5163
[Feature] support eplb in api_server by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/4782
[BugFix] dummy import some ops by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5192
[CI] Update redis download source for docker_build failure fix by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5198
[Bug fix] Send first token in D instance by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/5199
[BugFix] [OP] Fix the error in MoeExpertFFN operator when valid_token_num=0 by @zccjjj in https://github.com/PaddlePaddle/FastDeploy/pull/5196
[CI] Add Unittest by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/5187
[CI] 【Hackathon 9th Sprint No.17】NO.17 功能模块单测补充 by @xunyoyo in https://github.com/PaddlePaddle/FastDeploy/pull/5054
[CI] 【Hackathon 9th Sprint No.24】NO.24 功能模块单测补充 by @xunyoyo in https://github.com/PaddlePaddle/FastDeploy/pull/5055
[Speculative Decoding][MTP]Update extract_mtp_weight script and optimize config by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5183
[XPU] [CI] Xpu ci lock PaddlePaddle Version by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/5218
[BugFix] fix work metrics not returned by metrics api by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/4912
[BugFix] fix mm_positions type error by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5182
[Benchmark]add qwen3-235b pd+ep yaml by @xiegegege in https://github.com/PaddlePaddle/FastDeploy/pull/5225
[CI] Add Cherry-Pick PR check logic by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5191
[FDConfig] disable use_sequence_parallel_moe default by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5222
[Feature] The 45VL supports prompt_token_ids + messages input. by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/5148
[Feature] enable guided decoding ENABLE_V1_KVCACHE_SCHEDULER = 1 by @ST-XX in https://github.com/PaddlePaddle/FastDeploy/pull/5140
[Docs] add docs of base64 or local file mm inputs by @ApplEOFDiscord in https://github.com/PaddlePaddle/FastDeploy/pull/5193
[Metrics] Update time_to_first_token to include tokenization & queue time, and remove redundant metrics by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/4993
[Docs] add request params by @LiqinruiG in https://github.com/PaddlePaddle/FastDeploy/pull/5207
[Speculative Decoding]Fix attention mask offset by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5208
【BugFix】Fix logprob.slice_row inplace Error by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5237
[BugFix] fix prompt_token_ids is None in request dict in llm.generate by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/5241
[Fix] fix eplb noaux by @xiaoxiaohehe001 in https://github.com/PaddlePaddle/FastDeploy/pull/5239
[BugFix]Fix attention mask bug in D-Node of PD-split mode by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5245
[BugFix] BF16 MoE Cutlass Backend Support EP by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5242
[BugFix] fix vl performance bug by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5181
[Optimization] Refine row parallel bias and nranks and moe all_reduce by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5247
[CI] 【Hackathon 9th Sprint No.33】NO.33 功能模块单测补充 -part by @xunyoyo in https://github.com/PaddlePaddle/FastDeploy/pull/5056
[Speculative Decoding] split draft_tokens into standalone post-processing path by @sunlei1024 in https://github.com/PaddlePaddle/FastDeploy/pull/5205
[BugFix] fix mtp logprob bugs in chunk prefill by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5244
[CI]【Hackathon 9th Sprint No.50】NO.50 功能模块 fastdeploy/entrypoints/engine_client.py 单测补充 -part by @essos-bot in https://github.com/PaddlePaddle/FastDeploy/pull/5045
[BugFix] fix cuda-python requirement by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5261
[CI] 【Hackathon 9th Sprint No.41】NO.41 功能模块单测补充 -part by @xunyoyo in https://github.com/PaddlePaddle/FastDeploy/pull/5062
[PD Disaggregation] Add unittest for splitwise deployment with using rdma by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/5189
[BugFix][Metrics] Fix Prometheus Multiprocess Metrics Issues and Add ZMQ Communication Metrics by @fl0w2o48 in https://github.com/PaddlePaddle/FastDeploy/pull/5185
[XPU] support kernel for mtp(base) by @cmcamdy in https://github.com/PaddlePaddle/FastDeploy/pull/4748
[Docs] add qwen25-vl docs by @CSWYF3634076 in https://github.com/PaddlePaddle/FastDeploy/pull/5243
[CI] disable test_engine_client.py unit test by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5272
[CI] fix run batch unit test by @xiaolei373 in https://github.com/PaddlePaddle/FastDeploy/pull/4628
[BugFix]fix v1 loader lm head fp32 by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5270
[CI] Fix test streaming with stop str by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5275
[XPU][CI] Set pip index URL to Tsinghua mirror by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/5277
[Feature] support flash_mask_attention backend by @lizhenyun01 in https://github.com/PaddlePaddle/FastDeploy/pull/5134
[CI][XPU] add pd disaggregation by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/5179
Revert "[CI] 【Hackathon 9th Sprint No.33】NO.33 功能模块单测补充" -part by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/5286
[BugFix] fix tsp o_proj bias add by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5284
[BugFix] race condition [is_fetching] causing multiple fetch requests by @ST-XX in https://github.com/PaddlePaddle/FastDeploy/pull/5238
[BugFix]Set default OMP_NUM_THREADS=3 and fix extra GPU memory usage in DeepSeek by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5219
[Others] add PADDLE_ENFORCE by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5288
[OP]Remove extra H2D in DeepGemm. by @K11OntheBoat in https://github.com/PaddlePaddle/FastDeploy/pull/5262
[Feature] add bos config check by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5273
[Others] clean code by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5235
[FDConfig] remove engine client args, use fd_config instead by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5217
[Benchmark] Support random input by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/5298
[Intel HPU] change MoE weights and scales from list to tensor and add… by @fmiao2372 in https://github.com/PaddlePaddle/FastDeploy/pull/5289
[APIServer] add_prompt_ids_test by @DDDivano in https://github.com/PaddlePaddle/FastDeploy/pull/5283
[BugFix] fix aksk check bug by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5295
[BugFix] fix mm to_dict bug by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5300
[xpu] support mtp for xpu(mix) by @cmcamdy in https://github.com/PaddlePaddle/FastDeploy/pull/5274
[Features] add audio request & fix embedding bug by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/5201
[Deterministic] Move paddle version batch invariant pkg to Fastdeploy by @littledgg in https://github.com/PaddlePaddle/FastDeploy/pull/4763
[Feature] support chunked moe by @Wanglongzhi2001 in https://github.com/PaddlePaddle/FastDeploy/pull/4575
[XPU][CI]Change W4A8 Case Base Value by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/5309
[CI] Update build_docker to paddle_manylinux by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5226
[CI] Remove need approve by yuanlehome by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5310
[PD Disaggregation] support different tp_size for prefill and decode by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/5296
[XPU] fix gather_next_token by @cmcamdy in https://github.com/PaddlePaddle/FastDeploy/pull/5311
[XPU][CI] Change XPU CI Base Value by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/5318
[Optimization] EP empty_input_forward Remove Communication by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5254
[CI]add clear to run-batch ci by @xiaolei373 in https://github.com/PaddlePaddle/FastDeploy/pull/5307
[CI] disable test_chunked_moe.py in unit_test by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5322
Revert "[CI] 【Hackathon 9th Sprint No.41】NO.41 功能模块单测补充 -part" by @YuanRisheng in https://github.com/PaddlePaddle/FastDeploy/pull/5291
Revert "[CI] 【Hackathon 9th Sprint No.18】NO.18 功能模块单测补充 -part" by @YuanRisheng in https://github.com/PaddlePaddle/FastDeploy/pull/5290
[LogProbs]Enable prompt logprobs output and modify data transmission method for the online interface. by @qwes5s5 in https://github.com/PaddlePaddle/FastDeploy/pull/5089
[PD Disaggregation] Support PD deployment of DeepSeekv3. by @K11OntheBoat in https://github.com/PaddlePaddle/FastDeploy/pull/5251
[Feature] support reward model by @lizexu123 in https://github.com/PaddlePaddle/FastDeploy/pull/5301
[XPU]add enable_logprob by @qw86972190 in https://github.com/PaddlePaddle/FastDeploy/pull/5279
[CI] Fix return_code check in test_chunked_moe.py by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5326
[CI] Update test_docker to paddle_dev by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5278
[XPU] [CI] Xpu Ci Refactor by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/5252
[UNITEST] add test by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5305
[Intel HPU] add example benchmark scripts for hpu by @fmiao2372 in https://github.com/PaddlePaddle/FastDeploy/pull/5304
[Quantization] Support w4afp8 MoE dynamic quantization by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/5282
[CI] Disable queue state assertion temporarily by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5329
[CI] Add env ci by @Wanglongzhi2001 in https://github.com/PaddlePaddle/FastDeploy/pull/5331
[CI] Allow occasional distributed worker exit_code by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5341
[Optimization] supports mtp split_kv_attn, unified to append scenarios by @carryyu in https://github.com/PaddlePaddle/FastDeploy/pull/5343
[CI] Add RD in env CI. by @Wanglongzhi2001 in https://github.com/PaddlePaddle/FastDeploy/pull/5345
[Optimization]1.fix tp+ep moe_forward; 2.set max_prefill_batch=env.MAX_PREFILL_NUM by @carryyu in https://github.com/PaddlePaddle/FastDeploy/pull/5315
[BugFix] Fix EP issue in the CUTLASS MoE backend by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/5337
[CE]add wint4 ep by @tianlef in https://github.com/PaddlePaddle/FastDeploy/pull/5355
[Optimization]1.fix tp+ep moe_forward; 2.set max_prefill_batch=env.MAX_PREFILL_NUM by @carryyu in https://github.com/PaddlePaddle/FastDeploy/pull/5353
[bugfix]remove metrics middleware by @xiaolei373 in https://github.com/PaddlePaddle/FastDeploy/pull/5332
[XPU] xpu support mm prefix cache by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/5356
[Feature] Guided Decoding add LLguidance backend by @ST-XX in https://github.com/PaddlePaddle/FastDeploy/pull/5124
[Feature] support audio tts by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/5333
[FIX BUG] fix bug in TP in permute_x_fp8_kernel by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5350
[BugFix] dynamic cache kv block_wise_fp8 not need create layer.cache_k_scale by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5362
[Optimization] Requirements remove version for setuptools, uvicorn, triton and safetensors, del fastsafetensors by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/5330
[BugFix] Fix issues related to data retrieval logic, parameter validation, and result serialization in both online and offline interfaces. by @qwes5s5 in https://github.com/PaddlePaddle/FastDeploy/pull/5335
[Bug fix] fix pooling models by @lizexu123 in https://github.com/PaddlePaddle/FastDeploy/pull/5358
[Intel HPU] fix memory fragmentation issue and fix moe all_reduce issue by @fmiao2372 in https://github.com/PaddlePaddle/FastDeploy/pull/5357
[BugFix] Reduce timeout in unittest by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/5366
[Models] Add forward_meta to moe models' forward function by @Wanglongzhi2001 in https://github.com/PaddlePaddle/FastDeploy/pull/5138
[PD Disaggregation] support DP via v1 router and decouple DP and EP by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5197
[Docs] update FAQ with logprobs MQ limits and deprecation by @sunlei1024 in https://github.com/PaddlePaddle/FastDeploy/pull/5368
[BugFix] Exit if neither modern nor legacy wheel dir not found by @SigureMo in https://github.com/PaddlePaddle/FastDeploy/pull/5367
[FUCK] remove fastsafetensors by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5371
[RL] [BugFix] update check_model_weights_status loop by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5249
[Fearture] Support cache kv cache for output tokens by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/4535
[BugFix] fix get_request from scheduler by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/5369
[CI] disable test_schedule_output.py in unit_test by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5377
[Loader]Adapting DeepSeek weights for PyTorch loading. by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5373
[XPU] [Optimization] [EP] EP communication optimization. by @zccjjj in https://github.com/PaddlePaddle/FastDeploy/pull/5145
[BugFix] Compatible with asynchronous functions by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/5378
[XPU] support XDNN downloading function by @cqulilujia in https://github.com/PaddlePaddle/FastDeploy/pull/5365
[Intel HPU] fix bug about RP 5138 by @fmiao2372 in https://github.com/PaddlePaddle/FastDeploy/pull/5380
[XPU] [CI] Change Paddle Version to Nightly by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/5346
[XPU] bug fix block attn in mix mtp by @cmcamdy in https://github.com/PaddlePaddle/FastDeploy/pull/5384
[BugFix] Fix flash_attn_backend by @lizhenyun01 in https://github.com/PaddlePaddle/FastDeploy/pull/5387
[BugFix] Fix the issue of redundant logging for certain events in the trace_logger by @qwes5s5 in https://github.com/PaddlePaddle/FastDeploy/pull/5386
[Feature] support Two batch overlap, mainly used in Prefill by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5078
[XPU] redirect xvllm/xtdk/xhpc downloading log by @cqulilujia in https://github.com/PaddlePaddle/FastDeploy/pull/5388
[XPU] support moe_expert_ffn TGEMM selection by @cqulilujia in https://github.com/PaddlePaddle/FastDeploy/pull/5375
[Optimization] Qwen2.5-VL support multi-batch prefill by @aquagull in https://github.com/PaddlePaddle/FastDeploy/pull/5269
[BugFix] fix scheduler hang when input length is very close to max_model_len by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5393
[XPU] support ep4tp1+v1 loader by @zccjjj in https://github.com/PaddlePaddle/FastDeploy/pull/5398
[BugFix] fix async download bug by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5349
[BugFix] fix mtp prefix_cache dy-c8 bug by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5390
[BugFix]Fix plugin loading logic and logging messages by @wangyuwen1999 in https://github.com/PaddlePaddle/FastDeploy/pull/4909
[BugFix] fix top_p_candidates by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5400
[Reverted][RL] Support Rollout Routing Replay by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/5321
[Bug fix] Fix the multi-input accuracy issue in the pooling model. by @lizexu123 in https://github.com/PaddlePaddle/FastDeploy/pull/5374
[Others]remove _execute_empty_input by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5396
Revert "[RL] Support Rollout Routing Replay" by @Jiang-Jia-Jun in https://github.com/PaddlePaddle/FastDeploy/pull/5402
[Cherry-Pick][Loader]fix deepseek torch loading [#5410] [loader]fix bf16 deepseek [#5379] [Loader]Adapting DeepSeek weights for PyTorch loading by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5411
[Cherry-Pick][New][RL] Support Rollout Routing Replay (#5405) by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/5408
[Cherry-Pick][Loader][BugFix] Fix some parameters place on CPU in PaddleOCR-VL (#5413) by @SigureMo in https://github.com/PaddlePaddle/FastDeploy/pull/5414
[BugFix][Cherry-Pick] fix can not enter into cuda graph by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5423
[Cherry-Pick] [BugFix] [RL] remove shutdown_process_group/restart_process_group for RL (#5433) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5434
[Cherry-Pick][BugFix] 0 not into cuda graph to save memory (#5426) by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5432
[NewFeature]support dynamic load for normal by @gzy19990617 in https://github.com/PaddlePaddle/FastDeploy/pull/5437
[Cherry-Pick][Optimization] compulte real max_logprobs in batch (#5430) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5448
[Cherry-Pick] allow 0-dim tensor into ar by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5452
[BugFix] fix limit_thinking bug by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5469
[Cherry-Pick][CI] Fix attention bug in spec decoding(#5460) by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5481
[Cherry-Pick][CI] ep+prefix cache+chunk prefill(#5489) by @zccjjj in https://github.com/PaddlePaddle/FastDeploy/pull/5490
[Cherry-Pick] [BugFix] fix instability after clearing weight (#5493) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5487
[Cherry-Pick][RL]Fix RL weight loading issue in moe layer [#5503] by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5505
[[Cherry-Pick][BugFix] fix hung when n>1 and --enable-logprob (#5492)(#5499) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5498
[Cherry-Pick] [BugFix] [RL] skip model executing after clearing/updating is done (#5527) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5523
[Cherry-Pick][Feature][Optimization] Qwen Dynamic C8(#5486) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5536
[Bug Fix][Cherry-pick] Fix bug for caching output when preempted(#5502) by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/5510
[Cherry-Pick][BugFix] fix dynamic c8 in v1 loader(#5562) by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5519
【NewFeature】support load fp8 weight by @gzy19990617 in https://github.com/PaddlePaddle/FastDeploy/pull/5566
[Cherry-Pick][CI] Adape unit_test due to incompatibility change(#5578) by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5583
[Cherry-Pick][RL] R3 Support RDMA Store(#5467) by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/5468
[Cherry-Pick][CI]Support different inferseed in speculate decoding(#5568) by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5597
[Cherry-Pick][Feature]Add a switch for logprobs/prompt_logprobs token decoding.(#5463) by @qwes5s5 in https://github.com/PaddlePaddle/FastDeploy/pull/5572
[Cherry-Pick][CI]Fix write qknorm cache bug in speculative decoding(#5491) by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5617
[Cherry-Pick] Support for request-level speculative decoding metrics monitoring.(#5518) by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5614
[Cherry-Pick][Others] Maintain the mtp branch temporarily. (#5446) by @carryyu in https://github.com/PaddlePaddle/FastDeploy/pull/5621
[Model] tp+ep support v1_loader by @Wanglongzhi2001 in https://github.com/PaddlePaddle/FastDeploy/pull/5600
[Cherry-Pick][BugFix] fix speculate_limit_thinking_content_length [#5590] by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5615
[Cherry-Pick][RL]Support loading weights via the load_weights function for RL [#5549] by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5602
[Cherry-Pick][BugFix] fix rl model_weights_signal to support tp>1 [#5639] by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5637
[Cherry-Pick][RL]Fix RL load_weights [#5642] by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5643
[Cherry-Pick][BugFix] cp fix_cpu_cache_bugs(#5544) by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5577
[Cherry-Pick][BugFix] fix rl model_weights_signal to support tp>1 by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5650
[Cherry-Pick][XPU] logprob bug [#5626] by @qw86972190 in https://github.com/PaddlePaddle/FastDeploy/pull/5636
[Cherry-Pick][BugFix] Cp fix eb5 prefix cache(#5638) by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5644
[Cherry-Pick][Others]Prevent core dumps during Paddle version check [#5657] by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5659
[Cherry-Pick][BugFix] Fix custom_all_reduce overflow (#5662) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5667
[Cherry-Pick] [RL] provide options for whether shutdown comm group after weights cleared (#5663) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5664
[Cherry-Pick][BugFix] fix rl signal [#5681] by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5678
[Cherry-Pick][XPU]Set top_p=0.0 by default on XPU to optimize performance(#5686) by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/5688
[Cherry-Pick][CI] Support multi-step mtp with cudagraph (#5624) by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5670
[Cherry-Pick] [BugFix] fix double shutdown of comm group when rank0 clears weights slower than other ranks (#5715) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5710
[Cherry-Pick][CI] Revert adapt vl_model baseline changes due to Paddle update(#5732) by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5733
[Cherry-Pick][Feature] Entropy calculation support [#5692] by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5731
[Cherry-Pick][BugFix] Fix Chunked Prefill when max_tokens=1(#5736) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5747
[Cherry-Pick][CI] Refactor RL tests to reuse upload_clear(#5741) by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5755
[BugFix][Cherry-pick] Set enable_cache_output as false by default(#5751) by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/5752
[Cherry-Pick][Others]upgrade paddleformer to 0.4.0 [#5599] by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5716
[Cherry-Pick][Loader]Fix bug in MTP weight loading [#5744] by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/5745
[cherry-pick] support FA3 in mixed mode and support Qwen3 rope by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/5655
[BugFix][Cherry-Pick] cp fix logprob bug(#5604) by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5770
[FDConfig][Cherry-Pick] Cp disable mm chunked(#5774) by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5775
[BugFix][Cherry-pick] Fix preemption out of real_bsz(#5805) by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/5806
[Cherry-Pick] Fix process_response_dict to support async in serving_completion (#5758) by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/5802
[Cherry-Pick] Support flexible model by @Wanglongzhi2001 in https://github.com/PaddlePaddle/FastDeploy/pull/5749
[Cherry-Pick][BugFix] Fix _disable_sequence_parallel_moe_if_needed#5740 by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5811
[Cherry-Pick][Feature] support glm fa3 (#5586) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5810
[Cherry-Pick] [BugFix] fix shm opened but not closed in set_data_ipc (#5826) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5827
[Cherry-Pick][RL] add lm_head_fp32 in RolloutModelConfig(#5825) by @tianhaodongbd in https://github.com/PaddlePaddle/FastDeploy/pull/5824
[Cherry-Pick][BugFix] Fix entropy bugs (#5818) by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5819
[BugFix][Cherry-Pick] eb5 mm skip prefix cache(#5838) by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5839
[Cherry-Pick][Speculative Decoding] Optimize draft logprob (#5842) by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5843
[Cherry-Pick] [BugFix] fix cache manager not launched in case of mtp or blockwise fp8 (#5840) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5841
[Cherry-Pick][BugFix] cp skip_mm_revert(#5848) by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5849
[Cherry-Pick][Optimization] Optimization for gather_logprob by 10GB (#5817)(#5846) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5834
[Cherry-Pick][XPU]MAX_BSZ aligns gpu settings and disable prefix cache in OCR VL (#5831) by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/5845
[XPU][CI]Release ci update by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/5687
[Cherry-Pick][CI] Fix archive URL injection and add retry(#5725,#5828) by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5832
[Cherry-Pick][APIServer][Feature] Add configurable worker health check timeout via FD_WORKER_ALIVE_TIMEOUT(#5865) by @Copilot in https://github.com/PaddlePaddle/FastDeploy/pull/5867
[Cherry-Pick][RL] Change 'model' to the instance variable 'tmp_model'(#5872) by @tianhaodongbd in https://github.com/PaddlePaddle/FastDeploy/pull/5873
[Cherry-Pick][BugFix]support fa3 qwen-vl rope (#5869) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/5877
[BugFix] Fix speculate metrics bug by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5875
[Cherry-Pick][CI] Fix attn_mask_offset for multi-step MTP in mixed and PD-split modes(#5738) by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5793
[Cherry-Pick][OPs] ep_moe_expert_dispatch.cu dispatch num_experts_per_rank 5 by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/5889
[Cherry-Pick] [KVCache] launch cache transfer processes only if hierarchical cache or kv cache storage is enabled (#5871) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5859
[Cherry-Pick] [BugFix] fix mtp cache attaching for pd disaggregation (#5884) by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5885
[Bugfix]fix model weight signal tensor num by @gzy19990617 in https://github.com/PaddlePaddle/FastDeploy/pull/5899
[Cherry-Pick] [XPU]Cherry-pick Support ZMQ logprobs(#5628) by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/5852
[Feature] Add a global toggle for automatic injection of trace_id and span_id in logs by @qwes5s5 in https://github.com/PaddlePaddle/FastDeploy/pull/5765
[BugFix][Cherry-Pick] Cp fix eb5 prefix cache(#5879) by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5881
[Cherry-Pick][CI]Support multi-step mtp with cudagraph(#5886) by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/5898
[Cherry Pick][XPU][CI] Add logprobs Case by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/5907
[Cherry-Pick] [BugFix] fix mtp split kv attetion by @lizhenyun01 in https://github.com/PaddlePaddle/FastDeploy/pull/5921
[Optim][Cherry-pick] Reduce preemption occurrence when blocks not enough(#5696) by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/5808
[Cherry-Pick][Bugfix] Fix mtp logprob hang problem when include stop_seq (#5927) by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5928
[CI] Lock paddlepaddle-gpu==3.3.0 in release/2.4 by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5944
[BugFix] fix xpu import set_data_ipc by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/5945
[Cherry-Pick][Bugfix] Fix entropy calculation bugs (#5941) by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5942
[Cherry-Pick][BugFix] Fix misleading logging in worker_process for request counting (#5939) by @Copilot in https://github.com/PaddlePaddle/FastDeploy/pull/5953
[BugFix][Cherry-Pick] cp fix dyc8 cache bug(#5958) by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/5959
support_lastnorm_gather_split_r2.4 by @xiaoluomi in https://github.com/PaddlePaddle/FastDeploy/pull/5925
[Cherry-Pick][Speculative Decoding] Return accepted tokens per head in response (#5947) by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/5952
[CI] Align PaddlePaddle version to latest due to tag change by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/5971
2.4_fix_mtp_forward_meta by @xiaoluomi in https://github.com/PaddlePaddle/FastDeploy/pull/5977

New Contributors

@playaswd made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/4848
@WintersMontagne10335 made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/5011
@fl0w2o48 made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/5185
@wangyuwen1999 made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/4909

Full Changelog: https://github.com/PaddlePaddle/FastDeploy/compare/v2.3.3...v2.4.0

Source: README.md, updated 2026-01-20

FastDeploy Files

High-performance Inference and Deployment Toolkit for LLMs and VLMs

核心推理能力与模型支持增强

并行架构、调度与 MoE 能力演进

多模态、缓存与量化能力增强

多硬件平台支持扩展