FastDeploy - Browse /v2.3.0 at SourceForge.net

The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
README.md	2025-11-10	79.0 kB	0
v2.3.0 source code.tar.gz	2025-11-10	5.7 MB	0
v2.3.0 source code.zip	2025-11-10	7.1 MB	1
Totals: 3 Items		12.9 MB	1

新增功能

新增GLM 4.5文本类模型部署支持 [#3928]
新增GPT-OSS-BF16文本类模型部署支持 [#4240]
新增ERNIE-4.5-VL-28B-A3B-Thinking多模态思考模型部署支持，详见文档
新增PaddleOCR-VL多模态模型部署支持 [#4936]
多模态模型和思考模型增加受限解码StructredOutput支持 [#2749]
多模态模型增加Prefix Caching与Encoder Caching支持 [#4134]
新增Wfp8Afp8在线量化推理支持 [#4051] [#4238]
新增静态Cfp8量化推理支持 [#4568]
LogProb功能
- 支持EP并行下开启logprob #4151
- 支持MTP场景下开启logprob #4464 #4467
- 新增logprobs_mode参数指定返回结果的类型 #4567
HuggingFace Safetensors模型升级为默认能力
- Qwen2.5-VL系列支持 #3921
- ERNIE-4.5-VL系列模型支持 #4042
- 新增EP并行与Cache量化场景下支持 #3801
- 新增动态量化缓存机制，二次加载可使用缓存进行加载 #3857
Nvidia GPU下CUDA Graphs功能的完善
- CUDA Graphs默认在Decode阶段开启 #3594
- 使用统一内存池，降低显存开销 #4230
- 支持投机解码 #3769 #4545 #4617 #4669
- 支持TP、DP、EP混合并行 #4456 #4589
- 支持 PD 分离式部署 #4530
- 支持权重清理与动态加载下的重捕获 #3781 #3594
- 支持CustomAllReduce下开启CUDA Graphs重捕获 #4305
- 增加ERNIE-4.5-VL-MOE模型的支持 #3226
新增终端命令行CLI工具集
- chat：执行对话生成任务 #4037
- complete：执行文本补全任务 #4037
- serve：启动与OpenAI协议兼容的推理服务 #4226
- bench：对推理服务进行性能（延迟、吞吐）或精度评测
  - bench serve \ bench latency 精度评测工具 #4160 #4239
  - bench throughtput \ bench eval 性能评测工具 #4239
- collect-env：收集并打印系统、GPU、依赖等运行环境信息 #4044 #4558 #4159
- run-batch：批量执行推理任务，支持文件/URL输入输出 # 4237
- tokenizer：执行文本与 token 的编码、解码及词表导出 #4278
新增engine-worker-queue-port与cache-queue-port的匿名端口支持 [#4597]
新增```LogitsProcessors````后处理参数支持 [#4515]
新增ERNIE-45-VL-Thinking模型的ReasoningParser与ToolParser [#4571]
usage字段返回新增多模态输入与输出Token、思考Token的统计 [#4648] [#4520]
新增n参数支持单请求返回多个生成结果 [#4273]
离线推理chat接口新增tool参数支持工具调用 [#4415]
多模态数据预处理增加对url数据的下载增加重试 [#3838]

性能优化

优化per_token_quant_fp8算子性能，提升50% [#4238]
MTP支持Chunked Prefill与V1 KVCache调度 [#3659] [#4366]
V1 KVCache调度增加对上下文缓存的支持，并作为默认配置 [#3807] [#3814]
优化MLA kernel性能，支持auto chunk + graph下的高性能MLA kernel [#3886]
优化Qwen-VL中ViT模块的CPU同步耗时 [#4442]
Machete GEMM支持WINT4/WINT8以及group scale，并作为默认dense GEMM后端，优化模型性能与精度 [#4451] [#4295] [#4121] [#3999] [#3905]
优化append attention前处理算子性能 [#4443] [#4369] [#4367]
思考长度裁剪功能自定义算子化，实现更鲁棒更规范 [#4279] [#4736]
INTEL HPU优化多卡场景下sampling [#4445]
新增MergedReplicatedLinear方法，支持DeepSeek，qkv_a_proj融合 [#3673]
优化DeepEP buffer显存；支持EP场景下DeepEP buffer的creat/delete功能 [#4039]
优化集中式EP场景下DeepEP clear buffer带来的降速 [#4039]
spec decode适配qk norm [#3637]
优化MLA Kernel性能，支持auto chunk + CUDA Graphs [#3886]
解决KV Cache容量分配偏小问题 [#4355]
Engine与Worker跨进程通信支持零拷贝方式传输多模态张量数据 [#4531]
APIServer支持gunicore+uvicorn优化前处理耗时 [#4496] [#4364]

多硬件

昆仑芯P800
- 新增ERNIE-4.5-VL系列模型的支持 #4030
- 新增PaddleOCR-VL 0.9B模型的支持 #4529
- BlockAttention算子支持neos版本rope #4723
- 新增W4A8精度支持 #4068
- 适配V1 KVCache调度 #4573
沐曦C550
- 优化Attention、MoE、RotaryEmbedding算子实现 #3688
- 新增DeepSeek-R1、DeepSeek-V3.1-BF16部署支持 #4498
天数CoreX
- 新增ERNIE-4.5-VL-28B-A3B部署支持 #4313
- ERNIE-4.5-300B-A47B推理性能优化 #3651
- 修复rebuild_padding错误问题 #4504

文档

新增终端命令行工具CLI命令使用说明 [#4569]
新增优雅退出方案 [#3785]
更新模型支持文档 [#4754]
新增2Bit量化方式和最佳实践 [#3819] [#3968]
新增DP并行部署文档 [#3883]
新增昆仑芯ERNIE-4.5-VL模型部署文档 [#4586]
新增XPU PaddleOCR-VL模型部署文档 [#4792]
更新模型最佳实践文档 [#3969]
新增ERNIE-4.5-21B-A3B-Thinking最佳实践文档 [#3994]
更新metrics指标说明文档 [#4061]
更新接口参数文档，增加completion_tokens、rompt_tokens、tool_calls说明 [#4421]

Bug修复

修复DP并行场景下Prefix Caching无法部署问题 [#4359] [#4370]
修复集中式EP并行部署下长输入KVCache调度Hang住问题 [#4275]
修复开启CUDA Graphs时noaux_tc算子报错CUDA 700问题 [#4174]
修复V1 Loader下TritonMoEBlockWiseFP8权重shape错误 [#4384]
修复EP场景下MoE前处理问题，增加num_experts_per_rank合法值 [#4102]
修复CustomAllReduce输出不稳定问题 [#4437]
修复昆仑芯下思考长度限制，只有思考无回复内容问题 [#4539] [#4760]
修复推理异常退出场景下KVCache管理进程残留问题 [#4410]
修复部分场景默认开启ChunkedPrefill报错问题 [#3759]
修复调度方法导致DeepSeek模型CudaError问题 [#4757]
修复XPU多模下默认开启上下文缓存bug [#4694]
修复MTP与C8场景下模型加载问题 [#4077]
修复MLA默认开启TensorCore的bug [#4354]
修复APIServer连接重复初始化的问题 [#3901]
修复MultiAPIServer日志地址混乱问题 [#3967]
修复多机张量并行无法部署问题 [#4377]
修复Qwen-VL系列模型无法关闭思考问题 [#3808] [#4762]
修复APIServer的对话接口非流式返回场景下finish_reason不正确问题 [#4582]
修复ERNIE-4.5-VL模型ReasoningPaserser中思考结束符错误问题 [#4686]
修复离线接口enable_thinking强制False的不符合预期错误 [#4248]
修复ERNIE-4.5-VL对PNG格式透明背景图像的处理问题 [#4847]
修复rope3d开启FA3下的报错问题 [#3791]
修复部分硬件平台上算子导入出错问题 [#4559]
修复PD分离EP并行场景下启动推理服务的多个问题 # 4311 [#4420] [#4542] [#4693] [#4781]
修复Metrics中num_requests_running, num_requests_waiting, available_gpu_block_num统计不准确的问题 [#4404]
修复Trace日志在流式输出场景下trace span过多问题 [#4375]
修复动态C8计算错误问题 [#4119]
修复AppendAttention作为自定义算子注册下的Bug导致动静不统一问题 [#4340]
修复Qwen-VL系列模型预处理中视频与图片数据的占位符处理错误 [#4065]
修复模型组网存在的无用显存浪费问题 [#3854]
修复思考长度限制在并发场景下的Bug [#4296]
修复PD分离下IPC信号读取错误问题 [#4309]
修复metrics指标的共享目录命名冲突问题 [#4007]
修复昆仑芯barrier随机精度问题 [#4181]
修复思考长度限制超过上限时的异常问题 [#4086]

其它

修复沐曦硬件上的单测报错问题 [#4027]
修复沐曦硬件上的单测报错问题test_get_save_output_v1单测偶发挂的问题 [#4732]
昆仑芯增加W4A8单测用例 [#4501]
Config配置代码优化，去除冗余字段 [#4147] [#4362] [#4400]
第三方库采用submodule管理 [#4033]
新增DeepSeek-V3-0324端到端监控 [#4360]
ERNIE-4.5-VL模型续推字段generated_token_ids改为completion_token_ids [#4086]
后面进程异常退出时，APIServer进程自动退出提在终端输出提示 [#3271]
Metrics增加若干可观测性指标 [#3868]
新增Attention层的性能单测 [#4494]
DP+EP并行场景下支持模型权重的热更新 [#3765] [#3803] [#3898]
支持在训练场景下强制停止推理请求 [#3601] [#4402]
修复在训练场景下Qwen3模型命名映射异常问题 [#4338] [#4322]
修复流式请求max_streaming_response_token参数不起作用问题 [#3789]
增加基于ZMQ回传worker推理结果至Engine的通信方式 [#3521]

What's Changed

Add more runtime information to resource manager by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/3706
Add CI cases by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/3714
Add loader test for mtp by @YuanRisheng in https://github.com/PaddlePaddle/FastDeploy/pull/3724
fix typos by @co63oc in https://github.com/PaddlePaddle/FastDeploy/pull/3684
add ci images build job by @XieYunshen in https://github.com/PaddlePaddle/FastDeploy/pull/3749
[DOC] fix Document by @lizexu123 in https://github.com/PaddlePaddle/FastDeploy/pull/3782
Update test_ernie_21b_mtp.py by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/3783
fix test_load_mtp by @co63oc in https://github.com/PaddlePaddle/FastDeploy/pull/3780
[BugFix] Fix chunked prefill by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/3759
[BugFix] fix max streaming tokens invalid by @ltd0924 in https://github.com/PaddlePaddle/FastDeploy/pull/3789
[Feature] Setting number of apiserver workers automatically by @Jiang-Jia-Jun in https://github.com/PaddlePaddle/FastDeploy/pull/3790
[Feature] mm and thinking model support structred output by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/2749
[Feature] support model weight update in ep by @ltd0924 in https://github.com/PaddlePaddle/FastDeploy/pull/3765
[BugFix] fix error of import paddle.base.core.Config by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/3761
[Executor] Fix bug of import paddle with RLHF by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/3781
rename speculate_stop_generation_multi_stop_seqs by @co63oc in https://github.com/PaddlePaddle/FastDeploy/pull/3743
Modify mask_offset‘s format by @carryyu in https://github.com/PaddlePaddle/FastDeploy/pull/3525
rename speculate_token_penalty_multi_scores.cu by @co63oc in https://github.com/PaddlePaddle/FastDeploy/pull/3735
fix ce compile job by @XieYunshen in https://github.com/PaddlePaddle/FastDeploy/pull/3768
[v1loader]Reduce EB300B model loading time by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/3700
【Fix bug] w4afp8 的nblock固定为256，并且fa3的append attn 增加mask参数 by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/3771
【Hackathon 9th No.64】add test_draft_model_set_value_by_flags by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/3741
[Feat] Support streaming transfer data using ZMQ by @Wanglongzhi2001 in https://github.com/PaddlePaddle/FastDeploy/pull/3521
[BugFix] fix scheduler invalid by @ltd0924 in https://github.com/PaddlePaddle/FastDeploy/pull/3803
rename fused_get_rope.cu by @co63oc in https://github.com/PaddlePaddle/FastDeploy/pull/3752
【Hackathon 9th No.84】Supplementary Unit Test for fastdeploy/reasoning by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/3570
fix w8a8.py by @co63oc in https://github.com/PaddlePaddle/FastDeploy/pull/3733
fix dcu_worker.py by @co63oc in https://github.com/PaddlePaddle/FastDeploy/pull/3734
【Hackathon 9th No.73】add unit tests for graph_opt_backend by @ooooo-create in https://github.com/PaddlePaddle/FastDeploy/pull/3609
[XPU] FIX XPU CI BUG by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/3829
[Doc] update wint2 doc by @chang-wenbin in https://github.com/PaddlePaddle/FastDeploy/pull/3819
fix test_append_attention_with_output.py by @carryyu in https://github.com/PaddlePaddle/FastDeploy/pull/3831
[XPU] Update XPU CI case by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/3837
qk norm for speculate decode C16 by @rsmallblue in https://github.com/PaddlePaddle/FastDeploy/pull/3637
[V1 Loader]V1 loader support EP by @YuanRisheng in https://github.com/PaddlePaddle/FastDeploy/pull/3801
[Code Simplification] delete cum_offsets_out by @lizexu123 in https://github.com/PaddlePaddle/FastDeploy/pull/3815
[Feature] ernie4_5_vl_moe support huggingface safetensor loading by @aquagull in https://github.com/PaddlePaddle/FastDeploy/pull/3750
add reasoning parser plugin by @luukunn in https://github.com/PaddlePaddle/FastDeploy/pull/3811
reopen ut by @XieYunshen in https://github.com/PaddlePaddle/FastDeploy/pull/3795
Automatically configure workers based on max-num-seqs by @yyssys in https://github.com/PaddlePaddle/FastDeploy/pull/3846
【Hackathon 9th No.43、45】add speculate_get_padding_offset by @co63oc in https://github.com/PaddlePaddle/FastDeploy/pull/3730
【Hackathon 9th No.42】add test_speculate_get_output_padding_offset by @co63oc in https://github.com/PaddlePaddle/FastDeploy/pull/3740
[XPU] Update XPU stable xvllm and xtdk version for 2.2 by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/3853
【BUG FIX】Fixed moba single test port conflict by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/3863
fix typo EngineSevice EngineService by @co63oc in https://github.com/PaddlePaddle/FastDeploy/pull/3841
【Hackathon 9th No.27】add test_get_padding_offset by @co63oc in https://github.com/PaddlePaddle/FastDeploy/pull/3708
【Hackathon 9th No.54、57】 add unit tests for per_token_quant and per_token_quant_padding by @ooooo-create in https://github.com/PaddlePaddle/FastDeploy/pull/3746
[BugFix]add rollout config dp by @gzy19990617 in https://github.com/PaddlePaddle/FastDeploy/pull/3822
Support extend block tables by @RichardWooSJTU in https://github.com/PaddlePaddle/FastDeploy/pull/3824
【Hackathon 9th No.34】add test_get_position_ids_and_mask_encoder_batch by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/3739
【Hackathon 9th No.63】add test_draft_model_postprocess.py by @co63oc in https://github.com/PaddlePaddle/FastDeploy/pull/3757
[Feature] Set v1 scheduler as default in develop by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/3807
fix response processsors by @RichardWooSJTU in https://github.com/PaddlePaddle/FastDeploy/pull/3826
support mtp rope_3d by @xiaoxiaohehe001 in https://github.com/PaddlePaddle/FastDeploy/pull/3791
[Feature][MTP]support mtp in v1_scheduler mode by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/3695
Graceful shut down by @xiaolei373 in https://github.com/PaddlePaddle/FastDeploy/pull/3785
Support for async processor added. by @sunlei1024 in https://github.com/PaddlePaddle/FastDeploy/pull/3869
[CI] update paddleformers==0.2 in develop by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/3878
Update test_ernie_21b_mtp.py by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/3885
[BugFix] fix qwen vl processor by @ltd0924 in https://github.com/PaddlePaddle/FastDeploy/pull/3808
[Docs] add data parallel by @ltd0924 in https://github.com/PaddlePaddle/FastDeploy/pull/3883
【Hackathon 9th No.35】add test_moe_redundant_topk_select by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/3867
【BugFix】fix gpu mem oom by @gzy19990617 in https://github.com/PaddlePaddle/FastDeploy/pull/3854
【Hackathon 9th No.32】add unit tests for group_swiglu_with_masked by @ooooo-create in https://github.com/PaddlePaddle/FastDeploy/pull/3748
【Inference Optimize】Update MergedReplicatedLinear for DSK qkv_a_proj_with_mqa. by @chang-wenbin in https://github.com/PaddlePaddle/FastDeploy/pull/3673
[fix]load hadamard_block_size from config by @rsmallblue in https://github.com/PaddlePaddle/FastDeploy/pull/3797
[Feature] support controller port in multi api server by @ltd0924 in https://github.com/PaddlePaddle/FastDeploy/pull/3898
Compatible with EB 0.3B torch model arch by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/3913
[Attention]clean_code by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/3917
[Fix] mv connection_manager init by @ltd0924 in https://github.com/PaddlePaddle/FastDeploy/pull/3901
add cache queue port by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/3904
rename eagle_get_base_model_hidden_states.cu by @co63oc in https://github.com/PaddlePaddle/FastDeploy/pull/3753
[feature]Support model loading from cache by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/3857
ignore ci by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/3950
[Feature] add HTTP GET retry by @ApplEOFDiscord in https://github.com/PaddlePaddle/FastDeploy/pull/3838
[XPU]Fixed the issue of performance degradation caused by enabling ENABLE_V1_KVCACHE_SCHEDULER by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/3897
[Bug fix] Fix prompt token ids dtype in v1 by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/3860
supports dynamic Cfp8 by @carryyu in https://github.com/PaddlePaddle/FastDeploy/pull/3767
Update sparse attn documentation by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/3954
[Excutor] Experiment Feature-Support Prefill in cudagraph by @littledgg in https://github.com/PaddlePaddle/FastDeploy/pull/3459
[metrics] Add serveral observability metrics by @qwes5s5 in https://github.com/PaddlePaddle/FastDeploy/pull/3868
[Docs] Update env docs for Machete by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/3959
rename ep_moe_prefill_func ep_moe_expert_dispatch by @co63oc in https://github.com/PaddlePaddle/FastDeploy/pull/3938
fix typos by @co63oc in https://github.com/PaddlePaddle/FastDeploy/pull/3951
[Optimize]Error messages about Model api. by @AuferGachet in https://github.com/PaddlePaddle/FastDeploy/pull/3839
【Doc】Update WINT2 Doc Pic by @chang-wenbin in https://github.com/PaddlePaddle/FastDeploy/pull/3968
Modify markdown by @xiaolei373 in https://github.com/PaddlePaddle/FastDeploy/pull/3896
[docs] update docs by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/3975
【Hackathon 9th No.22】add unit tests for share_external_data by @ooooo-create in https://github.com/PaddlePaddle/FastDeploy/pull/3744
【Hackathon 9th No.68】supplementary unit test for ngram_match by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/3732
【Hackathon 9th No.44】add test_speculate_get_token_penalty_multi_scores.py by @co63oc in https://github.com/PaddlePaddle/FastDeploy/pull/3742
【Hackathon 9th No.69】add test_draft_model_preprocess by @co63oc in https://github.com/PaddlePaddle/FastDeploy/pull/3832
【Hackathon 9th No.60、62】add eagle_get_hidden_states by @co63oc in https://github.com/PaddlePaddle/FastDeploy/pull/3876
【Hackathon 9th No.66】add test_speculate_set_stop_value_multi_seqs by @co63oc in https://github.com/PaddlePaddle/FastDeploy/pull/3941
【Hackathon 9th No.36】add test_extract_text_token_output by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/3862
[docs] update best practice docs by @zoooo0820 in https://github.com/PaddlePaddle/FastDeploy/pull/3969
[XPU]Release2.2 update release note by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/3986
【Doc】update dsk doc by @chang-wenbin in https://github.com/PaddlePaddle/FastDeploy/pull/3989
update doc by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/3990
del batch id per token by @carryyu in https://github.com/PaddlePaddle/FastDeploy/pull/3963
[Docs] update VL best_practices for release/2.2 by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/3965
[CI] update ci by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/3962
[docs] add a3b-thinking doc by @zoooo0820 in https://github.com/PaddlePaddle/FastDeploy/pull/3994
【docs】update index.html and dockfile by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/3998
【FIX】Change the name of sparse attn from moba to plas by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/3845
【Fix】Change the name of sparse attn from moba to plas by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/3993
【docs】 update readme by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/4000
Revert "【Fix】Change the name of sparse attn from moba to plas" by @Jiang-Jia-Jun in https://github.com/PaddlePaddle/FastDeploy/pull/4002
Revert "【FIX】Change the name of sparse attn from moba to plas" by @Jiang-Jia-Jun in https://github.com/PaddlePaddle/FastDeploy/pull/4001
get org_vocab_size from args by @zeroRains in https://github.com/PaddlePaddle/FastDeploy/pull/3983
[V1 Loader]Ernie kv cache quant support v1 loader by @YuanRisheng in https://github.com/PaddlePaddle/FastDeploy/pull/3899
[V1 Loader] Support V1 Loader for Machete by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/3999
metrics shared folder naming by @zhuangzhuang12 in https://github.com/PaddlePaddle/FastDeploy/pull/4007
[MoE] clean code by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4020
[BugFix] Fix the abnormal memory usage caused by shape errors in the triton moe backend by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/4026
[xpu] add ep custom ops by @zhupengyang in https://github.com/PaddlePaddle/FastDeploy/pull/3911
[Feat] ernie4_5_vl_moe support CudaGraph by @aquagull in https://github.com/PaddlePaddle/FastDeploy/pull/3226
[Executor] Adjust signal sending order in RL training by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/3773
【Hackathon 9th No.28】add test_cutlass_fp8_fp8_fp8_dual_gemm_fused by @WanRui37 in https://github.com/PaddlePaddle/FastDeploy/pull/3935
[Fix] fix multi api server log dir by @ltd0924 in https://github.com/PaddlePaddle/FastDeploy/pull/3967
[MTP]support rope_3d in spec mode by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/4034
[Feature] Support zai-org/GLM-4.5-Air BF16 model by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/3928
【Inference Optimize】DeepSeek-V3-model MLA Optimize by @chang-wenbin in https://github.com/PaddlePaddle/FastDeploy/pull/3886
【Hackathon 9th No.55】add test_update_inputs_v1.py by @co63oc in https://github.com/PaddlePaddle/FastDeploy/pull/3992
[docs] Update environment variables documentation by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/3957
[BugFix] qwen2.5vl enable_thinking=true and image_patch_id bug fix by @CSWYF3634076 in https://github.com/PaddlePaddle/FastDeploy/pull/3921
fix import tests.utils error in tests/model_loader/test_load_mtp.py by @handsomecoderyang in https://github.com/PaddlePaddle/FastDeploy/pull/4027
[setup optimize]Support git submodule by @YuanRisheng in https://github.com/PaddlePaddle/FastDeploy/pull/4033
[CI] skip test_structured_outputs* temporarily by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4055
update ci by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/4064
[BugFix] mm_post_fix by @xiaoxiaohehe001 in https://github.com/PaddlePaddle/FastDeploy/pull/4005
[Echo] Support more types of prompt echo by @AuferGachet in https://github.com/PaddlePaddle/FastDeploy/pull/4022
[Feature] add cli command chat,complete by @memoryCoderC in https://github.com/PaddlePaddle/FastDeploy/pull/4037
[bug fix] Fix the placeholder in qwen prompt and add some unittests by @lddfym in https://github.com/PaddlePaddle/FastDeploy/pull/4065
[Feature] GLM-45-AIR Support Mix Quantization(Dense wfp8afp8 and wint8 triton_moe_backend) by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4051
[Optimize] optimize prefix cache in develop by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/3890
Add token processor plugin support by @RichardWooSJTU in https://github.com/PaddlePaddle/FastDeploy/pull/4059
fix typos by @co63oc in https://github.com/PaddlePaddle/FastDeploy/pull/3840
[metrics] update metrics markdown file by @qwes5s5 in https://github.com/PaddlePaddle/FastDeploy/pull/4061
[CI] add multi api server test by @ltd0924 in https://github.com/PaddlePaddle/FastDeploy/pull/4049
[Feature] refactor metax_gpu attention and moe and remove some useles… by @handsomecoderyang in https://github.com/PaddlePaddle/FastDeploy/pull/3688
【Hackathon 9th No.25】add test_fused_get_rotary_embedding by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/3892
【Hackathon 9th No.78】add test_chat.py by @co63oc in https://github.com/PaddlePaddle/FastDeploy/pull/3958
[BugFix]Fix load kv cache quant scale by @YuanRisheng in https://github.com/PaddlePaddle/FastDeploy/pull/4077
[format] Valid para format error info by @xiaolei373 in https://github.com/PaddlePaddle/FastDeploy/pull/4035
[BugFix] Fix image_feature 0-Size causing insert failed by @aquagull in https://github.com/PaddlePaddle/FastDeploy/pull/4042
fix(CE): update concurrency to stop CE tasks from canceling each other by @XieYunshen in https://github.com/PaddlePaddle/FastDeploy/pull/4083
Support offline inference with streaming output by @xyxinyang in https://github.com/PaddlePaddle/FastDeploy/pull/4071
【FastDeploy CLI】collect-env subcommand by @qwes5s5 in https://github.com/PaddlePaddle/FastDeploy/pull/4044
[Bug Fix]fix the bug for cache_messager signal loss by @zeroRains in https://github.com/PaddlePaddle/FastDeploy/pull/3879
【Hackathon 9th No.61、65、41】add test_draft_model_update by @co63oc in https://github.com/PaddlePaddle/FastDeploy/pull/3940
【Hackathon 9th No.49】add test_pre_cache_len_concat by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/3847
[Optimize] Support WINT8 and group scale for Machete by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/3905
[v1 loader]qwen Offline fp8 by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/4036
[xpu] support ep by @zhupengyang in https://github.com/PaddlePaddle/FastDeploy/pull/4067
[CUDAGraph] Support multi output buffers and merge some fixes from feature/exp_0908 by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/4062
[MTP]update hybrid-mtp-with-ngram by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/4047
[MTP]Develop mtp reshard by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/4099
[BugFix]Fix Ernie bf16 model loading bug and add comments by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/4106
fix typos by @co63oc in https://github.com/PaddlePaddle/FastDeploy/pull/4093
[BugFix]Fix key mismatch when load mtp by @YuanRisheng in https://github.com/PaddlePaddle/FastDeploy/pull/4105
[submodule] add ignore=all for deepgemm by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/4118
[BugFix] Fix EP MoE expert dispatch function by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/4102
【Hackathon 9th No.37】add test_top_k_renorm_probs by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/3755
【Hackathon 9th No.52】add test_dynamic_per_token_scaled_fp8_quant by @co63oc in https://github.com/PaddlePaddle/FastDeploy/pull/4015
[CE]add plas attention config by @tianlef in https://github.com/PaddlePaddle/FastDeploy/pull/4128
Addcase by @DDDivano in https://github.com/PaddlePaddle/FastDeploy/pull/4112
[benchmark]add lite-vl and x1 yaml by @xiegegege in https://github.com/PaddlePaddle/FastDeploy/pull/4130
[Doc][CE]x1_a3b server config by @tianlef in https://github.com/PaddlePaddle/FastDeploy/pull/4135
ci: Increase compilation task time limit by @XieYunshen in https://github.com/PaddlePaddle/FastDeploy/pull/4098
fix dynamic Cfp8 computing error by @rsmallblue in https://github.com/PaddlePaddle/FastDeploy/pull/4119
[Feature] Set prefix caching as default by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/3814
Update test_w4a8_model.py by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/4125
mv test to tests by @XieYunshen in https://github.com/PaddlePaddle/FastDeploy/pull/4129
[FDConfig]Remove max_num_batched_tokens/max_num_seqs in parallel config by @YuanRisheng in https://github.com/PaddlePaddle/FastDeploy/pull/4116
[BugFix] Forbiden FD_DISABLED_RECOVER while ENABLE_V1_KVCACHE_SCHEDULER by @Jiang-Jia-Jun in https://github.com/PaddlePaddle/FastDeploy/pull/4142
Reconstruct streaming data transfer with zmq by @RichardWooSJTU in https://github.com/PaddlePaddle/FastDeploy/pull/3836
Print KV Cache available memory and block memory usage in GB format by @qw86972190 in https://github.com/PaddlePaddle/FastDeploy/pull/4148
[BugFix]Fix test_prefix_cache by @YuanRisheng in https://github.com/PaddlePaddle/FastDeploy/pull/4155
[NewFeture]add ep rollout model init and update/clear ep buffer by @gzy19990617 in https://github.com/PaddlePaddle/FastDeploy/pull/4039
[CI] enhance clean port strategy by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4152
[Feature] Support mixed deployment with yiyan adapter in develop by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/3976
Add param valid log by @xiaolei373 in https://github.com/PaddlePaddle/FastDeploy/pull/4113
[FastDeploy CLI]collect-env unitest bug fix by @qwes5s5 in https://github.com/PaddlePaddle/FastDeploy/pull/4159
[Optimize] Machete uses group scale by default by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/4121
Bugfix test exception by @xiaolei373 in https://github.com/PaddlePaddle/FastDeploy/pull/4171
Each module should have its own plugins_loaded by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/4164
[Logprob] EP support logprob by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4151
[fix]Modify follow-up push parameters and Modify the verification method for thinking length by @luukunn in https://github.com/PaddlePaddle/FastDeploy/pull/4086
[FDConfig]Remove splitwise_role and engine_worker_queue_port in FDConfig by @YuanRisheng in https://github.com/PaddlePaddle/FastDeploy/pull/4147
【Hackathon 9th No.46】add test_fused_rotary_position_encoding by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/3848
[Bug fix] fix request assign by @RichardWooSJTU in https://github.com/PaddlePaddle/FastDeploy/pull/4163
[TEST] init first commit by @cqulilujia in https://github.com/PaddlePaddle/FastDeploy/pull/4192
fix nul by @co63oc in https://github.com/PaddlePaddle/FastDeploy/pull/4191
[BugFix]fix glm all_reduce tp group by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4187
[Feature] support pool by @lizexu123 in https://github.com/PaddlePaddle/FastDeploy/pull/3827
fix typos by @co63oc in https://github.com/PaddlePaddle/FastDeploy/pull/4176
【Hackathon 9th No.30】add test_tritonmoe_preprocess by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/3891
[Feature] Support pd ep deployment with yiyan adapter by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/4029
【Hackathon 9th No.40】add test_top_p_candidates by @co63oc in https://github.com/PaddlePaddle/FastDeploy/pull/4046
【Hackathon 9th No.26】add test_set_value_by_flags_and_idx.py by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/4186
[FD CLI] Add bench cli by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/4160
[Iluvatar GPU] Optimize attention performance and fix moe load ckpt e… by @wuyujiji in https://github.com/PaddlePaddle/FastDeploy/pull/3651
Remove useless code by @Jiang-Jia-Jun in https://github.com/PaddlePaddle/FastDeploy/pull/4195
[Feature] support clear data by @ltd0924 in https://github.com/PaddlePaddle/FastDeploy/pull/3601
[XPU]change xpu ci model by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4117
【FIX】Change the name of sparse attn from moba to plas (#4006) by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/4076
[XPU] update XPU CI by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4209
[XPU] Update run_ci_xpu.sh to lock xvllm version by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4210
[xpu] use cpu barrier by @zhupengyang in https://github.com/PaddlePaddle/FastDeploy/pull/4181
[Feature] support qwen3-embedding model load by @lizexu123 in https://github.com/PaddlePaddle/FastDeploy/pull/4202
Fix noaux_tc cuda Error 700 in CUDAGraph by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4174
[v1 loader]code style by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/4204
[Test]add glm45_air logprob test and rollout model by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4175
[XPU] Enable XPU V1 mode based on environment variable by @yyssys in https://github.com/PaddlePaddle/FastDeploy/pull/4213
register_model_class compatible with plugins by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/4236
【Hackathon 9th No.24】add rebuild_padding by @co63oc in https://github.com/PaddlePaddle/FastDeploy/pull/4107
[Intel HPU] Support intel hpu platform by @fmiao2372 in https://github.com/PaddlePaddle/FastDeploy/pull/4161
[CUDAGraph] [FIX] Fix CUDA error(700): 'cudaErrorIllegalAddress' in CascadeAppendW… by @YuhanXu in https://github.com/PaddlePaddle/FastDeploy/pull/4218
[BugFix]fix v1 loader moe bf16, and supoort dynamic_load_weight create quant param by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4229
[BugFix] fix qwen3-embedding model tp>1 by @lizexu123 in https://github.com/PaddlePaddle/FastDeploy/pull/4223
[Bug Fix] disable prefix caching in mm model by @ApplEOFDiscord in https://github.com/PaddlePaddle/FastDeploy/pull/4167
[Feature] add cli command serve by @memoryCoderC in https://github.com/PaddlePaddle/FastDeploy/pull/4226
[OPs] MoE support wfp8afp8(channelwise) and improve per_token_quant_fp8 by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4238
[fix]update apply_chat_template by @luukunn in https://github.com/PaddlePaddle/FastDeploy/pull/4137
[Model] Qwen2.5VL support --use-cudagraph and unit testing by @CSWYF3634076 in https://github.com/PaddlePaddle/FastDeploy/pull/4087
[CUDAGraph]CUDA Graph support unique memory pool by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4230
[BugFix]fix the bug for prefilled_step_idx signal of cache_messager in cudagraph and PD by @zeroRains in https://github.com/PaddlePaddle/FastDeploy/pull/4235
【Hackathon 9th No.21、23】add unit tests for fused_hadamard_quant_fp8, moe_fused_hadamard_quant_fp8 by @ooooo-create in https://github.com/PaddlePaddle/FastDeploy/pull/4094
[XPU] support XPU VL model inference by @cqulilujia in https://github.com/PaddlePaddle/FastDeploy/pull/4030
delete moe_phase in parallel_config（Moved to model_config） by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/4264
Support limit thinking lengths. by @K11OntheBoat in https://github.com/PaddlePaddle/FastDeploy/pull/4069
[Docs]Add ENABLE_V1_KVCACHE_SCHEDULER=0 to docs by @yyssys in https://github.com/PaddlePaddle/FastDeploy/pull/4268
【fix】Remove the logic that assigns the default value of 80% to reasoning_max_tokens in the offline component of FastDeploy. by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4248
[Feature] add config api by @memoryCoderC in https://github.com/PaddlePaddle/FastDeploy/pull/4254
[CI] fix base_test error temporarily by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4283
[Supplements and upgrades]Improvement of X1 parsers by @AuferGachet in https://github.com/PaddlePaddle/FastDeploy/pull/4172
fix ernie vl distributed attr. by @ZHUI in https://github.com/PaddlePaddle/FastDeploy/pull/4215
[Doc]add glm benchmark yaml by @tianlef in https://github.com/PaddlePaddle/FastDeploy/pull/4289
Add cli run batch by @xiaolei373 in https://github.com/PaddlePaddle/FastDeploy/pull/4237
Add speculative decoding approval check by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/4284
Set approve checking for config.py, worker, model and cudagraph by @zeroRains in https://github.com/PaddlePaddle/FastDeploy/pull/4276
[Docs]The XPU model loader uses the default version by @yyssys in https://github.com/PaddlePaddle/FastDeploy/pull/4292
increase ccache size by @XieYunshen in https://github.com/PaddlePaddle/FastDeploy/pull/4255
[Feature] deepgemm pre-compile tool support mixed parallel by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/4282
fix typos by @ccsuzzh in https://github.com/PaddlePaddle/FastDeploy/pull/4274
[fix]remove reasoning_max_tokens=max_toksns*0.8 in sampling_params by @luukunn in https://github.com/PaddlePaddle/FastDeploy/pull/4277
[Bug fix] Fix bug for running ep by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/4245
Fix wrong batch size of thinking_mask by @K11OntheBoat in https://github.com/PaddlePaddle/FastDeploy/pull/4296
[BugFix] Increase the conditions for the use of a Machete: not pre-quant by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/4295
[XPU] fix VL thinking mode by @cqulilujia in https://github.com/PaddlePaddle/FastDeploy/pull/4266
[feat] support prefix cache clearing when /clear_load_weight is called by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/4008
add_cli_tokenizer by @xiaolei373 in https://github.com/PaddlePaddle/FastDeploy/pull/4278
[fix] fix gpu_cache_kvs key by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/4311
【Feature】ResourceManagerV1 support need block num notifying by @RichardWooSJTU in https://github.com/PaddlePaddle/FastDeploy/pull/4220
Fix bugs of splitwise_complete_prefilled_step IPCsignal clear by @K11OntheBoat in https://github.com/PaddlePaddle/FastDeploy/pull/4309
[Metax] support cutlass moe & optimize flash attention & fix triton moe by @xiaozude in https://github.com/PaddlePaddle/FastDeploy/pull/4208
[NewFeature]custom_allreduce support cudagraph recapture by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4305
[FIx] CI Approve fix by @zeroRains in https://github.com/PaddlePaddle/FastDeploy/pull/4316
[BugFix]remove redundant includes by @fangfangssj in https://github.com/PaddlePaddle/FastDeploy/pull/4312
[Bug fix]revert worker process ipc signal suffix to fix ep by @RichardWooSJTU in https://github.com/PaddlePaddle/FastDeploy/pull/4323
【Inference Optimize】Support MLA_CACHE & Fix V1_Schedule Bug by @chang-wenbin in https://github.com/PaddlePaddle/FastDeploy/pull/4318
【Fix】updata docs by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/4339
[Doc] Update xpu fastdeploy version to 2.2.1 by @yyssys in https://github.com/PaddlePaddle/FastDeploy/pull/4338
【Bug-Fix】schedule_bugfix by @chang-wenbin in https://github.com/PaddlePaddle/FastDeploy/pull/4336
[Executor]CUDAGraph support Speculate Decode by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/3769
Remove redundant inplace outputs for append_attention by @SigureMo in https://github.com/PaddlePaddle/FastDeploy/pull/4340
【Inference Optimize】MLA Tensor-Core is enabled by default by @chang-wenbin in https://github.com/PaddlePaddle/FastDeploy/pull/4335
【Hackathon 9th No.86】autogen MultiQueryAppendC8Attention template_instantiation -part by @ccsuzzh in https://github.com/PaddlePaddle/FastDeploy/pull/4330
[XPU] Support W4A8C8-TP4-300B Model by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/4068
supports spec dynamic cfp8 by @carryyu in https://github.com/PaddlePaddle/FastDeploy/pull/4290
[FastDeploy Cli] Bench Command eval and throughput by @qwes5s5 in https://github.com/PaddlePaddle/FastDeploy/pull/4239
add release images build job by @XieYunshen in https://github.com/PaddlePaddle/FastDeploy/pull/4265
【Bug Fix】mla enables tensorcore by default by @chang-wenbin in https://github.com/PaddlePaddle/FastDeploy/pull/4354
【Inference Optimize】Calculate paddle_peak_increase using paddle_allocated_mem_after_run by @chang-wenbin in https://github.com/PaddlePaddle/FastDeploy/pull/4355
【Add CI】Add DeepSeek model end-to-end CI by @chang-wenbin in https://github.com/PaddlePaddle/FastDeploy/pull/4360
[Feature] support prefix cache in DP by @ltd0924 in https://github.com/PaddlePaddle/FastDeploy/pull/4359
【BugFix】fix qwen3moe name_mapping config by @gzy19990617 in https://github.com/PaddlePaddle/FastDeploy/pull/4348
[MTP]support more branchs in topp kernel by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/4352
[FDConfig]Remove max_model_len in FDConfig by @YuanRisheng in https://github.com/PaddlePaddle/FastDeploy/pull/4350
[XPU] fix XPU CI bug by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4358
[Doc] fix document navigation link paths by @yyssys in https://github.com/PaddlePaddle/FastDeploy/pull/4368
【Hackathon 9th No.20】add unit tests for masked_per_token_quant by @ooooo-create in https://github.com/PaddlePaddle/FastDeploy/pull/4111
[Doc] fix the port conflict issue in the usage example by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4379
[Optimization] Fuse get_max_len and get_kv_max_len by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/4369
[CI] fix diff_error temporarily by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4390
【Hackathon 9th No.67】add speculate_verify by @co63oc in https://github.com/PaddlePaddle/FastDeploy/pull/4326
[Doc]add x1 a3b quantization yaml by @tianlef in https://github.com/PaddlePaddle/FastDeploy/pull/4397
[Doc] fix offline inference doc by @ApplEOFDiscord in https://github.com/PaddlePaddle/FastDeploy/pull/4412
[Docx] add PaddlePaddle nightly build address for GPU by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/4414
[CI] Fix partial instability issues by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4418
[benchmark] Update benchmark tools by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/4416
[Optimization] Optimize split_q_block kernel by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/4367
[XPU] fix ep by @zhupengyang in https://github.com/PaddlePaddle/FastDeploy/pull/4393
[BugFix] fix multinode bugs by @ltd0924 in https://github.com/PaddlePaddle/FastDeploy/pull/4377
[fix] Fixed the issue of excessive/redundant spans being returned for streaming requests. by @qwes5s5 in https://github.com/PaddlePaddle/FastDeploy/pull/4375
[fix] fix requests & block metrics by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/4404
[MTP]support mtp chunk_prefill_v1 by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/4366
【BugFix】fix block_wise_fp8_v1_loader_moe_shape by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4384
【Fix CI Bug】Fix ci bug by @chang-wenbin in https://github.com/PaddlePaddle/FastDeploy/pull/4413
Disable gcu ci by @tianshuo78520a in https://github.com/PaddlePaddle/FastDeploy/pull/4427
V1 loader default by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/4251
[BugFix] fix workers=1 by @ltd0924 in https://github.com/PaddlePaddle/FastDeploy/pull/4364
[XPU] fix VL multi-batch accuracy issue by @cqulilujia in https://github.com/PaddlePaddle/FastDeploy/pull/4394
[BUGFIX] clear request [#4286] by @ltd0924 in https://github.com/PaddlePaddle/FastDeploy/pull/4402
fix param by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/4419
Feature：Add support for Pooling Model Embedding and provide an OpenAI-compatible API. by @sunlei1024 in https://github.com/PaddlePaddle/FastDeploy/pull/4344
[benchmark] Update benchmark_serving.py by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/4438
[BugFix] fix config bugs by @ltd0924 in https://github.com/PaddlePaddle/FastDeploy/pull/4370
[XPU] support prefix cache by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/4423
[Bug fix] Fix pd for x1 thinking by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/4433
[Bugfix]fix ep clear buffer perf by @gzy19990617 in https://github.com/PaddlePaddle/FastDeploy/pull/4389
[XPU] moe support VL 0-size input by @cqulilujia in https://github.com/PaddlePaddle/FastDeploy/pull/4408
[benchmark] Fix benchmark duration calculation logic by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/4446
[benchmark] Add filtering for failed requests in benchmark outputs by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/4448
perf: optimize ZMQ communication with async queue and single-threaded… by @sunlei1024 in https://github.com/PaddlePaddle/FastDeploy/pull/4444
[BUG] fix ep bug by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/4275
【Hackathon 9th No.86】autogen MultiQueryDecoderAttention template_instantiation -part by @ccsuzzh in https://github.com/PaddlePaddle/FastDeploy/pull/4383
[benchmark ] Update benchmark tools by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/4454
【Fix】 remove text_after_process & raw_prediction by @LiqinruiG in https://github.com/PaddlePaddle/FastDeploy/pull/4421
[Intel HPU] Enable dist sampler on intel hpu platform by @JianyuLi01 in https://github.com/PaddlePaddle/FastDeploy/pull/4445
[xpu] refine fused_moe by @zhupengyang in https://github.com/PaddlePaddle/FastDeploy/pull/4219
[SOT][CUDAGraph] Add support for custom all-reduce operators under SOT mode by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/4386
[FDConfig]Remove total_block_num/dtype/block_size/enc_dec_block_num in ParallelConfig by @YuanRisheng in https://github.com/PaddlePaddle/FastDeploy/pull/4400
[bugfix] kill cache_transfer_manager process by @xiaolei373 in https://github.com/PaddlePaddle/FastDeploy/pull/4401
[BugFix]Fix wfp8afp8 triton moe group_topk renormalized=True by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4449
[Comments] Modify comments to English by @xiaolei373 in https://github.com/PaddlePaddle/FastDeploy/pull/4460
[FDConfig]Remove reasoning_parser/guided_decoding_backend/disable_any_whitespace/device_ids in FDConfig by @YuanRisheng in https://github.com/PaddlePaddle/FastDeploy/pull/4362
[SOT][Cudagraph] Remove BreakGraph of [#3302] && update CustomOp by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/3694
[Optimization] Put get_block_shape_and_split_kv_block in cuda graph for append attention backend by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/4443
[Others] add PR Template by @zeroRains in https://github.com/PaddlePaddle/FastDeploy/pull/4452
[Optimize] Set preempted schedule log as info level by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/4453
[SOT] Change warnings to errors and remove fallback operations by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/4378
[BugFix]Dev fix custom ar unstable result by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4437
[CINN] Remove the restriction of automatically falling back to SOT after enabling CINN by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/4411
[Feature] support pooling model dummy_run by @lizexu123 in https://github.com/PaddlePaddle/FastDeploy/pull/4345
[XPU] abstract a hardware-agnostic operator wrapper for prefix cache and specify xpu device id definition by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/4455
[CI] Fix partial instability issues by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4461
[Docs]add environment variables to environment_variables.md by @xiaolei373 in https://github.com/PaddlePaddle/FastDeploy/pull/4466
[Perf][Qwen25_VL] Avoid triggering CPU synchronization in ViT's attn forward by @aquagull in https://github.com/PaddlePaddle/FastDeploy/pull/4442
[Docx] add language (en/cn) switch links by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/4470
[Iluvatar GPU] Adapt VL model by @wuyujiji in https://github.com/PaddlePaddle/FastDeploy/pull/4313
[Loader]check paddle version for v1 loader by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4473
[XPU]Fix w4a8 precision bug && rollback moe algo by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/4463
[Docx] fix the broken link by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/4479
LLM.chat add "tools" param by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4415
【feature】support n parameter by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4273
[ATTN]delete code and add ffn and moe layer level test by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4440
[CI] Handle unit test issues by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4483
[Graph Optimization][Speculative Decoding] Fix the bug of CUDAGraph + MTP + EP by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4456
Optimization of ‘tools’ in request fields by @AuferGachet in https://github.com/PaddlePaddle/FastDeploy/pull/4380
[Metax] adjust mctlass moe api by @handsomecoderyang in https://github.com/PaddlePaddle/FastDeploy/pull/4474
Support GPT-OSS-BF16 by @Limerances in https://github.com/PaddlePaddle/FastDeploy/pull/4240
[Feature] support mtp logprob by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/4464
[Loader]Qwen2.5-Math-PRM-7B and Ernie-VL-RM by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/4319
[fix] remove cache tensor creation for cache_transfer_manager by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/4420
[Benchmark] update benchmark scripts by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/4497
[XPU]Fix w4a8 garbled code issue by @yyssys in https://github.com/PaddlePaddle/FastDeploy/pull/4493
[XPU] Fix vl multi-card allreduce bug by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/4485
[BugFix][CI] Clean up SOT code cache using tearDown in CINN unitest by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/4491
Optimizing the performance of think length limit using custom operators by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/4279
[APIServer] support define gunicorn timeout by @ltd0924 in https://github.com/PaddlePaddle/FastDeploy/pull/4496
[CI] update ernie-4_5-vl baseline by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4495
【BugFix】fix ep buffer clear by @gzy19990617 in https://github.com/PaddlePaddle/FastDeploy/pull/4450
[XPU] bind block_attn kernel with pybind by @cqulilujia in https://github.com/PaddlePaddle/FastDeploy/pull/4499
[Executor] Default use CUDAGraph by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/3594
[Speculative Decoding] Add draft_logprobs Support for Speculative Decode MTP by @sunlei1024 in https://github.com/PaddlePaddle/FastDeploy/pull/4467
【CI】Add test cases for n parameter and streaming validation by @DDDivano in https://github.com/PaddlePaddle/FastDeploy/pull/4503
[Fearture] Support mm model close prefix cache by @ltd0924 in https://github.com/PaddlePaddle/FastDeploy/pull/4459
[Doc]add deepseek wint4 ce by @tianlef in https://github.com/PaddlePaddle/FastDeploy/pull/4517
[FDConfig]Turn on the CUDAGraph + Speculative Decoding switch by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4511
Add comprehensive unit tests for limit_thinking_content_length operators by @Copilot in https://github.com/PaddlePaddle/FastDeploy/pull/4510
[XPU]add xpu ci ep case by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4432
feat: add post-processing step for pool_output by @sunlei1024 in https://github.com/PaddlePaddle/FastDeploy/pull/4462
[FDConfig]Turn on the CUDAGraph + MultiModel switch by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4512
[Iluvatar GPU] fix ci error caused by rebuild_padding param and cuda graph by @wuyujiji in https://github.com/PaddlePaddle/FastDeploy/pull/4504
enhance set_stop_value_multi_ends and standardize the registration of some operators by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/4525
[CI] Remove redundant .coveragerc file and fix by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4521
[XPU]Modify the xpu memory display unit of log by @yyssys in https://github.com/PaddlePaddle/FastDeploy/pull/4534
[Feature] support fd return decode response by @zhuangzhuang12 in https://github.com/PaddlePaddle/FastDeploy/pull/4407
[Feature] Support AsyncLLM by @xyxinyang in https://github.com/PaddlePaddle/FastDeploy/pull/4458
[CI] stable test_rollout_model.py by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/4536
small change in test_fusedmoe.py by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4538
c++ code format by @zhupengyang in https://github.com/PaddlePaddle/FastDeploy/pull/4527
[XPU] Change XPU stable third-party version and add time-out by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4524
[FDConfig]Turn on the CUDAGraph + RL switch by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4508
[Others]Delete useless code by @YuanRisheng in https://github.com/PaddlePaddle/FastDeploy/pull/4544
[BugFix]Fix finish reason by @luukunn in https://github.com/PaddlePaddle/FastDeploy/pull/4543
[Doc]fix deepseek ce by @tianlef in https://github.com/PaddlePaddle/FastDeploy/pull/4560
[XPU] xpu support think length limit by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/4539
[CI] Optimize coverage upload reporting by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4547
WINT4/WINT8 dense gemm default use Machete by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/4451
[XPU] merge apply_tp, ops support token_num = 0 by @zhupengyang in https://github.com/PaddlePaddle/FastDeploy/pull/4507
[BugFix] Fix decode_type which has been deleted in req and optimize token client retry scheme by @RichardWooSJTU in https://github.com/PaddlePaddle/FastDeploy/pull/4564
[Graph Optimization] Support CUDAGraph Padding + MTP by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4545
[FDConfig]Turn on the CUDAGraph + PD Disaggregation switch by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4530
[KVCache] Support Static C8 by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/4568
[Metax] adapt DeepSeek by @xiaozude in https://github.com/PaddlePaddle/FastDeploy/pull/4498
[BugFix] fix create_cache_tensor for ep by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/4542
[EP] fix adapter bugs by @ltd0924 in https://github.com/PaddlePaddle/FastDeploy/pull/4572
[XPU]fix v1 hang bug by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/4573
[BugFix] fix import image_ops error on some platforms by @zoooo0820 in https://github.com/PaddlePaddle/FastDeploy/pull/4559
[CLI]Update parameters in bench latecy cli tool and fix collect-env cli tool by @qwes5s5 in https://github.com/PaddlePaddle/FastDeploy/pull/4558
[Graph Optimization] Add dy_runnable and introduce cudagraph_switch_threshold for cudagraph mode switching by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/4578
[XPU]Moe uses a new operator by @yyssys in https://github.com/PaddlePaddle/FastDeploy/pull/4585
[Feature] Support Paddle-OCR by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4396
[DataProcessor] add reasoning_tokens into usage info by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4520
perf: Optimize task queue communication from engine to worker by @sunlei1024 in https://github.com/PaddlePaddle/FastDeploy/pull/4531
[CI] Clean up ports after processing results by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/4587
[CI] Add /re-run command in PR comments to restart failed CI workflows by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4593
[Others] api server exits when worker process is dead by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/3271
[XPU] bind some OPs for VL model with pybind by @cqulilujia in https://github.com/PaddlePaddle/FastDeploy/pull/4522
[V1 loader] Qwen25 VL support v1 loader and torch style safetensors load by @CSWYF3634076 in https://github.com/PaddlePaddle/FastDeploy/pull/4388
[Feature] Support logprobs_mode by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4567
[CI] Fix path error of /re-run by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4606
[Feature] mm support prefix cache by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/4134
[Graph Optimization]1.fix the bug of draft model with ep 2.fix sampler bug by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4589
[XPU] update kunlun doc about supported models by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/4586
benchmark工具适配SGLang框架 by @ophilia-lee in https://github.com/PaddlePaddle/FastDeploy/pull/4607
[Unitest]Add unitest of Attention Layer by @K11OntheBoat in https://github.com/PaddlePaddle/FastDeploy/pull/4494
remove dev sync in prefill by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4598
[BugFix] fix offline stream output when set enable_thinking param by @xyxinyang in https://github.com/PaddlePaddle/FastDeploy/pull/4603
[BugFix] PaddleOCR-VL fix FD_DEBUG type and support v1 loader by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4605
[Feature] EngineWorkerQueue anonymous port by @ST-XX in https://github.com/PaddlePaddle/FastDeploy/pull/4597
[Docs] Add cli usage to docs by @xiaolei373 in https://github.com/PaddlePaddle/FastDeploy/pull/4569
[CI] fix run-batch port from env in unittest by @xiaolei373 in https://github.com/PaddlePaddle/FastDeploy/pull/4613
[CI] Relocate server test cases from ci_use directory to e2e by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4608
[Graph Optimization][Speculative Decoding] Update yaml and fix typo by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4612
Extend sleep time to 10 seconds in switch_service by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/4618
[Speculative Decoding][MTP]Support mtp in epdptp mode by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/4614
[BugFix] fix TPDP mix parallel infer by @lizhenyun01 in https://github.com/PaddlePaddle/FastDeploy/pull/4583
[Graph Optimization] Fix IR graph dependency error exposed after enabling SOT by updating the return value of TextImageGatherScatter by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/4610
[XPU]add xpu ci w4a8 case by @yyssys in https://github.com/PaddlePaddle/FastDeploy/pull/4501
[CI][BugFix] fix port conflicts in concurrent ci test and add more unit test on async_llm by @xyxinyang in https://github.com/PaddlePaddle/FastDeploy/pull/4616
[CI] Revert directory change of test_rollout_model due to intermittent failures by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4626
feat: add support for API usage with multimodal models by @sunlei1024 in https://github.com/PaddlePaddle/FastDeploy/pull/4548
[XPU] Support PaddleOCR-VL model for XPU by @cqulilujia in https://github.com/PaddlePaddle/FastDeploy/pull/4529
[Graph Optimization] Refactor default capture list by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4617
[BugFix] fix paddleocr prefix cache bug by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4625
[BugFix] fix import jit.marker.unified by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/4622
add einops dependency by @zhang-prog in https://github.com/PaddlePaddle/FastDeploy/pull/4633
[Feature] support logits processors by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/4515
[Feature] support reward api by @xiaolei373 in https://github.com/PaddlePaddle/FastDeploy/pull/4518
[BugFix] fix total_block_num init error in worker_process by @RichardWooSJTU in https://github.com/PaddlePaddle/FastDeploy/pull/4553
[BugFix] Fix graph opt test case by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4634
[Feature] add mm token usage by @ApplEOFDiscord in https://github.com/PaddlePaddle/FastDeploy/pull/4570
[XPU] Update the return value of TextImageGatherScatter by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/4636
[Docs] Add PaddleOCR-VL-0.9B best practices by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4658
[XPU] fix pos_emb_type bug by @cqulilujia in https://github.com/PaddlePaddle/FastDeploy/pull/4638
[Docs] add Qwen25vl yaml by @xjkmfa in https://github.com/PaddlePaddle/FastDeploy/pull/4662
[Feature] add a new reasoning parser by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4571
[XPU] [CI] Increase pytest timeout for XPU ep test by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4665
add noaux_tc to unitest fused_moe by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4656
[EP] fix several bugs in data parallel by @ltd0924 in https://github.com/PaddlePaddle/FastDeploy/pull/4657
[OP] Add InferShape&InferDtype for per_token_quant_padding by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/4667
【Hackathon 9th No.86】autogen MoeFastHardamardImplWrapper template_instantiation by @ccsuzzh in https://github.com/PaddlePaddle/FastDeploy/pull/4592
[UT] Add ut for speculative sampler by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/4650
[Doc] update docs by @ApplEOFDiscord in https://github.com/PaddlePaddle/FastDeploy/pull/4675
[Graph Optimization] Add the CUDAGraph usage switch for Draft Model by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4601
[CI] Add test for paddleocr_vl by @Limerances in https://github.com/PaddlePaddle/FastDeploy/pull/4627
[unitest]add real gate_correction_bias weight to mock real data dispatch by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4676
[noauxtc_kernel] remove useless code by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4643
[BugFix] fix offline llm chat "enable_thinking" is always "False" by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4686
[BugFix] fix total_block_num init error in worker_process and test_async_llm not throw error by @xyxinyang in https://github.com/PaddlePaddle/FastDeploy/pull/4687
[BugFix] fix --logprobs-mode raw_logits by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4681
[XPU] xpu currently disable prefix cache for VL model by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/4695
[XPU] [CI] Add Vl case by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4649
[BugFix] Fix finish reason in _create_chat_completion_choice by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4582
[Feature] Unify the registration name recognition for tool_parser and reasoning_parser to “-” by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4668
[BugFix] fix unittest of get_save_output_v1 by @Wanglongzhi2001 in https://github.com/PaddlePaddle/FastDeploy/pull/4701
[XPU] [CI] Lock xvllm version by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4715
[Graph Optimization] SOT+CUDAGraph support ERNIE4.5T VL 28B / 424B by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/4645
[Feature] support mtp distribution equivalence verification by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/4699
[KVCache] Support kv cache scale load by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/4624
add flops and bandwidth to test_ffn.py by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4704
benchmark工具支持受限解码场景指定response_format by @ophilia-lee in https://github.com/PaddlePaddle/FastDeploy/pull/4718
[CI] add missing unit tests for tokenizer_cli by @xiaolei373 in https://github.com/PaddlePaddle/FastDeploy/pull/4620
[Scheduler] update v1 prefill batch by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/4611
[BugFix] Fix profile run in pd-disaggregated deployment by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/4584
[BugFix] fix mm prefix_cache cuda error bug by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/4679
[Feature] Check bos url by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/4711
[BugFix] fix wint2 config by @chang-wenbin in https://github.com/PaddlePaddle/FastDeploy/pull/4721
[FDConfig] [PD Disaggregation] [Graph Optimization] Close Cudagraph for P node when PD Disaggregation by @littledgg in https://github.com/PaddlePaddle/FastDeploy/pull/4632
[XPU] xpu support neox style ROPE by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/4719
[BugFix] Skip building native architecture when specifying arch list by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4727
fix noaux by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4731
[BugFix] fix thinking bug by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/4710
[CI] Fix rollout_model test logic by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4730
[Feature] support pooling model runner by @lizexu123 in https://github.com/PaddlePaddle/FastDeploy/pull/4590
format code by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4720
[CI] fix some ci yaml by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4747
[Docs]Update XPU document version to 2.3.0 by @yyssys in https://github.com/PaddlePaddle/FastDeploy/pull/4741
[Speculative Decoding][MTP]Support mtp in splitewise and scheduler_v1 mode by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/4743
[Speculative Decoding][MTP]Support attn mask offset by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/4641
[Docs]Add parameter to the start service command by @yyssys in https://github.com/PaddlePaddle/FastDeploy/pull/4753
[Docs]Add parameter by @yyssys in https://github.com/PaddlePaddle/FastDeploy/pull/4755
[Docs] fix PaddleOCR-VL docs bug by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4702
[Feature] Support eplb for fd by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/4599
[XPU] add v1 support for bf16 by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/4744
【DataProcessor】add options thinking_mode by @luukunn in https://github.com/PaddlePaddle/FastDeploy/pull/4735
[Optimize] Support and robust for tpN for PD by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/4595
[Docs] fix error by @yyssys in https://github.com/PaddlePaddle/FastDeploy/pull/4768
[CI]test common model by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/4697
[Metax] adapt cutlass moe for ernie-vl by @neilzhuu in https://github.com/PaddlePaddle/FastDeploy/pull/4685
fix dynamic Cfp8 for RL load by @rsmallblue in https://github.com/PaddlePaddle/FastDeploy/pull/4144
[Docs] PaddleOCR-VL add RTX3060 server param by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4765
[BugFix] fix deepseek cuda error by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/4739
[XPU][CI] fix ci base value bug by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4783
[OP]Fix attn_params by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/4787
[CI]delete test_common_model by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/4794
[XPU] fix thinking bug where output only contains reasoning_content by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/4761
[XPU] add deployment doc for PaddleOCR-VL in XPU by @cqulilujia in https://github.com/PaddlePaddle/FastDeploy/pull/4784
[BugFix] Fix ernie4_5_vl_processor.py and qwen_vl_processor.py can not disable thinking by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4762
supports internode_ll_two_stage by @carryyu in https://github.com/PaddlePaddle/FastDeploy/pull/4162
supports pd partn by @carryyu in https://github.com/PaddlePaddle/FastDeploy/pull/4615
[Docs] Add new support models by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4801
[CI] Refactor CE wheel upload for multiple target paths by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4790
[Docx] update mkdocs.yml by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/4804
[BugFix] Fix step_shm_value in PD disaggregated deployment by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/4780
Update Unit Test for PaddleOCR-VL by @Limerances in https://github.com/PaddlePaddle/FastDeploy/pull/4802
[Metax] adapt cutlass moe and fix mla attention for DeepSeek by @xiaozude in https://github.com/PaddlePaddle/FastDeploy/pull/4602
[Feature][Executor] GPU Model Runner Supports prompt_logprobs and max_logprobs by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4769
[get_padding_offset.] clean get_padding_offset.cu by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4777
support ep+tp at op layer by @zhupengyang in https://github.com/PaddlePaddle/FastDeploy/pull/4688
[BugFix] fix reasoning parser register name by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4795
remove input_ids from ForwardMeta by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4793
[Feature] Add timestamp for profiler by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/4726
[XPU]Support V1 loader in weight_only Model by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/4808
[Bug Fix] process transparent image by @ApplEOFDiscord in https://github.com/PaddlePaddle/FastDeploy/pull/4807
add paddleocr_vl benchmark by @zhang-prog in https://github.com/PaddlePaddle/FastDeploy/pull/4833
[Doc] Update docs for v2.3.0rc0 by @Jiang-Jia-Jun in https://github.com/PaddlePaddle/FastDeploy/pull/4828
[BugFix] fix messages being inplace modified in offline chat api by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/4831
【New Feature】W4afp8 supports per group quantization by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/4272
[CI] fix docker_build error and add tag-base by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4810
[PD Disaggregation] Support Qwen3-MoE use PD + EP inference. by @K11OntheBoat in https://github.com/PaddlePaddle/FastDeploy/pull/4691
remove seq_lens_this_time by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4821
[BugFix] Fix ernie_vl_reasoning_parsers.py 'end_token' to 'think_end_token' by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4805
Fix: ci port conflict by @sunlei1024 in https://github.com/PaddlePaddle/FastDeploy/pull/4840
[CI] Add unittest for activation, native_paddle_backend, w4a8, w4afp8, platforms/utils by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/4812
[XPU][CI]Change ci vl model to 28 b by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4764
[Fix] fix ernie4_5_vl model torch format loadding by @aquagull in https://github.com/PaddlePaddle/FastDeploy/pull/4447
[Feature] [PD] add simple router and refine splitwise deployment by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/4709
[Docs] fix: correct typo in nvidia_gpu.md by @playaswd in https://github.com/PaddlePaddle/FastDeploy/pull/4848
[BugFix] Fix list to List by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/4818
[BugFix] Del get_act_fn, _load_st_projector by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/4824
[Benchmark] Enhance benchmark output logging by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/4682
[XPU] ep+tp all2all by @zhupengyang in https://github.com/PaddlePaddle/FastDeploy/pull/4836
[CI] Add Check PR Template by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4481
Revert "【New Feature】W4afp8 supports per group quantization" by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4854
[CI] Update deploy.py by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/4850
[CI] Optimize port cleanup logic by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4860
[Bug Fix] fix ernie4_5_vl_moe by @LokeZhou in https://github.com/PaddlePaddle/FastDeploy/pull/4843
Revert "[Bug Fix] fix ernie4_5_vl_moe" by @Jiang-Jia-Jun in https://github.com/PaddlePaddle/FastDeploy/pull/4863
[Feature] support mm disable_chunked by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/4803
[CI] Update ERNIE-4.5-VL baseline to adapt to MoE changes by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4867
[CI] Refactor check-bypass logic in run_tests_with_coverage by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4655
[Others] Delete PaddleOCR Useless Function by @Limerances in https://github.com/PaddlePaddle/FastDeploy/pull/4815
[Feature] Optim PaddleOCR-VL by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4873
[XPU] fix ep_tp all2all ci by @zhupengyang in https://github.com/PaddlePaddle/FastDeploy/pull/4876
[XPU] modify 424B model deployment parameter by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/4888
[XPU][CI] Ci bug fix by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4889
[BugFix] fix token_processor zmq by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4827
[CI] fix docker_build error of ciuse by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4886
[Metax] support ERNIE-4.5-VL-28B by @neilzhuu in https://github.com/PaddlePaddle/FastDeploy/pull/4820
[BugFix] max_lgprobes=-1 maps to ori_vocab_size by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4884
[Feature] Enable FastDeploy to support adding the “--api-key” authentication parameter. by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4806
[Docs]Supplement the English and Chinese user documentation for Tool calling by @AuferGachet in https://github.com/PaddlePaddle/FastDeploy/pull/4895
[XPU][CI]Update test assertion and base response value by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4907
[BugFix] When the value of "temperature" is 0, adjust it to 1e-06 by @luukunn in https://github.com/PaddlePaddle/FastDeploy/pull/4900
[Docs] add api-key usage instructions by @LiqinruiG in https://github.com/PaddlePaddle/FastDeploy/pull/4902
[CI] Add four unittest by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/4906
[Bug Fix] fix bug for PD EP by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/4823
[DeepEP] support async prefill by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4899
[XPU]Update documentation by @qw86972190 in https://github.com/PaddlePaddle/FastDeploy/pull/4917
[Docs] Improve reasoning_out docs by @LiqinruiG in https://github.com/PaddlePaddle/FastDeploy/pull/4901
[BugFix] Fix inference_start_time by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4922

New Contributors

@ooooo-create made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/3609
@zhupengyang made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/3911
@WanRui37 made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/3935
@handsomecoderyang made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/4027
@fmiao2372 made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/4161
@YuhanXu made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/4218
@ccsuzzh made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/4274
@xiaozude made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/4208
@fangfangssj made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/4312
@tianshuo78520a made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/4427
@JianyuLi01 made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/4445
@Limerances made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/4240
@ST-XX made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/4597
@zhang-prog made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/4633
@neilzhuu made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/4685
@juncaipeng made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/4709
@playaswd made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/4848

Full Changelog: https://github.com/PaddlePaddle/FastDeploy/compare/v2.2.1...v2.3.0

Source: README.md, updated 2025-11-10

FastDeploy Files

High-performance Inference and Deployment Toolkit for LLMs and VLMs

新增功能

性能优化

多硬件

文档

Bug修复

其它

What's Changed

New Contributors

FastDeploy Files

High-performance Inference and Deployment Toolkit for LLMs and VLMs

Get an email when there's a new version of FastDeploy

新增功能

性能优化

多硬件

文档

Bug修复

其它

What's Changed

New Contributors