Name | Modified | Size | Downloads / Week |
---|---|---|---|
Parent folder | |||
azure-ai-evaluation_1.10.0 source code.tar.gz | 2025-07-31 | 128.3 MB | |
azure-ai-evaluation_1.10.0 source code.zip | 2025-07-31 | 182.5 MB | |
README.md | 2025-07-31 | 2.7 kB | |
Totals: 3 Items | 310.8 MB | 0 |
1.10.0 (2025-07-31)
Breaking Changes
- Added
evaluate_query
parameter to all RAI service evaluators that can be passed as a keyword argument. This parameter controls whether queries are included in evaluation data when evaluating query-response pairs. Previously, queries were always included in evaluations. When set toTrue
, both query and response will be evaluated; when set toFalse
(default), only the response will be evaluated. This parameter is available across all RAI service evaluators includingContentSafetyEvaluator
,ViolenceEvaluator
,SexualEvaluator
,SelfHarmEvaluator
,HateUnfairnessEvaluator
,ProtectedMaterialEvaluator
,IndirectAttackEvaluator
,CodeVulnerabilityEvaluator
,UngroundedAttributesEvaluator
,GroundednessProEvaluator
, andEciEvaluator
. Existing code that relies on queries being evaluated will need to explicitly setevaluate_query=True
to maintain the previous behavior.
Features Added
- Added support for Azure OpenAI Python grader via
AzureOpenAIPythonGrader
class, which serves as a wrapper around Azure Open AI Python grader configurations. This new grader object can be supplied to the mainevaluate
method as if it were a normal callable evaluator. - Added
attack_success_thresholds
parameter toRedTeam
class for configuring custom thresholds that determine attack success. This allows users to set specific threshold values for each risk category, with scores greater than the threshold considered successful attacks (i.e. higher threshold means higher tolerance for harmful responses). - Enhanced threshold reporting in RedTeam results to include default threshold values when custom thresholds aren't specified, providing better transparency about the evaluation criteria used.
Bugs Fixed
- Fixed red team scan
output_path
issue where individual evaluation results were overwriting each other instead of being preserved as separate files. Individual evaluations now create unique files while the user'soutput_path
is reserved for final aggregated results. - Significant improvements to TaskAdherence evaluator. New version has less variance, is much faster and consumes fewer tokens.
- Significant improvements to Relevance evaluator. New version has more concrete rubrics and has less variance, is much faster and consumes fewer tokens.
Other Changes
- The default engine for evaluation was changed from
promptflow
(PFClient) to an in-SDK batch client (RunSubmitterClient) - Note: We've temporarily kept an escape hatch to fall back to the legacy
promptflow
implementation by setting_use_pf_client=True
when invokingevaluate()
. This is due to be removed in a future release.