AWS API Changes

2026/02/02 - Amazon Bedrock AgentCore Control - 2 updated api methods

Changes Adds tagging support for AgentCore Evaluations (evaluator and online evaluation config)

CreateEvaluator (updated)

Link ¶
Changes (request)

{'tags': {'string': 'string'}}

Creates a custom evaluator for agent quality assessment. Custom evaluators use LLM-as-a-Judge configurations with user-defined prompts, rating scales, and model settings to evaluate agent performance at tool call, trace, or session levels.

See also: AWS API Documentation

Request Syntax

client.create_evaluator(
    clientToken='string',
    evaluatorName='string',
    description='string',
    evaluatorConfig={
        'llmAsAJudge': {
            'instructions': 'string',
            'ratingScale': {
                'numerical': [
                    {
                        'definition': 'string',
                        'value': 123.0,
                        'label': 'string'
                    },
                ],
                'categorical': [
                    {
                        'definition': 'string',
                        'label': 'string'
                    },
                ]
            },
            'modelConfig': {
                'bedrockEvaluatorModelConfig': {
                    'modelId': 'string',
                    'inferenceConfig': {
                        'maxTokens': 123,
                        'temperature': ...,
                        'topP': ...,
                        'stopSequences': [
                            'string',
                        ]
                    },
                    'additionalModelRequestFields': {...}|[...]|123|123.4|'string'|True|None
                }
            }
        }
    },
    level='TOOL_CALL'|'TRACE'|'SESSION',
    tags={
        'string': 'string'
    }
)

type clientToken:

string

param clientToken:

A unique, case-sensitive identifier to ensure that the API request completes no more than one time. If you don't specify this field, a value is randomly generated for you. If this token matches a previous request, the service ignores the request, but doesn't return an error. For more information, see Ensuring idempotency.

This field is autopopulated if not provided.

type evaluatorName:

string

param evaluatorName:

[REQUIRED]

The name of the evaluator. Must be unique within your account.

type description:

string

param description:

The description of the evaluator that explains its purpose and evaluation criteria.

type evaluatorConfig:

dict

param evaluatorConfig:

[REQUIRED]

The configuration for the evaluator, including LLM-as-a-Judge settings with instructions, rating scale, and model configuration.

llmAsAJudge (dict) --

The LLM-as-a-Judge configuration that uses a language model to evaluate agent performance based on custom instructions and rating scales.
- instructions (string) -- [REQUIRED]
  
  The evaluation instructions that guide the language model in assessing agent performance, including criteria and evaluation guidelines.
- ratingScale (dict) -- [REQUIRED]
  
  The rating scale that defines how the evaluator should score agent performance, either numerical or categorical.
  
  Note
  
  This is a Tagged Union structure. Only one of the following top level keys can be set: numerical, categorical.
  - numerical (list) --
    
    The numerical rating scale with defined score values and descriptions for quantitative evaluation.
    - (dict) --
      
      The definition of a numerical rating scale option that provides a numeric value with its description for evaluation scoring.
      - definition (string) -- [REQUIRED]
        
        The description that explains what this numerical rating represents and when it should be used.
      - value (float) -- [REQUIRED]
        
        The numerical value for this rating scale option.
      - label (string) -- [REQUIRED]
        
        The label or name that describes this numerical rating option.
  - categorical (list) --
    
    The categorical rating scale with named categories and definitions for qualitative evaluation.
    - (dict) --
      
      The definition of a categorical rating scale option that provides a named category with its description for evaluation scoring.
      - definition (string) -- [REQUIRED]
        
        The description that explains what this categorical rating represents and when it should be used.
      - label (string) -- [REQUIRED]
        
        The label or name of this categorical rating option.
- modelConfig (dict) -- [REQUIRED]
  
  The model configuration that specifies which foundation model to use and how to configure it for evaluation.
  
  Note
  
  This is a Tagged Union structure. Only one of the following top level keys can be set: bedrockEvaluatorModelConfig.
  - bedrockEvaluatorModelConfig (dict) --
    
    The Amazon Bedrock model configuration for evaluation.
    - modelId (string) -- [REQUIRED]
      
      The identifier of the Amazon Bedrock model to use for evaluation. Must be a supported foundation model available in your region.
    - inferenceConfig (dict) --
      
      The inference configuration parameters that control model behavior during evaluation, including temperature, token limits, and sampling settings.
      - maxTokens (integer) --
        
        The maximum number of tokens to generate in the model response during evaluation.
      - temperature (float) --
        
        The temperature value that controls randomness in the model's responses. Lower values produce more deterministic outputs.
      - topP (float) --
        
        The top-p sampling parameter that controls the diversity of the model's responses by limiting the cumulative probability of token choices.
      - stopSequences (list) --
        
        The list of sequences that will cause the model to stop generating tokens when encountered.
        
        (string) --
    - additionalModelRequestFields (:ref:`document<document>`) --
      
      Additional model-specific request fields to customize model behavior beyond the standard inference configuration.

type level:

string

param level:

[REQUIRED]

The evaluation level that determines the scope of evaluation. Valid values are TOOL_CALL for individual tool invocations, TRACE for single request-response interactions, or SESSION for entire conversation sessions.

type tags:

dict

param tags:

A map of tag keys and values to assign to an AgentCore Evaluator. Tags enable you to categorize your resources in different ways, for example, by purpose, owner, or environment.

(string) --
- (string) --

rtype:

dict

returns:

Response Syntax

{
    'evaluatorArn': 'string',
    'evaluatorId': 'string',
    'createdAt': datetime(2015, 1, 1),
    'status': 'ACTIVE'|'CREATING'|'CREATE_FAILED'|'UPDATING'|'UPDATE_FAILED'|'DELETING'
}

Response Structure

(dict) --
- evaluatorArn (string) --
  
  The Amazon Resource Name (ARN) of the created evaluator.
- evaluatorId (string) --
  
  The unique identifier of the created evaluator.
- createdAt (datetime) --
  
  The timestamp when the evaluator was created.
- status (string) --
  
  The status of the evaluator creation operation.

CreateOnlineEvaluationConfig (updated)

Link ¶
Changes (request)

{'tags': {'string': 'string'}}

Creates an online evaluation configuration for continuous monitoring of agent performance. Online evaluation automatically samples live traffic from CloudWatch logs at specified rates and applies evaluators to assess agent quality in production.

See also: AWS API Documentation

Request Syntax

client.create_online_evaluation_config(
    clientToken='string',
    onlineEvaluationConfigName='string',
    description='string',
    rule={
        'samplingConfig': {
            'samplingPercentage': 123.0
        },
        'filters': [
            {
                'key': 'string',
                'operator': 'Equals'|'NotEquals'|'GreaterThan'|'LessThan'|'GreaterThanOrEqual'|'LessThanOrEqual'|'Contains'|'NotContains',
                'value': {
                    'stringValue': 'string',
                    'doubleValue': 123.0,
                    'booleanValue': True|False
                }
            },
        ],
        'sessionConfig': {
            'sessionTimeoutMinutes': 123
        }
    },
    dataSourceConfig={
        'cloudWatchLogs': {
            'logGroupNames': [
                'string',
            ],
            'serviceNames': [
                'string',
            ]
        }
    },
    evaluators=[
        {
            'evaluatorId': 'string'
        },
    ],
    evaluationExecutionRoleArn='string',
    enableOnCreate=True|False,
    tags={
        'string': 'string'
    }
)

type clientToken:

string

param clientToken:

This field is autopopulated if not provided.

type onlineEvaluationConfigName:

string

param onlineEvaluationConfigName:

[REQUIRED]

The name of the online evaluation configuration. Must be unique within your account.

type description:

string

param description:

The description of the online evaluation configuration that explains its monitoring purpose and scope.

type rule:

dict

param rule:

[REQUIRED]

The evaluation rule that defines sampling configuration, filters, and session detection settings for the online evaluation.

samplingConfig (dict) -- [REQUIRED]

The sampling configuration that determines what percentage of agent traces to evaluate.
- samplingPercentage (float) -- [REQUIRED]
  
  The percentage of agent traces to sample for evaluation, ranging from 0.01% to 100%.
filters (list) --

The list of filters that determine which agent traces should be included in the evaluation based on trace properties.
- (dict) --
  
  The filter that applies conditions to agent traces during online evaluation to determine which traces should be evaluated.
  - key (string) -- [REQUIRED]
    
    The key or field name to filter on within the agent trace data.
  - operator (string) -- [REQUIRED]
    
    The comparison operator to use for filtering.
  - value (dict) -- [REQUIRED]
    
    The value to compare against using the specified operator.
    
    Note
    
    This is a Tagged Union structure. Only one of the following top level keys can be set: stringValue, doubleValue, booleanValue.
    - stringValue (string) --
      
      The string value for text-based filtering.
    - doubleValue (float) --
      
      The numeric value for numerical filtering and comparisons.
    - booleanValue (boolean) --
      
      The boolean value for true/false filtering conditions.
sessionConfig (dict) --

The session configuration that defines timeout settings for detecting when agent sessions are complete and ready for evaluation.
- sessionTimeoutMinutes (integer) -- [REQUIRED]
  
  The number of minutes of inactivity after which an agent session is considered complete and ready for evaluation. Default is 15 minutes.

type dataSourceConfig:

dict

param dataSourceConfig:

[REQUIRED]

The data source configuration that specifies CloudWatch log groups and service names to monitor for agent traces.

cloudWatchLogs (dict) --

The CloudWatch logs configuration for reading agent traces from log groups.
- logGroupNames (list) -- [REQUIRED]
  
  The list of CloudWatch log group names to monitor for agent traces.
  - (string) --
- serviceNames (list) -- [REQUIRED]
  
  The list of service names to filter traces within the specified log groups. Used to identify relevant agent sessions.
  - (string) --

type evaluators:

list

param evaluators:

[REQUIRED]

The list of evaluators to apply during online evaluation. Can include both built-in evaluators and custom evaluators created with CreateEvaluator.

(dict) --

The reference to an evaluator used in online evaluation configurations, containing the evaluator identifier.

Note

This is a Tagged Union structure. Only one of the following top level keys can be set: evaluatorId.
- evaluatorId (string) --
  
  The unique identifier of the evaluator. Can reference builtin evaluators (e.g., Builtin.Helpfulness) or custom evaluators.

type evaluationExecutionRoleArn:

string

param evaluationExecutionRoleArn:

[REQUIRED]

The Amazon Resource Name (ARN) of the IAM role that grants permissions to read from CloudWatch logs, write evaluation results, and invoke Amazon Bedrock models for evaluation.

type enableOnCreate:

boolean

param enableOnCreate:

[REQUIRED]

Whether to enable the online evaluation configuration immediately upon creation. If true, evaluation begins automatically.

type tags:

dict

param tags:

A map of tag keys and values to assign to an AgentCore Online Evaluation Config. Tags enable you to categorize your resources in different ways, for example, by purpose, owner, or environment.

(string) --
- (string) --

rtype:

dict

returns:

Response Syntax

{
    'onlineEvaluationConfigArn': 'string',
    'onlineEvaluationConfigId': 'string',
    'createdAt': datetime(2015, 1, 1),
    'outputConfig': {
        'cloudWatchConfig': {
            'logGroupName': 'string'
        }
    },
    'status': 'ACTIVE'|'CREATING'|'CREATE_FAILED'|'UPDATING'|'UPDATE_FAILED'|'DELETING',
    'executionStatus': 'ENABLED'|'DISABLED',
    'failureReason': 'string'
}

Response Structure

(dict) --
- onlineEvaluationConfigArn (string) --
  
  The Amazon Resource Name (ARN) of the created online evaluation configuration.
- onlineEvaluationConfigId (string) --
  
  The unique identifier of the created online evaluation configuration.
- createdAt (datetime) --
  
  The timestamp when the online evaluation configuration was created.
- outputConfig (dict) --
  
  The configuration that specifies where evaluation results should be written for monitoring and analysis.
  - cloudWatchConfig (dict) --
    
    The CloudWatch configuration for writing evaluation results to CloudWatch logs with embedded metric format.
    - logGroupName (string) --
      
      The name of the CloudWatch log group where evaluation results will be written. The log group will be created if it doesn't exist.
- status (string) --
  
  The status of the online evaluation configuration.
- executionStatus (string) --
  
  The execution status indicating whether the online evaluation is currently running.
- failureReason (string) --
  
  The reason for failure if the online evaluation configuration creation or execution failed.