2026/02/02 - Amazon Bedrock AgentCore Control - 2 updated api methods
Changes Adds tagging support for AgentCore Evaluations (evaluator and online evaluation config)
{'tags': {'string': 'string'}}
Creates a custom evaluator for agent quality assessment. Custom evaluators use LLM-as-a-Judge configurations with user-defined prompts, rating scales, and model settings to evaluate agent performance at tool call, trace, or session levels.
See also: AWS API Documentation
Request Syntax
client.create_evaluator(
clientToken='string',
evaluatorName='string',
description='string',
evaluatorConfig={
'llmAsAJudge': {
'instructions': 'string',
'ratingScale': {
'numerical': [
{
'definition': 'string',
'value': 123.0,
'label': 'string'
},
],
'categorical': [
{
'definition': 'string',
'label': 'string'
},
]
},
'modelConfig': {
'bedrockEvaluatorModelConfig': {
'modelId': 'string',
'inferenceConfig': {
'maxTokens': 123,
'temperature': ...,
'topP': ...,
'stopSequences': [
'string',
]
},
'additionalModelRequestFields': {...}|[...]|123|123.4|'string'|True|None
}
}
}
},
level='TOOL_CALL'|'TRACE'|'SESSION',
tags={
'string': 'string'
}
)
string
A unique, case-sensitive identifier to ensure that the API request completes no more than one time. If you don't specify this field, a value is randomly generated for you. If this token matches a previous request, the service ignores the request, but doesn't return an error. For more information, see Ensuring idempotency.
This field is autopopulated if not provided.
string
[REQUIRED]
The name of the evaluator. Must be unique within your account.
string
The description of the evaluator that explains its purpose and evaluation criteria.
dict
[REQUIRED]
The configuration for the evaluator, including LLM-as-a-Judge settings with instructions, rating scale, and model configuration.
llmAsAJudge (dict) --
The LLM-as-a-Judge configuration that uses a language model to evaluate agent performance based on custom instructions and rating scales.
instructions (string) -- [REQUIRED]
The evaluation instructions that guide the language model in assessing agent performance, including criteria and evaluation guidelines.
ratingScale (dict) -- [REQUIRED]
The rating scale that defines how the evaluator should score agent performance, either numerical or categorical.
numerical (list) --
The numerical rating scale with defined score values and descriptions for quantitative evaluation.
(dict) --
The definition of a numerical rating scale option that provides a numeric value with its description for evaluation scoring.
definition (string) -- [REQUIRED]
The description that explains what this numerical rating represents and when it should be used.
value (float) -- [REQUIRED]
The numerical value for this rating scale option.
label (string) -- [REQUIRED]
The label or name that describes this numerical rating option.
categorical (list) --
The categorical rating scale with named categories and definitions for qualitative evaluation.
(dict) --
The definition of a categorical rating scale option that provides a named category with its description for evaluation scoring.
definition (string) -- [REQUIRED]
The description that explains what this categorical rating represents and when it should be used.
label (string) -- [REQUIRED]
The label or name of this categorical rating option.
modelConfig (dict) -- [REQUIRED]
The model configuration that specifies which foundation model to use and how to configure it for evaluation.
bedrockEvaluatorModelConfig (dict) --
The Amazon Bedrock model configuration for evaluation.
modelId (string) -- [REQUIRED]
The identifier of the Amazon Bedrock model to use for evaluation. Must be a supported foundation model available in your region.
inferenceConfig (dict) --
The inference configuration parameters that control model behavior during evaluation, including temperature, token limits, and sampling settings.
maxTokens (integer) --
The maximum number of tokens to generate in the model response during evaluation.
temperature (float) --
The temperature value that controls randomness in the model's responses. Lower values produce more deterministic outputs.
topP (float) --
The top-p sampling parameter that controls the diversity of the model's responses by limiting the cumulative probability of token choices.
stopSequences (list) --
The list of sequences that will cause the model to stop generating tokens when encountered.
(string) --
additionalModelRequestFields (:ref:`document<document>`) --
Additional model-specific request fields to customize model behavior beyond the standard inference configuration.
string
[REQUIRED]
The evaluation level that determines the scope of evaluation. Valid values are TOOL_CALL for individual tool invocations, TRACE for single request-response interactions, or SESSION for entire conversation sessions.
dict
A map of tag keys and values to assign to an AgentCore Evaluator. Tags enable you to categorize your resources in different ways, for example, by purpose, owner, or environment.
(string) --
(string) --
dict
Response Syntax
{
'evaluatorArn': 'string',
'evaluatorId': 'string',
'createdAt': datetime(2015, 1, 1),
'status': 'ACTIVE'|'CREATING'|'CREATE_FAILED'|'UPDATING'|'UPDATE_FAILED'|'DELETING'
}
Response Structure
(dict) --
evaluatorArn (string) --
The Amazon Resource Name (ARN) of the created evaluator.
evaluatorId (string) --
The unique identifier of the created evaluator.
createdAt (datetime) --
The timestamp when the evaluator was created.
status (string) --
The status of the evaluator creation operation.
{'tags': {'string': 'string'}}
Creates an online evaluation configuration for continuous monitoring of agent performance. Online evaluation automatically samples live traffic from CloudWatch logs at specified rates and applies evaluators to assess agent quality in production.
See also: AWS API Documentation
Request Syntax
client.create_online_evaluation_config(
clientToken='string',
onlineEvaluationConfigName='string',
description='string',
rule={
'samplingConfig': {
'samplingPercentage': 123.0
},
'filters': [
{
'key': 'string',
'operator': 'Equals'|'NotEquals'|'GreaterThan'|'LessThan'|'GreaterThanOrEqual'|'LessThanOrEqual'|'Contains'|'NotContains',
'value': {
'stringValue': 'string',
'doubleValue': 123.0,
'booleanValue': True|False
}
},
],
'sessionConfig': {
'sessionTimeoutMinutes': 123
}
},
dataSourceConfig={
'cloudWatchLogs': {
'logGroupNames': [
'string',
],
'serviceNames': [
'string',
]
}
},
evaluators=[
{
'evaluatorId': 'string'
},
],
evaluationExecutionRoleArn='string',
enableOnCreate=True|False,
tags={
'string': 'string'
}
)
string
A unique, case-sensitive identifier to ensure that the API request completes no more than one time. If you don't specify this field, a value is randomly generated for you. If this token matches a previous request, the service ignores the request, but doesn't return an error. For more information, see Ensuring idempotency.
This field is autopopulated if not provided.
string
[REQUIRED]
The name of the online evaluation configuration. Must be unique within your account.
string
The description of the online evaluation configuration that explains its monitoring purpose and scope.
dict
[REQUIRED]
The evaluation rule that defines sampling configuration, filters, and session detection settings for the online evaluation.
samplingConfig (dict) -- [REQUIRED]
The sampling configuration that determines what percentage of agent traces to evaluate.
samplingPercentage (float) -- [REQUIRED]
The percentage of agent traces to sample for evaluation, ranging from 0.01% to 100%.
filters (list) --
The list of filters that determine which agent traces should be included in the evaluation based on trace properties.
(dict) --
The filter that applies conditions to agent traces during online evaluation to determine which traces should be evaluated.
key (string) -- [REQUIRED]
The key or field name to filter on within the agent trace data.
operator (string) -- [REQUIRED]
The comparison operator to use for filtering.
value (dict) -- [REQUIRED]
The value to compare against using the specified operator.
stringValue (string) --
The string value for text-based filtering.
doubleValue (float) --
The numeric value for numerical filtering and comparisons.
booleanValue (boolean) --
The boolean value for true/false filtering conditions.
sessionConfig (dict) --
The session configuration that defines timeout settings for detecting when agent sessions are complete and ready for evaluation.
sessionTimeoutMinutes (integer) -- [REQUIRED]
The number of minutes of inactivity after which an agent session is considered complete and ready for evaluation. Default is 15 minutes.
dict
[REQUIRED]
The data source configuration that specifies CloudWatch log groups and service names to monitor for agent traces.
cloudWatchLogs (dict) --
The CloudWatch logs configuration for reading agent traces from log groups.
logGroupNames (list) -- [REQUIRED]
The list of CloudWatch log group names to monitor for agent traces.
(string) --
serviceNames (list) -- [REQUIRED]
The list of service names to filter traces within the specified log groups. Used to identify relevant agent sessions.
(string) --
list
[REQUIRED]
The list of evaluators to apply during online evaluation. Can include both built-in evaluators and custom evaluators created with CreateEvaluator.
(dict) --
The reference to an evaluator used in online evaluation configurations, containing the evaluator identifier.
evaluatorId (string) --
The unique identifier of the evaluator. Can reference builtin evaluators (e.g., Builtin.Helpfulness) or custom evaluators.
string
[REQUIRED]
The Amazon Resource Name (ARN) of the IAM role that grants permissions to read from CloudWatch logs, write evaluation results, and invoke Amazon Bedrock models for evaluation.
boolean
[REQUIRED]
Whether to enable the online evaluation configuration immediately upon creation. If true, evaluation begins automatically.
dict
A map of tag keys and values to assign to an AgentCore Online Evaluation Config. Tags enable you to categorize your resources in different ways, for example, by purpose, owner, or environment.
(string) --
(string) --
dict
Response Syntax
{
'onlineEvaluationConfigArn': 'string',
'onlineEvaluationConfigId': 'string',
'createdAt': datetime(2015, 1, 1),
'outputConfig': {
'cloudWatchConfig': {
'logGroupName': 'string'
}
},
'status': 'ACTIVE'|'CREATING'|'CREATE_FAILED'|'UPDATING'|'UPDATE_FAILED'|'DELETING',
'executionStatus': 'ENABLED'|'DISABLED',
'failureReason': 'string'
}
Response Structure
(dict) --
onlineEvaluationConfigArn (string) --
The Amazon Resource Name (ARN) of the created online evaluation configuration.
onlineEvaluationConfigId (string) --
The unique identifier of the created online evaluation configuration.
createdAt (datetime) --
The timestamp when the online evaluation configuration was created.
outputConfig (dict) --
The configuration that specifies where evaluation results should be written for monitoring and analysis.
cloudWatchConfig (dict) --
The CloudWatch configuration for writing evaluation results to CloudWatch logs with embedded metric format.
logGroupName (string) --
The name of the CloudWatch log group where evaluation results will be written. The log group will be created if it doesn't exist.
status (string) --
The status of the online evaluation configuration.
executionStatus (string) --
The execution status indicating whether the online evaluation is currently running.
failureReason (string) --
The reason for failure if the online evaluation configuration creation or execution failed.