2026/06/17 - Amazon Bedrock AgentCore - 3 updated api methods
Changes AgentCore Harness service will be Generally Available at NYS 2026 with this Treb release. Harness will support invoking specific endpoints via the qualifier parameter, AWS Skills for pre-built agent capabilities, and improved validation for skill git source URLs.
{'dataSourceConfig': {'onlineEvaluationConfigSource': {'timeRange': {'endTime': 'timestamp',
'startTime': 'timestamp'}}},
'failureAnalysisResult': {'failures': {'subCategories': {'rootCauses': {'affectedSessions': {'failureSpans': {'signals': {'category': {'other'}}}}}}}}}
Retrieves detailed information about a batch evaluation, including its status, configuration, results, and any error details.
See also: AWS API Documentation
Request Syntax
client.get_batch_evaluation(
batchEvaluationId='string'
)
string
[REQUIRED]
The unique identifier of the batch evaluation to retrieve.
dict
Response Syntax
{
'batchEvaluationId': 'string',
'batchEvaluationArn': 'string',
'batchEvaluationName': 'string',
'status': 'PENDING'|'IN_PROGRESS'|'COMPLETED'|'COMPLETED_WITH_ERRORS'|'FAILED'|'STOPPING'|'STOPPED'|'DELETING',
'createdAt': datetime(2015, 1, 1),
'evaluators': [
{
'evaluatorId': 'string'
},
],
'insights': [
{
'insightId': 'string'
},
],
'dataSourceConfig': {
'cloudWatchLogs': {
'serviceNames': [
'string',
],
'logGroupNames': [
'string',
],
'filterConfig': {
'sessionIds': [
'string',
],
'timeRange': {
'startTime': datetime(2015, 1, 1),
'endTime': datetime(2015, 1, 1)
}
}
},
'onlineEvaluationConfigSource': {
'onlineEvaluationConfigArn': 'string',
'timeRange': {
'startTime': datetime(2015, 1, 1),
'endTime': datetime(2015, 1, 1)
}
}
},
'outputConfig': {
'cloudWatchConfig': {
'logGroupName': 'string',
'logStreamName': 'string'
}
},
'evaluationResults': {
'numberOfSessionsCompleted': 123,
'numberOfSessionsInProgress': 123,
'numberOfSessionsFailed': 123,
'totalNumberOfSessions': 123,
'numberOfSessionsIgnored': 123,
'evaluatorSummaries': [
{
'evaluatorId': 'string',
'statistics': {
'averageScore': 123.0
},
'totalEvaluated': 123,
'totalFailed': 123
},
]
},
'failureAnalysisResult': {
'failures': [
{
'clusterId': 123,
'name': 'string',
'description': 'string',
'affectedSessionCount': 123,
'subCategories': [
{
'clusterId': 123,
'name': 'string',
'description': 'string',
'affectedSessionCount': 123,
'rootCauses': [
{
'clusterId': 123,
'name': 'string',
'rootCause': 'string',
'recommendation': 'string',
'affectedSessionCount': 123,
'affectedSessions': [
{
'sessionId': 'string',
'explanation': 'string',
'fixType': 'string',
'recommendation': 'string',
'failureSpans': [
{
'spanId': 'string',
'traceId': 'string',
'signals': [
{
'category': 'execution-error-category-authentication'|'execution-error-category-resource-not-found'|'execution-error-category-service-errors'|'execution-error-category-rate-limiting'|'execution-error-category-formatting'|'execution-error-category-timeout'|'execution-error-category-resource-exhaustion'|'execution-error-category-environment'|'execution-error-category-tool-schema'|'task-instruction-category-non-compliance'|'task-instruction-category-problem-id'|'incorrect-actions-category-tool-selection'|'incorrect-actions-category-poor-information-retrieval'|'incorrect-actions-category-clarification'|'incorrect-actions-category-inappropriate-info-request'|'context-handling-error-category-context-handling-failures'|'hallucination-category-hall-capabilities'|'hallucination-category-hall-misunderstand'|'hallucination-category-hall-usage'|'hallucination-category-hall-history'|'hallucination-category-hall-params'|'hallucination-category-fabricate-tool-outputs'|'repetitive-behavior-category-repetition-tool'|'repetitive-behavior-category-repetition-info'|'repetitive-behavior-category-step-repetition'|'orchestration-related-errors-category-reasoning-mismatch'|'orchestration-related-errors-category-goal-deviation'|'orchestration-related-errors-category-premature-termination'|'orchestration-related-errors-category-unaware-termination'|'llm-output-category-nonsensical'|'configuration-mismatch-category-tool-definition'|'coding-use-case-specific-failure-types-category-edge-case-oversights'|'coding-use-case-specific-failure-types-category-dependency-issues'|'other',
'evidence': 'string',
'confidence': 123.0
},
]
},
]
},
]
},
]
},
]
},
]
},
'userIntentResult': {
'userIntents': [
{
'clusterId': 123,
'name': 'string',
'description': 'string',
'affectedSessionCount': 123,
'affectedSessions': [
{
'sessionId': 'string',
'userMessages': [
'string',
]
},
]
},
]
},
'executionSummaryResult': {
'executionSummaries': [
{
'clusterId': 123,
'name': 'string',
'description': 'string',
'affectedSessionCount': 123,
'affectedSessions': [
{
'sessionId': 'string',
'approachTaken': 'string',
'finalOutcome': 'string'
},
]
},
]
},
'errorDetails': [
'string',
],
'description': 'string',
'updatedAt': datetime(2015, 1, 1),
'kmsKeyArn': 'string'
}
Response Structure
(dict) --
batchEvaluationId (string) --
The unique identifier of the batch evaluation.
batchEvaluationArn (string) --
The Amazon Resource Name (ARN) of the batch evaluation.
batchEvaluationName (string) --
The name of the batch evaluation.
status (string) --
The current status of the batch evaluation.
createdAt (datetime) --
The timestamp when the batch evaluation was created.
evaluators (list) --
The list of evaluators applied during the batch evaluation.
(dict) --
An evaluator to run against sessions during batch evaluation.
evaluatorId (string) --
The unique identifier of the evaluator. Can reference built-in evaluators (e.g., Builtin.Helpfulness) or custom evaluators.
insights (list) --
The list of insight analyses applied during the batch evaluation.
(dict) --
A reference to an insight analysis to run against sessions during batch evaluation. Insights provide deeper analysis beyond individual evaluator scores, including failure detection, user intent clustering, and execution summarization.
insightId (string) --
The unique identifier of the insight to run.
dataSourceConfig (dict) --
The data source configuration specifying where agent traces are pulled from.
cloudWatchLogs (dict) --
Configuration for pulling agent session traces from CloudWatch Logs.
serviceNames (list) --
The list of agent service names to filter traces within the specified log groups.
(string) --
logGroupNames (list) --
The list of CloudWatch log group names to read agent traces from. Maximum of 5 log groups.
(string) --
filterConfig (dict) --
Optional filter configuration to narrow down which sessions to evaluate.
sessionIds (list) --
A list of specific session IDs to evaluate. If specified, only these sessions are included in the evaluation.
(string) --
timeRange (dict) --
The time range filter for selecting sessions to evaluate.
startTime (datetime) --
The start time of the time range. Only sessions with activity at or after this timestamp are included.
endTime (datetime) --
The end time of the time range. Only sessions with activity before this timestamp are included.
onlineEvaluationConfigSource (dict) --
Reference an existing OnlineEvaluationConfig as session source
onlineEvaluationConfigArn (string) --
The Amazon Resource Name (ARN) of the online evaluation configuration to use as the session source.
timeRange (dict) --
Optional session filter configuration to narrow down which sessions from the online evaluation configuration to include.
startTime (datetime) --
The start time of the time range. Only sessions with activity at or after this timestamp are included.
endTime (datetime) --
The end time of the time range. Only sessions with activity before this timestamp are included.
outputConfig (dict) --
The output configuration specifying where evaluation results are written.
cloudWatchConfig (dict) --
The CloudWatch Logs configuration for writing evaluation results.
logGroupName (string) --
The name of the CloudWatch log group where evaluation results will be written.
logStreamName (string) --
The name of the CloudWatch log stream where evaluation results will be written.
evaluationResults (dict) --
The aggregated evaluation results, including session completion counts and evaluator score summaries.
numberOfSessionsCompleted (integer) --
The number of sessions that have been successfully evaluated.
numberOfSessionsInProgress (integer) --
The number of sessions currently being evaluated.
numberOfSessionsFailed (integer) --
The number of sessions that failed evaluation.
totalNumberOfSessions (integer) --
The total number of sessions included in the batch evaluation.
numberOfSessionsIgnored (integer) --
The number of sessions that were ignored during evaluation.
evaluatorSummaries (list) --
A list of per-evaluator summary statistics.
(dict) --
Summary statistics for a single evaluator within a batch evaluation.
evaluatorId (string) --
The unique identifier of the evaluator.
statistics (dict) --
The aggregated statistics for this evaluator.
averageScore (float) --
The average score across all evaluated sessions for this evaluator.
totalEvaluated (integer) --
The total number of sessions evaluated by this evaluator.
totalFailed (integer) --
The total number of sessions that failed evaluation by this evaluator.
failureAnalysisResult (dict) --
The failure analysis results from insights, containing categorized failure clusters with root causes and recommendations.
failures (list) --
The list of failure category clusters identified across analyzed sessions.
(dict) --
A top-level failure category identified by clustering similar failure patterns across sessions.
clusterId (integer) --
The unique identifier of the failure category cluster.
name (string) --
The name of the failure category.
description (string) --
A description of the failure category pattern.
affectedSessionCount (integer) --
The number of sessions affected by this failure category.
subCategories (list) --
The list of failure subcategories within this category.
(dict) --
A subcategory of failures within a top-level failure category.
clusterId (integer) --
The unique identifier of the failure subcategory cluster.
name (string) --
The name of the failure subcategory.
description (string) --
A description of the failure subcategory pattern.
affectedSessionCount (integer) --
The number of sessions affected by this failure subcategory.
rootCauses (list) --
The list of root cause clusters identified within this subcategory.
(dict) --
A cluster of similar root causes identified within a failure subcategory.
clusterId (integer) --
The unique identifier of the root cause cluster.
name (string) --
The name of the root cause cluster.
rootCause (string) --
The root cause explanation for this cluster of failures.
recommendation (string) --
The recommended fix for this root cause.
affectedSessionCount (integer) --
The number of sessions affected by this root cause.
affectedSessions (list) --
The list of sessions affected by this root cause.
(dict) --
A session affected by a detected failure pattern, including root cause details.
sessionId (string) --
The unique identifier of the affected session.
explanation (string) --
An explanation of how the failure manifested in this session.
fixType (string) --
The type of fix recommended for this failure.
recommendation (string) --
The specific fix recommendation for this session.
failureSpans (list) --
The list of spans where failures were detected in this session.
(dict) --
Details about a specific span where a failure was detected.
spanId (string) --
The unique identifier of the span where the failure occurred.
traceId (string) --
The trace identifier associated with the failure span.
signals (list) --
The failure signals detected in this span.
(dict) --
A signal indicating a detected failure within a span.
category (string) --
The failure category classification for this signal.
evidence (string) --
The evidence supporting the failure detection.
confidence (float) --
The confidence score of the failure detection.
userIntentResult (dict) --
The user intent clustering results from insights, containing grouped user intents across evaluated sessions.
userIntents (list) --
The list of user intent clusters identified across analyzed sessions.
(dict) --
A cluster of similar user intents identified across sessions.
clusterId (integer) --
The unique identifier of the user intent cluster.
name (string) --
The name of the user intent cluster.
description (string) --
A description of the user intent pattern.
affectedSessionCount (integer) --
The number of sessions with this user intent.
affectedSessions (list) --
The list of sessions with this user intent.
(dict) --
A session associated with a user intent cluster.
sessionId (string) --
The unique identifier of the session.
userMessages (list) --
The user messages from this session that contributed to the intent cluster.
(string) --
executionSummaryResult (dict) --
The execution summary clustering results from insights, containing grouped execution patterns across evaluated sessions.
executionSummaries (list) --
The list of execution summary clusters identified across analyzed sessions.
(dict) --
A cluster of similar execution patterns identified across sessions.
clusterId (integer) --
The unique identifier of the execution summary cluster.
name (string) --
The name of the execution pattern cluster.
description (string) --
A description of the execution pattern.
affectedSessionCount (integer) --
The number of sessions with this execution pattern.
affectedSessions (list) --
The list of sessions with this execution pattern.
(dict) --
A session associated with an execution summary cluster.
sessionId (string) --
The unique identifier of the session.
approachTaken (string) --
The approach taken by the agent during this session.
finalOutcome (string) --
The final outcome of the session.
errorDetails (list) --
The error details if the batch evaluation encountered failures.
(string) --
description (string) --
The description of the batch evaluation.
updatedAt (datetime) --
The timestamp when the batch evaluation was last updated.
kmsKeyArn (string) --
The ARN of the KMS key used to encrypt evaluation data.
{'qualifier': 'string', 'skills': {'awsSkills': {'paths': ['string']}}}
Operation to invoke a Harness.
See also: AWS API Documentation
Request Syntax
client.invoke_harness(
harnessArn='string',
qualifier='string',
runtimeSessionId='string',
runtimeUserId='string',
messages=[
{
'role': 'user'|'assistant',
'content': [
{
'text': 'string',
'toolUse': {
'name': 'string',
'toolUseId': 'string',
'input': {...}|[...]|123|123.4|'string'|True|None,
'type': 'tool_use'|'server_tool_use'|'mcp_tool_use',
'serverName': 'string'
},
'toolResult': {
'toolUseId': 'string',
'content': [
{
'text': 'string',
'json': {...}|[...]|123|123.4|'string'|True|None
},
],
'status': 'success'|'error',
'type': 'tool_use'|'server_tool_use'|'mcp_tool_use'
},
'reasoningContent': {
'reasoningText': {
'text': 'string',
'signature': 'string'
},
'redactedContent': b'bytes'
}
},
]
},
],
model={
'bedrockModelConfig': {
'modelId': 'string',
'maxTokens': 123,
'temperature': ...,
'topP': ...,
'apiFormat': 'converse_stream'|'responses'|'chat_completions',
'additionalParams': {...}|[...]|123|123.4|'string'|True|None
},
'openAiModelConfig': {
'modelId': 'string',
'apiKeyArn': 'string',
'maxTokens': 123,
'temperature': ...,
'topP': ...,
'apiFormat': 'chat_completions'|'responses',
'additionalParams': {...}|[...]|123|123.4|'string'|True|None
},
'geminiModelConfig': {
'modelId': 'string',
'apiKeyArn': 'string',
'maxTokens': 123,
'temperature': ...,
'topP': ...,
'topK': 123
},
'liteLlmModelConfig': {
'modelId': 'string',
'apiKeyArn': 'string',
'apiBase': 'string',
'maxTokens': 123,
'temperature': ...,
'topP': ...,
'additionalParams': {...}|[...]|123|123.4|'string'|True|None
}
},
systemPrompt=[
{
'text': 'string'
},
],
tools=[
{
'type': 'remote_mcp'|'agentcore_browser'|'agentcore_gateway'|'inline_function'|'agentcore_code_interpreter',
'name': 'string',
'config': {
'remoteMcp': {
'url': 'string',
'headers': {
'string': 'string'
}
},
'agentCoreBrowser': {
'browserArn': 'string'
},
'agentCoreGateway': {
'gatewayArn': 'string',
'outboundAuth': {
'awsIam': {}
,
'none': {}
,
'oauth': {
'providerArn': 'string',
'scopes': [
'string',
],
'customParameters': {
'string': 'string'
},
'grantType': 'CLIENT_CREDENTIALS'|'AUTHORIZATION_CODE'|'TOKEN_EXCHANGE',
'defaultReturnUrl': 'string'
}
}
},
'inlineFunction': {
'description': 'string',
'inputSchema': {...}|[...]|123|123.4|'string'|True|None
},
'agentCoreCodeInterpreter': {
'codeInterpreterArn': 'string'
}
}
},
],
skills=[
{
'path': 'string',
's3': {
'uri': 'string'
},
'git': {
'url': 'string',
'path': 'string',
'auth': {
'credentialArn': 'string',
'username': 'string'
}
},
'awsSkills': {
'paths': [
'string',
]
}
},
],
allowedTools=[
'string',
],
maxIterations=123,
maxTokens=123,
timeoutSeconds=123,
actorId='string'
)
string
[REQUIRED]
The ARN of the harness to invoke.
string
The endpoint name to invoke. If omitted, the DEFAULT endpoint is used.
string
[REQUIRED]
The session ID for the invocation. Use the same session ID across requests to continue a conversation.
string
An identifier for the end user making the request. This value is passed through to the runtime container.
list
[REQUIRED]
The messages to send to the agent.
(dict) --
A message in the conversation.
role (string) -- [REQUIRED]
The role of the message sender.
content (list) -- [REQUIRED]
The content blocks of the message.
(dict) --
A content block within a message.
text (string) --
Text content.
toolUse (dict) --
A tool use request from the model.
name (string) -- [REQUIRED]
The name of the tool to call.
toolUseId (string) -- [REQUIRED]
The unique ID of this tool use.
input (:ref:`document<document>`) -- [REQUIRED]
The JSON input to pass to the tool.
type (string) --
The type of tool use.
serverName (string) --
The name of the MCP server providing this tool.
toolResult (dict) --
A tool execution result.
toolUseId (string) -- [REQUIRED]
The tool use ID that this result corresponds to.
content (list) -- [REQUIRED]
The content of the tool result.
(dict) --
A content block within a tool result.
text (string) --
Text content.
json (:ref:`document<document>`) --
JSON content.
status (string) --
The status of the tool execution.
type (string) --
The type of tool use that produced this result.
reasoningContent (dict) --
Model reasoning content.
reasoningText (dict) --
The reasoning text.
text (string) -- [REQUIRED]
The reasoning text.
signature (string) --
Signature for verifying the reasoning content.
redactedContent (bytes) --
Redacted reasoning content.
dict
The model configuration to use for this invocation. If specified, overrides the harness default.
bedrockModelConfig (dict) --
Configuration for an Amazon Bedrock model.
modelId (string) -- [REQUIRED]
The Bedrock model ID.
maxTokens (integer) --
The maximum number of tokens to allow in the generated response per iteration.
temperature (float) --
The temperature to set when calling the model.
topP (float) --
The topP set when calling the model.
apiFormat (string) --
The API format to use when calling the Bedrock provider.
additionalParams (:ref:`document<document>`) --
Provider-specific parameters passed through to the model provider unchanged.
openAiModelConfig (dict) --
Configuration for an OpenAI model.
modelId (string) -- [REQUIRED]
The OpenAI model ID.
apiKeyArn (string) -- [REQUIRED]
The ARN of your OpenAI API key on AgentCore Identity.
maxTokens (integer) --
The maximum number of tokens to allow in the generated response per iteration.
temperature (float) --
The temperature to set when calling the model.
topP (float) --
The topP set when calling the model.
apiFormat (string) --
The API format to use when calling the OpenAI provider.
additionalParams (:ref:`document<document>`) --
Provider-specific parameters passed through to the model provider unchanged.
geminiModelConfig (dict) --
Configuration for a Google Gemini model.
modelId (string) -- [REQUIRED]
The Gemini model ID.
apiKeyArn (string) -- [REQUIRED]
The ARN of your Gemini API key on AgentCore Identity.
maxTokens (integer) --
The maximum number of tokens to allow in the generated response per iteration.
temperature (float) --
The temperature to set when calling the model.
topP (float) --
The topP set when calling the model.
topK (integer) --
The topK set when calling the model.
liteLlmModelConfig (dict) --
The LiteLLM model configuration for connecting to third-party model providers.
modelId (string) -- [REQUIRED]
The LiteLLM model identifier (e.g., "anthropic/claude-3-sonnet").
apiKeyArn (string) --
The ARN of the API key in AgentCore Identity for authenticating with the model provider.
apiBase (string) --
The base URL for the model provider's API endpoint.
maxTokens (integer) --
The maximum number of tokens to allow in the generated response per iteration.
temperature (float) --
The temperature to set when calling the model.
topP (float) --
The topP set when calling the model.
additionalParams (:ref:`document<document>`) --
Provider-specific parameters passed through to the model provider unchanged.
list
The system prompt to use for this invocation. If specified, overrides the harness default.
(dict) --
A content block in the system prompt.
text (string) --
The text content of the system prompt block.
list
The tools available to the agent for this invocation. If specified, overrides the harness default.
(dict) --
A tool available to the agent loop.
type (string) -- [REQUIRED]
The type of tool.
name (string) --
Unique name for the tool. If not provided, a name will be inferred or generated.
config (dict) --
Tool-specific configuration.
remoteMcp (dict) --
Configuration for remote MCP server.
url (string) -- [REQUIRED]
URL of the MCP endpoint.
headers (dict) --
Custom headers to include when connecting to the remote MCP server.
(string) --
The key of an HTTP header.
(string) --
The value of an HTTP header.
agentCoreBrowser (dict) --
Configuration for AgentCore Browser.
browserArn (string) --
If not populated, the built-in Browser ARN is used.
agentCoreGateway (dict) --
Configuration for AgentCore Gateway.
gatewayArn (string) -- [REQUIRED]
The ARN of the desired AgentCore Gateway.
outboundAuth (dict) --
How harness authenticates to this Gateway. Defaults to AWS_IAM (SigV4) if omitted.
awsIam (dict) --
SigV4-sign requests using the agent's execution role.
none (dict) --
No authentication.
oauth (dict) --
OAuth 2.0 authentication via AgentCore Identity.
providerArn (string) -- [REQUIRED]
The ARN of the OAuth 2.0 credential provider in AgentCore Identity.
scopes (list) -- [REQUIRED]
The OAuth 2.0 scopes to request when obtaining an access token.
(string) --
customParameters (dict) --
Additional custom parameters to include in the OAuth 2.0 token request.
(string) --
(string) --
grantType (string) --
The OAuth 2.0 grant type to use for authentication.
defaultReturnUrl (string) --
The default return URL for the OAuth 2.0 authorization flow.
inlineFunction (dict) --
Configuration for an inline function tool.
description (string) -- [REQUIRED]
Description of what the tool does, provided to the model.
inputSchema (:ref:`document<document>`) -- [REQUIRED]
JSON Schema describing the tool's input parameters.
agentCoreCodeInterpreter (dict) --
Configuration for AgentCore Code Interpreter.
codeInterpreterArn (string) --
If not populated, the built-in Code Interpreter ARN is used.
list
The skills available to the agent for this invocation. If specified, overrides the harness default.
(dict) --
A skill available to the agent.
path (string) --
The filesystem path to the skill definition.
s3 (dict) --
An S3 source containing the skill.
uri (string) -- [REQUIRED]
The S3 URI pointing to the skill directory (e.g., s3://bucket/skills/my-skill/).
git (dict) --
A git repository containing the skill.
url (string) -- [REQUIRED]
The HTTPS URL of the git repository.
path (string) --
Subdirectory within the repository containing the skill.
auth (dict) --
Authentication configuration for private repositories.
credentialArn (string) -- [REQUIRED]
The ARN of the credential in AgentCore Identity containing the password or personal access token.
username (string) --
Username for authentication. Defaults to 'oauth2' if not specified.
awsSkills (dict) --
AWS Skills baked into the Harness's underlying Runtime.
paths (list) --
Optionally filter allowed skills with glob syntax, e.g., ['core-skills/*'].
(string) --
list
The tools that the agent is allowed to use for this invocation. If specified, overrides the harness default.
(string) --
integer
The maximum number of iterations the agent loop can execute. If specified, overrides the harness default.
integer
The maximum number of tokens the agent can generate per iteration. If specified, overrides the harness default.
integer
The maximum duration in seconds for the agent loop execution. If specified, overrides the harness default.
string
The actor ID for memory operations. Overrides the actor ID configured on the harness.
dict
The response of this operation contains an :class:`.EventStream` member. When iterated the :class:`.EventStream` will yield events based on the structure below, where only one of the top level keys will be present for any given event.
Response Syntax
{
'stream': EventStream({
'messageStart': {
'role': 'user'|'assistant'
},
'contentBlockStart': {
'contentBlockIndex': 123,
'start': {
'toolUse': {
'toolUseId': 'string',
'name': 'string',
'type': 'tool_use'|'server_tool_use'|'mcp_tool_use',
'serverName': 'string'
},
'toolResult': {
'toolUseId': 'string',
'status': 'success'|'error'
}
}
},
'contentBlockDelta': {
'contentBlockIndex': 123,
'delta': {
'text': 'string',
'toolUse': {
'input': 'string'
},
'toolResult': [
{
'text': 'string',
'json': {...}|[...]|123|123.4|'string'|True|None
},
],
'reasoningContent': {
'text': 'string',
'redactedContent': b'bytes',
'signature': 'string'
}
}
},
'contentBlockStop': {
'contentBlockIndex': 123
},
'messageStop': {
'stopReason': 'end_turn'|'tool_use'|'tool_result'|'max_tokens'|'stop_sequence'|'content_filtered'|'malformed_model_output'|'malformed_tool_use'|'interrupted'|'partial_turn'|'model_context_window_exceeded'|'max_iterations_exceeded'|'max_output_tokens_exceeded'|'timeout_exceeded'
},
'metadata': {
'usage': {
'inputTokens': 123,
'outputTokens': 123,
'totalTokens': 123,
'cacheReadInputTokens': 123,
'cacheWriteInputTokens': 123
},
'metrics': {
'latencyMs': 123
}
},
'internalServerException': {
'message': 'string'
},
'validationException': {
'message': 'string',
'reason': 'CannotParse'|'FieldValidationFailed'|'IdempotentParameterMismatchException'|'EventInOtherSession'|'ResourceConflict',
'fieldList': [
{
'name': 'string',
'message': 'string'
},
]
},
'runtimeClientError': {
'message': 'string'
}
})
}
Response Structure
(dict) --
stream (:class:`.EventStream`) --
The streaming output from the harness invocation.
messageStart (dict) --
Indicates the start of a new message from the agent.
role (string) --
The role of the message sender.
contentBlockStart (dict) --
Indicates the start of a new content block.
contentBlockIndex (integer) --
The index of the content block within the message.
start (dict) --
The content block start payload.
toolUse (dict) --
Start of a tool use content block.
toolUseId (string) --
The unique ID of this tool use.
name (string) --
The name of the tool being called.
type (string) --
The type of tool use.
serverName (string) --
The name of the MCP server providing this tool.
toolResult (dict) --
Start of a tool result content block.
toolUseId (string) --
The tool use ID that this result corresponds to.
status (string) --
The status of the tool execution.
contentBlockDelta (dict) --
A delta update to the current content block.
contentBlockIndex (integer) --
The index of the content block being updated.
delta (dict) --
The delta payload.
text (string) --
A text delta.
toolUse (dict) --
A tool use input delta.
input (string) --
The partial JSON input for the tool call.
toolResult (list) --
A tool result delta.
(dict) --
A delta update to a tool result content block.
text (string) --
A text tool result delta.
json (:ref:`document<document>`) --
A JSON tool result delta.
reasoningContent (dict) --
A reasoning content delta.
text (string) --
Reasoning text delta.
redactedContent (bytes) --
Redacted reasoning content.
signature (string) --
Signature for the reasoning content.
contentBlockStop (dict) --
Indicates the end of the current content block.
contentBlockIndex (integer) --
The index of the content block that ended.
messageStop (dict) --
Indicates the end of the current message.
stopReason (string) --
The reason the agent stopped generating.
metadata (dict) --
Token usage and latency metrics for the invocation.
usage (dict) --
Token usage counts.
inputTokens (integer) --
The number of input tokens consumed.
outputTokens (integer) --
The number of output tokens generated.
totalTokens (integer) --
The total number of tokens consumed.
cacheReadInputTokens (integer) --
The number of input tokens read from cache.
cacheWriteInputTokens (integer) --
The number of input tokens written to cache.
metrics (dict) --
Latency metrics.
latencyMs (integer) --
The end-to-end latency of the invocation in milliseconds.
internalServerException (dict) --
The exception that occurs when the service encounters an unexpected internal error. This is a temporary condition that will resolve itself with retries. We recommend implementing exponential backoff retry logic in your application.
message (string) --
validationException (dict) --
The exception that occurs when the input fails to satisfy the constraints specified by the service. Check the error message for details about which input parameter is invalid and correct your request.
message (string) --
reason (string) --
fieldList (list) --
(dict) --
Stores information about a field passed inside a request that resulted in an exception.
name (string) --
The name of the field.
message (string) --
A message describing why this field failed validation.
runtimeClientError (dict) --
An error returned by the runtime container during agent execution.
message (string) --
{'dataSourceConfig': {'onlineEvaluationConfigSource': {'timeRange': {'endTime': 'timestamp',
'startTime': 'timestamp'}}}}
Starts a batch evaluation job that evaluates agent performance across multiple sessions. Batch evaluations pull agent traces from CloudWatch Logs or an existing online evaluation configuration and run specified evaluators and insights against them.
See also: AWS API Documentation
Request Syntax
client.start_batch_evaluation(
batchEvaluationName='string',
evaluators=[
{
'evaluatorId': 'string'
},
],
insights=[
{
'insightId': 'string'
},
],
dataSourceConfig={
'cloudWatchLogs': {
'serviceNames': [
'string',
],
'logGroupNames': [
'string',
],
'filterConfig': {
'sessionIds': [
'string',
],
'timeRange': {
'startTime': datetime(2015, 1, 1),
'endTime': datetime(2015, 1, 1)
}
}
},
'onlineEvaluationConfigSource': {
'onlineEvaluationConfigArn': 'string',
'timeRange': {
'startTime': datetime(2015, 1, 1),
'endTime': datetime(2015, 1, 1)
}
}
},
clientToken='string',
evaluationMetadata={
'sessionMetadata': [
{
'sessionId': 'string',
'testScenarioId': 'string',
'groundTruth': {
'inline': {
'assertions': [
{
'text': 'string'
},
],
'expectedTrajectory': {
'toolNames': [
'string',
]
},
'turns': [
{
'input': {
'prompt': 'string'
},
'expectedResponse': {
'text': 'string'
}
},
]
}
},
'metadata': {
'string': 'string'
}
},
]
},
tags={
'string': 'string'
},
kmsKeyArn='string',
description='string'
)
string
[REQUIRED]
The name of the batch evaluation. Must be unique within your account.
list
The list of evaluators to apply during the batch evaluation. Can include both built-in evaluators and custom evaluators. Maximum of 10 evaluators.
(dict) --
An evaluator to run against sessions during batch evaluation.
evaluatorId (string) -- [REQUIRED]
The unique identifier of the evaluator. Can reference built-in evaluators (e.g., Builtin.Helpfulness) or custom evaluators.
list
The list of insight analyses to run against sessions during the batch evaluation. Maximum of 10 insights.
(dict) --
A reference to an insight analysis to run against sessions during batch evaluation. Insights provide deeper analysis beyond individual evaluator scores, including failure detection, user intent clustering, and execution summarization.
insightId (string) -- [REQUIRED]
The unique identifier of the insight to run.
dict
[REQUIRED]
The data source configuration that specifies where to pull agent session traces from for evaluation.
cloudWatchLogs (dict) --
Configuration for pulling agent session traces from CloudWatch Logs.
serviceNames (list) -- [REQUIRED]
The list of agent service names to filter traces within the specified log groups.
(string) --
logGroupNames (list) -- [REQUIRED]
The list of CloudWatch log group names to read agent traces from. Maximum of 5 log groups.
(string) --
filterConfig (dict) --
Optional filter configuration to narrow down which sessions to evaluate.
sessionIds (list) --
A list of specific session IDs to evaluate. If specified, only these sessions are included in the evaluation.
(string) --
timeRange (dict) --
The time range filter for selecting sessions to evaluate.
startTime (datetime) --
The start time of the time range. Only sessions with activity at or after this timestamp are included.
endTime (datetime) --
The end time of the time range. Only sessions with activity before this timestamp are included.
onlineEvaluationConfigSource (dict) --
Reference an existing OnlineEvaluationConfig as session source
onlineEvaluationConfigArn (string) -- [REQUIRED]
The Amazon Resource Name (ARN) of the online evaluation configuration to use as the session source.
timeRange (dict) --
Optional session filter configuration to narrow down which sessions from the online evaluation configuration to include.
startTime (datetime) --
The start time of the time range. Only sessions with activity at or after this timestamp are included.
endTime (datetime) --
The end time of the time range. Only sessions with activity before this timestamp are included.
string
A unique, case-sensitive identifier to ensure that the API request completes no more than one time. If this token matches a previous request, the service ignores the request, but does not return an error.
This field is autopopulated if not provided.
dict
Optional metadata for the evaluation, including session-specific ground truth data and test scenario identifiers.
sessionMetadata (list) --
A list of session metadata entries containing ground truth data and test scenario identifiers for specific sessions.
(dict) --
Metadata for a specific session in a batch evaluation, including ground truth data and test scenario identifiers.
sessionId (string) -- [REQUIRED]
The unique identifier of the session this metadata applies to.
testScenarioId (string) --
An optional test scenario identifier for categorizing and tracking evaluation results.
groundTruth (dict) --
The ground truth data for this session, including expected responses and assertions.
inline (dict) --
Inline ground truth data provided directly in the request.
assertions (list) --
Assertions for evaluation, reuses common model EvaluationContentList.
(dict) --
A content block for ground truth data in evaluation reference inputs. Supports text content for expected responses and assertions.
text (string) --
The text content of the ground truth data. Used for expected response text and assertion statements.
expectedTrajectory (dict) --
The expected tool call sequence for trajectory evaluation.
toolNames (list) --
The list of tool names representing the expected tool call sequence.
(string) --
turns (list) --
A list of per-turn ground truth data, each containing an input prompt and expected response.
(dict) --
Ground truth data for a single conversation turn.
input (dict) --
The input for this conversation turn.
prompt (string) --
The text prompt for this conversation turn.
expectedResponse (dict) --
The expected response for this conversation turn.
text (string) --
The text content of the ground truth data. Used for expected response text and assertion statements.
metadata (dict) --
Additional key-value metadata associated with this session.
(string) --
(string) --
dict
A map of tag keys and values to associate with the batch evaluation.
(string) --
(string) --
string
The ARN of the KMS key used to encrypt evaluation data. If provided, customer data is encrypted at rest with the specified key.
string
The description of the batch evaluation.
dict
Response Syntax
{
'batchEvaluationId': 'string',
'batchEvaluationArn': 'string',
'batchEvaluationName': 'string',
'evaluators': [
{
'evaluatorId': 'string'
},
],
'insights': [
{
'insightId': 'string'
},
],
'status': 'PENDING'|'IN_PROGRESS'|'COMPLETED'|'COMPLETED_WITH_ERRORS'|'FAILED'|'STOPPING'|'STOPPED'|'DELETING',
'createdAt': datetime(2015, 1, 1),
'outputConfig': {
'cloudWatchConfig': {
'logGroupName': 'string',
'logStreamName': 'string'
}
},
'tags': {
'string': 'string'
},
'kmsKeyArn': 'string',
'description': 'string'
}
Response Structure
(dict) --
batchEvaluationId (string) --
The unique identifier of the created batch evaluation.
batchEvaluationArn (string) --
The Amazon Resource Name (ARN) of the created batch evaluation.
batchEvaluationName (string) --
The name of the batch evaluation.
evaluators (list) --
The list of evaluators applied during the batch evaluation.
(dict) --
An evaluator to run against sessions during batch evaluation.
evaluatorId (string) --
The unique identifier of the evaluator. Can reference built-in evaluators (e.g., Builtin.Helpfulness) or custom evaluators.
insights (list) --
The list of insight analyses applied during the batch evaluation.
(dict) --
A reference to an insight analysis to run against sessions during batch evaluation. Insights provide deeper analysis beyond individual evaluator scores, including failure detection, user intent clustering, and execution summarization.
insightId (string) --
The unique identifier of the insight to run.
status (string) --
The status of the batch evaluation.
createdAt (datetime) --
The timestamp when the batch evaluation was created.
outputConfig (dict) --
The output configuration specifying where evaluation results are written.
cloudWatchConfig (dict) --
The CloudWatch Logs configuration for writing evaluation results.
logGroupName (string) --
The name of the CloudWatch log group where evaluation results will be written.
logStreamName (string) --
The name of the CloudWatch log stream where evaluation results will be written.
tags (dict) --
The tags associated with the batch evaluation.
(string) --
(string) --
kmsKeyArn (string) --
The ARN of the KMS key used to encrypt evaluation data.
description (string) --
The description of the batch evaluation.