Agents for Amazon Bedrock

2024/12/02 - Agents for Amazon Bedrock - 4 new3 updated api methods

Changes  Add support for Knowledge Base Evaluations & LLM as a judge

ListKnowledgeBaseDocuments (new) Link ¶

Retrieves all the documents contained in a data source that is connected to a knowledge base. For more information, see Ingest documents into a knowledge base in real-time in the Amazon Bedrock User Guide.

See also: AWS API Documentation

Request Syntax

client.list_knowledge_base_documents(
    dataSourceId='string',
    knowledgeBaseId='string',
    maxResults=123,
    nextToken='string'
)
type dataSourceId:

string

param dataSourceId:

[REQUIRED]

The unique identifier of the data source that contains the documents.

type knowledgeBaseId:

string

param knowledgeBaseId:

[REQUIRED]

The unique identifier of the knowledge base that is connected to the data source.

type maxResults:

integer

param maxResults:

The maximum number of results to return in the response. If the total number of results is greater than this value, use the token returned in the response in the nextToken field when making another request to return the next batch of results.

type nextToken:

string

param nextToken:

If the total number of results is greater than the maxResults value provided in the request, enter the token returned in the nextToken field in the response in this field to return the next batch of results.

rtype:

dict

returns:

Response Syntax

{
    'documentDetails': [
        {
            'dataSourceId': 'string',
            'identifier': {
                'custom': {
                    'id': 'string'
                },
                'dataSourceType': 'CUSTOM'|'S3',
                's3': {
                    'uri': 'string'
                }
            },
            'knowledgeBaseId': 'string',
            'status': 'INDEXED'|'PARTIALLY_INDEXED'|'PENDING'|'FAILED'|'METADATA_PARTIALLY_INDEXED'|'METADATA_UPDATE_FAILED'|'IGNORED'|'NOT_FOUND'|'STARTING'|'IN_PROGRESS'|'DELETING'|'DELETE_IN_PROGRESS',
            'statusReason': 'string',
            'updatedAt': datetime(2015, 1, 1)
        },
    ],
    'nextToken': 'string'
}

Response Structure

  • (dict) --

    • documentDetails (list) --

      A list of objects, each of which contains information about the documents that were retrieved.

      • (dict) --

        Contains the details for a document that was ingested or deleted.

        • dataSourceId (string) --

          The identifier of the data source connected to the knowledge base that the document was ingested into or deleted from.

        • identifier (dict) --

          Contains information that identifies the document.

          • custom (dict) --

            Contains information that identifies the document in a custom data source.

            • id (string) --

              The identifier of the document to ingest into a custom data source.

          • dataSourceType (string) --

            The type of data source connected to the knowledge base that contains the document.

          • s3 (dict) --

            Contains information that identifies the document in an S3 data source.

            • uri (string) --

              The location's URI. For example, s3://my-bucket/chunk-processor/.

        • knowledgeBaseId (string) --

          The identifier of the knowledge base that the document was ingested into or deleted from.

        • status (string) --

          The ingestion status of the document. The following statuses are possible:

          • STARTED – You submitted the ingestion job containing the document.

          • PENDING – The document is waiting to be ingested.

          • IN_PROGRESS – The document is being ingested.

          • INDEXED – The document was successfully indexed.

          • PARTIALLY_INDEXED – The document was partially indexed.

          • METADATA_PARTIALLY_INDEXED – You submitted metadata for an existing document and it was partially indexed.

          • METADATA_UPDATE_FAILED – You submitted a metadata update for an existing document but it failed.

          • FAILED – The document failed to be ingested.

          • NOT_FOUND – The document wasn't found.

          • IGNORED – The document was ignored during ingestion.

          • DELETING – You submitted the delete job containing the document.

          • DELETE_IN_PROGRESS – The document is being deleted.

        • statusReason (string) --

          The reason for the status. Appears alongside the status IGNORED.

        • updatedAt (datetime) --

          The date and time at which the document was last updated.

    • nextToken (string) --

      If the total number of results is greater than the maxResults value provided in the request, use this token when making another request in the nextToken field to return the next batch of results.

GetKnowledgeBaseDocuments (new) Link ¶

Retrieves specific documents from a data source that is connected to a knowledge base. For more information, see Ingest documents into a knowledge base in real-time in the Amazon Bedrock User Guide.

See also: AWS API Documentation

Request Syntax

client.get_knowledge_base_documents(
    dataSourceId='string',
    documentIdentifiers=[
        {
            'custom': {
                'id': 'string'
            },
            'dataSourceType': 'CUSTOM'|'S3',
            's3': {
                'uri': 'string'
            }
        },
    ],
    knowledgeBaseId='string'
)
type dataSourceId:

string

param dataSourceId:

[REQUIRED]

The unique identifier of the data source that contains the documents.

type documentIdentifiers:

list

param documentIdentifiers:

[REQUIRED]

A list of objects, each of which contains information to identify a document for which to retrieve information.

  • (dict) --

    Contains information that identifies the document.

    • custom (dict) --

      Contains information that identifies the document in a custom data source.

      • id (string) -- [REQUIRED]

        The identifier of the document to ingest into a custom data source.

    • dataSourceType (string) -- [REQUIRED]

      The type of data source connected to the knowledge base that contains the document.

    • s3 (dict) --

      Contains information that identifies the document in an S3 data source.

      • uri (string) -- [REQUIRED]

        The location's URI. For example, s3://my-bucket/chunk-processor/.

type knowledgeBaseId:

string

param knowledgeBaseId:

[REQUIRED]

The unique identifier of the knowledge base that is connected to the data source.

rtype:

dict

returns:

Response Syntax

{
    'documentDetails': [
        {
            'dataSourceId': 'string',
            'identifier': {
                'custom': {
                    'id': 'string'
                },
                'dataSourceType': 'CUSTOM'|'S3',
                's3': {
                    'uri': 'string'
                }
            },
            'knowledgeBaseId': 'string',
            'status': 'INDEXED'|'PARTIALLY_INDEXED'|'PENDING'|'FAILED'|'METADATA_PARTIALLY_INDEXED'|'METADATA_UPDATE_FAILED'|'IGNORED'|'NOT_FOUND'|'STARTING'|'IN_PROGRESS'|'DELETING'|'DELETE_IN_PROGRESS',
            'statusReason': 'string',
            'updatedAt': datetime(2015, 1, 1)
        },
    ]
}

Response Structure

  • (dict) --

    • documentDetails (list) --

      A list of objects, each of which contains information about the documents that were retrieved.

      • (dict) --

        Contains the details for a document that was ingested or deleted.

        • dataSourceId (string) --

          The identifier of the data source connected to the knowledge base that the document was ingested into or deleted from.

        • identifier (dict) --

          Contains information that identifies the document.

          • custom (dict) --

            Contains information that identifies the document in a custom data source.

            • id (string) --

              The identifier of the document to ingest into a custom data source.

          • dataSourceType (string) --

            The type of data source connected to the knowledge base that contains the document.

          • s3 (dict) --

            Contains information that identifies the document in an S3 data source.

            • uri (string) --

              The location's URI. For example, s3://my-bucket/chunk-processor/.

        • knowledgeBaseId (string) --

          The identifier of the knowledge base that the document was ingested into or deleted from.

        • status (string) --

          The ingestion status of the document. The following statuses are possible:

          • STARTED – You submitted the ingestion job containing the document.

          • PENDING – The document is waiting to be ingested.

          • IN_PROGRESS – The document is being ingested.

          • INDEXED – The document was successfully indexed.

          • PARTIALLY_INDEXED – The document was partially indexed.

          • METADATA_PARTIALLY_INDEXED – You submitted metadata for an existing document and it was partially indexed.

          • METADATA_UPDATE_FAILED – You submitted a metadata update for an existing document but it failed.

          • FAILED – The document failed to be ingested.

          • NOT_FOUND – The document wasn't found.

          • IGNORED – The document was ignored during ingestion.

          • DELETING – You submitted the delete job containing the document.

          • DELETE_IN_PROGRESS – The document is being deleted.

        • statusReason (string) --

          The reason for the status. Appears alongside the status IGNORED.

        • updatedAt (datetime) --

          The date and time at which the document was last updated.

DeleteKnowledgeBaseDocuments (new) Link ¶

Deletes documents from a data source and syncs the changes to the knowledge base that is connected to it. For more information, see Ingest documents into a knowledge base in real-time in the Amazon Bedrock User Guide.

See also: AWS API Documentation

Request Syntax

client.delete_knowledge_base_documents(
    clientToken='string',
    dataSourceId='string',
    documentIdentifiers=[
        {
            'custom': {
                'id': 'string'
            },
            'dataSourceType': 'CUSTOM'|'S3',
            's3': {
                'uri': 'string'
            }
        },
    ],
    knowledgeBaseId='string'
)
type clientToken:

string

param clientToken:

A unique, case-sensitive identifier to ensure that the API request completes no more than one time. If this token matches a previous request, Amazon Bedrock ignores the request, but does not return an error. For more information, see Ensuring idempotency.

This field is autopopulated if not provided.

type dataSourceId:

string

param dataSourceId:

[REQUIRED]

The unique identifier of the data source that contains the documents.

type documentIdentifiers:

list

param documentIdentifiers:

[REQUIRED]

A list of objects, each of which contains information to identify a document to delete.

  • (dict) --

    Contains information that identifies the document.

    • custom (dict) --

      Contains information that identifies the document in a custom data source.

      • id (string) -- [REQUIRED]

        The identifier of the document to ingest into a custom data source.

    • dataSourceType (string) -- [REQUIRED]

      The type of data source connected to the knowledge base that contains the document.

    • s3 (dict) --

      Contains information that identifies the document in an S3 data source.

      • uri (string) -- [REQUIRED]

        The location's URI. For example, s3://my-bucket/chunk-processor/.

type knowledgeBaseId:

string

param knowledgeBaseId:

[REQUIRED]

The unique identifier of the knowledge base that is connected to the data source.

rtype:

dict

returns:

Response Syntax

{
    'documentDetails': [
        {
            'dataSourceId': 'string',
            'identifier': {
                'custom': {
                    'id': 'string'
                },
                'dataSourceType': 'CUSTOM'|'S3',
                's3': {
                    'uri': 'string'
                }
            },
            'knowledgeBaseId': 'string',
            'status': 'INDEXED'|'PARTIALLY_INDEXED'|'PENDING'|'FAILED'|'METADATA_PARTIALLY_INDEXED'|'METADATA_UPDATE_FAILED'|'IGNORED'|'NOT_FOUND'|'STARTING'|'IN_PROGRESS'|'DELETING'|'DELETE_IN_PROGRESS',
            'statusReason': 'string',
            'updatedAt': datetime(2015, 1, 1)
        },
    ]
}

Response Structure

  • (dict) --

    • documentDetails (list) --

      A list of objects, each of which contains information about the documents that were deleted.

      • (dict) --

        Contains the details for a document that was ingested or deleted.

        • dataSourceId (string) --

          The identifier of the data source connected to the knowledge base that the document was ingested into or deleted from.

        • identifier (dict) --

          Contains information that identifies the document.

          • custom (dict) --

            Contains information that identifies the document in a custom data source.

            • id (string) --

              The identifier of the document to ingest into a custom data source.

          • dataSourceType (string) --

            The type of data source connected to the knowledge base that contains the document.

          • s3 (dict) --

            Contains information that identifies the document in an S3 data source.

            • uri (string) --

              The location's URI. For example, s3://my-bucket/chunk-processor/.

        • knowledgeBaseId (string) --

          The identifier of the knowledge base that the document was ingested into or deleted from.

        • status (string) --

          The ingestion status of the document. The following statuses are possible:

          • STARTED – You submitted the ingestion job containing the document.

          • PENDING – The document is waiting to be ingested.

          • IN_PROGRESS – The document is being ingested.

          • INDEXED – The document was successfully indexed.

          • PARTIALLY_INDEXED – The document was partially indexed.

          • METADATA_PARTIALLY_INDEXED – You submitted metadata for an existing document and it was partially indexed.

          • METADATA_UPDATE_FAILED – You submitted a metadata update for an existing document but it failed.

          • FAILED – The document failed to be ingested.

          • NOT_FOUND – The document wasn't found.

          • IGNORED – The document was ignored during ingestion.

          • DELETING – You submitted the delete job containing the document.

          • DELETE_IN_PROGRESS – The document is being deleted.

        • statusReason (string) --

          The reason for the status. Appears alongside the status IGNORED.

        • updatedAt (datetime) --

          The date and time at which the document was last updated.

IngestKnowledgeBaseDocuments (new) Link ¶

Ingests documents directly into the knowledge base that is connected to the data source. The dataSourceType specified in the content for each document must match the type of the data source that you specify in the header. For more information, see Ingest documents into a knowledge base in real-time in the Amazon Bedrock User Guide.

See also: AWS API Documentation

Request Syntax

client.ingest_knowledge_base_documents(
    clientToken='string',
    dataSourceId='string',
    documents=[
        {
            'content': {
                'custom': {
                    'customDocumentIdentifier': {
                        'id': 'string'
                    },
                    'inlineContent': {
                        'byteContent': {
                            'data': b'bytes',
                            'mimeType': 'string'
                        },
                        'textContent': {
                            'data': 'string'
                        },
                        'type': 'BYTE'|'TEXT'
                    },
                    's3Location': {
                        'bucketOwnerAccountId': 'string',
                        'uri': 'string'
                    },
                    'sourceType': 'IN_LINE'|'S3_LOCATION'
                },
                'dataSourceType': 'CUSTOM'|'S3',
                's3': {
                    's3Location': {
                        'uri': 'string'
                    }
                }
            },
            'metadata': {
                'inlineAttributes': [
                    {
                        'key': 'string',
                        'value': {
                            'booleanValue': True|False,
                            'numberValue': 123.0,
                            'stringListValue': [
                                'string',
                            ],
                            'stringValue': 'string',
                            'type': 'BOOLEAN'|'NUMBER'|'STRING'|'STRING_LIST'
                        }
                    },
                ],
                's3Location': {
                    'bucketOwnerAccountId': 'string',
                    'uri': 'string'
                },
                'type': 'IN_LINE_ATTRIBUTE'|'S3_LOCATION'
            }
        },
    ],
    knowledgeBaseId='string'
)
type clientToken:

string

param clientToken:

A unique, case-sensitive identifier to ensure that the API request completes no more than one time. If this token matches a previous request, Amazon Bedrock ignores the request, but does not return an error. For more information, see Ensuring idempotency.

This field is autopopulated if not provided.

type dataSourceId:

string

param dataSourceId:

[REQUIRED]

The unique identifier of the data source connected to the knowledge base that you're adding documents to.

type documents:

list

param documents:

[REQUIRED]

A list of objects, each of which contains information about the documents to add.

  • (dict) --

    Contains information about a document to ingest into a knowledge base and metadata to associate with it.

    • content (dict) -- [REQUIRED]

      Contains the content of the document.

      • custom (dict) --

        Contains information about the content to ingest into a knowledge base connected to a custom data source.

        • customDocumentIdentifier (dict) -- [REQUIRED]

          A unique identifier for the document.

          • id (string) -- [REQUIRED]

            The identifier of the document to ingest into a custom data source.

        • inlineContent (dict) --

          Contains information about content defined inline to ingest into a knowledge base.

          • byteContent (dict) --

            Contains information about content defined inline in bytes.

            • data (bytes) -- [REQUIRED]

              The base64-encoded string of the content.

            • mimeType (string) -- [REQUIRED]

              The MIME type of the content. For a list of MIME types, see Media Types. The following MIME types are supported:

              • text/plain

              • text/html

              • text/csv

              • text/vtt

              • message/rfc822

              • application/xhtml+xml

              • application/pdf

              • application/msword

              • application/vnd.ms-word.document.macroenabled.12

              • application/vnd.ms-word.template.macroenabled.12

              • application/vnd.ms-excel

              • application/vnd.ms-excel.addin.macroenabled.12

              • application/vnd.ms-excel.sheet.macroenabled.12

              • application/vnd.ms-excel.template.macroenabled.12

              • application/vnd.ms-excel.sheet.binary.macroenabled.12

              • application/vnd.ms-spreadsheetml

              • application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

              • application/vnd.openxmlformats-officedocument.spreadsheetml.template

              • application/vnd.openxmlformats-officedocument.wordprocessingml.document

              • application/vnd.openxmlformats-officedocument.wordprocessingml.template

          • textContent (dict) --

            Contains information about content defined inline in text.

            • data (string) -- [REQUIRED]

              The text of the content.

          • type (string) -- [REQUIRED]

            The type of inline content to define.

        • s3Location (dict) --

          Contains information about the Amazon S3 location of the file from which to ingest data.

          • bucketOwnerAccountId (string) --

            The identifier of the Amazon Web Services account that owns the S3 bucket containing the content to ingest.

          • uri (string) -- [REQUIRED]

            The S3 URI of the file containing the content to ingest.

        • sourceType (string) -- [REQUIRED]

          The source of the data to ingest.

      • dataSourceType (string) -- [REQUIRED]

        The type of data source that is connected to the knowledge base to which to ingest this document.

      • s3 (dict) --

        Contains information about the content to ingest into a knowledge base connected to an Amazon S3 data source

        • s3Location (dict) -- [REQUIRED]

          The S3 location of the file containing the content to ingest.

          • uri (string) -- [REQUIRED]

            The location's URI. For example, s3://my-bucket/chunk-processor/.

    • metadata (dict) --

      Contains the metadata to associate with the document.

      • inlineAttributes (list) --

        An array of objects, each of which defines a metadata attribute to associate with the content to ingest. You define the attributes inline.

        • (dict) --

          Contains information about a metadata attribute.

          • key (string) -- [REQUIRED]

            The key of the metadata attribute.

          • value (dict) -- [REQUIRED]

            Contains the value of the metadata attribute.

            • booleanValue (boolean) --

              The value of the Boolean metadata attribute.

            • numberValue (float) --

              The value of the numeric metadata attribute.

            • stringListValue (list) --

              An array of strings that define the value of the metadata attribute.

              • (string) --

            • stringValue (string) --

              The value of the string metadata attribute.

            • type (string) -- [REQUIRED]

              The type of the metadata attribute.

      • s3Location (dict) --

        The Amazon S3 location of the file containing metadata to associate with the content to ingest.

        • bucketOwnerAccountId (string) --

          The identifier of the Amazon Web Services account that owns the S3 bucket containing the content to ingest.

        • uri (string) -- [REQUIRED]

          The S3 URI of the file containing the content to ingest.

      • type (string) -- [REQUIRED]

        The type of the source source from which to add metadata.

type knowledgeBaseId:

string

param knowledgeBaseId:

[REQUIRED]

The unique identifier of the knowledge base to ingest the documents into.

rtype:

dict

returns:

Response Syntax

{
    'documentDetails': [
        {
            'dataSourceId': 'string',
            'identifier': {
                'custom': {
                    'id': 'string'
                },
                'dataSourceType': 'CUSTOM'|'S3',
                's3': {
                    'uri': 'string'
                }
            },
            'knowledgeBaseId': 'string',
            'status': 'INDEXED'|'PARTIALLY_INDEXED'|'PENDING'|'FAILED'|'METADATA_PARTIALLY_INDEXED'|'METADATA_UPDATE_FAILED'|'IGNORED'|'NOT_FOUND'|'STARTING'|'IN_PROGRESS'|'DELETING'|'DELETE_IN_PROGRESS',
            'statusReason': 'string',
            'updatedAt': datetime(2015, 1, 1)
        },
    ]
}

Response Structure

  • (dict) --

    • documentDetails (list) --

      A list of objects, each of which contains information about the documents that were ingested.

      • (dict) --

        Contains the details for a document that was ingested or deleted.

        • dataSourceId (string) --

          The identifier of the data source connected to the knowledge base that the document was ingested into or deleted from.

        • identifier (dict) --

          Contains information that identifies the document.

          • custom (dict) --

            Contains information that identifies the document in a custom data source.

            • id (string) --

              The identifier of the document to ingest into a custom data source.

          • dataSourceType (string) --

            The type of data source connected to the knowledge base that contains the document.

          • s3 (dict) --

            Contains information that identifies the document in an S3 data source.

            • uri (string) --

              The location's URI. For example, s3://my-bucket/chunk-processor/.

        • knowledgeBaseId (string) --

          The identifier of the knowledge base that the document was ingested into or deleted from.

        • status (string) --

          The ingestion status of the document. The following statuses are possible:

          • STARTED – You submitted the ingestion job containing the document.

          • PENDING – The document is waiting to be ingested.

          • IN_PROGRESS – The document is being ingested.

          • INDEXED – The document was successfully indexed.

          • PARTIALLY_INDEXED – The document was partially indexed.

          • METADATA_PARTIALLY_INDEXED – You submitted metadata for an existing document and it was partially indexed.

          • METADATA_UPDATE_FAILED – You submitted a metadata update for an existing document but it failed.

          • FAILED – The document failed to be ingested.

          • NOT_FOUND – The document wasn't found.

          • IGNORED – The document was ignored during ingestion.

          • DELETING – You submitted the delete job containing the document.

          • DELETE_IN_PROGRESS – The document is being deleted.

        • statusReason (string) --

          The reason for the status. Appears alongside the status IGNORED.

        • updatedAt (datetime) --

          The date and time at which the document was last updated.

CreateDataSource (updated) Link ¶
Changes (request, response)
Request
{'dataSourceConfiguration': {'type': {'CUSTOM'}}}
Response
{'dataSource': {'dataSourceConfiguration': {'type': {'CUSTOM'}}}}

Connects a knowledge base to a data source. You specify the configuration for the specific data source service in the dataSourceConfiguration field.

See also: AWS API Documentation

Request Syntax

client.create_data_source(
    clientToken='string',
    dataDeletionPolicy='RETAIN'|'DELETE',
    dataSourceConfiguration={
        'confluenceConfiguration': {
            'crawlerConfiguration': {
                'filterConfiguration': {
                    'patternObjectFilter': {
                        'filters': [
                            {
                                'exclusionFilters': [
                                    'string',
                                ],
                                'inclusionFilters': [
                                    'string',
                                ],
                                'objectType': 'string'
                            },
                        ]
                    },
                    'type': 'PATTERN'
                }
            },
            'sourceConfiguration': {
                'authType': 'BASIC'|'OAUTH2_CLIENT_CREDENTIALS',
                'credentialsSecretArn': 'string',
                'hostType': 'SAAS',
                'hostUrl': 'string'
            }
        },
        's3Configuration': {
            'bucketArn': 'string',
            'bucketOwnerAccountId': 'string',
            'inclusionPrefixes': [
                'string',
            ]
        },
        'salesforceConfiguration': {
            'crawlerConfiguration': {
                'filterConfiguration': {
                    'patternObjectFilter': {
                        'filters': [
                            {
                                'exclusionFilters': [
                                    'string',
                                ],
                                'inclusionFilters': [
                                    'string',
                                ],
                                'objectType': 'string'
                            },
                        ]
                    },
                    'type': 'PATTERN'
                }
            },
            'sourceConfiguration': {
                'authType': 'OAUTH2_CLIENT_CREDENTIALS',
                'credentialsSecretArn': 'string',
                'hostUrl': 'string'
            }
        },
        'sharePointConfiguration': {
            'crawlerConfiguration': {
                'filterConfiguration': {
                    'patternObjectFilter': {
                        'filters': [
                            {
                                'exclusionFilters': [
                                    'string',
                                ],
                                'inclusionFilters': [
                                    'string',
                                ],
                                'objectType': 'string'
                            },
                        ]
                    },
                    'type': 'PATTERN'
                }
            },
            'sourceConfiguration': {
                'authType': 'OAUTH2_CLIENT_CREDENTIALS',
                'credentialsSecretArn': 'string',
                'domain': 'string',
                'hostType': 'ONLINE',
                'siteUrls': [
                    'string',
                ],
                'tenantId': 'string'
            }
        },
        'type': 'S3'|'WEB'|'CONFLUENCE'|'SALESFORCE'|'SHAREPOINT'|'CUSTOM',
        'webConfiguration': {
            'crawlerConfiguration': {
                'crawlerLimits': {
                    'rateLimit': 123
                },
                'exclusionFilters': [
                    'string',
                ],
                'inclusionFilters': [
                    'string',
                ],
                'scope': 'HOST_ONLY'|'SUBDOMAINS'
            },
            'sourceConfiguration': {
                'urlConfiguration': {
                    'seedUrls': [
                        {
                            'url': 'string'
                        },
                    ]
                }
            }
        }
    },
    description='string',
    knowledgeBaseId='string',
    name='string',
    serverSideEncryptionConfiguration={
        'kmsKeyArn': 'string'
    },
    vectorIngestionConfiguration={
        'chunkingConfiguration': {
            'chunkingStrategy': 'FIXED_SIZE'|'NONE'|'HIERARCHICAL'|'SEMANTIC',
            'fixedSizeChunkingConfiguration': {
                'maxTokens': 123,
                'overlapPercentage': 123
            },
            'hierarchicalChunkingConfiguration': {
                'levelConfigurations': [
                    {
                        'maxTokens': 123
                    },
                ],
                'overlapTokens': 123
            },
            'semanticChunkingConfiguration': {
                'breakpointPercentileThreshold': 123,
                'bufferSize': 123,
                'maxTokens': 123
            }
        },
        'customTransformationConfiguration': {
            'intermediateStorage': {
                's3Location': {
                    'uri': 'string'
                }
            },
            'transformations': [
                {
                    'stepToApply': 'POST_CHUNKING',
                    'transformationFunction': {
                        'transformationLambdaConfiguration': {
                            'lambdaArn': 'string'
                        }
                    }
                },
            ]
        },
        'parsingConfiguration': {
            'bedrockFoundationModelConfiguration': {
                'modelArn': 'string',
                'parsingPrompt': {
                    'parsingPromptText': 'string'
                }
            },
            'parsingStrategy': 'BEDROCK_FOUNDATION_MODEL'
        }
    }
)
type clientToken:

string

param clientToken:

A unique, case-sensitive identifier to ensure that the API request completes no more than one time. If this token matches a previous request, Amazon Bedrock ignores the request, but does not return an error. For more information, see Ensuring idempotency.

This field is autopopulated if not provided.

type dataDeletionPolicy:

string

param dataDeletionPolicy:

The data deletion policy for the data source.

You can set the data deletion policy to:

  • DELETE: Deletes all data from your data source that’s converted into vector embeddings upon deletion of a knowledge base or data source resource. Note that the vector store itself is not deleted, only the data. This flag is ignored if an Amazon Web Services account is deleted.

  • RETAIN: Retains all data from your data source that’s converted into vector embeddings upon deletion of a knowledge base or data source resource. Note that the vector store itself is not deleted if you delete a knowledge base or data source resource.

type dataSourceConfiguration:

dict

param dataSourceConfiguration:

[REQUIRED]

The connection configuration for the data source.

  • confluenceConfiguration (dict) --

    The configuration information to connect to Confluence as your data source.

    • crawlerConfiguration (dict) --

      The configuration of the Confluence content. For example, configuring specific types of Confluence content.

      • filterConfiguration (dict) --

        The configuration of filtering the Confluence content. For example, configuring regular expression patterns to include or exclude certain content.

        • patternObjectFilter (dict) --

          The configuration of filtering certain objects or content types of the data source.

          • filters (list) -- [REQUIRED]

            The configuration of specific filters applied to your data source content. You can filter out or include certain content.

            • (dict) --

              The specific filters applied to your data source content. You can filter out or include certain content.

              • exclusionFilters (list) --

                A list of one or more exclusion regular expression patterns to exclude certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

                • (string) --

              • inclusionFilters (list) --

                A list of one or more inclusion regular expression patterns to include certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

                • (string) --

              • objectType (string) -- [REQUIRED]

                The supported object type or content type of the data source.

        • type (string) -- [REQUIRED]

          The type of filtering that you want to apply to certain objects or content of the data source. For example, the PATTERN type is regular expression patterns you can apply to filter your content.

    • sourceConfiguration (dict) -- [REQUIRED]

      The endpoint information to connect to your Confluence data source.

      • authType (string) -- [REQUIRED]

        The supported authentication type to authenticate and connect to your Confluence instance.

      • credentialsSecretArn (string) -- [REQUIRED]

        The Amazon Resource Name of an Secrets Manager secret that stores your authentication credentials for your Confluence instance URL. For more information on the key-value pairs that must be included in your secret, depending on your authentication type, see Confluence connection configuration.

      • hostType (string) -- [REQUIRED]

        The supported host type, whether online/cloud or server/on-premises.

      • hostUrl (string) -- [REQUIRED]

        The Confluence host URL or instance URL.

  • s3Configuration (dict) --

    The configuration information to connect to Amazon S3 as your data source.

    • bucketArn (string) -- [REQUIRED]

      The Amazon Resource Name (ARN) of the S3 bucket that contains your data.

    • bucketOwnerAccountId (string) --

      The account ID for the owner of the S3 bucket.

    • inclusionPrefixes (list) --

      A list of S3 prefixes to include certain files or content. For more information, see Organizing objects using prefixes.

      • (string) --

  • salesforceConfiguration (dict) --

    The configuration information to connect to Salesforce as your data source.

    • crawlerConfiguration (dict) --

      The configuration of the Salesforce content. For example, configuring specific types of Salesforce content.

      • filterConfiguration (dict) --

        The configuration of filtering the Salesforce content. For example, configuring regular expression patterns to include or exclude certain content.

        • patternObjectFilter (dict) --

          The configuration of filtering certain objects or content types of the data source.

          • filters (list) -- [REQUIRED]

            The configuration of specific filters applied to your data source content. You can filter out or include certain content.

            • (dict) --

              The specific filters applied to your data source content. You can filter out or include certain content.

              • exclusionFilters (list) --

                A list of one or more exclusion regular expression patterns to exclude certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

                • (string) --

              • inclusionFilters (list) --

                A list of one or more inclusion regular expression patterns to include certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

                • (string) --

              • objectType (string) -- [REQUIRED]

                The supported object type or content type of the data source.

        • type (string) -- [REQUIRED]

          The type of filtering that you want to apply to certain objects or content of the data source. For example, the PATTERN type is regular expression patterns you can apply to filter your content.

    • sourceConfiguration (dict) -- [REQUIRED]

      The endpoint information to connect to your Salesforce data source.

      • authType (string) -- [REQUIRED]

        The supported authentication type to authenticate and connect to your Salesforce instance.

      • credentialsSecretArn (string) -- [REQUIRED]

        The Amazon Resource Name of an Secrets Manager secret that stores your authentication credentials for your Salesforce instance URL. For more information on the key-value pairs that must be included in your secret, depending on your authentication type, see Salesforce connection configuration.

      • hostUrl (string) -- [REQUIRED]

        The Salesforce host URL or instance URL.

  • sharePointConfiguration (dict) --

    The configuration information to connect to SharePoint as your data source.

    • crawlerConfiguration (dict) --

      The configuration of the SharePoint content. For example, configuring specific types of SharePoint content.

      • filterConfiguration (dict) --

        The configuration of filtering the SharePoint content. For example, configuring regular expression patterns to include or exclude certain content.

        • patternObjectFilter (dict) --

          The configuration of filtering certain objects or content types of the data source.

          • filters (list) -- [REQUIRED]

            The configuration of specific filters applied to your data source content. You can filter out or include certain content.

            • (dict) --

              The specific filters applied to your data source content. You can filter out or include certain content.

              • exclusionFilters (list) --

                A list of one or more exclusion regular expression patterns to exclude certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

                • (string) --

              • inclusionFilters (list) --

                A list of one or more inclusion regular expression patterns to include certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

                • (string) --

              • objectType (string) -- [REQUIRED]

                The supported object type or content type of the data source.

        • type (string) -- [REQUIRED]

          The type of filtering that you want to apply to certain objects or content of the data source. For example, the PATTERN type is regular expression patterns you can apply to filter your content.

    • sourceConfiguration (dict) -- [REQUIRED]

      The endpoint information to connect to your SharePoint data source.

      • authType (string) -- [REQUIRED]

        The supported authentication type to authenticate and connect to your SharePoint site/sites.

      • credentialsSecretArn (string) -- [REQUIRED]

        The Amazon Resource Name of an Secrets Manager secret that stores your authentication credentials for your SharePoint site/sites. For more information on the key-value pairs that must be included in your secret, depending on your authentication type, see SharePoint connection configuration.

      • domain (string) -- [REQUIRED]

        The domain of your SharePoint instance or site URL/URLs.

      • hostType (string) -- [REQUIRED]

        The supported host type, whether online/cloud or server/on-premises.

      • siteUrls (list) -- [REQUIRED]

        A list of one or more SharePoint site URLs.

        • (string) --

      • tenantId (string) --

        The identifier of your Microsoft 365 tenant.

  • type (string) -- [REQUIRED]

    The type of data source.

  • webConfiguration (dict) --

    The configuration of web URLs to crawl for your data source. You should be authorized to crawl the URLs.

    • crawlerConfiguration (dict) --

      The Web Crawler configuration details for the web data source.

      • crawlerLimits (dict) --

        The configuration of crawl limits for the web URLs.

        • rateLimit (integer) --

          The max rate at which pages are crawled, up to 300 per minute per host.

      • exclusionFilters (list) --

        A list of one or more exclusion regular expression patterns to exclude certain URLs. If you specify an inclusion and exclusion filter/pattern and both match a URL, the exclusion filter takes precedence and the web content of the URL isn’t crawled.

        • (string) --

      • inclusionFilters (list) --

        A list of one or more inclusion regular expression patterns to include certain URLs. If you specify an inclusion and exclusion filter/pattern and both match a URL, the exclusion filter takes precedence and the web content of the URL isn’t crawled.

        • (string) --

      • scope (string) --

        The scope of what is crawled for your URLs.

        You can choose to crawl only web pages that belong to the same host or primary domain. For example, only web pages that contain the seed URL "https://docs.aws.amazon.com/bedrock/latest/userguide/" and no other domains. You can choose to include sub domains in addition to the host or primary domain. For example, web pages that contain "aws.amazon.com" can also include sub domain "docs.aws.amazon.com".

    • sourceConfiguration (dict) -- [REQUIRED]

      The source configuration details for the web data source.

      • urlConfiguration (dict) -- [REQUIRED]

        The configuration of the URL/URLs.

        • seedUrls (list) --

          One or more seed or starting point URLs.

          • (dict) --

            The seed or starting point URL. You should be authorized to crawl the URL.

            • url (string) --

              A seed or starting point URL.

type description:

string

param description:

A description of the data source.

type knowledgeBaseId:

string

param knowledgeBaseId:

[REQUIRED]

The unique identifier of the knowledge base to which to add the data source.

type name:

string

param name:

[REQUIRED]

The name of the data source.

type serverSideEncryptionConfiguration:

dict

param serverSideEncryptionConfiguration:

Contains details about the server-side encryption for the data source.

  • kmsKeyArn (string) --

    The Amazon Resource Name (ARN) of the KMS key used to encrypt the resource.

type vectorIngestionConfiguration:

dict

param vectorIngestionConfiguration:

Contains details about how to ingest the documents in the data source.

  • chunkingConfiguration (dict) --

    Details about how to chunk the documents in the data source. A chunk refers to an excerpt from a data source that is returned when the knowledge base that it belongs to is queried.

    • chunkingStrategy (string) -- [REQUIRED]

      Knowledge base can split your source data into chunks. A chunk refers to an excerpt from a data source that is returned when the knowledge base that it belongs to is queried. You have the following options for chunking your data. If you opt for NONE, then you may want to pre-process your files by splitting them up such that each file corresponds to a chunk.

      • FIXED_SIZE – Amazon Bedrock splits your source data into chunks of the approximate size that you set in the fixedSizeChunkingConfiguration.

      • HIERARCHICAL – Split documents into layers of chunks where the first layer contains large chunks, and the second layer contains smaller chunks derived from the first layer.

      • SEMANTIC – Split documents into chunks based on groups of similar content derived with natural language processing.

      • NONE – Amazon Bedrock treats each file as one chunk. If you choose this option, you may want to pre-process your documents by splitting them into separate files.

    • fixedSizeChunkingConfiguration (dict) --

      Configurations for when you choose fixed-size chunking. If you set the chunkingStrategy as NONE, exclude this field.

      • maxTokens (integer) -- [REQUIRED]

        The maximum number of tokens to include in a chunk.

      • overlapPercentage (integer) -- [REQUIRED]

        The percentage of overlap between adjacent chunks of a data source.

    • hierarchicalChunkingConfiguration (dict) --

      Settings for hierarchical document chunking for a data source. Hierarchical chunking splits documents into layers of chunks where the first layer contains large chunks, and the second layer contains smaller chunks derived from the first layer.

      • levelConfigurations (list) -- [REQUIRED]

        Token settings for each layer.

        • (dict) --

          Token settings for a layer in a hierarchical chunking configuration.

          • maxTokens (integer) -- [REQUIRED]

            The maximum number of tokens that a chunk can contain in this layer.

      • overlapTokens (integer) -- [REQUIRED]

        The number of tokens to repeat across chunks in the same layer.

    • semanticChunkingConfiguration (dict) --

      Settings for semantic document chunking for a data source. Semantic chunking splits a document into into smaller documents based on groups of similar content derived from the text with natural language processing.

      • breakpointPercentileThreshold (integer) -- [REQUIRED]

        The dissimilarity threshold for splitting chunks.

      • bufferSize (integer) -- [REQUIRED]

        The buffer size.

      • maxTokens (integer) -- [REQUIRED]

        The maximum number of tokens that a chunk can contain.

  • customTransformationConfiguration (dict) --

    A custom document transformer for parsed data source documents.

    • intermediateStorage (dict) -- [REQUIRED]

      An S3 bucket path for input and output objects.

      • s3Location (dict) -- [REQUIRED]

        An S3 bucket path.

        • uri (string) -- [REQUIRED]

          The location's URI. For example, s3://my-bucket/chunk-processor/.

    • transformations (list) -- [REQUIRED]

      A Lambda function that processes documents.

      • (dict) --

        A custom processing step for documents moving through a data source ingestion pipeline. To process documents after they have been converted into chunks, set the step to apply to POST_CHUNKING.

        • stepToApply (string) -- [REQUIRED]

          When the service applies the transformation.

        • transformationFunction (dict) -- [REQUIRED]

          A Lambda function that processes documents.

          • transformationLambdaConfiguration (dict) -- [REQUIRED]

            The Lambda function.

            • lambdaArn (string) -- [REQUIRED]

              The function's ARN identifier.

  • parsingConfiguration (dict) --

    A custom parser for data source documents.

    • bedrockFoundationModelConfiguration (dict) --

      Settings for a foundation model used to parse documents for a data source.

      • modelArn (string) -- [REQUIRED]

        The ARN of the foundation model or inference profile.

      • parsingPrompt (dict) --

        Instructions for interpreting the contents of a document.

        • parsingPromptText (string) -- [REQUIRED]

          Instructions for interpreting the contents of a document.

    • parsingStrategy (string) -- [REQUIRED]

      The parsing strategy for the data source.

rtype:

dict

returns:

Response Syntax

{
    'dataSource': {
        'createdAt': datetime(2015, 1, 1),
        'dataDeletionPolicy': 'RETAIN'|'DELETE',
        'dataSourceConfiguration': {
            'confluenceConfiguration': {
                'crawlerConfiguration': {
                    'filterConfiguration': {
                        'patternObjectFilter': {
                            'filters': [
                                {
                                    'exclusionFilters': [
                                        'string',
                                    ],
                                    'inclusionFilters': [
                                        'string',
                                    ],
                                    'objectType': 'string'
                                },
                            ]
                        },
                        'type': 'PATTERN'
                    }
                },
                'sourceConfiguration': {
                    'authType': 'BASIC'|'OAUTH2_CLIENT_CREDENTIALS',
                    'credentialsSecretArn': 'string',
                    'hostType': 'SAAS',
                    'hostUrl': 'string'
                }
            },
            's3Configuration': {
                'bucketArn': 'string',
                'bucketOwnerAccountId': 'string',
                'inclusionPrefixes': [
                    'string',
                ]
            },
            'salesforceConfiguration': {
                'crawlerConfiguration': {
                    'filterConfiguration': {
                        'patternObjectFilter': {
                            'filters': [
                                {
                                    'exclusionFilters': [
                                        'string',
                                    ],
                                    'inclusionFilters': [
                                        'string',
                                    ],
                                    'objectType': 'string'
                                },
                            ]
                        },
                        'type': 'PATTERN'
                    }
                },
                'sourceConfiguration': {
                    'authType': 'OAUTH2_CLIENT_CREDENTIALS',
                    'credentialsSecretArn': 'string',
                    'hostUrl': 'string'
                }
            },
            'sharePointConfiguration': {
                'crawlerConfiguration': {
                    'filterConfiguration': {
                        'patternObjectFilter': {
                            'filters': [
                                {
                                    'exclusionFilters': [
                                        'string',
                                    ],
                                    'inclusionFilters': [
                                        'string',
                                    ],
                                    'objectType': 'string'
                                },
                            ]
                        },
                        'type': 'PATTERN'
                    }
                },
                'sourceConfiguration': {
                    'authType': 'OAUTH2_CLIENT_CREDENTIALS',
                    'credentialsSecretArn': 'string',
                    'domain': 'string',
                    'hostType': 'ONLINE',
                    'siteUrls': [
                        'string',
                    ],
                    'tenantId': 'string'
                }
            },
            'type': 'S3'|'WEB'|'CONFLUENCE'|'SALESFORCE'|'SHAREPOINT'|'CUSTOM',
            'webConfiguration': {
                'crawlerConfiguration': {
                    'crawlerLimits': {
                        'rateLimit': 123
                    },
                    'exclusionFilters': [
                        'string',
                    ],
                    'inclusionFilters': [
                        'string',
                    ],
                    'scope': 'HOST_ONLY'|'SUBDOMAINS'
                },
                'sourceConfiguration': {
                    'urlConfiguration': {
                        'seedUrls': [
                            {
                                'url': 'string'
                            },
                        ]
                    }
                }
            }
        },
        'dataSourceId': 'string',
        'description': 'string',
        'failureReasons': [
            'string',
        ],
        'knowledgeBaseId': 'string',
        'name': 'string',
        'serverSideEncryptionConfiguration': {
            'kmsKeyArn': 'string'
        },
        'status': 'AVAILABLE'|'DELETING'|'DELETE_UNSUCCESSFUL',
        'updatedAt': datetime(2015, 1, 1),
        'vectorIngestionConfiguration': {
            'chunkingConfiguration': {
                'chunkingStrategy': 'FIXED_SIZE'|'NONE'|'HIERARCHICAL'|'SEMANTIC',
                'fixedSizeChunkingConfiguration': {
                    'maxTokens': 123,
                    'overlapPercentage': 123
                },
                'hierarchicalChunkingConfiguration': {
                    'levelConfigurations': [
                        {
                            'maxTokens': 123
                        },
                    ],
                    'overlapTokens': 123
                },
                'semanticChunkingConfiguration': {
                    'breakpointPercentileThreshold': 123,
                    'bufferSize': 123,
                    'maxTokens': 123
                }
            },
            'customTransformationConfiguration': {
                'intermediateStorage': {
                    's3Location': {
                        'uri': 'string'
                    }
                },
                'transformations': [
                    {
                        'stepToApply': 'POST_CHUNKING',
                        'transformationFunction': {
                            'transformationLambdaConfiguration': {
                                'lambdaArn': 'string'
                            }
                        }
                    },
                ]
            },
            'parsingConfiguration': {
                'bedrockFoundationModelConfiguration': {
                    'modelArn': 'string',
                    'parsingPrompt': {
                        'parsingPromptText': 'string'
                    }
                },
                'parsingStrategy': 'BEDROCK_FOUNDATION_MODEL'
            }
        }
    }
}

Response Structure

  • (dict) --

    • dataSource (dict) --

      Contains details about the data source.

      • createdAt (datetime) --

        The time at which the data source was created.

      • dataDeletionPolicy (string) --

        The data deletion policy for the data source.

      • dataSourceConfiguration (dict) --

        The connection configuration for the data source.

        • confluenceConfiguration (dict) --

          The configuration information to connect to Confluence as your data source.

          • crawlerConfiguration (dict) --

            The configuration of the Confluence content. For example, configuring specific types of Confluence content.

            • filterConfiguration (dict) --

              The configuration of filtering the Confluence content. For example, configuring regular expression patterns to include or exclude certain content.

              • patternObjectFilter (dict) --

                The configuration of filtering certain objects or content types of the data source.

                • filters (list) --

                  The configuration of specific filters applied to your data source content. You can filter out or include certain content.

                  • (dict) --

                    The specific filters applied to your data source content. You can filter out or include certain content.

                    • exclusionFilters (list) --

                      A list of one or more exclusion regular expression patterns to exclude certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

                      • (string) --

                    • inclusionFilters (list) --

                      A list of one or more inclusion regular expression patterns to include certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

                      • (string) --

                    • objectType (string) --

                      The supported object type or content type of the data source.

              • type (string) --

                The type of filtering that you want to apply to certain objects or content of the data source. For example, the PATTERN type is regular expression patterns you can apply to filter your content.

          • sourceConfiguration (dict) --

            The endpoint information to connect to your Confluence data source.

            • authType (string) --

              The supported authentication type to authenticate and connect to your Confluence instance.

            • credentialsSecretArn (string) --

              The Amazon Resource Name of an Secrets Manager secret that stores your authentication credentials for your Confluence instance URL. For more information on the key-value pairs that must be included in your secret, depending on your authentication type, see Confluence connection configuration.

            • hostType (string) --

              The supported host type, whether online/cloud or server/on-premises.

            • hostUrl (string) --

              The Confluence host URL or instance URL.

        • s3Configuration (dict) --

          The configuration information to connect to Amazon S3 as your data source.

          • bucketArn (string) --

            The Amazon Resource Name (ARN) of the S3 bucket that contains your data.

          • bucketOwnerAccountId (string) --

            The account ID for the owner of the S3 bucket.

          • inclusionPrefixes (list) --

            A list of S3 prefixes to include certain files or content. For more information, see Organizing objects using prefixes.

            • (string) --

        • salesforceConfiguration (dict) --

          The configuration information to connect to Salesforce as your data source.

          • crawlerConfiguration (dict) --

            The configuration of the Salesforce content. For example, configuring specific types of Salesforce content.

            • filterConfiguration (dict) --

              The configuration of filtering the Salesforce content. For example, configuring regular expression patterns to include or exclude certain content.

              • patternObjectFilter (dict) --

                The configuration of filtering certain objects or content types of the data source.

                • filters (list) --

                  The configuration of specific filters applied to your data source content. You can filter out or include certain content.

                  • (dict) --

                    The specific filters applied to your data source content. You can filter out or include certain content.

                    • exclusionFilters (list) --

                      A list of one or more exclusion regular expression patterns to exclude certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

                      • (string) --

                    • inclusionFilters (list) --

                      A list of one or more inclusion regular expression patterns to include certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

                      • (string) --

                    • objectType (string) --

                      The supported object type or content type of the data source.

              • type (string) --

                The type of filtering that you want to apply to certain objects or content of the data source. For example, the PATTERN type is regular expression patterns you can apply to filter your content.

          • sourceConfiguration (dict) --

            The endpoint information to connect to your Salesforce data source.

            • authType (string) --

              The supported authentication type to authenticate and connect to your Salesforce instance.

            • credentialsSecretArn (string) --

              The Amazon Resource Name of an Secrets Manager secret that stores your authentication credentials for your Salesforce instance URL. For more information on the key-value pairs that must be included in your secret, depending on your authentication type, see Salesforce connection configuration.

            • hostUrl (string) --

              The Salesforce host URL or instance URL.

        • sharePointConfiguration (dict) --

          The configuration information to connect to SharePoint as your data source.

          • crawlerConfiguration (dict) --

            The configuration of the SharePoint content. For example, configuring specific types of SharePoint content.

            • filterConfiguration (dict) --

              The configuration of filtering the SharePoint content. For example, configuring regular expression patterns to include or exclude certain content.

              • patternObjectFilter (dict) --

                The configuration of filtering certain objects or content types of the data source.

                • filters (list) --

                  The configuration of specific filters applied to your data source content. You can filter out or include certain content.

                  • (dict) --

                    The specific filters applied to your data source content. You can filter out or include certain content.

                    • exclusionFilters (list) --

                      A list of one or more exclusion regular expression patterns to exclude certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

                      • (string) --

                    • inclusionFilters (list) --

                      A list of one or more inclusion regular expression patterns to include certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

                      • (string) --

                    • objectType (string) --

                      The supported object type or content type of the data source.

              • type (string) --

                The type of filtering that you want to apply to certain objects or content of the data source. For example, the PATTERN type is regular expression patterns you can apply to filter your content.

          • sourceConfiguration (dict) --

            The endpoint information to connect to your SharePoint data source.

            • authType (string) --

              The supported authentication type to authenticate and connect to your SharePoint site/sites.

            • credentialsSecretArn (string) --

              The Amazon Resource Name of an Secrets Manager secret that stores your authentication credentials for your SharePoint site/sites. For more information on the key-value pairs that must be included in your secret, depending on your authentication type, see SharePoint connection configuration.

            • domain (string) --

              The domain of your SharePoint instance or site URL/URLs.

            • hostType (string) --

              The supported host type, whether online/cloud or server/on-premises.

            • siteUrls (list) --

              A list of one or more SharePoint site URLs.

              • (string) --

            • tenantId (string) --

              The identifier of your Microsoft 365 tenant.

        • type (string) --

          The type of data source.

        • webConfiguration (dict) --

          The configuration of web URLs to crawl for your data source. You should be authorized to crawl the URLs.

          • crawlerConfiguration (dict) --

            The Web Crawler configuration details for the web data source.

            • crawlerLimits (dict) --

              The configuration of crawl limits for the web URLs.

              • rateLimit (integer) --

                The max rate at which pages are crawled, up to 300 per minute per host.

            • exclusionFilters (list) --

              A list of one or more exclusion regular expression patterns to exclude certain URLs. If you specify an inclusion and exclusion filter/pattern and both match a URL, the exclusion filter takes precedence and the web content of the URL isn’t crawled.

              • (string) --

            • inclusionFilters (list) --

              A list of one or more inclusion regular expression patterns to include certain URLs. If you specify an inclusion and exclusion filter/pattern and both match a URL, the exclusion filter takes precedence and the web content of the URL isn’t crawled.

              • (string) --

            • scope (string) --

              The scope of what is crawled for your URLs.

              You can choose to crawl only web pages that belong to the same host or primary domain. For example, only web pages that contain the seed URL "https://docs.aws.amazon.com/bedrock/latest/userguide/" and no other domains. You can choose to include sub domains in addition to the host or primary domain. For example, web pages that contain "aws.amazon.com" can also include sub domain "docs.aws.amazon.com".

          • sourceConfiguration (dict) --

            The source configuration details for the web data source.

            • urlConfiguration (dict) --

              The configuration of the URL/URLs.

              • seedUrls (list) --

                One or more seed or starting point URLs.

                • (dict) --

                  The seed or starting point URL. You should be authorized to crawl the URL.

                  • url (string) --

                    A seed or starting point URL.

      • dataSourceId (string) --

        The unique identifier of the data source.

      • description (string) --

        The description of the data source.

      • failureReasons (list) --

        The detailed reasons on the failure to delete a data source.

        • (string) --

      • knowledgeBaseId (string) --

        The unique identifier of the knowledge base to which the data source belongs.

      • name (string) --

        The name of the data source.

      • serverSideEncryptionConfiguration (dict) --

        Contains details about the configuration of the server-side encryption.

        • kmsKeyArn (string) --

          The Amazon Resource Name (ARN) of the KMS key used to encrypt the resource.

      • status (string) --

        The status of the data source. The following statuses are possible:

        • Available – The data source has been created and is ready for ingestion into the knowledge base.

        • Deleting – The data source is being deleted.

      • updatedAt (datetime) --

        The time at which the data source was last updated.

      • vectorIngestionConfiguration (dict) --

        Contains details about how to ingest the documents in the data source.

        • chunkingConfiguration (dict) --

          Details about how to chunk the documents in the data source. A chunk refers to an excerpt from a data source that is returned when the knowledge base that it belongs to is queried.

          • chunkingStrategy (string) --

            Knowledge base can split your source data into chunks. A chunk refers to an excerpt from a data source that is returned when the knowledge base that it belongs to is queried. You have the following options for chunking your data. If you opt for NONE, then you may want to pre-process your files by splitting them up such that each file corresponds to a chunk.

            • FIXED_SIZE – Amazon Bedrock splits your source data into chunks of the approximate size that you set in the fixedSizeChunkingConfiguration.

            • HIERARCHICAL – Split documents into layers of chunks where the first layer contains large chunks, and the second layer contains smaller chunks derived from the first layer.

            • SEMANTIC – Split documents into chunks based on groups of similar content derived with natural language processing.

            • NONE – Amazon Bedrock treats each file as one chunk. If you choose this option, you may want to pre-process your documents by splitting them into separate files.

          • fixedSizeChunkingConfiguration (dict) --

            Configurations for when you choose fixed-size chunking. If you set the chunkingStrategy as NONE, exclude this field.

            • maxTokens (integer) --

              The maximum number of tokens to include in a chunk.

            • overlapPercentage (integer) --

              The percentage of overlap between adjacent chunks of a data source.

          • hierarchicalChunkingConfiguration (dict) --

            Settings for hierarchical document chunking for a data source. Hierarchical chunking splits documents into layers of chunks where the first layer contains large chunks, and the second layer contains smaller chunks derived from the first layer.

            • levelConfigurations (list) --

              Token settings for each layer.

              • (dict) --

                Token settings for a layer in a hierarchical chunking configuration.

                • maxTokens (integer) --

                  The maximum number of tokens that a chunk can contain in this layer.

            • overlapTokens (integer) --

              The number of tokens to repeat across chunks in the same layer.

          • semanticChunkingConfiguration (dict) --

            Settings for semantic document chunking for a data source. Semantic chunking splits a document into into smaller documents based on groups of similar content derived from the text with natural language processing.

            • breakpointPercentileThreshold (integer) --

              The dissimilarity threshold for splitting chunks.

            • bufferSize (integer) --

              The buffer size.

            • maxTokens (integer) --

              The maximum number of tokens that a chunk can contain.

        • customTransformationConfiguration (dict) --

          A custom document transformer for parsed data source documents.

          • intermediateStorage (dict) --

            An S3 bucket path for input and output objects.

            • s3Location (dict) --

              An S3 bucket path.

              • uri (string) --

                The location's URI. For example, s3://my-bucket/chunk-processor/.

          • transformations (list) --

            A Lambda function that processes documents.

            • (dict) --

              A custom processing step for documents moving through a data source ingestion pipeline. To process documents after they have been converted into chunks, set the step to apply to POST_CHUNKING.

              • stepToApply (string) --

                When the service applies the transformation.

              • transformationFunction (dict) --

                A Lambda function that processes documents.

                • transformationLambdaConfiguration (dict) --

                  The Lambda function.

                  • lambdaArn (string) --

                    The function's ARN identifier.

        • parsingConfiguration (dict) --

          A custom parser for data source documents.

          • bedrockFoundationModelConfiguration (dict) --

            Settings for a foundation model used to parse documents for a data source.

            • modelArn (string) --

              The ARN of the foundation model or inference profile.

            • parsingPrompt (dict) --

              Instructions for interpreting the contents of a document.

              • parsingPromptText (string) --

                Instructions for interpreting the contents of a document.

          • parsingStrategy (string) --

            The parsing strategy for the data source.

GetDataSource (updated) Link ¶
Changes (response)
{'dataSource': {'dataSourceConfiguration': {'type': {'CUSTOM'}}}}

Gets information about a data source.

See also: AWS API Documentation

Request Syntax

client.get_data_source(
    dataSourceId='string',
    knowledgeBaseId='string'
)
type dataSourceId:

string

param dataSourceId:

[REQUIRED]

The unique identifier of the data source.

type knowledgeBaseId:

string

param knowledgeBaseId:

[REQUIRED]

The unique identifier of the knowledge base for the data source.

rtype:

dict

returns:

Response Syntax

{
    'dataSource': {
        'createdAt': datetime(2015, 1, 1),
        'dataDeletionPolicy': 'RETAIN'|'DELETE',
        'dataSourceConfiguration': {
            'confluenceConfiguration': {
                'crawlerConfiguration': {
                    'filterConfiguration': {
                        'patternObjectFilter': {
                            'filters': [
                                {
                                    'exclusionFilters': [
                                        'string',
                                    ],
                                    'inclusionFilters': [
                                        'string',
                                    ],
                                    'objectType': 'string'
                                },
                            ]
                        },
                        'type': 'PATTERN'
                    }
                },
                'sourceConfiguration': {
                    'authType': 'BASIC'|'OAUTH2_CLIENT_CREDENTIALS',
                    'credentialsSecretArn': 'string',
                    'hostType': 'SAAS',
                    'hostUrl': 'string'
                }
            },
            's3Configuration': {
                'bucketArn': 'string',
                'bucketOwnerAccountId': 'string',
                'inclusionPrefixes': [
                    'string',
                ]
            },
            'salesforceConfiguration': {
                'crawlerConfiguration': {
                    'filterConfiguration': {
                        'patternObjectFilter': {
                            'filters': [
                                {
                                    'exclusionFilters': [
                                        'string',
                                    ],
                                    'inclusionFilters': [
                                        'string',
                                    ],
                                    'objectType': 'string'
                                },
                            ]
                        },
                        'type': 'PATTERN'
                    }
                },
                'sourceConfiguration': {
                    'authType': 'OAUTH2_CLIENT_CREDENTIALS',
                    'credentialsSecretArn': 'string',
                    'hostUrl': 'string'
                }
            },
            'sharePointConfiguration': {
                'crawlerConfiguration': {
                    'filterConfiguration': {
                        'patternObjectFilter': {
                            'filters': [
                                {
                                    'exclusionFilters': [
                                        'string',
                                    ],
                                    'inclusionFilters': [
                                        'string',
                                    ],
                                    'objectType': 'string'
                                },
                            ]
                        },
                        'type': 'PATTERN'
                    }
                },
                'sourceConfiguration': {
                    'authType': 'OAUTH2_CLIENT_CREDENTIALS',
                    'credentialsSecretArn': 'string',
                    'domain': 'string',
                    'hostType': 'ONLINE',
                    'siteUrls': [
                        'string',
                    ],
                    'tenantId': 'string'
                }
            },
            'type': 'S3'|'WEB'|'CONFLUENCE'|'SALESFORCE'|'SHAREPOINT'|'CUSTOM',
            'webConfiguration': {
                'crawlerConfiguration': {
                    'crawlerLimits': {
                        'rateLimit': 123
                    },
                    'exclusionFilters': [
                        'string',
                    ],
                    'inclusionFilters': [
                        'string',
                    ],
                    'scope': 'HOST_ONLY'|'SUBDOMAINS'
                },
                'sourceConfiguration': {
                    'urlConfiguration': {
                        'seedUrls': [
                            {
                                'url': 'string'
                            },
                        ]
                    }
                }
            }
        },
        'dataSourceId': 'string',
        'description': 'string',
        'failureReasons': [
            'string',
        ],
        'knowledgeBaseId': 'string',
        'name': 'string',
        'serverSideEncryptionConfiguration': {
            'kmsKeyArn': 'string'
        },
        'status': 'AVAILABLE'|'DELETING'|'DELETE_UNSUCCESSFUL',
        'updatedAt': datetime(2015, 1, 1),
        'vectorIngestionConfiguration': {
            'chunkingConfiguration': {
                'chunkingStrategy': 'FIXED_SIZE'|'NONE'|'HIERARCHICAL'|'SEMANTIC',
                'fixedSizeChunkingConfiguration': {
                    'maxTokens': 123,
                    'overlapPercentage': 123
                },
                'hierarchicalChunkingConfiguration': {
                    'levelConfigurations': [
                        {
                            'maxTokens': 123
                        },
                    ],
                    'overlapTokens': 123
                },
                'semanticChunkingConfiguration': {
                    'breakpointPercentileThreshold': 123,
                    'bufferSize': 123,
                    'maxTokens': 123
                }
            },
            'customTransformationConfiguration': {
                'intermediateStorage': {
                    's3Location': {
                        'uri': 'string'
                    }
                },
                'transformations': [
                    {
                        'stepToApply': 'POST_CHUNKING',
                        'transformationFunction': {
                            'transformationLambdaConfiguration': {
                                'lambdaArn': 'string'
                            }
                        }
                    },
                ]
            },
            'parsingConfiguration': {
                'bedrockFoundationModelConfiguration': {
                    'modelArn': 'string',
                    'parsingPrompt': {
                        'parsingPromptText': 'string'
                    }
                },
                'parsingStrategy': 'BEDROCK_FOUNDATION_MODEL'
            }
        }
    }
}

Response Structure

  • (dict) --

    • dataSource (dict) --

      Contains details about the data source.

      • createdAt (datetime) --

        The time at which the data source was created.

      • dataDeletionPolicy (string) --

        The data deletion policy for the data source.

      • dataSourceConfiguration (dict) --

        The connection configuration for the data source.

        • confluenceConfiguration (dict) --

          The configuration information to connect to Confluence as your data source.

          • crawlerConfiguration (dict) --

            The configuration of the Confluence content. For example, configuring specific types of Confluence content.

            • filterConfiguration (dict) --

              The configuration of filtering the Confluence content. For example, configuring regular expression patterns to include or exclude certain content.

              • patternObjectFilter (dict) --

                The configuration of filtering certain objects or content types of the data source.

                • filters (list) --

                  The configuration of specific filters applied to your data source content. You can filter out or include certain content.

                  • (dict) --

                    The specific filters applied to your data source content. You can filter out or include certain content.

                    • exclusionFilters (list) --

                      A list of one or more exclusion regular expression patterns to exclude certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

                      • (string) --

                    • inclusionFilters (list) --

                      A list of one or more inclusion regular expression patterns to include certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

                      • (string) --

                    • objectType (string) --

                      The supported object type or content type of the data source.

              • type (string) --

                The type of filtering that you want to apply to certain objects or content of the data source. For example, the PATTERN type is regular expression patterns you can apply to filter your content.

          • sourceConfiguration (dict) --

            The endpoint information to connect to your Confluence data source.

            • authType (string) --

              The supported authentication type to authenticate and connect to your Confluence instance.

            • credentialsSecretArn (string) --

              The Amazon Resource Name of an Secrets Manager secret that stores your authentication credentials for your Confluence instance URL. For more information on the key-value pairs that must be included in your secret, depending on your authentication type, see Confluence connection configuration.

            • hostType (string) --

              The supported host type, whether online/cloud or server/on-premises.

            • hostUrl (string) --

              The Confluence host URL or instance URL.

        • s3Configuration (dict) --

          The configuration information to connect to Amazon S3 as your data source.

          • bucketArn (string) --

            The Amazon Resource Name (ARN) of the S3 bucket that contains your data.

          • bucketOwnerAccountId (string) --

            The account ID for the owner of the S3 bucket.

          • inclusionPrefixes (list) --

            A list of S3 prefixes to include certain files or content. For more information, see Organizing objects using prefixes.

            • (string) --

        • salesforceConfiguration (dict) --

          The configuration information to connect to Salesforce as your data source.

          • crawlerConfiguration (dict) --

            The configuration of the Salesforce content. For example, configuring specific types of Salesforce content.

            • filterConfiguration (dict) --

              The configuration of filtering the Salesforce content. For example, configuring regular expression patterns to include or exclude certain content.

              • patternObjectFilter (dict) --

                The configuration of filtering certain objects or content types of the data source.

                • filters (list) --

                  The configuration of specific filters applied to your data source content. You can filter out or include certain content.

                  • (dict) --

                    The specific filters applied to your data source content. You can filter out or include certain content.

                    • exclusionFilters (list) --

                      A list of one or more exclusion regular expression patterns to exclude certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

                      • (string) --

                    • inclusionFilters (list) --

                      A list of one or more inclusion regular expression patterns to include certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

                      • (string) --

                    • objectType (string) --

                      The supported object type or content type of the data source.

              • type (string) --

                The type of filtering that you want to apply to certain objects or content of the data source. For example, the PATTERN type is regular expression patterns you can apply to filter your content.

          • sourceConfiguration (dict) --

            The endpoint information to connect to your Salesforce data source.

            • authType (string) --

              The supported authentication type to authenticate and connect to your Salesforce instance.

            • credentialsSecretArn (string) --

              The Amazon Resource Name of an Secrets Manager secret that stores your authentication credentials for your Salesforce instance URL. For more information on the key-value pairs that must be included in your secret, depending on your authentication type, see Salesforce connection configuration.

            • hostUrl (string) --

              The Salesforce host URL or instance URL.

        • sharePointConfiguration (dict) --

          The configuration information to connect to SharePoint as your data source.

          • crawlerConfiguration (dict) --

            The configuration of the SharePoint content. For example, configuring specific types of SharePoint content.

            • filterConfiguration (dict) --

              The configuration of filtering the SharePoint content. For example, configuring regular expression patterns to include or exclude certain content.

              • patternObjectFilter (dict) --

                The configuration of filtering certain objects or content types of the data source.

                • filters (list) --

                  The configuration of specific filters applied to your data source content. You can filter out or include certain content.

                  • (dict) --

                    The specific filters applied to your data source content. You can filter out or include certain content.

                    • exclusionFilters (list) --

                      A list of one or more exclusion regular expression patterns to exclude certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

                      • (string) --

                    • inclusionFilters (list) --

                      A list of one or more inclusion regular expression patterns to include certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

                      • (string) --

                    • objectType (string) --

                      The supported object type or content type of the data source.

              • type (string) --

                The type of filtering that you want to apply to certain objects or content of the data source. For example, the PATTERN type is regular expression patterns you can apply to filter your content.

          • sourceConfiguration (dict) --

            The endpoint information to connect to your SharePoint data source.

            • authType (string) --

              The supported authentication type to authenticate and connect to your SharePoint site/sites.

            • credentialsSecretArn (string) --

              The Amazon Resource Name of an Secrets Manager secret that stores your authentication credentials for your SharePoint site/sites. For more information on the key-value pairs that must be included in your secret, depending on your authentication type, see SharePoint connection configuration.

            • domain (string) --

              The domain of your SharePoint instance or site URL/URLs.

            • hostType (string) --

              The supported host type, whether online/cloud or server/on-premises.

            • siteUrls (list) --

              A list of one or more SharePoint site URLs.

              • (string) --

            • tenantId (string) --

              The identifier of your Microsoft 365 tenant.

        • type (string) --

          The type of data source.

        • webConfiguration (dict) --

          The configuration of web URLs to crawl for your data source. You should be authorized to crawl the URLs.

          • crawlerConfiguration (dict) --

            The Web Crawler configuration details for the web data source.

            • crawlerLimits (dict) --

              The configuration of crawl limits for the web URLs.

              • rateLimit (integer) --

                The max rate at which pages are crawled, up to 300 per minute per host.

            • exclusionFilters (list) --

              A list of one or more exclusion regular expression patterns to exclude certain URLs. If you specify an inclusion and exclusion filter/pattern and both match a URL, the exclusion filter takes precedence and the web content of the URL isn’t crawled.

              • (string) --

            • inclusionFilters (list) --

              A list of one or more inclusion regular expression patterns to include certain URLs. If you specify an inclusion and exclusion filter/pattern and both match a URL, the exclusion filter takes precedence and the web content of the URL isn’t crawled.

              • (string) --

            • scope (string) --

              The scope of what is crawled for your URLs.

              You can choose to crawl only web pages that belong to the same host or primary domain. For example, only web pages that contain the seed URL "https://docs.aws.amazon.com/bedrock/latest/userguide/" and no other domains. You can choose to include sub domains in addition to the host or primary domain. For example, web pages that contain "aws.amazon.com" can also include sub domain "docs.aws.amazon.com".

          • sourceConfiguration (dict) --

            The source configuration details for the web data source.

            • urlConfiguration (dict) --

              The configuration of the URL/URLs.

              • seedUrls (list) --

                One or more seed or starting point URLs.

                • (dict) --

                  The seed or starting point URL. You should be authorized to crawl the URL.

                  • url (string) --

                    A seed or starting point URL.

      • dataSourceId (string) --

        The unique identifier of the data source.

      • description (string) --

        The description of the data source.

      • failureReasons (list) --

        The detailed reasons on the failure to delete a data source.

        • (string) --

      • knowledgeBaseId (string) --

        The unique identifier of the knowledge base to which the data source belongs.

      • name (string) --

        The name of the data source.

      • serverSideEncryptionConfiguration (dict) --

        Contains details about the configuration of the server-side encryption.

        • kmsKeyArn (string) --

          The Amazon Resource Name (ARN) of the KMS key used to encrypt the resource.

      • status (string) --

        The status of the data source. The following statuses are possible:

        • Available – The data source has been created and is ready for ingestion into the knowledge base.

        • Deleting – The data source is being deleted.

      • updatedAt (datetime) --

        The time at which the data source was last updated.

      • vectorIngestionConfiguration (dict) --

        Contains details about how to ingest the documents in the data source.

        • chunkingConfiguration (dict) --

          Details about how to chunk the documents in the data source. A chunk refers to an excerpt from a data source that is returned when the knowledge base that it belongs to is queried.

          • chunkingStrategy (string) --

            Knowledge base can split your source data into chunks. A chunk refers to an excerpt from a data source that is returned when the knowledge base that it belongs to is queried. You have the following options for chunking your data. If you opt for NONE, then you may want to pre-process your files by splitting them up such that each file corresponds to a chunk.

            • FIXED_SIZE – Amazon Bedrock splits your source data into chunks of the approximate size that you set in the fixedSizeChunkingConfiguration.

            • HIERARCHICAL – Split documents into layers of chunks where the first layer contains large chunks, and the second layer contains smaller chunks derived from the first layer.

            • SEMANTIC – Split documents into chunks based on groups of similar content derived with natural language processing.

            • NONE – Amazon Bedrock treats each file as one chunk. If you choose this option, you may want to pre-process your documents by splitting them into separate files.

          • fixedSizeChunkingConfiguration (dict) --

            Configurations for when you choose fixed-size chunking. If you set the chunkingStrategy as NONE, exclude this field.

            • maxTokens (integer) --

              The maximum number of tokens to include in a chunk.

            • overlapPercentage (integer) --

              The percentage of overlap between adjacent chunks of a data source.

          • hierarchicalChunkingConfiguration (dict) --

            Settings for hierarchical document chunking for a data source. Hierarchical chunking splits documents into layers of chunks where the first layer contains large chunks, and the second layer contains smaller chunks derived from the first layer.

            • levelConfigurations (list) --

              Token settings for each layer.

              • (dict) --

                Token settings for a layer in a hierarchical chunking configuration.

                • maxTokens (integer) --

                  The maximum number of tokens that a chunk can contain in this layer.

            • overlapTokens (integer) --

              The number of tokens to repeat across chunks in the same layer.

          • semanticChunkingConfiguration (dict) --

            Settings for semantic document chunking for a data source. Semantic chunking splits a document into into smaller documents based on groups of similar content derived from the text with natural language processing.

            • breakpointPercentileThreshold (integer) --

              The dissimilarity threshold for splitting chunks.

            • bufferSize (integer) --

              The buffer size.

            • maxTokens (integer) --

              The maximum number of tokens that a chunk can contain.

        • customTransformationConfiguration (dict) --

          A custom document transformer for parsed data source documents.

          • intermediateStorage (dict) --

            An S3 bucket path for input and output objects.

            • s3Location (dict) --

              An S3 bucket path.

              • uri (string) --

                The location's URI. For example, s3://my-bucket/chunk-processor/.

          • transformations (list) --

            A Lambda function that processes documents.

            • (dict) --

              A custom processing step for documents moving through a data source ingestion pipeline. To process documents after they have been converted into chunks, set the step to apply to POST_CHUNKING.

              • stepToApply (string) --

                When the service applies the transformation.

              • transformationFunction (dict) --

                A Lambda function that processes documents.

                • transformationLambdaConfiguration (dict) --

                  The Lambda function.

                  • lambdaArn (string) --

                    The function's ARN identifier.

        • parsingConfiguration (dict) --

          A custom parser for data source documents.

          • bedrockFoundationModelConfiguration (dict) --

            Settings for a foundation model used to parse documents for a data source.

            • modelArn (string) --

              The ARN of the foundation model or inference profile.

            • parsingPrompt (dict) --

              Instructions for interpreting the contents of a document.

              • parsingPromptText (string) --

                Instructions for interpreting the contents of a document.

          • parsingStrategy (string) --

            The parsing strategy for the data source.

UpdateDataSource (updated) Link ¶
Changes (request, response)
Request
{'dataSourceConfiguration': {'type': {'CUSTOM'}}}
Response
{'dataSource': {'dataSourceConfiguration': {'type': {'CUSTOM'}}}}

Updates the configurations for a data source connector.

See also: AWS API Documentation

Request Syntax

client.update_data_source(
    dataDeletionPolicy='RETAIN'|'DELETE',
    dataSourceConfiguration={
        'confluenceConfiguration': {
            'crawlerConfiguration': {
                'filterConfiguration': {
                    'patternObjectFilter': {
                        'filters': [
                            {
                                'exclusionFilters': [
                                    'string',
                                ],
                                'inclusionFilters': [
                                    'string',
                                ],
                                'objectType': 'string'
                            },
                        ]
                    },
                    'type': 'PATTERN'
                }
            },
            'sourceConfiguration': {
                'authType': 'BASIC'|'OAUTH2_CLIENT_CREDENTIALS',
                'credentialsSecretArn': 'string',
                'hostType': 'SAAS',
                'hostUrl': 'string'
            }
        },
        's3Configuration': {
            'bucketArn': 'string',
            'bucketOwnerAccountId': 'string',
            'inclusionPrefixes': [
                'string',
            ]
        },
        'salesforceConfiguration': {
            'crawlerConfiguration': {
                'filterConfiguration': {
                    'patternObjectFilter': {
                        'filters': [
                            {
                                'exclusionFilters': [
                                    'string',
                                ],
                                'inclusionFilters': [
                                    'string',
                                ],
                                'objectType': 'string'
                            },
                        ]
                    },
                    'type': 'PATTERN'
                }
            },
            'sourceConfiguration': {
                'authType': 'OAUTH2_CLIENT_CREDENTIALS',
                'credentialsSecretArn': 'string',
                'hostUrl': 'string'
            }
        },
        'sharePointConfiguration': {
            'crawlerConfiguration': {
                'filterConfiguration': {
                    'patternObjectFilter': {
                        'filters': [
                            {
                                'exclusionFilters': [
                                    'string',
                                ],
                                'inclusionFilters': [
                                    'string',
                                ],
                                'objectType': 'string'
                            },
                        ]
                    },
                    'type': 'PATTERN'
                }
            },
            'sourceConfiguration': {
                'authType': 'OAUTH2_CLIENT_CREDENTIALS',
                'credentialsSecretArn': 'string',
                'domain': 'string',
                'hostType': 'ONLINE',
                'siteUrls': [
                    'string',
                ],
                'tenantId': 'string'
            }
        },
        'type': 'S3'|'WEB'|'CONFLUENCE'|'SALESFORCE'|'SHAREPOINT'|'CUSTOM',
        'webConfiguration': {
            'crawlerConfiguration': {
                'crawlerLimits': {
                    'rateLimit': 123
                },
                'exclusionFilters': [
                    'string',
                ],
                'inclusionFilters': [
                    'string',
                ],
                'scope': 'HOST_ONLY'|'SUBDOMAINS'
            },
            'sourceConfiguration': {
                'urlConfiguration': {
                    'seedUrls': [
                        {
                            'url': 'string'
                        },
                    ]
                }
            }
        }
    },
    dataSourceId='string',
    description='string',
    knowledgeBaseId='string',
    name='string',
    serverSideEncryptionConfiguration={
        'kmsKeyArn': 'string'
    },
    vectorIngestionConfiguration={
        'chunkingConfiguration': {
            'chunkingStrategy': 'FIXED_SIZE'|'NONE'|'HIERARCHICAL'|'SEMANTIC',
            'fixedSizeChunkingConfiguration': {
                'maxTokens': 123,
                'overlapPercentage': 123
            },
            'hierarchicalChunkingConfiguration': {
                'levelConfigurations': [
                    {
                        'maxTokens': 123
                    },
                ],
                'overlapTokens': 123
            },
            'semanticChunkingConfiguration': {
                'breakpointPercentileThreshold': 123,
                'bufferSize': 123,
                'maxTokens': 123
            }
        },
        'customTransformationConfiguration': {
            'intermediateStorage': {
                's3Location': {
                    'uri': 'string'
                }
            },
            'transformations': [
                {
                    'stepToApply': 'POST_CHUNKING',
                    'transformationFunction': {
                        'transformationLambdaConfiguration': {
                            'lambdaArn': 'string'
                        }
                    }
                },
            ]
        },
        'parsingConfiguration': {
            'bedrockFoundationModelConfiguration': {
                'modelArn': 'string',
                'parsingPrompt': {
                    'parsingPromptText': 'string'
                }
            },
            'parsingStrategy': 'BEDROCK_FOUNDATION_MODEL'
        }
    }
)
type dataDeletionPolicy:

string

param dataDeletionPolicy:

The data deletion policy for the data source that you want to update.

type dataSourceConfiguration:

dict

param dataSourceConfiguration:

[REQUIRED]

The connection configuration for the data source that you want to update.

  • confluenceConfiguration (dict) --

    The configuration information to connect to Confluence as your data source.

    • crawlerConfiguration (dict) --

      The configuration of the Confluence content. For example, configuring specific types of Confluence content.

      • filterConfiguration (dict) --

        The configuration of filtering the Confluence content. For example, configuring regular expression patterns to include or exclude certain content.

        • patternObjectFilter (dict) --

          The configuration of filtering certain objects or content types of the data source.

          • filters (list) -- [REQUIRED]

            The configuration of specific filters applied to your data source content. You can filter out or include certain content.

            • (dict) --

              The specific filters applied to your data source content. You can filter out or include certain content.

              • exclusionFilters (list) --

                A list of one or more exclusion regular expression patterns to exclude certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

                • (string) --

              • inclusionFilters (list) --

                A list of one or more inclusion regular expression patterns to include certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

                • (string) --

              • objectType (string) -- [REQUIRED]

                The supported object type or content type of the data source.

        • type (string) -- [REQUIRED]

          The type of filtering that you want to apply to certain objects or content of the data source. For example, the PATTERN type is regular expression patterns you can apply to filter your content.

    • sourceConfiguration (dict) -- [REQUIRED]

      The endpoint information to connect to your Confluence data source.

      • authType (string) -- [REQUIRED]

        The supported authentication type to authenticate and connect to your Confluence instance.

      • credentialsSecretArn (string) -- [REQUIRED]

        The Amazon Resource Name of an Secrets Manager secret that stores your authentication credentials for your Confluence instance URL. For more information on the key-value pairs that must be included in your secret, depending on your authentication type, see Confluence connection configuration.

      • hostType (string) -- [REQUIRED]

        The supported host type, whether online/cloud or server/on-premises.

      • hostUrl (string) -- [REQUIRED]

        The Confluence host URL or instance URL.

  • s3Configuration (dict) --

    The configuration information to connect to Amazon S3 as your data source.

    • bucketArn (string) -- [REQUIRED]

      The Amazon Resource Name (ARN) of the S3 bucket that contains your data.

    • bucketOwnerAccountId (string) --

      The account ID for the owner of the S3 bucket.

    • inclusionPrefixes (list) --

      A list of S3 prefixes to include certain files or content. For more information, see Organizing objects using prefixes.

      • (string) --

  • salesforceConfiguration (dict) --

    The configuration information to connect to Salesforce as your data source.

    • crawlerConfiguration (dict) --

      The configuration of the Salesforce content. For example, configuring specific types of Salesforce content.

      • filterConfiguration (dict) --

        The configuration of filtering the Salesforce content. For example, configuring regular expression patterns to include or exclude certain content.

        • patternObjectFilter (dict) --

          The configuration of filtering certain objects or content types of the data source.

          • filters (list) -- [REQUIRED]

            The configuration of specific filters applied to your data source content. You can filter out or include certain content.

            • (dict) --

              The specific filters applied to your data source content. You can filter out or include certain content.

              • exclusionFilters (list) --

                A list of one or more exclusion regular expression patterns to exclude certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

                • (string) --

              • inclusionFilters (list) --

                A list of one or more inclusion regular expression patterns to include certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

                • (string) --

              • objectType (string) -- [REQUIRED]

                The supported object type or content type of the data source.

        • type (string) -- [REQUIRED]

          The type of filtering that you want to apply to certain objects or content of the data source. For example, the PATTERN type is regular expression patterns you can apply to filter your content.

    • sourceConfiguration (dict) -- [REQUIRED]

      The endpoint information to connect to your Salesforce data source.

      • authType (string) -- [REQUIRED]

        The supported authentication type to authenticate and connect to your Salesforce instance.

      • credentialsSecretArn (string) -- [REQUIRED]

        The Amazon Resource Name of an Secrets Manager secret that stores your authentication credentials for your Salesforce instance URL. For more information on the key-value pairs that must be included in your secret, depending on your authentication type, see Salesforce connection configuration.

      • hostUrl (string) -- [REQUIRED]

        The Salesforce host URL or instance URL.

  • sharePointConfiguration (dict) --

    The configuration information to connect to SharePoint as your data source.

    • crawlerConfiguration (dict) --

      The configuration of the SharePoint content. For example, configuring specific types of SharePoint content.

      • filterConfiguration (dict) --

        The configuration of filtering the SharePoint content. For example, configuring regular expression patterns to include or exclude certain content.

        • patternObjectFilter (dict) --

          The configuration of filtering certain objects or content types of the data source.

          • filters (list) -- [REQUIRED]

            The configuration of specific filters applied to your data source content. You can filter out or include certain content.

            • (dict) --

              The specific filters applied to your data source content. You can filter out or include certain content.

              • exclusionFilters (list) --

                A list of one or more exclusion regular expression patterns to exclude certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

                • (string) --

              • inclusionFilters (list) --

                A list of one or more inclusion regular expression patterns to include certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

                • (string) --

              • objectType (string) -- [REQUIRED]

                The supported object type or content type of the data source.

        • type (string) -- [REQUIRED]

          The type of filtering that you want to apply to certain objects or content of the data source. For example, the PATTERN type is regular expression patterns you can apply to filter your content.

    • sourceConfiguration (dict) -- [REQUIRED]

      The endpoint information to connect to your SharePoint data source.

      • authType (string) -- [REQUIRED]

        The supported authentication type to authenticate and connect to your SharePoint site/sites.

      • credentialsSecretArn (string) -- [REQUIRED]

        The Amazon Resource Name of an Secrets Manager secret that stores your authentication credentials for your SharePoint site/sites. For more information on the key-value pairs that must be included in your secret, depending on your authentication type, see SharePoint connection configuration.

      • domain (string) -- [REQUIRED]

        The domain of your SharePoint instance or site URL/URLs.

      • hostType (string) -- [REQUIRED]

        The supported host type, whether online/cloud or server/on-premises.

      • siteUrls (list) -- [REQUIRED]

        A list of one or more SharePoint site URLs.

        • (string) --

      • tenantId (string) --

        The identifier of your Microsoft 365 tenant.

  • type (string) -- [REQUIRED]

    The type of data source.

  • webConfiguration (dict) --

    The configuration of web URLs to crawl for your data source. You should be authorized to crawl the URLs.

    • crawlerConfiguration (dict) --

      The Web Crawler configuration details for the web data source.

      • crawlerLimits (dict) --

        The configuration of crawl limits for the web URLs.

        • rateLimit (integer) --

          The max rate at which pages are crawled, up to 300 per minute per host.

      • exclusionFilters (list) --

        A list of one or more exclusion regular expression patterns to exclude certain URLs. If you specify an inclusion and exclusion filter/pattern and both match a URL, the exclusion filter takes precedence and the web content of the URL isn’t crawled.

        • (string) --

      • inclusionFilters (list) --

        A list of one or more inclusion regular expression patterns to include certain URLs. If you specify an inclusion and exclusion filter/pattern and both match a URL, the exclusion filter takes precedence and the web content of the URL isn’t crawled.

        • (string) --

      • scope (string) --

        The scope of what is crawled for your URLs.

        You can choose to crawl only web pages that belong to the same host or primary domain. For example, only web pages that contain the seed URL "https://docs.aws.amazon.com/bedrock/latest/userguide/" and no other domains. You can choose to include sub domains in addition to the host or primary domain. For example, web pages that contain "aws.amazon.com" can also include sub domain "docs.aws.amazon.com".

    • sourceConfiguration (dict) -- [REQUIRED]

      The source configuration details for the web data source.

      • urlConfiguration (dict) -- [REQUIRED]

        The configuration of the URL/URLs.

        • seedUrls (list) --

          One or more seed or starting point URLs.

          • (dict) --

            The seed or starting point URL. You should be authorized to crawl the URL.

            • url (string) --

              A seed or starting point URL.

type dataSourceId:

string

param dataSourceId:

[REQUIRED]

The unique identifier of the data source.

type description:

string

param description:

Specifies a new description for the data source.

type knowledgeBaseId:

string

param knowledgeBaseId:

[REQUIRED]

The unique identifier of the knowledge base for the data source.

type name:

string

param name:

[REQUIRED]

Specifies a new name for the data source.

type serverSideEncryptionConfiguration:

dict

param serverSideEncryptionConfiguration:

Contains details about server-side encryption of the data source.

  • kmsKeyArn (string) --

    The Amazon Resource Name (ARN) of the KMS key used to encrypt the resource.

type vectorIngestionConfiguration:

dict

param vectorIngestionConfiguration:

Contains details about how to ingest the documents in the data source.

  • chunkingConfiguration (dict) --

    Details about how to chunk the documents in the data source. A chunk refers to an excerpt from a data source that is returned when the knowledge base that it belongs to is queried.

    • chunkingStrategy (string) -- [REQUIRED]

      Knowledge base can split your source data into chunks. A chunk refers to an excerpt from a data source that is returned when the knowledge base that it belongs to is queried. You have the following options for chunking your data. If you opt for NONE, then you may want to pre-process your files by splitting them up such that each file corresponds to a chunk.

      • FIXED_SIZE – Amazon Bedrock splits your source data into chunks of the approximate size that you set in the fixedSizeChunkingConfiguration.

      • HIERARCHICAL – Split documents into layers of chunks where the first layer contains large chunks, and the second layer contains smaller chunks derived from the first layer.

      • SEMANTIC – Split documents into chunks based on groups of similar content derived with natural language processing.

      • NONE – Amazon Bedrock treats each file as one chunk. If you choose this option, you may want to pre-process your documents by splitting them into separate files.

    • fixedSizeChunkingConfiguration (dict) --

      Configurations for when you choose fixed-size chunking. If you set the chunkingStrategy as NONE, exclude this field.

      • maxTokens (integer) -- [REQUIRED]

        The maximum number of tokens to include in a chunk.

      • overlapPercentage (integer) -- [REQUIRED]

        The percentage of overlap between adjacent chunks of a data source.

    • hierarchicalChunkingConfiguration (dict) --

      Settings for hierarchical document chunking for a data source. Hierarchical chunking splits documents into layers of chunks where the first layer contains large chunks, and the second layer contains smaller chunks derived from the first layer.

      • levelConfigurations (list) -- [REQUIRED]

        Token settings for each layer.

        • (dict) --

          Token settings for a layer in a hierarchical chunking configuration.

          • maxTokens (integer) -- [REQUIRED]

            The maximum number of tokens that a chunk can contain in this layer.

      • overlapTokens (integer) -- [REQUIRED]

        The number of tokens to repeat across chunks in the same layer.

    • semanticChunkingConfiguration (dict) --

      Settings for semantic document chunking for a data source. Semantic chunking splits a document into into smaller documents based on groups of similar content derived from the text with natural language processing.

      • breakpointPercentileThreshold (integer) -- [REQUIRED]

        The dissimilarity threshold for splitting chunks.

      • bufferSize (integer) -- [REQUIRED]

        The buffer size.

      • maxTokens (integer) -- [REQUIRED]

        The maximum number of tokens that a chunk can contain.

  • customTransformationConfiguration (dict) --

    A custom document transformer for parsed data source documents.

    • intermediateStorage (dict) -- [REQUIRED]

      An S3 bucket path for input and output objects.

      • s3Location (dict) -- [REQUIRED]

        An S3 bucket path.

        • uri (string) -- [REQUIRED]

          The location's URI. For example, s3://my-bucket/chunk-processor/.

    • transformations (list) -- [REQUIRED]

      A Lambda function that processes documents.

      • (dict) --

        A custom processing step for documents moving through a data source ingestion pipeline. To process documents after they have been converted into chunks, set the step to apply to POST_CHUNKING.

        • stepToApply (string) -- [REQUIRED]

          When the service applies the transformation.

        • transformationFunction (dict) -- [REQUIRED]

          A Lambda function that processes documents.

          • transformationLambdaConfiguration (dict) -- [REQUIRED]

            The Lambda function.

            • lambdaArn (string) -- [REQUIRED]

              The function's ARN identifier.

  • parsingConfiguration (dict) --

    A custom parser for data source documents.

    • bedrockFoundationModelConfiguration (dict) --

      Settings for a foundation model used to parse documents for a data source.

      • modelArn (string) -- [REQUIRED]

        The ARN of the foundation model or inference profile.

      • parsingPrompt (dict) --

        Instructions for interpreting the contents of a document.

        • parsingPromptText (string) -- [REQUIRED]

          Instructions for interpreting the contents of a document.

    • parsingStrategy (string) -- [REQUIRED]

      The parsing strategy for the data source.

rtype:

dict

returns:

Response Syntax

{
    'dataSource': {
        'createdAt': datetime(2015, 1, 1),
        'dataDeletionPolicy': 'RETAIN'|'DELETE',
        'dataSourceConfiguration': {
            'confluenceConfiguration': {
                'crawlerConfiguration': {
                    'filterConfiguration': {
                        'patternObjectFilter': {
                            'filters': [
                                {
                                    'exclusionFilters': [
                                        'string',
                                    ],
                                    'inclusionFilters': [
                                        'string',
                                    ],
                                    'objectType': 'string'
                                },
                            ]
                        },
                        'type': 'PATTERN'
                    }
                },
                'sourceConfiguration': {
                    'authType': 'BASIC'|'OAUTH2_CLIENT_CREDENTIALS',
                    'credentialsSecretArn': 'string',
                    'hostType': 'SAAS',
                    'hostUrl': 'string'
                }
            },
            's3Configuration': {
                'bucketArn': 'string',
                'bucketOwnerAccountId': 'string',
                'inclusionPrefixes': [
                    'string',
                ]
            },
            'salesforceConfiguration': {
                'crawlerConfiguration': {
                    'filterConfiguration': {
                        'patternObjectFilter': {
                            'filters': [
                                {
                                    'exclusionFilters': [
                                        'string',
                                    ],
                                    'inclusionFilters': [
                                        'string',
                                    ],
                                    'objectType': 'string'
                                },
                            ]
                        },
                        'type': 'PATTERN'
                    }
                },
                'sourceConfiguration': {
                    'authType': 'OAUTH2_CLIENT_CREDENTIALS',
                    'credentialsSecretArn': 'string',
                    'hostUrl': 'string'
                }
            },
            'sharePointConfiguration': {
                'crawlerConfiguration': {
                    'filterConfiguration': {
                        'patternObjectFilter': {
                            'filters': [
                                {
                                    'exclusionFilters': [
                                        'string',
                                    ],
                                    'inclusionFilters': [
                                        'string',
                                    ],
                                    'objectType': 'string'
                                },
                            ]
                        },
                        'type': 'PATTERN'
                    }
                },
                'sourceConfiguration': {
                    'authType': 'OAUTH2_CLIENT_CREDENTIALS',
                    'credentialsSecretArn': 'string',
                    'domain': 'string',
                    'hostType': 'ONLINE',
                    'siteUrls': [
                        'string',
                    ],
                    'tenantId': 'string'
                }
            },
            'type': 'S3'|'WEB'|'CONFLUENCE'|'SALESFORCE'|'SHAREPOINT'|'CUSTOM',
            'webConfiguration': {
                'crawlerConfiguration': {
                    'crawlerLimits': {
                        'rateLimit': 123
                    },
                    'exclusionFilters': [
                        'string',
                    ],
                    'inclusionFilters': [
                        'string',
                    ],
                    'scope': 'HOST_ONLY'|'SUBDOMAINS'
                },
                'sourceConfiguration': {
                    'urlConfiguration': {
                        'seedUrls': [
                            {
                                'url': 'string'
                            },
                        ]
                    }
                }
            }
        },
        'dataSourceId': 'string',
        'description': 'string',
        'failureReasons': [
            'string',
        ],
        'knowledgeBaseId': 'string',
        'name': 'string',
        'serverSideEncryptionConfiguration': {
            'kmsKeyArn': 'string'
        },
        'status': 'AVAILABLE'|'DELETING'|'DELETE_UNSUCCESSFUL',
        'updatedAt': datetime(2015, 1, 1),
        'vectorIngestionConfiguration': {
            'chunkingConfiguration': {
                'chunkingStrategy': 'FIXED_SIZE'|'NONE'|'HIERARCHICAL'|'SEMANTIC',
                'fixedSizeChunkingConfiguration': {
                    'maxTokens': 123,
                    'overlapPercentage': 123
                },
                'hierarchicalChunkingConfiguration': {
                    'levelConfigurations': [
                        {
                            'maxTokens': 123
                        },
                    ],
                    'overlapTokens': 123
                },
                'semanticChunkingConfiguration': {
                    'breakpointPercentileThreshold': 123,
                    'bufferSize': 123,
                    'maxTokens': 123
                }
            },
            'customTransformationConfiguration': {
                'intermediateStorage': {
                    's3Location': {
                        'uri': 'string'
                    }
                },
                'transformations': [
                    {
                        'stepToApply': 'POST_CHUNKING',
                        'transformationFunction': {
                            'transformationLambdaConfiguration': {
                                'lambdaArn': 'string'
                            }
                        }
                    },
                ]
            },
            'parsingConfiguration': {
                'bedrockFoundationModelConfiguration': {
                    'modelArn': 'string',
                    'parsingPrompt': {
                        'parsingPromptText': 'string'
                    }
                },
                'parsingStrategy': 'BEDROCK_FOUNDATION_MODEL'
            }
        }
    }
}

Response Structure

  • (dict) --

    • dataSource (dict) --

      Contains details about the data source.

      • createdAt (datetime) --

        The time at which the data source was created.

      • dataDeletionPolicy (string) --

        The data deletion policy for the data source.

      • dataSourceConfiguration (dict) --

        The connection configuration for the data source.

        • confluenceConfiguration (dict) --

          The configuration information to connect to Confluence as your data source.

          • crawlerConfiguration (dict) --

            The configuration of the Confluence content. For example, configuring specific types of Confluence content.

            • filterConfiguration (dict) --

              The configuration of filtering the Confluence content. For example, configuring regular expression patterns to include or exclude certain content.

              • patternObjectFilter (dict) --

                The configuration of filtering certain objects or content types of the data source.

                • filters (list) --

                  The configuration of specific filters applied to your data source content. You can filter out or include certain content.

                  • (dict) --

                    The specific filters applied to your data source content. You can filter out or include certain content.

                    • exclusionFilters (list) --

                      A list of one or more exclusion regular expression patterns to exclude certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

                      • (string) --

                    • inclusionFilters (list) --

                      A list of one or more inclusion regular expression patterns to include certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

                      • (string) --

                    • objectType (string) --

                      The supported object type or content type of the data source.

              • type (string) --

                The type of filtering that you want to apply to certain objects or content of the data source. For example, the PATTERN type is regular expression patterns you can apply to filter your content.

          • sourceConfiguration (dict) --

            The endpoint information to connect to your Confluence data source.

            • authType (string) --

              The supported authentication type to authenticate and connect to your Confluence instance.

            • credentialsSecretArn (string) --

              The Amazon Resource Name of an Secrets Manager secret that stores your authentication credentials for your Confluence instance URL. For more information on the key-value pairs that must be included in your secret, depending on your authentication type, see Confluence connection configuration.

            • hostType (string) --

              The supported host type, whether online/cloud or server/on-premises.

            • hostUrl (string) --

              The Confluence host URL or instance URL.

        • s3Configuration (dict) --

          The configuration information to connect to Amazon S3 as your data source.

          • bucketArn (string) --

            The Amazon Resource Name (ARN) of the S3 bucket that contains your data.

          • bucketOwnerAccountId (string) --

            The account ID for the owner of the S3 bucket.

          • inclusionPrefixes (list) --

            A list of S3 prefixes to include certain files or content. For more information, see Organizing objects using prefixes.

            • (string) --

        • salesforceConfiguration (dict) --

          The configuration information to connect to Salesforce as your data source.

          • crawlerConfiguration (dict) --

            The configuration of the Salesforce content. For example, configuring specific types of Salesforce content.

            • filterConfiguration (dict) --

              The configuration of filtering the Salesforce content. For example, configuring regular expression patterns to include or exclude certain content.

              • patternObjectFilter (dict) --

                The configuration of filtering certain objects or content types of the data source.

                • filters (list) --

                  The configuration of specific filters applied to your data source content. You can filter out or include certain content.

                  • (dict) --

                    The specific filters applied to your data source content. You can filter out or include certain content.

                    • exclusionFilters (list) --

                      A list of one or more exclusion regular expression patterns to exclude certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

                      • (string) --

                    • inclusionFilters (list) --

                      A list of one or more inclusion regular expression patterns to include certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

                      • (string) --

                    • objectType (string) --

                      The supported object type or content type of the data source.

              • type (string) --

                The type of filtering that you want to apply to certain objects or content of the data source. For example, the PATTERN type is regular expression patterns you can apply to filter your content.

          • sourceConfiguration (dict) --

            The endpoint information to connect to your Salesforce data source.

            • authType (string) --

              The supported authentication type to authenticate and connect to your Salesforce instance.

            • credentialsSecretArn (string) --

              The Amazon Resource Name of an Secrets Manager secret that stores your authentication credentials for your Salesforce instance URL. For more information on the key-value pairs that must be included in your secret, depending on your authentication type, see Salesforce connection configuration.

            • hostUrl (string) --

              The Salesforce host URL or instance URL.

        • sharePointConfiguration (dict) --

          The configuration information to connect to SharePoint as your data source.

          • crawlerConfiguration (dict) --

            The configuration of the SharePoint content. For example, configuring specific types of SharePoint content.

            • filterConfiguration (dict) --

              The configuration of filtering the SharePoint content. For example, configuring regular expression patterns to include or exclude certain content.

              • patternObjectFilter (dict) --

                The configuration of filtering certain objects or content types of the data source.

                • filters (list) --

                  The configuration of specific filters applied to your data source content. You can filter out or include certain content.

                  • (dict) --

                    The specific filters applied to your data source content. You can filter out or include certain content.

                    • exclusionFilters (list) --

                      A list of one or more exclusion regular expression patterns to exclude certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

                      • (string) --

                    • inclusionFilters (list) --

                      A list of one or more inclusion regular expression patterns to include certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

                      • (string) --

                    • objectType (string) --

                      The supported object type or content type of the data source.

              • type (string) --

                The type of filtering that you want to apply to certain objects or content of the data source. For example, the PATTERN type is regular expression patterns you can apply to filter your content.

          • sourceConfiguration (dict) --

            The endpoint information to connect to your SharePoint data source.

            • authType (string) --

              The supported authentication type to authenticate and connect to your SharePoint site/sites.

            • credentialsSecretArn (string) --

              The Amazon Resource Name of an Secrets Manager secret that stores your authentication credentials for your SharePoint site/sites. For more information on the key-value pairs that must be included in your secret, depending on your authentication type, see SharePoint connection configuration.

            • domain (string) --

              The domain of your SharePoint instance or site URL/URLs.

            • hostType (string) --

              The supported host type, whether online/cloud or server/on-premises.

            • siteUrls (list) --

              A list of one or more SharePoint site URLs.

              • (string) --

            • tenantId (string) --

              The identifier of your Microsoft 365 tenant.

        • type (string) --

          The type of data source.

        • webConfiguration (dict) --

          The configuration of web URLs to crawl for your data source. You should be authorized to crawl the URLs.

          • crawlerConfiguration (dict) --

            The Web Crawler configuration details for the web data source.

            • crawlerLimits (dict) --

              The configuration of crawl limits for the web URLs.

              • rateLimit (integer) --

                The max rate at which pages are crawled, up to 300 per minute per host.

            • exclusionFilters (list) --

              A list of one or more exclusion regular expression patterns to exclude certain URLs. If you specify an inclusion and exclusion filter/pattern and both match a URL, the exclusion filter takes precedence and the web content of the URL isn’t crawled.

              • (string) --

            • inclusionFilters (list) --

              A list of one or more inclusion regular expression patterns to include certain URLs. If you specify an inclusion and exclusion filter/pattern and both match a URL, the exclusion filter takes precedence and the web content of the URL isn’t crawled.

              • (string) --

            • scope (string) --

              The scope of what is crawled for your URLs.

              You can choose to crawl only web pages that belong to the same host or primary domain. For example, only web pages that contain the seed URL "https://docs.aws.amazon.com/bedrock/latest/userguide/" and no other domains. You can choose to include sub domains in addition to the host or primary domain. For example, web pages that contain "aws.amazon.com" can also include sub domain "docs.aws.amazon.com".

          • sourceConfiguration (dict) --

            The source configuration details for the web data source.

            • urlConfiguration (dict) --

              The configuration of the URL/URLs.

              • seedUrls (list) --

                One or more seed or starting point URLs.

                • (dict) --

                  The seed or starting point URL. You should be authorized to crawl the URL.

                  • url (string) --

                    A seed or starting point URL.

      • dataSourceId (string) --

        The unique identifier of the data source.

      • description (string) --

        The description of the data source.

      • failureReasons (list) --

        The detailed reasons on the failure to delete a data source.

        • (string) --

      • knowledgeBaseId (string) --

        The unique identifier of the knowledge base to which the data source belongs.

      • name (string) --

        The name of the data source.

      • serverSideEncryptionConfiguration (dict) --

        Contains details about the configuration of the server-side encryption.

        • kmsKeyArn (string) --

          The Amazon Resource Name (ARN) of the KMS key used to encrypt the resource.

      • status (string) --

        The status of the data source. The following statuses are possible:

        • Available – The data source has been created and is ready for ingestion into the knowledge base.

        • Deleting – The data source is being deleted.

      • updatedAt (datetime) --

        The time at which the data source was last updated.

      • vectorIngestionConfiguration (dict) --

        Contains details about how to ingest the documents in the data source.

        • chunkingConfiguration (dict) --

          Details about how to chunk the documents in the data source. A chunk refers to an excerpt from a data source that is returned when the knowledge base that it belongs to is queried.

          • chunkingStrategy (string) --

            Knowledge base can split your source data into chunks. A chunk refers to an excerpt from a data source that is returned when the knowledge base that it belongs to is queried. You have the following options for chunking your data. If you opt for NONE, then you may want to pre-process your files by splitting them up such that each file corresponds to a chunk.

            • FIXED_SIZE – Amazon Bedrock splits your source data into chunks of the approximate size that you set in the fixedSizeChunkingConfiguration.

            • HIERARCHICAL – Split documents into layers of chunks where the first layer contains large chunks, and the second layer contains smaller chunks derived from the first layer.

            • SEMANTIC – Split documents into chunks based on groups of similar content derived with natural language processing.

            • NONE – Amazon Bedrock treats each file as one chunk. If you choose this option, you may want to pre-process your documents by splitting them into separate files.

          • fixedSizeChunkingConfiguration (dict) --

            Configurations for when you choose fixed-size chunking. If you set the chunkingStrategy as NONE, exclude this field.

            • maxTokens (integer) --

              The maximum number of tokens to include in a chunk.

            • overlapPercentage (integer) --

              The percentage of overlap between adjacent chunks of a data source.

          • hierarchicalChunkingConfiguration (dict) --

            Settings for hierarchical document chunking for a data source. Hierarchical chunking splits documents into layers of chunks where the first layer contains large chunks, and the second layer contains smaller chunks derived from the first layer.

            • levelConfigurations (list) --

              Token settings for each layer.

              • (dict) --

                Token settings for a layer in a hierarchical chunking configuration.

                • maxTokens (integer) --

                  The maximum number of tokens that a chunk can contain in this layer.

            • overlapTokens (integer) --

              The number of tokens to repeat across chunks in the same layer.

          • semanticChunkingConfiguration (dict) --

            Settings for semantic document chunking for a data source. Semantic chunking splits a document into into smaller documents based on groups of similar content derived from the text with natural language processing.

            • breakpointPercentileThreshold (integer) --

              The dissimilarity threshold for splitting chunks.

            • bufferSize (integer) --

              The buffer size.

            • maxTokens (integer) --

              The maximum number of tokens that a chunk can contain.

        • customTransformationConfiguration (dict) --

          A custom document transformer for parsed data source documents.

          • intermediateStorage (dict) --

            An S3 bucket path for input and output objects.

            • s3Location (dict) --

              An S3 bucket path.

              • uri (string) --

                The location's URI. For example, s3://my-bucket/chunk-processor/.

          • transformations (list) --

            A Lambda function that processes documents.

            • (dict) --

              A custom processing step for documents moving through a data source ingestion pipeline. To process documents after they have been converted into chunks, set the step to apply to POST_CHUNKING.

              • stepToApply (string) --

                When the service applies the transformation.

              • transformationFunction (dict) --

                A Lambda function that processes documents.

                • transformationLambdaConfiguration (dict) --

                  The Lambda function.

                  • lambdaArn (string) --

                    The function's ARN identifier.

        • parsingConfiguration (dict) --

          A custom parser for data source documents.

          • bedrockFoundationModelConfiguration (dict) --

            Settings for a foundation model used to parse documents for a data source.

            • modelArn (string) --

              The ARN of the foundation model or inference profile.

            • parsingPrompt (dict) --

              Instructions for interpreting the contents of a document.

              • parsingPromptText (string) --

                Instructions for interpreting the contents of a document.

          • parsingStrategy (string) --

            The parsing strategy for the data source.