AWS Glue

2026/05/06 - AWS Glue - 3 updated api methods

Changes  Adds support for a CustomLogGroupPrefix parameter in StartDataQualityRulesetEvaluationRun to specify custom CloudWatch log group paths, and a RulesetName filter in ListDataQualityRulesetEvaluationRuns to filter evaluation runs by ruleset name.

GetDataQualityRulesetEvaluationRun (updated) Link ¶
Changes (response)
{'AdditionalRunOptions': {'CustomLogGroupPrefix': 'string'}}

Retrieves a specific run where a ruleset is evaluated against a data source.

See also: AWS API Documentation

Request Syntax

client.get_data_quality_ruleset_evaluation_run(
    RunId='string'
)
type RunId:

string

param RunId:

[REQUIRED]

The unique run identifier associated with this run.

rtype:

dict

returns:

Response Syntax

{
    'RunId': 'string',
    'DataSource': {
        'GlueTable': {
            'DatabaseName': 'string',
            'TableName': 'string',
            'CatalogId': 'string',
            'ConnectionName': 'string',
            'AdditionalOptions': {
                'string': 'string'
            }
        },
        'DataQualityGlueTable': {
            'DatabaseName': 'string',
            'TableName': 'string',
            'CatalogId': 'string',
            'ConnectionName': 'string',
            'AdditionalOptions': {
                'string': 'string'
            },
            'PreProcessingQuery': 'string'
        }
    },
    'Role': 'string',
    'NumberOfWorkers': 123,
    'Timeout': 123,
    'AdditionalRunOptions': {
        'CloudWatchMetricsEnabled': True|False,
        'ResultsS3Prefix': 'string',
        'CompositeRuleEvaluationMethod': 'COLUMN'|'ROW',
        'CustomLogGroupPrefix': 'string'
    },
    'Status': 'STARTING'|'RUNNING'|'STOPPING'|'STOPPED'|'SUCCEEDED'|'FAILED'|'TIMEOUT',
    'ErrorString': 'string',
    'StartedOn': datetime(2015, 1, 1),
    'LastModifiedOn': datetime(2015, 1, 1),
    'CompletedOn': datetime(2015, 1, 1),
    'ExecutionTime': 123,
    'RulesetNames': [
        'string',
    ],
    'ResultIds': [
        'string',
    ],
    'AdditionalDataSources': {
        'string': {
            'GlueTable': {
                'DatabaseName': 'string',
                'TableName': 'string',
                'CatalogId': 'string',
                'ConnectionName': 'string',
                'AdditionalOptions': {
                    'string': 'string'
                }
            },
            'DataQualityGlueTable': {
                'DatabaseName': 'string',
                'TableName': 'string',
                'CatalogId': 'string',
                'ConnectionName': 'string',
                'AdditionalOptions': {
                    'string': 'string'
                },
                'PreProcessingQuery': 'string'
            }
        }
    }
}

Response Structure

  • (dict) --

    • RunId (string) --

      The unique run identifier associated with this run.

    • DataSource (dict) --

      The data source (an Glue table) associated with this evaluation run.

      • GlueTable (dict) --

        An Glue table.

        • DatabaseName (string) --

          A database name in the Glue Data Catalog.

        • TableName (string) --

          A table name in the Glue Data Catalog.

        • CatalogId (string) --

          A unique identifier for the Glue Data Catalog.

        • ConnectionName (string) --

          The name of the connection to the Glue Data Catalog.

        • AdditionalOptions (dict) --

          Additional options for the table. Currently there are two keys supported:

          • pushDownPredicate: to filter on partitions without having to list and read all the files in your dataset.

          • catalogPartitionPredicate: to use server-side partition pruning using partition indexes in the Glue Data Catalog.

          • (string) --

            • (string) --

      • DataQualityGlueTable (dict) --

        An Glue table for Data Quality Operations.

        • DatabaseName (string) --

          A database name in the Glue Data Catalog.

        • TableName (string) --

          A table name in the Glue Data Catalog.

        • CatalogId (string) --

          A unique identifier for the Glue Data Catalog.

        • ConnectionName (string) --

          The name of the connection to the Glue Data Catalog.

        • AdditionalOptions (dict) --

          Additional options for the table. Currently there are two keys supported:

          • pushDownPredicate: to filter on partitions without having to list and read all the files in your dataset.

          • catalogPartitionPredicate: to use server-side partition pruning using partition indexes in the Glue Data Catalog.

          • (string) --

            • (string) --

        • PreProcessingQuery (string) --

          SQL Query of SparkSQL format that can be used to pre-process the data for the table in Glue Data Catalog, before running the Data Quality Operation.

    • Role (string) --

      An IAM role supplied to encrypt the results of the run.

    • NumberOfWorkers (integer) --

      The number of G.1X workers to be used in the run. The default is 5.

    • Timeout (integer) --

      The timeout for a run in minutes. This is the maximum time that a run can consume resources before it is terminated and enters TIMEOUT status. The default is 2,880 minutes (48 hours).

    • AdditionalRunOptions (dict) --

      Additional run options you can specify for an evaluation run.

      • CloudWatchMetricsEnabled (boolean) --

        Whether or not to enable CloudWatch metrics.

      • ResultsS3Prefix (string) --

        Prefix for Amazon S3 to store results.

      • CompositeRuleEvaluationMethod (string) --

        Set the evaluation method for composite rules in the ruleset to ROW/COLUMN

      • CustomLogGroupPrefix (string) --

        A custom prefix for the CloudWatch log group names. When specified, evaluation run logs are written to <CustomLogGroupPrefix>/error and <CustomLogGroupPrefix>/output instead of the default /aws-glue/data-quality/error and /aws-glue/data-quality/output log groups.

    • Status (string) --

      The status for this run.

    • ErrorString (string) --

      The error strings that are associated with the run.

    • StartedOn (datetime) --

      The date and time when this run started.

    • LastModifiedOn (datetime) --

      A timestamp. The last point in time when this data quality rule recommendation run was modified.

    • CompletedOn (datetime) --

      The date and time when this run was completed.

    • ExecutionTime (integer) --

      The amount of time (in seconds) that the run consumed resources.

    • RulesetNames (list) --

      A list of ruleset names for the run. Currently, this parameter takes only one Ruleset name.

      • (string) --

    • ResultIds (list) --

      A list of result IDs for the data quality results for the run.

      • (string) --

    • AdditionalDataSources (dict) --

      A map of reference strings to additional data sources you can specify for an evaluation run.

      • (string) --

        • (dict) --

          A data source (an Glue table) for which you want data quality results.

          • GlueTable (dict) --

            An Glue table.

            • DatabaseName (string) --

              A database name in the Glue Data Catalog.

            • TableName (string) --

              A table name in the Glue Data Catalog.

            • CatalogId (string) --

              A unique identifier for the Glue Data Catalog.

            • ConnectionName (string) --

              The name of the connection to the Glue Data Catalog.

            • AdditionalOptions (dict) --

              Additional options for the table. Currently there are two keys supported:

              • pushDownPredicate: to filter on partitions without having to list and read all the files in your dataset.

              • catalogPartitionPredicate: to use server-side partition pruning using partition indexes in the Glue Data Catalog.

              • (string) --

                • (string) --

          • DataQualityGlueTable (dict) --

            An Glue table for Data Quality Operations.

            • DatabaseName (string) --

              A database name in the Glue Data Catalog.

            • TableName (string) --

              A table name in the Glue Data Catalog.

            • CatalogId (string) --

              A unique identifier for the Glue Data Catalog.

            • ConnectionName (string) --

              The name of the connection to the Glue Data Catalog.

            • AdditionalOptions (dict) --

              Additional options for the table. Currently there are two keys supported:

              • pushDownPredicate: to filter on partitions without having to list and read all the files in your dataset.

              • catalogPartitionPredicate: to use server-side partition pruning using partition indexes in the Glue Data Catalog.

              • (string) --

                • (string) --

            • PreProcessingQuery (string) --

              SQL Query of SparkSQL format that can be used to pre-process the data for the table in Glue Data Catalog, before running the Data Quality Operation.

ListDataQualityRulesetEvaluationRuns (updated) Link ¶
Changes (request)
{'Filter': {'RulesetName': 'string'}}

Lists all the runs meeting the filter criteria, where a ruleset is evaluated against a data source.

See also: AWS API Documentation

Request Syntax

client.list_data_quality_ruleset_evaluation_runs(
    Filter={
        'DataSource': {
            'GlueTable': {
                'DatabaseName': 'string',
                'TableName': 'string',
                'CatalogId': 'string',
                'ConnectionName': 'string',
                'AdditionalOptions': {
                    'string': 'string'
                }
            },
            'DataQualityGlueTable': {
                'DatabaseName': 'string',
                'TableName': 'string',
                'CatalogId': 'string',
                'ConnectionName': 'string',
                'AdditionalOptions': {
                    'string': 'string'
                },
                'PreProcessingQuery': 'string'
            }
        },
        'StartedBefore': datetime(2015, 1, 1),
        'StartedAfter': datetime(2015, 1, 1),
        'RulesetName': 'string'
    },
    NextToken='string',
    MaxResults=123
)
type Filter:

dict

param Filter:

The filter criteria.

  • DataSource (dict) -- [REQUIRED]

    Filter based on a data source (an Glue table) associated with the run.

    • GlueTable (dict) --

      An Glue table.

      • DatabaseName (string) -- [REQUIRED]

        A database name in the Glue Data Catalog.

      • TableName (string) -- [REQUIRED]

        A table name in the Glue Data Catalog.

      • CatalogId (string) --

        A unique identifier for the Glue Data Catalog.

      • ConnectionName (string) --

        The name of the connection to the Glue Data Catalog.

      • AdditionalOptions (dict) --

        Additional options for the table. Currently there are two keys supported:

        • pushDownPredicate: to filter on partitions without having to list and read all the files in your dataset.

        • catalogPartitionPredicate: to use server-side partition pruning using partition indexes in the Glue Data Catalog.

        • (string) --

          • (string) --

    • DataQualityGlueTable (dict) --

      An Glue table for Data Quality Operations.

      • DatabaseName (string) -- [REQUIRED]

        A database name in the Glue Data Catalog.

      • TableName (string) -- [REQUIRED]

        A table name in the Glue Data Catalog.

      • CatalogId (string) --

        A unique identifier for the Glue Data Catalog.

      • ConnectionName (string) --

        The name of the connection to the Glue Data Catalog.

      • AdditionalOptions (dict) --

        Additional options for the table. Currently there are two keys supported:

        • pushDownPredicate: to filter on partitions without having to list and read all the files in your dataset.

        • catalogPartitionPredicate: to use server-side partition pruning using partition indexes in the Glue Data Catalog.

        • (string) --

          • (string) --

      • PreProcessingQuery (string) --

        SQL Query of SparkSQL format that can be used to pre-process the data for the table in Glue Data Catalog, before running the Data Quality Operation.

  • StartedBefore (datetime) --

    Filter results by runs that started before this time.

  • StartedAfter (datetime) --

    Filter results by runs that started after this time.

  • RulesetName (string) --

    Filter results by the name of the ruleset.

type NextToken:

string

param NextToken:

A paginated token to offset the results.

type MaxResults:

integer

param MaxResults:

The maximum number of results to return.

rtype:

dict

returns:

Response Syntax

{
    'Runs': [
        {
            'RunId': 'string',
            'Status': 'STARTING'|'RUNNING'|'STOPPING'|'STOPPED'|'SUCCEEDED'|'FAILED'|'TIMEOUT',
            'StartedOn': datetime(2015, 1, 1),
            'DataSource': {
                'GlueTable': {
                    'DatabaseName': 'string',
                    'TableName': 'string',
                    'CatalogId': 'string',
                    'ConnectionName': 'string',
                    'AdditionalOptions': {
                        'string': 'string'
                    }
                },
                'DataQualityGlueTable': {
                    'DatabaseName': 'string',
                    'TableName': 'string',
                    'CatalogId': 'string',
                    'ConnectionName': 'string',
                    'AdditionalOptions': {
                        'string': 'string'
                    },
                    'PreProcessingQuery': 'string'
                }
            }
        },
    ],
    'NextToken': 'string'
}

Response Structure

  • (dict) --

    • Runs (list) --

      A list of DataQualityRulesetEvaluationRunDescription objects representing data quality ruleset runs.

      • (dict) --

        Describes the result of a data quality ruleset evaluation run.

        • RunId (string) --

          The unique run identifier associated with this run.

        • Status (string) --

          The status for this run.

        • StartedOn (datetime) --

          The date and time when the run started.

        • DataSource (dict) --

          The data source (an Glue table) associated with the run.

          • GlueTable (dict) --

            An Glue table.

            • DatabaseName (string) --

              A database name in the Glue Data Catalog.

            • TableName (string) --

              A table name in the Glue Data Catalog.

            • CatalogId (string) --

              A unique identifier for the Glue Data Catalog.

            • ConnectionName (string) --

              The name of the connection to the Glue Data Catalog.

            • AdditionalOptions (dict) --

              Additional options for the table. Currently there are two keys supported:

              • pushDownPredicate: to filter on partitions without having to list and read all the files in your dataset.

              • catalogPartitionPredicate: to use server-side partition pruning using partition indexes in the Glue Data Catalog.

              • (string) --

                • (string) --

          • DataQualityGlueTable (dict) --

            An Glue table for Data Quality Operations.

            • DatabaseName (string) --

              A database name in the Glue Data Catalog.

            • TableName (string) --

              A table name in the Glue Data Catalog.

            • CatalogId (string) --

              A unique identifier for the Glue Data Catalog.

            • ConnectionName (string) --

              The name of the connection to the Glue Data Catalog.

            • AdditionalOptions (dict) --

              Additional options for the table. Currently there are two keys supported:

              • pushDownPredicate: to filter on partitions without having to list and read all the files in your dataset.

              • catalogPartitionPredicate: to use server-side partition pruning using partition indexes in the Glue Data Catalog.

              • (string) --

                • (string) --

            • PreProcessingQuery (string) --

              SQL Query of SparkSQL format that can be used to pre-process the data for the table in Glue Data Catalog, before running the Data Quality Operation.

    • NextToken (string) --

      A pagination token, if more results are available.

StartDataQualityRulesetEvaluationRun (updated) Link ¶
Changes (request)
{'AdditionalRunOptions': {'CustomLogGroupPrefix': 'string'}}

Once you have a ruleset definition (either recommended or your own), you call this operation to evaluate the ruleset against a data source (Glue table). The evaluation computes results which you can retrieve with the GetDataQualityResult API.

See also: AWS API Documentation

Request Syntax

client.start_data_quality_ruleset_evaluation_run(
    DataSource={
        'GlueTable': {
            'DatabaseName': 'string',
            'TableName': 'string',
            'CatalogId': 'string',
            'ConnectionName': 'string',
            'AdditionalOptions': {
                'string': 'string'
            }
        },
        'DataQualityGlueTable': {
            'DatabaseName': 'string',
            'TableName': 'string',
            'CatalogId': 'string',
            'ConnectionName': 'string',
            'AdditionalOptions': {
                'string': 'string'
            },
            'PreProcessingQuery': 'string'
        }
    },
    Role='string',
    NumberOfWorkers=123,
    Timeout=123,
    ClientToken='string',
    AdditionalRunOptions={
        'CloudWatchMetricsEnabled': True|False,
        'ResultsS3Prefix': 'string',
        'CompositeRuleEvaluationMethod': 'COLUMN'|'ROW',
        'CustomLogGroupPrefix': 'string'
    },
    RulesetNames=[
        'string',
    ],
    AdditionalDataSources={
        'string': {
            'GlueTable': {
                'DatabaseName': 'string',
                'TableName': 'string',
                'CatalogId': 'string',
                'ConnectionName': 'string',
                'AdditionalOptions': {
                    'string': 'string'
                }
            },
            'DataQualityGlueTable': {
                'DatabaseName': 'string',
                'TableName': 'string',
                'CatalogId': 'string',
                'ConnectionName': 'string',
                'AdditionalOptions': {
                    'string': 'string'
                },
                'PreProcessingQuery': 'string'
            }
        }
    }
)
type DataSource:

dict

param DataSource:

[REQUIRED]

The data source (Glue table) associated with this run.

  • GlueTable (dict) --

    An Glue table.

    • DatabaseName (string) -- [REQUIRED]

      A database name in the Glue Data Catalog.

    • TableName (string) -- [REQUIRED]

      A table name in the Glue Data Catalog.

    • CatalogId (string) --

      A unique identifier for the Glue Data Catalog.

    • ConnectionName (string) --

      The name of the connection to the Glue Data Catalog.

    • AdditionalOptions (dict) --

      Additional options for the table. Currently there are two keys supported:

      • pushDownPredicate: to filter on partitions without having to list and read all the files in your dataset.

      • catalogPartitionPredicate: to use server-side partition pruning using partition indexes in the Glue Data Catalog.

      • (string) --

        • (string) --

  • DataQualityGlueTable (dict) --

    An Glue table for Data Quality Operations.

    • DatabaseName (string) -- [REQUIRED]

      A database name in the Glue Data Catalog.

    • TableName (string) -- [REQUIRED]

      A table name in the Glue Data Catalog.

    • CatalogId (string) --

      A unique identifier for the Glue Data Catalog.

    • ConnectionName (string) --

      The name of the connection to the Glue Data Catalog.

    • AdditionalOptions (dict) --

      Additional options for the table. Currently there are two keys supported:

      • pushDownPredicate: to filter on partitions without having to list and read all the files in your dataset.

      • catalogPartitionPredicate: to use server-side partition pruning using partition indexes in the Glue Data Catalog.

      • (string) --

        • (string) --

    • PreProcessingQuery (string) --

      SQL Query of SparkSQL format that can be used to pre-process the data for the table in Glue Data Catalog, before running the Data Quality Operation.

type Role:

string

param Role:

[REQUIRED]

An IAM role supplied to encrypt the results of the run.

type NumberOfWorkers:

integer

param NumberOfWorkers:

The number of G.1X workers to be used in the run. The default is 5.

type Timeout:

integer

param Timeout:

The timeout for a run in minutes. This is the maximum time that a run can consume resources before it is terminated and enters TIMEOUT status. The default is 2,880 minutes (48 hours).

type ClientToken:

string

param ClientToken:

Used for idempotency and is recommended to be set to a random ID (such as a UUID) to avoid creating or starting multiple instances of the same resource.

type AdditionalRunOptions:

dict

param AdditionalRunOptions:

Additional run options you can specify for an evaluation run.

  • CloudWatchMetricsEnabled (boolean) --

    Whether or not to enable CloudWatch metrics.

  • ResultsS3Prefix (string) --

    Prefix for Amazon S3 to store results.

  • CompositeRuleEvaluationMethod (string) --

    Set the evaluation method for composite rules in the ruleset to ROW/COLUMN

  • CustomLogGroupPrefix (string) --

    A custom prefix for the CloudWatch log group names. When specified, evaluation run logs are written to <CustomLogGroupPrefix>/error and <CustomLogGroupPrefix>/output instead of the default /aws-glue/data-quality/error and /aws-glue/data-quality/output log groups.

type RulesetNames:

list

param RulesetNames:

[REQUIRED]

A list of ruleset names.

  • (string) --

type AdditionalDataSources:

dict

param AdditionalDataSources:

A map of reference strings to additional data sources you can specify for an evaluation run.

  • (string) --

    • (dict) --

      A data source (an Glue table) for which you want data quality results.

      • GlueTable (dict) --

        An Glue table.

        • DatabaseName (string) -- [REQUIRED]

          A database name in the Glue Data Catalog.

        • TableName (string) -- [REQUIRED]

          A table name in the Glue Data Catalog.

        • CatalogId (string) --

          A unique identifier for the Glue Data Catalog.

        • ConnectionName (string) --

          The name of the connection to the Glue Data Catalog.

        • AdditionalOptions (dict) --

          Additional options for the table. Currently there are two keys supported:

          • pushDownPredicate: to filter on partitions without having to list and read all the files in your dataset.

          • catalogPartitionPredicate: to use server-side partition pruning using partition indexes in the Glue Data Catalog.

          • (string) --

            • (string) --

      • DataQualityGlueTable (dict) --

        An Glue table for Data Quality Operations.

        • DatabaseName (string) -- [REQUIRED]

          A database name in the Glue Data Catalog.

        • TableName (string) -- [REQUIRED]

          A table name in the Glue Data Catalog.

        • CatalogId (string) --

          A unique identifier for the Glue Data Catalog.

        • ConnectionName (string) --

          The name of the connection to the Glue Data Catalog.

        • AdditionalOptions (dict) --

          Additional options for the table. Currently there are two keys supported:

          • pushDownPredicate: to filter on partitions without having to list and read all the files in your dataset.

          • catalogPartitionPredicate: to use server-side partition pruning using partition indexes in the Glue Data Catalog.

          • (string) --

            • (string) --

        • PreProcessingQuery (string) --

          SQL Query of SparkSQL format that can be used to pre-process the data for the table in Glue Data Catalog, before running the Data Quality Operation.

rtype:

dict

returns:

Response Syntax

{
    'RunId': 'string'
}

Response Structure

  • (dict) --

    • RunId (string) --

      The unique run identifier associated with this run.