AWS Glue

2023/11/16 - AWS Glue - 5 new api methods

Changes  Introduces new column statistics APIs to support statistics generation for tables within the Glue Data Catalog.

StartColumnStatisticsTaskRun (new) Link ¶

Starts a column statistics task run, for a specified table and columns.

See also: AWS API Documentation

Request Syntax

client.start_column_statistics_task_run(
    DatabaseName='string',
    TableName='string',
    ColumnNameList=[
        'string',
    ],
    Role='string',
    SampleSize=123.0,
    CatalogID='string',
    SecurityConfiguration='string'
)
type DatabaseName:

string

param DatabaseName:

[REQUIRED]

The name of the database where the table resides.

type TableName:

string

param TableName:

[REQUIRED]

The name of the table to generate statistics.

type ColumnNameList:

list

param ColumnNameList:

A list of the column names to generate statistics. If none is supplied, all column names for the table will be used by default.

  • (string) --

type Role:

string

param Role:

[REQUIRED]

The IAM role that the service assumes to generate statistics.

type SampleSize:

float

param SampleSize:

The percentage of rows used to generate statistics. If none is supplied, the entire table will be used to generate stats.

type CatalogID:

string

param CatalogID:

The ID of the Data Catalog where the table reside. If none is supplied, the Amazon Web Services account ID is used by default.

type SecurityConfiguration:

string

param SecurityConfiguration:

Name of the security configuration that is used to encrypt CloudWatch logs for the column stats task run.

rtype:

dict

returns:

Response Syntax

{
    'ColumnStatisticsTaskRunId': 'string'
}

Response Structure

  • (dict) --

    • ColumnStatisticsTaskRunId (string) --

      The identifier for the column statistics task run.

GetColumnStatisticsTaskRuns (new) Link ¶

Retrieves information about all runs associated with the specified table.

See also: AWS API Documentation

Request Syntax

client.get_column_statistics_task_runs(
    DatabaseName='string',
    TableName='string',
    MaxResults=123,
    NextToken='string'
)
type DatabaseName:

string

param DatabaseName:

[REQUIRED]

The name of the database where the table resides.

type TableName:

string

param TableName:

[REQUIRED]

The name of the table.

type MaxResults:

integer

param MaxResults:

The maximum size of the response.

type NextToken:

string

param NextToken:

A continuation token, if this is a continuation call.

rtype:

dict

returns:

Response Syntax

{
    'ColumnStatisticsTaskRuns': [
        {
            'CustomerId': 'string',
            'ColumnStatisticsTaskRunId': 'string',
            'DatabaseName': 'string',
            'TableName': 'string',
            'ColumnNameList': [
                'string',
            ],
            'CatalogID': 'string',
            'Role': 'string',
            'SampleSize': 123.0,
            'SecurityConfiguration': 'string',
            'NumberOfWorkers': 123,
            'WorkerType': 'string',
            'Status': 'STARTING'|'RUNNING'|'SUCCEEDED'|'FAILED'|'STOPPED',
            'CreationTime': datetime(2015, 1, 1),
            'LastUpdated': datetime(2015, 1, 1),
            'StartTime': datetime(2015, 1, 1),
            'EndTime': datetime(2015, 1, 1),
            'ErrorMessage': 'string',
            'DPUSeconds': 123.0
        },
    ],
    'NextToken': 'string'
}

Response Structure

  • (dict) --

    • ColumnStatisticsTaskRuns (list) --

      A list of column statistics task runs.

      • (dict) --

        The object that shows the details of the column stats run.

        • CustomerId (string) --

          The Amazon Web Services account ID.

        • ColumnStatisticsTaskRunId (string) --

          The identifier for the particular column statistics task run.

        • DatabaseName (string) --

          The database where the table resides.

        • TableName (string) --

          The name of the table for which column statistics is generated.

        • ColumnNameList (list) --

          A list of the column names. If none is supplied, all column names for the table will be used by default.

          • (string) --

        • CatalogID (string) --

          The ID of the Data Catalog where the table resides. If none is supplied, the Amazon Web Services account ID is used by default.

        • Role (string) --

          The IAM role that the service assumes to generate statistics.

        • SampleSize (float) --

          The percentage of rows used to generate statistics. If none is supplied, the entire table will be used to generate stats.

        • SecurityConfiguration (string) --

          Name of the security configuration that is used to encrypt CloudWatch logs for the column stats task run.

        • NumberOfWorkers (integer) --

          The number of workers used to generate column statistics. The job is preconfigured to autoscale up to 25 instances.

        • WorkerType (string) --

          The type of workers being used for generating stats. The default is g.1x.

        • Status (string) --

          The status of the task run.

        • CreationTime (datetime) --

          The time that this task was created.

        • LastUpdated (datetime) --

          The last point in time when this task was modified.

        • StartTime (datetime) --

          The start time of the task.

        • EndTime (datetime) --

          The end time of the task.

        • ErrorMessage (string) --

          The error message for the job.

        • DPUSeconds (float) --

          The calculated DPU usage in seconds for all autoscaled workers.

    • NextToken (string) --

      A continuation token, if not all task runs have yet been returned.

StopColumnStatisticsTaskRun (new) Link ¶

Stops a task run for the specified table.

See also: AWS API Documentation

Request Syntax

client.stop_column_statistics_task_run(
    DatabaseName='string',
    TableName='string'
)
type DatabaseName:

string

param DatabaseName:

[REQUIRED]

The name of the database where the table resides.

type TableName:

string

param TableName:

[REQUIRED]

The name of the table.

rtype:

dict

returns:

Response Syntax

{}

Response Structure

  • (dict) --

GetColumnStatisticsTaskRun (new) Link ¶

Get the associated metadata/information for a task run, given a task run ID.

See also: AWS API Documentation

Request Syntax

client.get_column_statistics_task_run(
    ColumnStatisticsTaskRunId='string'
)
type ColumnStatisticsTaskRunId:

string

param ColumnStatisticsTaskRunId:

[REQUIRED]

The identifier for the particular column statistics task run.

rtype:

dict

returns:

Response Syntax

{
    'ColumnStatisticsTaskRun': {
        'CustomerId': 'string',
        'ColumnStatisticsTaskRunId': 'string',
        'DatabaseName': 'string',
        'TableName': 'string',
        'ColumnNameList': [
            'string',
        ],
        'CatalogID': 'string',
        'Role': 'string',
        'SampleSize': 123.0,
        'SecurityConfiguration': 'string',
        'NumberOfWorkers': 123,
        'WorkerType': 'string',
        'Status': 'STARTING'|'RUNNING'|'SUCCEEDED'|'FAILED'|'STOPPED',
        'CreationTime': datetime(2015, 1, 1),
        'LastUpdated': datetime(2015, 1, 1),
        'StartTime': datetime(2015, 1, 1),
        'EndTime': datetime(2015, 1, 1),
        'ErrorMessage': 'string',
        'DPUSeconds': 123.0
    }
}

Response Structure

  • (dict) --

    • ColumnStatisticsTaskRun (dict) --

      A ColumnStatisticsTaskRun object representing the details of the column stats run.

      • CustomerId (string) --

        The Amazon Web Services account ID.

      • ColumnStatisticsTaskRunId (string) --

        The identifier for the particular column statistics task run.

      • DatabaseName (string) --

        The database where the table resides.

      • TableName (string) --

        The name of the table for which column statistics is generated.

      • ColumnNameList (list) --

        A list of the column names. If none is supplied, all column names for the table will be used by default.

        • (string) --

      • CatalogID (string) --

        The ID of the Data Catalog where the table resides. If none is supplied, the Amazon Web Services account ID is used by default.

      • Role (string) --

        The IAM role that the service assumes to generate statistics.

      • SampleSize (float) --

        The percentage of rows used to generate statistics. If none is supplied, the entire table will be used to generate stats.

      • SecurityConfiguration (string) --

        Name of the security configuration that is used to encrypt CloudWatch logs for the column stats task run.

      • NumberOfWorkers (integer) --

        The number of workers used to generate column statistics. The job is preconfigured to autoscale up to 25 instances.

      • WorkerType (string) --

        The type of workers being used for generating stats. The default is g.1x.

      • Status (string) --

        The status of the task run.

      • CreationTime (datetime) --

        The time that this task was created.

      • LastUpdated (datetime) --

        The last point in time when this task was modified.

      • StartTime (datetime) --

        The start time of the task.

      • EndTime (datetime) --

        The end time of the task.

      • ErrorMessage (string) --

        The error message for the job.

      • DPUSeconds (float) --

        The calculated DPU usage in seconds for all autoscaled workers.

ListColumnStatisticsTaskRuns (new) Link ¶

List all task runs for a particular account.

See also: AWS API Documentation

Request Syntax

client.list_column_statistics_task_runs(
    MaxResults=123,
    NextToken='string'
)
type MaxResults:

integer

param MaxResults:

The maximum size of the response.

type NextToken:

string

param NextToken:

A continuation token, if this is a continuation call.

rtype:

dict

returns:

Response Syntax

{
    'ColumnStatisticsTaskRunIds': [
        'string',
    ],
    'NextToken': 'string'
}

Response Structure

  • (dict) --

    • ColumnStatisticsTaskRunIds (list) --

      A list of column statistics task run IDs.

      • (string) --

    • NextToken (string) --

      A continuation token, if not all task run IDs have yet been returned.