AWS Glue

2024/10/31 - AWS Glue - 6 new 2 updated api methods

Changes  Add schedule support for AWS Glue column statistics

UpdateColumnStatisticsTaskSettings (new) Link ¶

Updates settings for a column statistics task.

See also: AWS API Documentation

Request Syntax

client.update_column_statistics_task_settings(
    DatabaseName='string',
    TableName='string',
    Role='string',
    Schedule='string',
    ColumnNameList=[
        'string',
    ],
    SampleSize=123.0,
    CatalogID='string',
    SecurityConfiguration='string'
)
type DatabaseName

string

param DatabaseName

[REQUIRED]

The name of the database where the table resides.

type TableName

string

param TableName

[REQUIRED]

The name of the table for which to generate column statistics.

type Role

string

param Role

The role used for running the column statistics.

type Schedule

string

param Schedule

A schedule for running the column statistics, specified in CRON syntax.

type ColumnNameList

list

param ColumnNameList

A list of column names for which to run statistics.

  • (string) --

type SampleSize

float

param SampleSize

The percentage of data to sample.

type CatalogID

string

param CatalogID

The ID of the Data Catalog in which the database resides.

type SecurityConfiguration

string

param SecurityConfiguration

Name of the security configuration that is used to encrypt CloudWatch logs.

rtype

dict

returns

Response Syntax

{}

Response Structure

  • (dict) --

DeleteColumnStatisticsTaskSettings (new) Link ¶

Deletes settings for a column statistics task.

See also: AWS API Documentation

Request Syntax

client.delete_column_statistics_task_settings(
    DatabaseName='string',
    TableName='string'
)
type DatabaseName

string

param DatabaseName

[REQUIRED]

The name of the database where the table resides.

type TableName

string

param TableName

[REQUIRED]

The name of the table for which to delete column statistics.

rtype

dict

returns

Response Syntax

{}

Response Structure

  • (dict) --

GetColumnStatisticsTaskSettings (new) Link ¶

Gets settings for a column statistics task.

See also: AWS API Documentation

Request Syntax

client.get_column_statistics_task_settings(
    DatabaseName='string',
    TableName='string'
)
type DatabaseName

string

param DatabaseName

[REQUIRED]

The name of the database where the table resides.

type TableName

string

param TableName

[REQUIRED]

The name of the table for which to retrieve column statistics.

rtype

dict

returns

Response Syntax

{
    'ColumnStatisticsTaskSettings': {
        'DatabaseName': 'string',
        'TableName': 'string',
        'Schedule': {
            'ScheduleExpression': 'string',
            'State': 'SCHEDULED'|'NOT_SCHEDULED'|'TRANSITIONING'
        },
        'ColumnNameList': [
            'string',
        ],
        'CatalogID': 'string',
        'Role': 'string',
        'SampleSize': 123.0,
        'SecurityConfiguration': 'string'
    }
}

Response Structure

  • (dict) --

    • ColumnStatisticsTaskSettings (dict) --

      A ColumnStatisticsTaskSettings object representing the settings for the column statistics task.

      • DatabaseName (string) --

        The name of the database where the table resides.

      • TableName (string) --

        The name of the table for which to generate column statistics.

      • Schedule (dict) --

        A schedule for running the column statistics, specified in CRON syntax.

        • ScheduleExpression (string) --

          A cron expression used to specify the schedule (see Time-Based Schedules for Jobs and Crawlers. For example, to run something every day at 12:15 UTC, you would specify: cron(15 12 * * ? *) .

        • State (string) --

          The state of the schedule.

      • ColumnNameList (list) --

        A list of column names for which to run statistics.

        • (string) --

      • CatalogID (string) --

        The ID of the Data Catalog in which the database resides.

      • Role (string) --

        The role used for running the column statistics.

      • SampleSize (float) --

        The percentage of data to sample.

      • SecurityConfiguration (string) --

        Name of the security configuration that is used to encrypt CloudWatch logs.

StartColumnStatisticsTaskRunSchedule (new) Link ¶

Starts a column statistics task run schedule.

See also: AWS API Documentation

Request Syntax

client.start_column_statistics_task_run_schedule(
    DatabaseName='string',
    TableName='string'
)
type DatabaseName

string

param DatabaseName

[REQUIRED]

The name of the database where the table resides.

type TableName

string

param TableName

[REQUIRED]

The name of the table for which to start a column statistic task run schedule.

rtype

dict

returns

Response Syntax

{}

Response Structure

  • (dict) --

StopColumnStatisticsTaskRunSchedule (new) Link ¶

Stops a column statistics task run schedule.

See also: AWS API Documentation

Request Syntax

client.stop_column_statistics_task_run_schedule(
    DatabaseName='string',
    TableName='string'
)
type DatabaseName

string

param DatabaseName

[REQUIRED]

The name of the database where the table resides.

type TableName

string

param TableName

[REQUIRED]

The name of the table for which to stop a column statistic task run schedule.

rtype

dict

returns

Response Syntax

{}

Response Structure

  • (dict) --

CreateColumnStatisticsTaskSettings (new) Link ¶

Creates settings for a column statistics task.

See also: AWS API Documentation

Request Syntax

client.create_column_statistics_task_settings(
    DatabaseName='string',
    TableName='string',
    Role='string',
    Schedule='string',
    ColumnNameList=[
        'string',
    ],
    SampleSize=123.0,
    CatalogID='string',
    SecurityConfiguration='string',
    Tags={
        'string': 'string'
    }
)
type DatabaseName

string

param DatabaseName

[REQUIRED]

The name of the database where the table resides.

type TableName

string

param TableName

[REQUIRED]

The name of the table for which to generate column statistics.

type Role

string

param Role

[REQUIRED]

The role used for running the column statistics.

type Schedule

string

param Schedule

A schedule for running the column statistics, specified in CRON syntax.

type ColumnNameList

list

param ColumnNameList

A list of column names for which to run statistics.

  • (string) --

type SampleSize

float

param SampleSize

The percentage of data to sample.

type CatalogID

string

param CatalogID

The ID of the Data Catalog in which the database resides.

type SecurityConfiguration

string

param SecurityConfiguration

Name of the security configuration that is used to encrypt CloudWatch logs.

type Tags

dict

param Tags

A map of tags.

  • (string) --

    • (string) --

rtype

dict

returns

Response Syntax

{}

Response Structure

  • (dict) --

GetColumnStatisticsTaskRun (updated) Link ¶
Changes (response)
{'ColumnStatisticsTaskRun': {'ComputationType': 'FULL | INCREMENTAL'}}

Get the associated metadata/information for a task run, given a task run ID.

See also: AWS API Documentation

Request Syntax

client.get_column_statistics_task_run(
    ColumnStatisticsTaskRunId='string'
)
type ColumnStatisticsTaskRunId

string

param ColumnStatisticsTaskRunId

[REQUIRED]

The identifier for the particular column statistics task run.

rtype

dict

returns

Response Syntax

{
    'ColumnStatisticsTaskRun': {
        'CustomerId': 'string',
        'ColumnStatisticsTaskRunId': 'string',
        'DatabaseName': 'string',
        'TableName': 'string',
        'ColumnNameList': [
            'string',
        ],
        'CatalogID': 'string',
        'Role': 'string',
        'SampleSize': 123.0,
        'SecurityConfiguration': 'string',
        'NumberOfWorkers': 123,
        'WorkerType': 'string',
        'ComputationType': 'FULL'|'INCREMENTAL',
        'Status': 'STARTING'|'RUNNING'|'SUCCEEDED'|'FAILED'|'STOPPED',
        'CreationTime': datetime(2015, 1, 1),
        'LastUpdated': datetime(2015, 1, 1),
        'StartTime': datetime(2015, 1, 1),
        'EndTime': datetime(2015, 1, 1),
        'ErrorMessage': 'string',
        'DPUSeconds': 123.0
    }
}

Response Structure

  • (dict) --

    • ColumnStatisticsTaskRun (dict) --

      A ColumnStatisticsTaskRun object representing the details of the column stats run.

      • CustomerId (string) --

        The Amazon Web Services account ID.

      • ColumnStatisticsTaskRunId (string) --

        The identifier for the particular column statistics task run.

      • DatabaseName (string) --

        The database where the table resides.

      • TableName (string) --

        The name of the table for which column statistics is generated.

      • ColumnNameList (list) --

        A list of the column names. If none is supplied, all column names for the table will be used by default.

        • (string) --

      • CatalogID (string) --

        The ID of the Data Catalog where the table resides. If none is supplied, the Amazon Web Services account ID is used by default.

      • Role (string) --

        The IAM role that the service assumes to generate statistics.

      • SampleSize (float) --

        The percentage of rows used to generate statistics. If none is supplied, the entire table will be used to generate stats.

      • SecurityConfiguration (string) --

        Name of the security configuration that is used to encrypt CloudWatch logs for the column stats task run.

      • NumberOfWorkers (integer) --

        The number of workers used to generate column statistics. The job is preconfigured to autoscale up to 25 instances.

      • WorkerType (string) --

        The type of workers being used for generating stats. The default is g.1x .

      • ComputationType (string) --

        The type of column statistics computation.

      • Status (string) --

        The status of the task run.

      • CreationTime (datetime) --

        The time that this task was created.

      • LastUpdated (datetime) --

        The last point in time when this task was modified.

      • StartTime (datetime) --

        The start time of the task.

      • EndTime (datetime) --

        The end time of the task.

      • ErrorMessage (string) --

        The error message for the job.

      • DPUSeconds (float) --

        The calculated DPU usage in seconds for all autoscaled workers.

GetColumnStatisticsTaskRuns (updated) Link ¶
Changes (response)
{'ColumnStatisticsTaskRuns': {'ComputationType': 'FULL | INCREMENTAL'}}

Retrieves information about all runs associated with the specified table.

See also: AWS API Documentation

Request Syntax

client.get_column_statistics_task_runs(
    DatabaseName='string',
    TableName='string',
    MaxResults=123,
    NextToken='string'
)
type DatabaseName

string

param DatabaseName

[REQUIRED]

The name of the database where the table resides.

type TableName

string

param TableName

[REQUIRED]

The name of the table.

type MaxResults

integer

param MaxResults

The maximum size of the response.

type NextToken

string

param NextToken

A continuation token, if this is a continuation call.

rtype

dict

returns

Response Syntax

{
    'ColumnStatisticsTaskRuns': [
        {
            'CustomerId': 'string',
            'ColumnStatisticsTaskRunId': 'string',
            'DatabaseName': 'string',
            'TableName': 'string',
            'ColumnNameList': [
                'string',
            ],
            'CatalogID': 'string',
            'Role': 'string',
            'SampleSize': 123.0,
            'SecurityConfiguration': 'string',
            'NumberOfWorkers': 123,
            'WorkerType': 'string',
            'ComputationType': 'FULL'|'INCREMENTAL',
            'Status': 'STARTING'|'RUNNING'|'SUCCEEDED'|'FAILED'|'STOPPED',
            'CreationTime': datetime(2015, 1, 1),
            'LastUpdated': datetime(2015, 1, 1),
            'StartTime': datetime(2015, 1, 1),
            'EndTime': datetime(2015, 1, 1),
            'ErrorMessage': 'string',
            'DPUSeconds': 123.0
        },
    ],
    'NextToken': 'string'
}

Response Structure

  • (dict) --

    • ColumnStatisticsTaskRuns (list) --

      A list of column statistics task runs.

      • (dict) --

        The object that shows the details of the column stats run.

        • CustomerId (string) --

          The Amazon Web Services account ID.

        • ColumnStatisticsTaskRunId (string) --

          The identifier for the particular column statistics task run.

        • DatabaseName (string) --

          The database where the table resides.

        • TableName (string) --

          The name of the table for which column statistics is generated.

        • ColumnNameList (list) --

          A list of the column names. If none is supplied, all column names for the table will be used by default.

          • (string) --

        • CatalogID (string) --

          The ID of the Data Catalog where the table resides. If none is supplied, the Amazon Web Services account ID is used by default.

        • Role (string) --

          The IAM role that the service assumes to generate statistics.

        • SampleSize (float) --

          The percentage of rows used to generate statistics. If none is supplied, the entire table will be used to generate stats.

        • SecurityConfiguration (string) --

          Name of the security configuration that is used to encrypt CloudWatch logs for the column stats task run.

        • NumberOfWorkers (integer) --

          The number of workers used to generate column statistics. The job is preconfigured to autoscale up to 25 instances.

        • WorkerType (string) --

          The type of workers being used for generating stats. The default is g.1x .

        • ComputationType (string) --

          The type of column statistics computation.

        • Status (string) --

          The status of the task run.

        • CreationTime (datetime) --

          The time that this task was created.

        • LastUpdated (datetime) --

          The last point in time when this task was modified.

        • StartTime (datetime) --

          The start time of the task.

        • EndTime (datetime) --

          The end time of the task.

        • ErrorMessage (string) --

          The error message for the job.

        • DPUSeconds (float) --

          The calculated DPU usage in seconds for all autoscaled workers.

    • NextToken (string) --

      A continuation token, if not all task runs have yet been returned.