AWS Glue

2019/08/08 - AWS Glue - 13 new 16 updated api methods

Changes  Update glue client to latest version

SearchTables (new) Link ¶

Searches a set of tables based on properties in the table metadata as well as on the parent database. You can search against text or filter conditions.

You can only get tables that you have access to based on the security policies defined in Lake Formation. You need at least a read-only access to the table for it to be returned. If you do not have access to all the columns in the table, these columns will not be searched against when returning the list of tables back to you. If you have access to the columns but not the data in the columns, those columns and the associated metadata for those columns will be included in the search.

See also: AWS API Documentation

Request Syntax

client.search_tables(
    CatalogId='string',
    NextToken='string',
    Filters=[
        {
            'Key': 'string',
            'Value': 'string',
            'Comparator': 'EQUALS'|'GREATER_THAN'|'LESS_THAN'|'GREATER_THAN_EQUALS'|'LESS_THAN_EQUALS'
        },
    ],
    SearchText='string',
    SortCriteria=[
        {
            'FieldName': 'string',
            'Sort': 'ASC'|'DESC'
        },
    ],
    MaxResults=123
)
type CatalogId

string

param CatalogId

A unique identifier, consisting of account_id/datalake .

type NextToken

string

param NextToken

A continuation token, included if this is a continuation call.

type Filters

list

param Filters

A list of key-value pairs, and a comparator used to filter the search results. Returns all entities matching the predicate.

  • (dict) --

    Defines a property predicate.

    • Key (string) --

      The key of the property.

    • Value (string) --

      The value of the property.

    • Comparator (string) --

      The comparator used to compare this property to others.

type SearchText

string

param SearchText

A string used for a text search.

Specifying a value in quotes filters based on an exact match to the value.

type SortCriteria

list

param SortCriteria

A list of criteria for sorting the results by a field name, in an ascending or descending order.

  • (dict) --

    • FieldName (string) --

    • Sort (string) --

type MaxResults

integer

param MaxResults

The maximum number of tables to return in a single response.

rtype

dict

returns

Response Syntax

{
    'NextToken': 'string',
    'TableList': [
        {
            'Name': 'string',
            'DatabaseName': 'string',
            'Description': 'string',
            'Owner': 'string',
            'CreateTime': datetime(2015, 1, 1),
            'UpdateTime': datetime(2015, 1, 1),
            'LastAccessTime': datetime(2015, 1, 1),
            'LastAnalyzedTime': datetime(2015, 1, 1),
            'Retention': 123,
            'StorageDescriptor': {
                'Columns': [
                    {
                        'Name': 'string',
                        'Type': 'string',
                        'Comment': 'string',
                        'Parameters': {
                            'string': 'string'
                        }
                    },
                ],
                'Location': 'string',
                'InputFormat': 'string',
                'OutputFormat': 'string',
                'Compressed': True|False,
                'NumberOfBuckets': 123,
                'SerdeInfo': {
                    'Name': 'string',
                    'SerializationLibrary': 'string',
                    'Parameters': {
                        'string': 'string'
                    }
                },
                'BucketColumns': [
                    'string',
                ],
                'SortColumns': [
                    {
                        'Column': 'string',
                        'SortOrder': 123
                    },
                ],
                'Parameters': {
                    'string': 'string'
                },
                'SkewedInfo': {
                    'SkewedColumnNames': [
                        'string',
                    ],
                    'SkewedColumnValues': [
                        'string',
                    ],
                    'SkewedColumnValueLocationMaps': {
                        'string': 'string'
                    }
                },
                'StoredAsSubDirectories': True|False
            },
            'PartitionKeys': [
                {
                    'Name': 'string',
                    'Type': 'string',
                    'Comment': 'string',
                    'Parameters': {
                        'string': 'string'
                    }
                },
            ],
            'ViewOriginalText': 'string',
            'ViewExpandedText': 'string',
            'TableType': 'string',
            'Parameters': {
                'string': 'string'
            },
            'CreatedBy': 'string',
            'IsRegisteredWithLakeFormation': True|False
        },
    ]
}

Response Structure

  • (dict) --

    • NextToken (string) --

      A continuation token, present if the current list segment is not the last.

    • TableList (list) --

      A list of the requested Table objects. The SearchTables response returns only the tables that you have access to.

      • (dict) --

        Represents a collection of related data organized in columns and rows.

        • Name (string) --

          The table name. For Hive compatibility, this must be entirely lowercase.

        • DatabaseName (string) --

          The name of the database where the table metadata resides. For Hive compatibility, this must be all lowercase.

        • Description (string) --

          A description of the table.

        • Owner (string) --

          The owner of the table.

        • CreateTime (datetime) --

          The time when the table definition was created in the Data Catalog.

        • UpdateTime (datetime) --

          The last time that the table was updated.

        • LastAccessTime (datetime) --

          The last time that the table was accessed. This is usually taken from HDFS, and might not be reliable.

        • LastAnalyzedTime (datetime) --

          The last time that column statistics were computed for this table.

        • Retention (integer) --

          The retention time for this table.

        • StorageDescriptor (dict) --

          A storage descriptor containing information about the physical storage of this table.

          • Columns (list) --

            A list of the Columns in the table.

            • (dict) --

              A column in a Table .

              • Name (string) --

                The name of the Column .

              • Type (string) --

                The data type of the Column .

              • Comment (string) --

                A free-form text comment.

              • Parameters (dict) --

                These key-value pairs define properties associated with the column.

                • (string) --

                  • (string) --

          • Location (string) --

            The physical location of the table. By default, this takes the form of the warehouse location, followed by the database location in the warehouse, followed by the table name.

          • InputFormat (string) --

            The input format: SequenceFileInputFormat (binary), or TextInputFormat , or a custom format.

          • OutputFormat (string) --

            The output format: SequenceFileOutputFormat (binary), or IgnoreKeyTextOutputFormat , or a custom format.

          • Compressed (boolean) --

            True if the data in the table is compressed, or False if not.

          • NumberOfBuckets (integer) --

            Must be specified if the table contains any dimension columns.

          • SerdeInfo (dict) --

            The serialization/deserialization (SerDe) information.

            • Name (string) --

              Name of the SerDe.

            • SerializationLibrary (string) --

              Usually the class that implements the SerDe. An example is org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe .

            • Parameters (dict) --

              These key-value pairs define initialization parameters for the SerDe.

              • (string) --

                • (string) --

          • BucketColumns (list) --

            A list of reducer grouping columns, clustering columns, and bucketing columns in the table.

            • (string) --

          • SortColumns (list) --

            A list specifying the sort order of each bucket in the table.

            • (dict) --

              Specifies the sort order of a sorted column.

              • Column (string) --

                The name of the column.

              • SortOrder (integer) --

                Indicates that the column is sorted in ascending order ( == 1 ), or in descending order ( ==0 ).

          • Parameters (dict) --

            The user-supplied properties in key-value form.

            • (string) --

              • (string) --

          • SkewedInfo (dict) --

            The information about values that appear frequently in a column (skewed values).

            • SkewedColumnNames (list) --

              A list of names of columns that contain skewed values.

              • (string) --

            • SkewedColumnValues (list) --

              A list of values that appear so frequently as to be considered skewed.

              • (string) --

            • SkewedColumnValueLocationMaps (dict) --

              A mapping of skewed values to the columns that contain them.

              • (string) --

                • (string) --

          • StoredAsSubDirectories (boolean) --

            True if the table data is stored in subdirectories, or False if not.

        • PartitionKeys (list) --

          A list of columns by which the table is partitioned. Only primitive types are supported as partition keys.

          When you create a table used by Amazon Athena, and you do not specify any partitionKeys , you must at least set the value of partitionKeys to an empty list. For example:

          "PartitionKeys": []

          • (dict) --

            A column in a Table .

            • Name (string) --

              The name of the Column .

            • Type (string) --

              The data type of the Column .

            • Comment (string) --

              A free-form text comment.

            • Parameters (dict) --

              These key-value pairs define properties associated with the column.

              • (string) --

                • (string) --

        • ViewOriginalText (string) --

          If the table is a view, the original text of the view; otherwise null .

        • ViewExpandedText (string) --

          If the table is a view, the expanded text of the view; otherwise null .

        • TableType (string) --

          The type of this table ( EXTERNAL_TABLE , VIRTUAL_VIEW , etc.).

        • Parameters (dict) --

          These key-value pairs define properties associated with the table.

          • (string) --

            • (string) --

        • CreatedBy (string) --

          The person or entity who created the table.

        • IsRegisteredWithLakeFormation (boolean) --

          Indicates whether the table has been registered with AWS Lake Formation.

CancelMLTaskRun (new) Link ¶

Cancels (stops) a task run. Machine learning task runs are asynchronous tasks that AWS Glue runs on your behalf as part of various machine learning workflows. You can cancel a machine learning task run at any time by calling CancelMLTaskRun with a task run's parent transform's TransformID and the task run's TaskRunId .

See also: AWS API Documentation

Request Syntax

client.cancel_ml_task_run(
    TransformId='string',
    TaskRunId='string'
)
type TransformId

string

param TransformId

[REQUIRED]

The unique identifier of the machine learning transform.

type TaskRunId

string

param TaskRunId

[REQUIRED]

A unique identifier for the task run.

rtype

dict

returns

Response Syntax

{
    'TransformId': 'string',
    'TaskRunId': 'string',
    'Status': 'STARTING'|'RUNNING'|'STOPPING'|'STOPPED'|'SUCCEEDED'|'FAILED'|'TIMEOUT'
}

Response Structure

  • (dict) --

    • TransformId (string) --

      The unique identifier of the machine learning transform.

    • TaskRunId (string) --

      The unique identifier for the task run.

    • Status (string) --

      The status for this run.

GetMLTaskRun (new) Link ¶

Gets details for a specific task run on a machine learning transform. Machine learning task runs are asynchronous tasks that AWS Glue runs on your behalf as part of various machine learning workflows. You can check the stats of any task run by calling GetMLTaskRun with the TaskRunID and its parent transform's TransformID .

See also: AWS API Documentation

Request Syntax

client.get_ml_task_run(
    TransformId='string',
    TaskRunId='string'
)
type TransformId

string

param TransformId

[REQUIRED]

The unique identifier of the machine learning transform.

type TaskRunId

string

param TaskRunId

[REQUIRED]

The unique identifier of the task run.

rtype

dict

returns

Response Syntax

{
    'TransformId': 'string',
    'TaskRunId': 'string',
    'Status': 'STARTING'|'RUNNING'|'STOPPING'|'STOPPED'|'SUCCEEDED'|'FAILED'|'TIMEOUT',
    'LogGroupName': 'string',
    'Properties': {
        'TaskType': 'EVALUATION'|'LABELING_SET_GENERATION'|'IMPORT_LABELS'|'EXPORT_LABELS'|'FIND_MATCHES',
        'ImportLabelsTaskRunProperties': {
            'InputS3Path': 'string',
            'Replace': True|False
        },
        'ExportLabelsTaskRunProperties': {
            'OutputS3Path': 'string'
        },
        'LabelingSetGenerationTaskRunProperties': {
            'OutputS3Path': 'string'
        },
        'FindMatchesTaskRunProperties': {
            'JobId': 'string',
            'JobName': 'string',
            'JobRunId': 'string'
        }
    },
    'ErrorString': 'string',
    'StartedOn': datetime(2015, 1, 1),
    'LastModifiedOn': datetime(2015, 1, 1),
    'CompletedOn': datetime(2015, 1, 1),
    'ExecutionTime': 123
}

Response Structure

  • (dict) --

    • TransformId (string) --

      The unique identifier of the task run.

    • TaskRunId (string) --

      The unique run identifier associated with this run.

    • Status (string) --

      The status for this task run.

    • LogGroupName (string) --

      The names of the log groups that are associated with the task run.

    • Properties (dict) --

      The list of properties that are associated with the task run.

      • TaskType (string) --

        The type of task run.

      • ImportLabelsTaskRunProperties (dict) --

        The configuration properties for an importing labels task run.

        • InputS3Path (string) --

          The Amazon Simple Storage Service (Amazon S3) path from where you will import the labels.

        • Replace (boolean) --

          Indicates whether to overwrite your existing labels.

      • ExportLabelsTaskRunProperties (dict) --

        The configuration properties for an exporting labels task run.

        • OutputS3Path (string) --

          The Amazon Simple Storage Service (Amazon S3) path where you will export the labels.

      • LabelingSetGenerationTaskRunProperties (dict) --

        The configuration properties for a labeling set generation task run.

        • OutputS3Path (string) --

          The Amazon Simple Storage Service (Amazon S3) path where you will generate the labeling set.

      • FindMatchesTaskRunProperties (dict) --

        The configuration properties for a find matches task run.

        • JobId (string) --

          The job ID for the Find Matches task run.

        • JobName (string) --

          The name assigned to the job for the Find Matches task run.

        • JobRunId (string) --

          The job run ID for the Find Matches task run.

    • ErrorString (string) --

      The error strings that are associated with the task run.

    • StartedOn (datetime) --

      The date and time when this task run started.

    • LastModifiedOn (datetime) --

      The date and time when this task run was last modified.

    • CompletedOn (datetime) --

      The date and time when this task run was completed.

    • ExecutionTime (integer) --

      The amount of time (in seconds) that the task run consumed resources.

UpdateMLTransform (new) Link ¶

Updates an existing machine learning transform. Call this operation to tune the algorithm parameters to achieve better results.

After calling this operation, you can call the StartMLEvaluationTaskRun operation to assess how well your new parameters achieved your goals (such as improving the quality of your machine learning transform, or making it more cost-effective).

See also: AWS API Documentation

Request Syntax

client.update_ml_transform(
    TransformId='string',
    Name='string',
    Description='string',
    Parameters={
        'TransformType': 'FIND_MATCHES',
        'FindMatchesParameters': {
            'PrimaryKeyColumnName': 'string',
            'PrecisionRecallTradeoff': 123.0,
            'AccuracyCostTradeoff': 123.0,
            'EnforceProvidedLabels': True|False
        }
    },
    Role='string',
    MaxCapacity=123.0,
    WorkerType='Standard'|'G.1X'|'G.2X',
    NumberOfWorkers=123,
    Timeout=123,
    MaxRetries=123
)
type TransformId

string

param TransformId

[REQUIRED]

A unique identifier that was generated when the transform was created.

type Name

string

param Name

The unique name that you gave the transform when you created it.

type Description

string

param Description

A description of the transform. The default is an empty string.

type Parameters

dict

param Parameters

The configuration parameters that are specific to the transform type (algorithm) used. Conditionally dependent on the transform type.

  • TransformType (string) -- [REQUIRED]

    The type of machine learning transform.

    For information about the types of machine learning transforms, see Creating Machine Learning Transforms.

  • FindMatchesParameters (dict) --

    The parameters for the find matches algorithm.

    • PrimaryKeyColumnName (string) --

      The name of a column that uniquely identifies rows in the source table. Used to help identify matching records.

    • PrecisionRecallTradeoff (float) --

      The value selected when tuning your transform for a balance between precision and recall. A value of 0.5 means no preference; a value of 1.0 means a bias purely for precision, and a value of 0.0 means a bias for recall. Because this is a tradeoff, choosing values close to 1.0 means very low recall, and choosing values close to 0.0 results in very low precision.

      The precision metric indicates how often your model is correct when it predicts a match.

      The recall metric indicates that for an actual match, how often your model predicts the match.

    • AccuracyCostTradeoff (float) --

      The value that is selected when tuning your transform for a balance between accuracy and cost. A value of 0.5 means that the system balances accuracy and cost concerns. A value of 1.0 means a bias purely for accuracy, which typically results in a higher cost, sometimes substantially higher. A value of 0.0 means a bias purely for cost, which results in a less accurate FindMatches transform, sometimes with unacceptable accuracy.

      Accuracy measures how well the transform finds true positives and true negatives. Increasing accuracy requires more machine resources and cost. But it also results in increased recall.

      Cost measures how many compute resources, and thus money, are consumed to run the transform.

    • EnforceProvidedLabels (boolean) --

      The value to switch on or off to force the output to match the provided labels from users. If the value is True , the find matches transform forces the output to match the provided labels. The results override the normal conflation results. If the value is False , the find matches transform does not ensure all the labels provided are respected, and the results rely on the trained model.

      Note that setting this value to true may increase the conflation execution time.

type Role

string

param Role

The name or Amazon Resource Name (ARN) of the IAM role with the required permissions.

type MaxCapacity

float

param MaxCapacity

The number of AWS Glue data processing units (DPUs) that are allocated to task runs for this transform. You can allocate from 2 to 100 DPUs; the default is 10. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory. For more information, see the AWS Glue pricing page.

When the WorkerType field is set to a value other than Standard , the MaxCapacity field is set automatically and becomes read-only.

type WorkerType

string

param WorkerType

The type of predefined worker that is allocated when this task runs. Accepts a value of Standard, G.1X, or G.2X.

  • For the Standard worker type, each worker provides 4 vCPU, 16 GB of memory and a 50GB disk, and 2 executors per worker.

  • For the G.1X worker type, each worker provides 4 vCPU, 16 GB of memory and a 64GB disk, and 1 executor per worker.

  • For the G.2X worker type, each worker provides 8 vCPU, 32 GB of memory and a 128GB disk, and 1 executor per worker.

type NumberOfWorkers

integer

param NumberOfWorkers

The number of workers of a defined workerType that are allocated when this task runs.

type Timeout

integer

param Timeout

The timeout for a task run for this transform in minutes. This is the maximum time that a task run for this transform can consume resources before it is terminated and enters TIMEOUT status. The default is 2,880 minutes (48 hours).

type MaxRetries

integer

param MaxRetries

The maximum number of times to retry a task for this transform after a task run fails.

rtype

dict

returns

Response Syntax

{
    'TransformId': 'string'
}

Response Structure

  • (dict) --

    • TransformId (string) --

      The unique identifier for the transform that was updated.

GetMLTaskRuns (new) Link ¶

Gets a list of runs for a machine learning transform. Machine learning task runs are asynchronous tasks that AWS Glue runs on your behalf as part of various machine learning workflows. You can get a sortable, filterable list of machine learning task runs by calling GetMLTaskRuns with their parent transform's TransformID and other optional parameters as documented in this section.

This operation returns a list of historic runs and must be paginated.

See also: AWS API Documentation

Request Syntax

client.get_ml_task_runs(
    TransformId='string',
    NextToken='string',
    MaxResults=123,
    Filter={
        'TaskRunType': 'EVALUATION'|'LABELING_SET_GENERATION'|'IMPORT_LABELS'|'EXPORT_LABELS'|'FIND_MATCHES',
        'Status': 'STARTING'|'RUNNING'|'STOPPING'|'STOPPED'|'SUCCEEDED'|'FAILED'|'TIMEOUT',
        'StartedBefore': datetime(2015, 1, 1),
        'StartedAfter': datetime(2015, 1, 1)
    },
    Sort={
        'Column': 'TASK_RUN_TYPE'|'STATUS'|'STARTED',
        'SortDirection': 'DESCENDING'|'ASCENDING'
    }
)
type TransformId

string

param TransformId

[REQUIRED]

The unique identifier of the machine learning transform.

type NextToken

string

param NextToken

A token for pagination of the results. The default is empty.

type MaxResults

integer

param MaxResults

The maximum number of results to return.

type Filter

dict

param Filter

The filter criteria, in the TaskRunFilterCriteria structure, for the task run.

  • TaskRunType (string) --

    The type of task run.

  • Status (string) --

    The current status of the task run.

  • StartedBefore (datetime) --

    Filter on task runs started before this date.

  • StartedAfter (datetime) --

    Filter on task runs started after this date.

type Sort

dict

param Sort

The sorting criteria, in the TaskRunSortCriteria structure, for the task run.

  • Column (string) -- [REQUIRED]

    The column to be used to sort the list of task runs for the machine learning transform.

  • SortDirection (string) -- [REQUIRED]

    The sort direction to be used to sort the list of task runs for the machine learning transform.

rtype

dict

returns

Response Syntax

{
    'TaskRuns': [
        {
            'TransformId': 'string',
            'TaskRunId': 'string',
            'Status': 'STARTING'|'RUNNING'|'STOPPING'|'STOPPED'|'SUCCEEDED'|'FAILED'|'TIMEOUT',
            'LogGroupName': 'string',
            'Properties': {
                'TaskType': 'EVALUATION'|'LABELING_SET_GENERATION'|'IMPORT_LABELS'|'EXPORT_LABELS'|'FIND_MATCHES',
                'ImportLabelsTaskRunProperties': {
                    'InputS3Path': 'string',
                    'Replace': True|False
                },
                'ExportLabelsTaskRunProperties': {
                    'OutputS3Path': 'string'
                },
                'LabelingSetGenerationTaskRunProperties': {
                    'OutputS3Path': 'string'
                },
                'FindMatchesTaskRunProperties': {
                    'JobId': 'string',
                    'JobName': 'string',
                    'JobRunId': 'string'
                }
            },
            'ErrorString': 'string',
            'StartedOn': datetime(2015, 1, 1),
            'LastModifiedOn': datetime(2015, 1, 1),
            'CompletedOn': datetime(2015, 1, 1),
            'ExecutionTime': 123
        },
    ],
    'NextToken': 'string'
}

Response Structure

  • (dict) --

    • TaskRuns (list) --

      A list of task runs that are associated with the transform.

      • (dict) --

        The sampling parameters that are associated with the machine learning transform.

        • TransformId (string) --

          The unique identifier for the transform.

        • TaskRunId (string) --

          The unique identifier for this task run.

        • Status (string) --

          The current status of the requested task run.

        • LogGroupName (string) --

          The names of the log group for secure logging, associated with this task run.

        • Properties (dict) --

          Specifies configuration properties associated with this task run.

          • TaskType (string) --

            The type of task run.

          • ImportLabelsTaskRunProperties (dict) --

            The configuration properties for an importing labels task run.

            • InputS3Path (string) --

              The Amazon Simple Storage Service (Amazon S3) path from where you will import the labels.

            • Replace (boolean) --

              Indicates whether to overwrite your existing labels.

          • ExportLabelsTaskRunProperties (dict) --

            The configuration properties for an exporting labels task run.

            • OutputS3Path (string) --

              The Amazon Simple Storage Service (Amazon S3) path where you will export the labels.

          • LabelingSetGenerationTaskRunProperties (dict) --

            The configuration properties for a labeling set generation task run.

            • OutputS3Path (string) --

              The Amazon Simple Storage Service (Amazon S3) path where you will generate the labeling set.

          • FindMatchesTaskRunProperties (dict) --

            The configuration properties for a find matches task run.

            • JobId (string) --

              The job ID for the Find Matches task run.

            • JobName (string) --

              The name assigned to the job for the Find Matches task run.

            • JobRunId (string) --

              The job run ID for the Find Matches task run.

        • ErrorString (string) --

          The list of error strings associated with this task run.

        • StartedOn (datetime) --

          The date and time that this task run started.

        • LastModifiedOn (datetime) --

          The last point in time that the requested task run was updated.

        • CompletedOn (datetime) --

          The last point in time that the requested task run was completed.

        • ExecutionTime (integer) --

          The amount of time (in seconds) that the task run consumed resources.

    • NextToken (string) --

      A pagination token, if more results are available.

StartMLEvaluationTaskRun (new) Link ¶

Starts a task to estimate the quality of the transform.

When you provide label sets as examples of truth, AWS Glue machine learning uses some of those examples to learn from them. The rest of the labels are used as a test to estimate quality.

Returns a unique identifier for the run. You can call GetMLTaskRun to get more information about the stats of the EvaluationTaskRun .

See also: AWS API Documentation

Request Syntax

client.start_ml_evaluation_task_run(
    TransformId='string'
)
type TransformId

string

param TransformId

[REQUIRED]

The unique identifier of the machine learning transform.

rtype

dict

returns

Response Syntax

{
    'TaskRunId': 'string'
}

Response Structure

  • (dict) --

    • TaskRunId (string) --

      The unique identifier associated with this run.

StartExportLabelsTaskRun (new) Link ¶

Begins an asynchronous task to export all labeled data for a particular transform. This task is the only label-related API call that is not part of the typical active learning workflow. You typically use StartExportLabelsTaskRun when you want to work with all of your existing labels at the same time, such as when you want to remove or change labels that were previously submitted as truth. This API operation accepts the TransformId whose labels you want to export and an Amazon Simple Storage Service (Amazon S3) path to export the labels to. The operation returns a TaskRunId . You can check on the status of your task run by calling the GetMLTaskRun API.

See also: AWS API Documentation

Request Syntax

client.start_export_labels_task_run(
    TransformId='string',
    OutputS3Path='string'
)
type TransformId

string

param TransformId

[REQUIRED]

The unique identifier of the machine learning transform.

type OutputS3Path

string

param OutputS3Path

[REQUIRED]

The Amazon S3 path where you export the labels.

rtype

dict

returns

Response Syntax

{
    'TaskRunId': 'string'
}

Response Structure

  • (dict) --

    • TaskRunId (string) --

      The unique identifier for the task run.

DeleteMLTransform (new) Link ¶

Deletes an AWS Glue machine learning transform. Machine learning transforms are a special type of transform that use machine learning to learn the details of the transformation to be performed by learning from examples provided by humans. These transformations are then saved by AWS Glue. If you no longer need a transform, you can delete it by calling DeleteMLTransforms . However, any AWS Glue jobs that still reference the deleted transform will no longer succeed.

See also: AWS API Documentation

Request Syntax

client.delete_ml_transform(
    TransformId='string'
)
type TransformId

string

param TransformId

[REQUIRED]

The unique identifier of the transform to delete.

rtype

dict

returns

Response Syntax

{
    'TransformId': 'string'
}

Response Structure

  • (dict) --

    • TransformId (string) --

      The unique identifier of the transform that was deleted.

GetMLTransform (new) Link ¶

Gets an AWS Glue machine learning transform artifact and all its corresponding metadata. Machine learning transforms are a special type of transform that use machine learning to learn the details of the transformation to be performed by learning from examples provided by humans. These transformations are then saved by AWS Glue. You can retrieve their metadata by calling GetMLTransform .

See also: AWS API Documentation

Request Syntax

client.get_ml_transform(
    TransformId='string'
)
type TransformId

string

param TransformId

[REQUIRED]

The unique identifier of the transform, generated at the time that the transform was created.

rtype

dict

returns

Response Syntax

{
    'TransformId': 'string',
    'Name': 'string',
    'Description': 'string',
    'Status': 'NOT_READY'|'READY'|'DELETING',
    'CreatedOn': datetime(2015, 1, 1),
    'LastModifiedOn': datetime(2015, 1, 1),
    'InputRecordTables': [
        {
            'DatabaseName': 'string',
            'TableName': 'string',
            'CatalogId': 'string',
            'ConnectionName': 'string'
        },
    ],
    'Parameters': {
        'TransformType': 'FIND_MATCHES',
        'FindMatchesParameters': {
            'PrimaryKeyColumnName': 'string',
            'PrecisionRecallTradeoff': 123.0,
            'AccuracyCostTradeoff': 123.0,
            'EnforceProvidedLabels': True|False
        }
    },
    'EvaluationMetrics': {
        'TransformType': 'FIND_MATCHES',
        'FindMatchesMetrics': {
            'AreaUnderPRCurve': 123.0,
            'Precision': 123.0,
            'Recall': 123.0,
            'F1': 123.0,
            'ConfusionMatrix': {
                'NumTruePositives': 123,
                'NumFalsePositives': 123,
                'NumTrueNegatives': 123,
                'NumFalseNegatives': 123
            }
        }
    },
    'LabelCount': 123,
    'Schema': [
        {
            'Name': 'string',
            'DataType': 'string'
        },
    ],
    'Role': 'string',
    'MaxCapacity': 123.0,
    'WorkerType': 'Standard'|'G.1X'|'G.2X',
    'NumberOfWorkers': 123,
    'Timeout': 123,
    'MaxRetries': 123
}

Response Structure

  • (dict) --

    • TransformId (string) --

      The unique identifier of the transform, generated at the time that the transform was created.

    • Name (string) --

      The unique name given to the transform when it was created.

    • Description (string) --

      A description of the transform.

    • Status (string) --

      The last known status of the transform (to indicate whether it can be used or not). One of "NOT_READY", "READY", or "DELETING".

    • CreatedOn (datetime) --

      The date and time when the transform was created.

    • LastModifiedOn (datetime) --

      The date and time when the transform was last modified.

    • InputRecordTables (list) --

      A list of AWS Glue table definitions used by the transform.

      • (dict) --

        The database and table in the AWS Glue Data Catalog that is used for input or output data.

        • DatabaseName (string) --

          A database name in the AWS Glue Data Catalog.

        • TableName (string) --

          A table name in the AWS Glue Data Catalog.

        • CatalogId (string) --

          A unique identifier for the AWS Glue Data Catalog.

        • ConnectionName (string) --

          The name of the connection to the AWS Glue Data Catalog.

    • Parameters (dict) --

      The configuration parameters that are specific to the algorithm used.

      • TransformType (string) --

        The type of machine learning transform.

        For information about the types of machine learning transforms, see Creating Machine Learning Transforms.

      • FindMatchesParameters (dict) --

        The parameters for the find matches algorithm.

        • PrimaryKeyColumnName (string) --

          The name of a column that uniquely identifies rows in the source table. Used to help identify matching records.

        • PrecisionRecallTradeoff (float) --

          The value selected when tuning your transform for a balance between precision and recall. A value of 0.5 means no preference; a value of 1.0 means a bias purely for precision, and a value of 0.0 means a bias for recall. Because this is a tradeoff, choosing values close to 1.0 means very low recall, and choosing values close to 0.0 results in very low precision.

          The precision metric indicates how often your model is correct when it predicts a match.

          The recall metric indicates that for an actual match, how often your model predicts the match.

        • AccuracyCostTradeoff (float) --

          The value that is selected when tuning your transform for a balance between accuracy and cost. A value of 0.5 means that the system balances accuracy and cost concerns. A value of 1.0 means a bias purely for accuracy, which typically results in a higher cost, sometimes substantially higher. A value of 0.0 means a bias purely for cost, which results in a less accurate FindMatches transform, sometimes with unacceptable accuracy.

          Accuracy measures how well the transform finds true positives and true negatives. Increasing accuracy requires more machine resources and cost. But it also results in increased recall.

          Cost measures how many compute resources, and thus money, are consumed to run the transform.

        • EnforceProvidedLabels (boolean) --

          The value to switch on or off to force the output to match the provided labels from users. If the value is True , the find matches transform forces the output to match the provided labels. The results override the normal conflation results. If the value is False , the find matches transform does not ensure all the labels provided are respected, and the results rely on the trained model.

          Note that setting this value to true may increase the conflation execution time.

    • EvaluationMetrics (dict) --

      The latest evaluation metrics.

      • TransformType (string) --

        The type of machine learning transform.

      • FindMatchesMetrics (dict) --

        The evaluation metrics for the find matches algorithm.

        • AreaUnderPRCurve (float) --

          The area under the precision/recall curve (AUPRC) is a single number measuring the overall quality of the transform, that is independent of the choice made for precision vs. recall. Higher values indicate that you have a more attractive precision vs. recall tradeoff.

          For more information, see Precision and recall in Wikipedia.

        • Precision (float) --

          The precision metric indicates when often your transform is correct when it predicts a match. Specifically, it measures how well the transform finds true positives from the total true positives possible.

          For more information, see Precision and recall in Wikipedia.

        • Recall (float) --

          The recall metric indicates that for an actual match, how often your transform predicts the match. Specifically, it measures how well the transform finds true positives from the total records in the source data.

          For more information, see Precision and recall in Wikipedia.

        • F1 (float) --

          The maximum F1 metric indicates the transform's accuracy between 0 and 1, where 1 is the best accuracy.

          For more information, see F1 score in Wikipedia.

        • ConfusionMatrix (dict) --

          The confusion matrix shows you what your transform is predicting accurately and what types of errors it is making.

          For more information, see Confusion matrix in Wikipedia.

          • NumTruePositives (integer) --

            The number of matches in the data that the transform correctly found, in the confusion matrix for your transform.

          • NumFalsePositives (integer) --

            The number of nonmatches in the data that the transform incorrectly classified as a match, in the confusion matrix for your transform.

          • NumTrueNegatives (integer) --

            The number of nonmatches in the data that the transform correctly rejected, in the confusion matrix for your transform.

          • NumFalseNegatives (integer) --

            The number of matches in the data that the transform didn't find, in the confusion matrix for your transform.

    • LabelCount (integer) --

      The number of labels available for this transform.

    • Schema (list) --

      The Map<Column, Type> object that represents the schema that this transform accepts. Has an upper bound of 100 columns.

      • (dict) --

        A key-value pair representing a column and data type that this transform can run against. The Schema parameter of the MLTransform may contain up to 100 of these structures.

        • Name (string) --

          The name of the column.

        • DataType (string) --

          The type of data in the column.

    • Role (string) --

      The name or Amazon Resource Name (ARN) of the IAM role with the required permissions.

    • MaxCapacity (float) --

      The number of AWS Glue data processing units (DPUs) that are allocated to task runs for this transform. You can allocate from 2 to 100 DPUs; the default is 10. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory. For more information, see the AWS Glue pricing page.

      When the WorkerType field is set to a value other than Standard , the MaxCapacity field is set automatically and becomes read-only.

    • WorkerType (string) --

      The type of predefined worker that is allocated when this task runs. Accepts a value of Standard, G.1X, or G.2X.

      • For the Standard worker type, each worker provides 4 vCPU, 16 GB of memory and a 50GB disk, and 2 executors per worker.

      • For the G.1X worker type, each worker provides 4 vCPU, 16 GB of memory and a 64GB disk, and 1 executor per worker.

      • For the G.2X worker type, each worker provides 8 vCPU, 32 GB of memory and a 128GB disk, and 1 executor per worker.

    • NumberOfWorkers (integer) --

      The number of workers of a defined workerType that are allocated when this task runs.

    • Timeout (integer) --

      The timeout for a task run for this transform in minutes. This is the maximum time that a task run for this transform can consume resources before it is terminated and enters TIMEOUT status. The default is 2,880 minutes (48 hours).

    • MaxRetries (integer) --

      The maximum number of times to retry a task for this transform after a task run fails.

StartMLLabelingSetGenerationTaskRun (new) Link ¶

Starts the active learning workflow for your machine learning transform to improve the transform's quality by generating label sets and adding labels.

When the StartMLLabelingSetGenerationTaskRun finishes, AWS Glue will have generated a "labeling set" or a set of questions for humans to answer.

In the case of the FindMatches transform, these questions are of the form, “What is the correct way to group these rows together into groups composed entirely of matching records?”

After the labeling process is finished, you can upload your labels with a call to StartImportLabelsTaskRun . After StartImportLabelsTaskRun finishes, all future runs of the machine learning transform will use the new and improved labels and perform a higher-quality transformation.

See also: AWS API Documentation

Request Syntax

client.start_ml_labeling_set_generation_task_run(
    TransformId='string',
    OutputS3Path='string'
)
type TransformId

string

param TransformId

[REQUIRED]

The unique identifier of the machine learning transform.

type OutputS3Path

string

param OutputS3Path

[REQUIRED]

The Amazon Simple Storage Service (Amazon S3) path where you generate the labeling set.

rtype

dict

returns

Response Syntax

{
    'TaskRunId': 'string'
}

Response Structure

  • (dict) --

    • TaskRunId (string) --

      The unique run identifier that is associated with this task run.

CreateMLTransform (new) Link ¶

Creates an AWS Glue machine learning transform. This operation creates the transform and all the necessary parameters to train it.

Call this operation as the first step in the process of using a machine learning transform (such as the FindMatches transform) for deduplicating data. You can provide an optional Description , in addition to the parameters that you want to use for your algorithm.

You must also specify certain parameters for the tasks that AWS Glue runs on your behalf as part of learning from your data and creating a high-quality machine learning transform. These parameters include Role , and optionally, AllocatedCapacity , Timeout , and MaxRetries . For more information, see Jobs.

See also: AWS API Documentation

Request Syntax

client.create_ml_transform(
    Name='string',
    Description='string',
    InputRecordTables=[
        {
            'DatabaseName': 'string',
            'TableName': 'string',
            'CatalogId': 'string',
            'ConnectionName': 'string'
        },
    ],
    Parameters={
        'TransformType': 'FIND_MATCHES',
        'FindMatchesParameters': {
            'PrimaryKeyColumnName': 'string',
            'PrecisionRecallTradeoff': 123.0,
            'AccuracyCostTradeoff': 123.0,
            'EnforceProvidedLabels': True|False
        }
    },
    Role='string',
    MaxCapacity=123.0,
    WorkerType='Standard'|'G.1X'|'G.2X',
    NumberOfWorkers=123,
    Timeout=123,
    MaxRetries=123
)
type Name

string

param Name

[REQUIRED]

The unique name that you give the transform when you create it.

type Description

string

param Description

A description of the machine learning transform that is being defined. The default is an empty string.

type InputRecordTables

list

param InputRecordTables

[REQUIRED]

A list of AWS Glue table definitions used by the transform.

  • (dict) --

    The database and table in the AWS Glue Data Catalog that is used for input or output data.

    • DatabaseName (string) -- [REQUIRED]

      A database name in the AWS Glue Data Catalog.

    • TableName (string) -- [REQUIRED]

      A table name in the AWS Glue Data Catalog.

    • CatalogId (string) --

      A unique identifier for the AWS Glue Data Catalog.

    • ConnectionName (string) --

      The name of the connection to the AWS Glue Data Catalog.

type Parameters

dict

param Parameters

[REQUIRED]

The algorithmic parameters that are specific to the transform type used. Conditionally dependent on the transform type.

  • TransformType (string) -- [REQUIRED]

    The type of machine learning transform.

    For information about the types of machine learning transforms, see Creating Machine Learning Transforms.

  • FindMatchesParameters (dict) --

    The parameters for the find matches algorithm.

    • PrimaryKeyColumnName (string) --

      The name of a column that uniquely identifies rows in the source table. Used to help identify matching records.

    • PrecisionRecallTradeoff (float) --

      The value selected when tuning your transform for a balance between precision and recall. A value of 0.5 means no preference; a value of 1.0 means a bias purely for precision, and a value of 0.0 means a bias for recall. Because this is a tradeoff, choosing values close to 1.0 means very low recall, and choosing values close to 0.0 results in very low precision.

      The precision metric indicates how often your model is correct when it predicts a match.

      The recall metric indicates that for an actual match, how often your model predicts the match.

    • AccuracyCostTradeoff (float) --

      The value that is selected when tuning your transform for a balance between accuracy and cost. A value of 0.5 means that the system balances accuracy and cost concerns. A value of 1.0 means a bias purely for accuracy, which typically results in a higher cost, sometimes substantially higher. A value of 0.0 means a bias purely for cost, which results in a less accurate FindMatches transform, sometimes with unacceptable accuracy.

      Accuracy measures how well the transform finds true positives and true negatives. Increasing accuracy requires more machine resources and cost. But it also results in increased recall.

      Cost measures how many compute resources, and thus money, are consumed to run the transform.

    • EnforceProvidedLabels (boolean) --

      The value to switch on or off to force the output to match the provided labels from users. If the value is True , the find matches transform forces the output to match the provided labels. The results override the normal conflation results. If the value is False , the find matches transform does not ensure all the labels provided are respected, and the results rely on the trained model.

      Note that setting this value to true may increase the conflation execution time.

type Role

string

param Role

[REQUIRED]

The name or Amazon Resource Name (ARN) of the IAM role with the required permissions. Ensure that this role has permission to your Amazon Simple Storage Service (Amazon S3) sources, targets, temporary directory, scripts, and any libraries that are used by the task run for this transform.

type MaxCapacity

float

param MaxCapacity

The number of AWS Glue data processing units (DPUs) that are allocated to task runs for this transform. You can allocate from 2 to 100 DPUs; the default is 10. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory. For more information, see the AWS Glue pricing page.

When the WorkerType field is set to a value other than Standard , the MaxCapacity field is set automatically and becomes read-only.

type WorkerType

string

param WorkerType

The type of predefined worker that is allocated when this task runs. Accepts a value of Standard, G.1X, or G.2X.

  • For the Standard worker type, each worker provides 4 vCPU, 16 GB of memory and a 50GB disk, and 2 executors per worker.

  • For the G.1X worker type, each worker provides 4 vCPU, 16 GB of memory and a 64GB disk, and 1 executor per worker.

  • For the G.2X worker type, each worker provides 8 vCPU, 32 GB of memory and a 128GB disk, and 1 executor per worker.

type NumberOfWorkers

integer

param NumberOfWorkers

The number of workers of a defined workerType that are allocated when this task runs.

type Timeout

integer

param Timeout

The timeout of the task run for this transform in minutes. This is the maximum time that a task run for this transform can consume resources before it is terminated and enters TIMEOUT status. The default is 2,880 minutes (48 hours).

type MaxRetries

integer

param MaxRetries

The maximum number of times to retry a task for this transform after a task run fails.

rtype

dict

returns

Response Syntax

{
    'TransformId': 'string'
}

Response Structure

  • (dict) --

    • TransformId (string) --

      A unique identifier that is generated for the transform.

StartImportLabelsTaskRun (new) Link ¶

Enables you to provide additional labels (examples of truth) to be used to teach the machine learning transform and improve its quality. This API operation is generally used as part of the active learning workflow that starts with the StartMLLabelingSetGenerationTaskRun call and that ultimately results in improving the quality of your machine learning transform.

After the StartMLLabelingSetGenerationTaskRun finishes, AWS Glue machine learning will have generated a series of questions for humans to answer. (Answering these questions is often called 'labeling' in the machine learning workflows). In the case of the FindMatches transform, these questions are of the form, “What is the correct way to group these rows together into groups composed entirely of matching records?” After the labeling process is finished, users upload their answers/labels with a call to StartImportLabelsTaskRun . After StartImportLabelsTaskRun finishes, all future runs of the machine learning transform use the new and improved labels and perform a higher-quality transformation.

By default, StartMLLabelingSetGenerationTaskRun continually learns from and combines all labels that you upload unless you set Replace to true. If you set Replace to true, StartImportLabelsTaskRun deletes and forgets all previously uploaded labels and learns only from the exact set that you upload. Replacing labels can be helpful if you realize that you previously uploaded incorrect labels, and you believe that they are having a negative effect on your transform quality.

You can check on the status of your task run by calling the GetMLTaskRun operation.

See also: AWS API Documentation

Request Syntax

client.start_import_labels_task_run(
    TransformId='string',
    InputS3Path='string',
    ReplaceAllLabels=True|False
)
type TransformId

string

param TransformId

[REQUIRED]

The unique identifier of the machine learning transform.

type InputS3Path

string

param InputS3Path

[REQUIRED]

The Amazon Simple Storage Service (Amazon S3) path from where you import the labels.

type ReplaceAllLabels

boolean

param ReplaceAllLabels

Indicates whether to overwrite your existing labels.

rtype

dict

returns

Response Syntax

{
    'TaskRunId': 'string'
}

Response Structure

  • (dict) --

    • TaskRunId (string) --

      The unique identifier for the task run.

GetMLTransforms (new) Link ¶

Gets a sortable, filterable list of existing AWS Glue machine learning transforms. Machine learning transforms are a special type of transform that use machine learning to learn the details of the transformation to be performed by learning from examples provided by humans. These transformations are then saved by AWS Glue, and you can retrieve their metadata by calling GetMLTransforms .

See also: AWS API Documentation

Request Syntax

client.get_ml_transforms(
    NextToken='string',
    MaxResults=123,
    Filter={
        'Name': 'string',
        'TransformType': 'FIND_MATCHES',
        'Status': 'NOT_READY'|'READY'|'DELETING',
        'CreatedBefore': datetime(2015, 1, 1),
        'CreatedAfter': datetime(2015, 1, 1),
        'LastModifiedBefore': datetime(2015, 1, 1),
        'LastModifiedAfter': datetime(2015, 1, 1),
        'Schema': [
            {
                'Name': 'string',
                'DataType': 'string'
            },
        ]
    },
    Sort={
        'Column': 'NAME'|'TRANSFORM_TYPE'|'STATUS'|'CREATED'|'LAST_MODIFIED',
        'SortDirection': 'DESCENDING'|'ASCENDING'
    }
)
type NextToken

string

param NextToken

A paginated token to offset the results.

type MaxResults

integer

param MaxResults

The maximum number of results to return.

type Filter

dict

param Filter

The filter transformation criteria.

  • Name (string) --

    A unique transform name that is used to filter the machine learning transforms.

  • TransformType (string) --

    The type of machine learning transform that is used to filter the machine learning transforms.

  • Status (string) --

    Filters the list of machine learning transforms by the last known status of the transforms (to indicate whether a transform can be used or not). One of "NOT_READY", "READY", or "DELETING".

  • CreatedBefore (datetime) --

    The time and date before which the transforms were created.

  • CreatedAfter (datetime) --

    The time and date after which the transforms were created.

  • LastModifiedBefore (datetime) --

    Filter on transforms last modified before this date.

  • LastModifiedAfter (datetime) --

    Filter on transforms last modified after this date.

  • Schema (list) --

    Filters on datasets with a specific schema. The Map<Column, Type> object is an array of key-value pairs representing the schema this transform accepts, where Column is the name of a column, and Type is the type of the data such as an integer or string. Has an upper bound of 100 columns.

    • (dict) --

      A key-value pair representing a column and data type that this transform can run against. The Schema parameter of the MLTransform may contain up to 100 of these structures.

      • Name (string) --

        The name of the column.

      • DataType (string) --

        The type of data in the column.

type Sort

dict

param Sort

The sorting criteria.

  • Column (string) -- [REQUIRED]

    The column to be used in the sorting criteria that are associated with the machine learning transform.

  • SortDirection (string) -- [REQUIRED]

    The sort direction to be used in the sorting criteria that are associated with the machine learning transform.

rtype

dict

returns

Response Syntax

{
    'Transforms': [
        {
            'TransformId': 'string',
            'Name': 'string',
            'Description': 'string',
            'Status': 'NOT_READY'|'READY'|'DELETING',
            'CreatedOn': datetime(2015, 1, 1),
            'LastModifiedOn': datetime(2015, 1, 1),
            'InputRecordTables': [
                {
                    'DatabaseName': 'string',
                    'TableName': 'string',
                    'CatalogId': 'string',
                    'ConnectionName': 'string'
                },
            ],
            'Parameters': {
                'TransformType': 'FIND_MATCHES',
                'FindMatchesParameters': {
                    'PrimaryKeyColumnName': 'string',
                    'PrecisionRecallTradeoff': 123.0,
                    'AccuracyCostTradeoff': 123.0,
                    'EnforceProvidedLabels': True|False
                }
            },
            'EvaluationMetrics': {
                'TransformType': 'FIND_MATCHES',
                'FindMatchesMetrics': {
                    'AreaUnderPRCurve': 123.0,
                    'Precision': 123.0,
                    'Recall': 123.0,
                    'F1': 123.0,
                    'ConfusionMatrix': {
                        'NumTruePositives': 123,
                        'NumFalsePositives': 123,
                        'NumTrueNegatives': 123,
                        'NumFalseNegatives': 123
                    }
                }
            },
            'LabelCount': 123,
            'Schema': [
                {
                    'Name': 'string',
                    'DataType': 'string'
                },
            ],
            'Role': 'string',
            'MaxCapacity': 123.0,
            'WorkerType': 'Standard'|'G.1X'|'G.2X',
            'NumberOfWorkers': 123,
            'Timeout': 123,
            'MaxRetries': 123
        },
    ],
    'NextToken': 'string'
}

Response Structure

  • (dict) --

    • Transforms (list) --

      A list of machine learning transforms.

      • (dict) --

        A structure for a machine learning transform.

        • TransformId (string) --

          The unique transform ID that is generated for the machine learning transform. The ID is guaranteed to be unique and does not change.

        • Name (string) --

          A user-defined name for the machine learning transform. Names are not guaranteed unique and can be changed at any time.

        • Description (string) --

          A user-defined, long-form description text for the machine learning transform. Descriptions are not guaranteed to be unique and can be changed at any time.

        • Status (string) --

          The current status of the machine learning transform.

        • CreatedOn (datetime) --

          A timestamp. The time and date that this machine learning transform was created.

        • LastModifiedOn (datetime) --

          A timestamp. The last point in time when this machine learning transform was modified.

        • InputRecordTables (list) --

          A list of AWS Glue table definitions used by the transform.

          • (dict) --

            The database and table in the AWS Glue Data Catalog that is used for input or output data.

            • DatabaseName (string) --

              A database name in the AWS Glue Data Catalog.

            • TableName (string) --

              A table name in the AWS Glue Data Catalog.

            • CatalogId (string) --

              A unique identifier for the AWS Glue Data Catalog.

            • ConnectionName (string) --

              The name of the connection to the AWS Glue Data Catalog.

        • Parameters (dict) --

          A TransformParameters object. You can use parameters to tune (customize) the behavior of the machine learning transform by specifying what data it learns from and your preference on various tradeoffs (such as precious vs. recall, or accuracy vs. cost).

          • TransformType (string) --

            The type of machine learning transform.

            For information about the types of machine learning transforms, see Creating Machine Learning Transforms.

          • FindMatchesParameters (dict) --

            The parameters for the find matches algorithm.

            • PrimaryKeyColumnName (string) --

              The name of a column that uniquely identifies rows in the source table. Used to help identify matching records.

            • PrecisionRecallTradeoff (float) --

              The value selected when tuning your transform for a balance between precision and recall. A value of 0.5 means no preference; a value of 1.0 means a bias purely for precision, and a value of 0.0 means a bias for recall. Because this is a tradeoff, choosing values close to 1.0 means very low recall, and choosing values close to 0.0 results in very low precision.

              The precision metric indicates how often your model is correct when it predicts a match.

              The recall metric indicates that for an actual match, how often your model predicts the match.

            • AccuracyCostTradeoff (float) --

              The value that is selected when tuning your transform for a balance between accuracy and cost. A value of 0.5 means that the system balances accuracy and cost concerns. A value of 1.0 means a bias purely for accuracy, which typically results in a higher cost, sometimes substantially higher. A value of 0.0 means a bias purely for cost, which results in a less accurate FindMatches transform, sometimes with unacceptable accuracy.

              Accuracy measures how well the transform finds true positives and true negatives. Increasing accuracy requires more machine resources and cost. But it also results in increased recall.

              Cost measures how many compute resources, and thus money, are consumed to run the transform.

            • EnforceProvidedLabels (boolean) --

              The value to switch on or off to force the output to match the provided labels from users. If the value is True , the find matches transform forces the output to match the provided labels. The results override the normal conflation results. If the value is False , the find matches transform does not ensure all the labels provided are respected, and the results rely on the trained model.

              Note that setting this value to true may increase the conflation execution time.

        • EvaluationMetrics (dict) --

          An EvaluationMetrics object. Evaluation metrics provide an estimate of the quality of your machine learning transform.

          • TransformType (string) --

            The type of machine learning transform.

          • FindMatchesMetrics (dict) --

            The evaluation metrics for the find matches algorithm.

            • AreaUnderPRCurve (float) --

              The area under the precision/recall curve (AUPRC) is a single number measuring the overall quality of the transform, that is independent of the choice made for precision vs. recall. Higher values indicate that you have a more attractive precision vs. recall tradeoff.

              For more information, see Precision and recall in Wikipedia.

            • Precision (float) --

              The precision metric indicates when often your transform is correct when it predicts a match. Specifically, it measures how well the transform finds true positives from the total true positives possible.

              For more information, see Precision and recall in Wikipedia.

            • Recall (float) --

              The recall metric indicates that for an actual match, how often your transform predicts the match. Specifically, it measures how well the transform finds true positives from the total records in the source data.

              For more information, see Precision and recall in Wikipedia.

            • F1 (float) --

              The maximum F1 metric indicates the transform's accuracy between 0 and 1, where 1 is the best accuracy.

              For more information, see F1 score in Wikipedia.

            • ConfusionMatrix (dict) --

              The confusion matrix shows you what your transform is predicting accurately and what types of errors it is making.

              For more information, see Confusion matrix in Wikipedia.

              • NumTruePositives (integer) --

                The number of matches in the data that the transform correctly found, in the confusion matrix for your transform.

              • NumFalsePositives (integer) --

                The number of nonmatches in the data that the transform incorrectly classified as a match, in the confusion matrix for your transform.

              • NumTrueNegatives (integer) --

                The number of nonmatches in the data that the transform correctly rejected, in the confusion matrix for your transform.

              • NumFalseNegatives (integer) --

                The number of matches in the data that the transform didn't find, in the confusion matrix for your transform.

        • LabelCount (integer) --

          A count identifier for the labeling files generated by AWS Glue for this transform. As you create a better transform, you can iteratively download, label, and upload the labeling file.

        • Schema (list) --

          A map of key-value pairs representing the columns and data types that this transform can run against. Has an upper bound of 100 columns.

          • (dict) --

            A key-value pair representing a column and data type that this transform can run against. The Schema parameter of the MLTransform may contain up to 100 of these structures.

            • Name (string) --

              The name of the column.

            • DataType (string) --

              The type of data in the column.

        • Role (string) --

          The name or Amazon Resource Name (ARN) of the IAM role with the required permissions. This role needs permission to your Amazon Simple Storage Service (Amazon S3) sources, targets, temporary directory, scripts, and any libraries used by the task run for this transform.

        • MaxCapacity (float) --

          The number of AWS Glue data processing units (DPUs) that are allocated to task runs for this transform. You can allocate from 2 to 100 DPUs; the default is 10. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory. For more information, see the AWS Glue pricing page.

          When the WorkerType field is set to a value other than Standard , the MaxCapacity field is set automatically and becomes read-only.

        • WorkerType (string) --

          The type of predefined worker that is allocated when a task of this transform runs. Accepts a value of Standard, G.1X, or G.2X.

          • For the Standard worker type, each worker provides 4 vCPU, 16 GB of memory and a 50GB disk, and 2 executors per worker.

          • For the G.1X worker type, each worker provides 4 vCPU, 16 GB of memory and a 64GB disk, and 1 executor per worker.

          • For the G.2X worker type, each worker provides 8 vCPU, 32 GB of memory and a 128GB disk, and 1 executor per worker.

        • NumberOfWorkers (integer) --

          The number of workers of a defined workerType that are allocated when a task of the transform runs.

        • Timeout (integer) --

          The timeout in minutes of the machine learning transform.

        • MaxRetries (integer) --

          The maximum number of times to retry after an MLTaskRun of the machine learning transform fails.

    • NextToken (string) --

      A pagination token, if more results are available.

BatchCreatePartition (updated) Link ¶
Changes (request)
{'PartitionInputList': {'StorageDescriptor': {'Columns': {'Parameters': {'string': 'string'}}}}}

Creates one or more partitions in a batch operation.

See also: AWS API Documentation

Request Syntax

client.batch_create_partition(
    CatalogId='string',
    DatabaseName='string',
    TableName='string',
    PartitionInputList=[
        {
            'Values': [
                'string',
            ],
            'LastAccessTime': datetime(2015, 1, 1),
            'StorageDescriptor': {
                'Columns': [
                    {
                        'Name': 'string',
                        'Type': 'string',
                        'Comment': 'string',
                        'Parameters': {
                            'string': 'string'
                        }
                    },
                ],
                'Location': 'string',
                'InputFormat': 'string',
                'OutputFormat': 'string',
                'Compressed': True|False,
                'NumberOfBuckets': 123,
                'SerdeInfo': {
                    'Name': 'string',
                    'SerializationLibrary': 'string',
                    'Parameters': {
                        'string': 'string'
                    }
                },
                'BucketColumns': [
                    'string',
                ],
                'SortColumns': [
                    {
                        'Column': 'string',
                        'SortOrder': 123
                    },
                ],
                'Parameters': {
                    'string': 'string'
                },
                'SkewedInfo': {
                    'SkewedColumnNames': [
                        'string',
                    ],
                    'SkewedColumnValues': [
                        'string',
                    ],
                    'SkewedColumnValueLocationMaps': {
                        'string': 'string'
                    }
                },
                'StoredAsSubDirectories': True|False
            },
            'Parameters': {
                'string': 'string'
            },
            'LastAnalyzedTime': datetime(2015, 1, 1)
        },
    ]
)
type CatalogId

string

param CatalogId

The ID of the catalog in which the partition is to be created. Currently, this should be the AWS account ID.

type DatabaseName

string

param DatabaseName

[REQUIRED]

The name of the metadata database in which the partition is to be created.

type TableName

string

param TableName

[REQUIRED]

The name of the metadata table in which the partition is to be created.

type PartitionInputList

list

param PartitionInputList

[REQUIRED]

A list of PartitionInput structures that define the partitions to be created.

  • (dict) --

    The structure used to create and update a partition.

    • Values (list) --

      The values of the partition. Although this parameter is not required by the SDK, you must specify this parameter for a valid input.

      • (string) --

    • LastAccessTime (datetime) --

      The last time at which the partition was accessed.

    • StorageDescriptor (dict) --

      Provides information about the physical location where the partition is stored.

      • Columns (list) --

        A list of the Columns in the table.

        • (dict) --

          A column in a Table .

          • Name (string) -- [REQUIRED]

            The name of the Column .

          • Type (string) --

            The data type of the Column .

          • Comment (string) --

            A free-form text comment.

          • Parameters (dict) --

            These key-value pairs define properties associated with the column.

            • (string) --

              • (string) --

      • Location (string) --

        The physical location of the table. By default, this takes the form of the warehouse location, followed by the database location in the warehouse, followed by the table name.

      • InputFormat (string) --

        The input format: SequenceFileInputFormat (binary), or TextInputFormat , or a custom format.

      • OutputFormat (string) --

        The output format: SequenceFileOutputFormat (binary), or IgnoreKeyTextOutputFormat , or a custom format.

      • Compressed (boolean) --

        True if the data in the table is compressed, or False if not.

      • NumberOfBuckets (integer) --

        Must be specified if the table contains any dimension columns.

      • SerdeInfo (dict) --

        The serialization/deserialization (SerDe) information.

        • Name (string) --

          Name of the SerDe.

        • SerializationLibrary (string) --

          Usually the class that implements the SerDe. An example is org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe .

        • Parameters (dict) --

          These key-value pairs define initialization parameters for the SerDe.

          • (string) --

            • (string) --

      • BucketColumns (list) --

        A list of reducer grouping columns, clustering columns, and bucketing columns in the table.

        • (string) --

      • SortColumns (list) --

        A list specifying the sort order of each bucket in the table.

        • (dict) --

          Specifies the sort order of a sorted column.

          • Column (string) -- [REQUIRED]

            The name of the column.

          • SortOrder (integer) -- [REQUIRED]

            Indicates that the column is sorted in ascending order ( == 1 ), or in descending order ( ==0 ).

      • Parameters (dict) --

        The user-supplied properties in key-value form.

        • (string) --

          • (string) --

      • SkewedInfo (dict) --

        The information about values that appear frequently in a column (skewed values).

        • SkewedColumnNames (list) --

          A list of names of columns that contain skewed values.

          • (string) --

        • SkewedColumnValues (list) --

          A list of values that appear so frequently as to be considered skewed.

          • (string) --

        • SkewedColumnValueLocationMaps (dict) --

          A mapping of skewed values to the columns that contain them.

          • (string) --

            • (string) --

      • StoredAsSubDirectories (boolean) --

        True if the table data is stored in subdirectories, or False if not.

    • Parameters (dict) --

      These key-value pairs define partition parameters.

      • (string) --

        • (string) --

    • LastAnalyzedTime (datetime) --

      The last time at which column statistics were computed for this partition.

rtype

dict

returns

Response Syntax

{
    'Errors': [
        {
            'PartitionValues': [
                'string',
            ],
            'ErrorDetail': {
                'ErrorCode': 'string',
                'ErrorMessage': 'string'
            }
        },
    ]
}

Response Structure

  • (dict) --

    • Errors (list) --

      The errors encountered when trying to create the requested partitions.

      • (dict) --

        Contains information about a partition error.

        • PartitionValues (list) --

          The values that define the partition.

          • (string) --

        • ErrorDetail (dict) --

          The details about the partition error.

          • ErrorCode (string) --

            The code associated with this error.

          • ErrorMessage (string) --

            A message describing the error.

BatchGetPartition (updated) Link ¶
Changes (response)
{'Partitions': {'StorageDescriptor': {'Columns': {'Parameters': {'string': 'string'}}}}}

Retrieves partitions in a batch request.

See also: AWS API Documentation

Request Syntax

client.batch_get_partition(
    CatalogId='string',
    DatabaseName='string',
    TableName='string',
    PartitionsToGet=[
        {
            'Values': [
                'string',
            ]
        },
    ]
)
type CatalogId

string

param CatalogId

The ID of the Data Catalog where the partitions in question reside. If none is supplied, the AWS account ID is used by default.

type DatabaseName

string

param DatabaseName

[REQUIRED]

The name of the catalog database where the partitions reside.

type TableName

string

param TableName

[REQUIRED]

The name of the partitions' table.

type PartitionsToGet

list

param PartitionsToGet

[REQUIRED]

A list of partition values identifying the partitions to retrieve.

  • (dict) --

    Contains a list of values defining partitions.

    • Values (list) -- [REQUIRED]

      The list of values.

      • (string) --

rtype

dict

returns

Response Syntax

{
    'Partitions': [
        {
            'Values': [
                'string',
            ],
            'DatabaseName': 'string',
            'TableName': 'string',
            'CreationTime': datetime(2015, 1, 1),
            'LastAccessTime': datetime(2015, 1, 1),
            'StorageDescriptor': {
                'Columns': [
                    {
                        'Name': 'string',
                        'Type': 'string',
                        'Comment': 'string',
                        'Parameters': {
                            'string': 'string'
                        }
                    },
                ],
                'Location': 'string',
                'InputFormat': 'string',
                'OutputFormat': 'string',
                'Compressed': True|False,
                'NumberOfBuckets': 123,
                'SerdeInfo': {
                    'Name': 'string',
                    'SerializationLibrary': 'string',
                    'Parameters': {
                        'string': 'string'
                    }
                },
                'BucketColumns': [
                    'string',
                ],
                'SortColumns': [
                    {
                        'Column': 'string',
                        'SortOrder': 123
                    },
                ],
                'Parameters': {
                    'string': 'string'
                },
                'SkewedInfo': {
                    'SkewedColumnNames': [
                        'string',
                    ],
                    'SkewedColumnValues': [
                        'string',
                    ],
                    'SkewedColumnValueLocationMaps': {
                        'string': 'string'
                    }
                },
                'StoredAsSubDirectories': True|False
            },
            'Parameters': {
                'string': 'string'
            },
            'LastAnalyzedTime': datetime(2015, 1, 1)
        },
    ],
    'UnprocessedKeys': [
        {
            'Values': [
                'string',
            ]
        },
    ]
}

Response Structure

  • (dict) --

    • Partitions (list) --

      A list of the requested partitions.

      • (dict) --

        Represents a slice of table data.

        • Values (list) --

          The values of the partition.

          • (string) --

        • DatabaseName (string) --

          The name of the catalog database in which to create the partition.

        • TableName (string) --

          The name of the database table in which to create the partition.

        • CreationTime (datetime) --

          The time at which the partition was created.

        • LastAccessTime (datetime) --

          The last time at which the partition was accessed.

        • StorageDescriptor (dict) --

          Provides information about the physical location where the partition is stored.

          • Columns (list) --

            A list of the Columns in the table.

            • (dict) --

              A column in a Table .

              • Name (string) --

                The name of the Column .

              • Type (string) --

                The data type of the Column .

              • Comment (string) --

                A free-form text comment.

              • Parameters (dict) --

                These key-value pairs define properties associated with the column.

                • (string) --

                  • (string) --

          • Location (string) --

            The physical location of the table. By default, this takes the form of the warehouse location, followed by the database location in the warehouse, followed by the table name.

          • InputFormat (string) --

            The input format: SequenceFileInputFormat (binary), or TextInputFormat , or a custom format.

          • OutputFormat (string) --

            The output format: SequenceFileOutputFormat (binary), or IgnoreKeyTextOutputFormat , or a custom format.

          • Compressed (boolean) --

            True if the data in the table is compressed, or False if not.

          • NumberOfBuckets (integer) --

            Must be specified if the table contains any dimension columns.

          • SerdeInfo (dict) --

            The serialization/deserialization (SerDe) information.

            • Name (string) --

              Name of the SerDe.

            • SerializationLibrary (string) --

              Usually the class that implements the SerDe. An example is org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe .

            • Parameters (dict) --

              These key-value pairs define initialization parameters for the SerDe.

              • (string) --

                • (string) --

          • BucketColumns (list) --

            A list of reducer grouping columns, clustering columns, and bucketing columns in the table.

            • (string) --

          • SortColumns (list) --

            A list specifying the sort order of each bucket in the table.

            • (dict) --

              Specifies the sort order of a sorted column.

              • Column (string) --

                The name of the column.

              • SortOrder (integer) --

                Indicates that the column is sorted in ascending order ( == 1 ), or in descending order ( ==0 ).

          • Parameters (dict) --

            The user-supplied properties in key-value form.

            • (string) --

              • (string) --

          • SkewedInfo (dict) --

            The information about values that appear frequently in a column (skewed values).

            • SkewedColumnNames (list) --

              A list of names of columns that contain skewed values.

              • (string) --

            • SkewedColumnValues (list) --

              A list of values that appear so frequently as to be considered skewed.

              • (string) --

            • SkewedColumnValueLocationMaps (dict) --

              A mapping of skewed values to the columns that contain them.

              • (string) --

                • (string) --

          • StoredAsSubDirectories (boolean) --

            True if the table data is stored in subdirectories, or False if not.

        • Parameters (dict) --

          These key-value pairs define partition parameters.

          • (string) --

            • (string) --

        • LastAnalyzedTime (datetime) --

          The last time at which column statistics were computed for this partition.

    • UnprocessedKeys (list) --

      A list of the partition values in the request for which partitions were not returned.

      • (dict) --

        Contains a list of values defining partitions.

        • Values (list) --

          The list of values.

          • (string) --

CreateDatabase (updated) Link ¶
Changes (request)
{'DatabaseInput': {'CreateTableDefaultPermissions': [{'Permissions': ['ALL | '
                                                                      'SELECT '
                                                                      '| ALTER '
                                                                      '| DROP '
                                                                      '| '
                                                                      'DELETE '
                                                                      '| '
                                                                      'INSERT '
                                                                      '| '
                                                                      'CREATE_DATABASE '
                                                                      '| '
                                                                      'CREATE_TABLE '
                                                                      '| '
                                                                      'DATA_LOCATION_ACCESS'],
                                                      'Principal': {'DataLakePrincipalIdentifier': 'string'}}]}}

Creates a new database in a Data Catalog.

See also: AWS API Documentation

Request Syntax

client.create_database(
    CatalogId='string',
    DatabaseInput={
        'Name': 'string',
        'Description': 'string',
        'LocationUri': 'string',
        'Parameters': {
            'string': 'string'
        },
        'CreateTableDefaultPermissions': [
            {
                'Principal': {
                    'DataLakePrincipalIdentifier': 'string'
                },
                'Permissions': [
                    'ALL'|'SELECT'|'ALTER'|'DROP'|'DELETE'|'INSERT'|'CREATE_DATABASE'|'CREATE_TABLE'|'DATA_LOCATION_ACCESS',
                ]
            },
        ]
    }
)
type CatalogId

string

param CatalogId

The ID of the Data Catalog in which to create the database. If none is provided, the AWS account ID is used by default.

type DatabaseInput

dict

param DatabaseInput

[REQUIRED]

The metadata for the database.

  • Name (string) -- [REQUIRED]

    The name of the database. For Hive compatibility, this is folded to lowercase when it is stored.

  • Description (string) --

    A description of the database.

  • LocationUri (string) --

    The location of the database (for example, an HDFS path).

  • Parameters (dict) --

    These key-value pairs define parameters and properties of the database.

    These key-value pairs define parameters and properties of the database.

    • (string) --

      • (string) --

  • CreateTableDefaultPermissions (list) --

    Creates a set of default permissions on the table for principals.

    • (dict) --

      Permissions granted to a principal.

      • Principal (dict) --

        The principal who is granted permissions.

        • DataLakePrincipalIdentifier (string) --

          An identifier for the AWS Lake Formation principal.

      • Permissions (list) --

        The permissions that are granted to the principal.

        • (string) --

rtype

dict

returns

Response Syntax

{}

Response Structure

  • (dict) --

CreatePartition (updated) Link ¶
Changes (request)
{'PartitionInput': {'StorageDescriptor': {'Columns': {'Parameters': {'string': 'string'}}}}}

Creates a new partition.

See also: AWS API Documentation

Request Syntax

client.create_partition(
    CatalogId='string',
    DatabaseName='string',
    TableName='string',
    PartitionInput={
        'Values': [
            'string',
        ],
        'LastAccessTime': datetime(2015, 1, 1),
        'StorageDescriptor': {
            'Columns': [
                {
                    'Name': 'string',
                    'Type': 'string',
                    'Comment': 'string',
                    'Parameters': {
                        'string': 'string'
                    }
                },
            ],
            'Location': 'string',
            'InputFormat': 'string',
            'OutputFormat': 'string',
            'Compressed': True|False,
            'NumberOfBuckets': 123,
            'SerdeInfo': {
                'Name': 'string',
                'SerializationLibrary': 'string',
                'Parameters': {
                    'string': 'string'
                }
            },
            'BucketColumns': [
                'string',
            ],
            'SortColumns': [
                {
                    'Column': 'string',
                    'SortOrder': 123
                },
            ],
            'Parameters': {
                'string': 'string'
            },
            'SkewedInfo': {
                'SkewedColumnNames': [
                    'string',
                ],
                'SkewedColumnValues': [
                    'string',
                ],
                'SkewedColumnValueLocationMaps': {
                    'string': 'string'
                }
            },
            'StoredAsSubDirectories': True|False
        },
        'Parameters': {
            'string': 'string'
        },
        'LastAnalyzedTime': datetime(2015, 1, 1)
    }
)
type CatalogId

string

param CatalogId

The AWS account ID of the catalog in which the partition is to be created.

type DatabaseName

string

param DatabaseName

[REQUIRED]

The name of the metadata database in which the partition is to be created.

type TableName

string

param TableName

[REQUIRED]

The name of the metadata table in which the partition is to be created.

type PartitionInput

dict

param PartitionInput

[REQUIRED]

A PartitionInput structure defining the partition to be created.

  • Values (list) --

    The values of the partition. Although this parameter is not required by the SDK, you must specify this parameter for a valid input.

    • (string) --

  • LastAccessTime (datetime) --

    The last time at which the partition was accessed.

  • StorageDescriptor (dict) --

    Provides information about the physical location where the partition is stored.

    • Columns (list) --

      A list of the Columns in the table.

      • (dict) --

        A column in a Table .

        • Name (string) -- [REQUIRED]

          The name of the Column .

        • Type (string) --

          The data type of the Column .

        • Comment (string) --

          A free-form text comment.

        • Parameters (dict) --

          These key-value pairs define properties associated with the column.

          • (string) --

            • (string) --

    • Location (string) --

      The physical location of the table. By default, this takes the form of the warehouse location, followed by the database location in the warehouse, followed by the table name.

    • InputFormat (string) --

      The input format: SequenceFileInputFormat (binary), or TextInputFormat , or a custom format.

    • OutputFormat (string) --

      The output format: SequenceFileOutputFormat (binary), or IgnoreKeyTextOutputFormat , or a custom format.

    • Compressed (boolean) --

      True if the data in the table is compressed, or False if not.

    • NumberOfBuckets (integer) --

      Must be specified if the table contains any dimension columns.

    • SerdeInfo (dict) --

      The serialization/deserialization (SerDe) information.

      • Name (string) --

        Name of the SerDe.

      • SerializationLibrary (string) --

        Usually the class that implements the SerDe. An example is org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe .

      • Parameters (dict) --

        These key-value pairs define initialization parameters for the SerDe.

        • (string) --

          • (string) --

    • BucketColumns (list) --

      A list of reducer grouping columns, clustering columns, and bucketing columns in the table.

      • (string) --

    • SortColumns (list) --

      A list specifying the sort order of each bucket in the table.

      • (dict) --

        Specifies the sort order of a sorted column.

        • Column (string) -- [REQUIRED]

          The name of the column.

        • SortOrder (integer) -- [REQUIRED]

          Indicates that the column is sorted in ascending order ( == 1 ), or in descending order ( ==0 ).

    • Parameters (dict) --

      The user-supplied properties in key-value form.

      • (string) --

        • (string) --

    • SkewedInfo (dict) --

      The information about values that appear frequently in a column (skewed values).

      • SkewedColumnNames (list) --

        A list of names of columns that contain skewed values.

        • (string) --

      • SkewedColumnValues (list) --

        A list of values that appear so frequently as to be considered skewed.

        • (string) --

      • SkewedColumnValueLocationMaps (dict) --

        A mapping of skewed values to the columns that contain them.

        • (string) --

          • (string) --

    • StoredAsSubDirectories (boolean) --

      True if the table data is stored in subdirectories, or False if not.

  • Parameters (dict) --

    These key-value pairs define partition parameters.

    • (string) --

      • (string) --

  • LastAnalyzedTime (datetime) --

    The last time at which column statistics were computed for this partition.

rtype

dict

returns

Response Syntax

{}

Response Structure

  • (dict) --

CreateTable (updated) Link ¶
Changes (request)
{'TableInput': {'PartitionKeys': {'Parameters': {'string': 'string'}},
                'StorageDescriptor': {'Columns': {'Parameters': {'string': 'string'}}}}}

Creates a new table definition in the Data Catalog.

See also: AWS API Documentation

Request Syntax

client.create_table(
    CatalogId='string',
    DatabaseName='string',
    TableInput={
        'Name': 'string',
        'Description': 'string',
        'Owner': 'string',
        'LastAccessTime': datetime(2015, 1, 1),
        'LastAnalyzedTime': datetime(2015, 1, 1),
        'Retention': 123,
        'StorageDescriptor': {
            'Columns': [
                {
                    'Name': 'string',
                    'Type': 'string',
                    'Comment': 'string',
                    'Parameters': {
                        'string': 'string'
                    }
                },
            ],
            'Location': 'string',
            'InputFormat': 'string',
            'OutputFormat': 'string',
            'Compressed': True|False,
            'NumberOfBuckets': 123,
            'SerdeInfo': {
                'Name': 'string',
                'SerializationLibrary': 'string',
                'Parameters': {
                    'string': 'string'
                }
            },
            'BucketColumns': [
                'string',
            ],
            'SortColumns': [
                {
                    'Column': 'string',
                    'SortOrder': 123
                },
            ],
            'Parameters': {
                'string': 'string'
            },
            'SkewedInfo': {
                'SkewedColumnNames': [
                    'string',
                ],
                'SkewedColumnValues': [
                    'string',
                ],
                'SkewedColumnValueLocationMaps': {
                    'string': 'string'
                }
            },
            'StoredAsSubDirectories': True|False
        },
        'PartitionKeys': [
            {
                'Name': 'string',
                'Type': 'string',
                'Comment': 'string',
                'Parameters': {
                    'string': 'string'
                }
            },
        ],
        'ViewOriginalText': 'string',
        'ViewExpandedText': 'string',
        'TableType': 'string',
        'Parameters': {
            'string': 'string'
        }
    }
)
type CatalogId

string

param CatalogId

The ID of the Data Catalog in which to create the Table . If none is supplied, the AWS account ID is used by default.

type DatabaseName

string

param DatabaseName

[REQUIRED]

The catalog database in which to create the new table. For Hive compatibility, this name is entirely lowercase.

type TableInput

dict

param TableInput

[REQUIRED]

The TableInput object that defines the metadata table to create in the catalog.

  • Name (string) -- [REQUIRED]

    The table name. For Hive compatibility, this is folded to lowercase when it is stored.

  • Description (string) --

    A description of the table.

  • Owner (string) --

    The table owner.

  • LastAccessTime (datetime) --

    The last time that the table was accessed.

  • LastAnalyzedTime (datetime) --

    The last time that column statistics were computed for this table.

  • Retention (integer) --

    The retention time for this table.

  • StorageDescriptor (dict) --

    A storage descriptor containing information about the physical storage of this table.

    • Columns (list) --

      A list of the Columns in the table.

      • (dict) --

        A column in a Table .

        • Name (string) -- [REQUIRED]

          The name of the Column .

        • Type (string) --

          The data type of the Column .

        • Comment (string) --

          A free-form text comment.

        • Parameters (dict) --

          These key-value pairs define properties associated with the column.

          • (string) --

            • (string) --

    • Location (string) --

      The physical location of the table. By default, this takes the form of the warehouse location, followed by the database location in the warehouse, followed by the table name.

    • InputFormat (string) --

      The input format: SequenceFileInputFormat (binary), or TextInputFormat , or a custom format.

    • OutputFormat (string) --

      The output format: SequenceFileOutputFormat (binary), or IgnoreKeyTextOutputFormat , or a custom format.

    • Compressed (boolean) --

      True if the data in the table is compressed, or False if not.

    • NumberOfBuckets (integer) --

      Must be specified if the table contains any dimension columns.

    • SerdeInfo (dict) --

      The serialization/deserialization (SerDe) information.

      • Name (string) --

        Name of the SerDe.

      • SerializationLibrary (string) --

        Usually the class that implements the SerDe. An example is org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe .

      • Parameters (dict) --

        These key-value pairs define initialization parameters for the SerDe.

        • (string) --

          • (string) --

    • BucketColumns (list) --

      A list of reducer grouping columns, clustering columns, and bucketing columns in the table.

      • (string) --

    • SortColumns (list) --

      A list specifying the sort order of each bucket in the table.

      • (dict) --

        Specifies the sort order of a sorted column.

        • Column (string) -- [REQUIRED]

          The name of the column.

        • SortOrder (integer) -- [REQUIRED]

          Indicates that the column is sorted in ascending order ( == 1 ), or in descending order ( ==0 ).

    • Parameters (dict) --

      The user-supplied properties in key-value form.

      • (string) --

        • (string) --

    • SkewedInfo (dict) --

      The information about values that appear frequently in a column (skewed values).

      • SkewedColumnNames (list) --

        A list of names of columns that contain skewed values.

        • (string) --

      • SkewedColumnValues (list) --

        A list of values that appear so frequently as to be considered skewed.

        • (string) --

      • SkewedColumnValueLocationMaps (dict) --

        A mapping of skewed values to the columns that contain them.

        • (string) --

          • (string) --

    • StoredAsSubDirectories (boolean) --

      True if the table data is stored in subdirectories, or False if not.

  • PartitionKeys (list) --

    A list of columns by which the table is partitioned. Only primitive types are supported as partition keys.

    When you create a table used by Amazon Athena, and you do not specify any partitionKeys , you must at least set the value of partitionKeys to an empty list. For example:

    "PartitionKeys": []

    • (dict) --

      A column in a Table .

      • Name (string) -- [REQUIRED]

        The name of the Column .

      • Type (string) --

        The data type of the Column .

      • Comment (string) --

        A free-form text comment.

      • Parameters (dict) --

        These key-value pairs define properties associated with the column.

        • (string) --

          • (string) --

  • ViewOriginalText (string) --

    If the table is a view, the original text of the view; otherwise null .

  • ViewExpandedText (string) --

    If the table is a view, the expanded text of the view; otherwise null .

  • TableType (string) --

    The type of this table ( EXTERNAL_TABLE , VIRTUAL_VIEW , etc.).

  • Parameters (dict) --

    These key-value pairs define properties associated with the table.

    • (string) --

      • (string) --

rtype

dict

returns

Response Syntax

{}

Response Structure

  • (dict) --

GetDatabase (updated) Link ¶
Changes (response)
{'Database': {'CreateTableDefaultPermissions': [{'Permissions': ['ALL | SELECT '
                                                                 '| ALTER | '
                                                                 'DROP | '
                                                                 'DELETE | '
                                                                 'INSERT | '
                                                                 'CREATE_DATABASE '
                                                                 '| '
                                                                 'CREATE_TABLE '
                                                                 '| '
                                                                 'DATA_LOCATION_ACCESS'],
                                                 'Principal': {'DataLakePrincipalIdentifier': 'string'}}]}}

Retrieves the definition of a specified database.

See also: AWS API Documentation

Request Syntax

client.get_database(
    CatalogId='string',
    Name='string'
)
type CatalogId

string

param CatalogId

The ID of the Data Catalog in which the database resides. If none is provided, the AWS account ID is used by default.

type Name

string

param Name

[REQUIRED]

The name of the database to retrieve. For Hive compatibility, this should be all lowercase.

rtype

dict

returns

Response Syntax

{
    'Database': {
        'Name': 'string',
        'Description': 'string',
        'LocationUri': 'string',
        'Parameters': {
            'string': 'string'
        },
        'CreateTime': datetime(2015, 1, 1),
        'CreateTableDefaultPermissions': [
            {
                'Principal': {
                    'DataLakePrincipalIdentifier': 'string'
                },
                'Permissions': [
                    'ALL'|'SELECT'|'ALTER'|'DROP'|'DELETE'|'INSERT'|'CREATE_DATABASE'|'CREATE_TABLE'|'DATA_LOCATION_ACCESS',
                ]
            },
        ]
    }
}

Response Structure

  • (dict) --

    • Database (dict) --

      The definition of the specified database in the Data Catalog.

      • Name (string) --

        The name of the database. For Hive compatibility, this is folded to lowercase when it is stored.

      • Description (string) --

        A description of the database.

      • LocationUri (string) --

        The location of the database (for example, an HDFS path).

      • Parameters (dict) --

        These key-value pairs define parameters and properties of the database.

        • (string) --

          • (string) --

      • CreateTime (datetime) --

        The time at which the metadata database was created in the catalog.

      • CreateTableDefaultPermissions (list) --

        Creates a set of default permissions on the table for principals.

        • (dict) --

          Permissions granted to a principal.

          • Principal (dict) --

            The principal who is granted permissions.

            • DataLakePrincipalIdentifier (string) --

              An identifier for the AWS Lake Formation principal.

          • Permissions (list) --

            The permissions that are granted to the principal.

            • (string) --

GetDatabases (updated) Link ¶
Changes (response)
{'DatabaseList': {'CreateTableDefaultPermissions': [{'Permissions': ['ALL | '
                                                                     'SELECT | '
                                                                     'ALTER | '
                                                                     'DROP | '
                                                                     'DELETE | '
                                                                     'INSERT | '
                                                                     'CREATE_DATABASE '
                                                                     '| '
                                                                     'CREATE_TABLE '
                                                                     '| '
                                                                     'DATA_LOCATION_ACCESS'],
                                                     'Principal': {'DataLakePrincipalIdentifier': 'string'}}]}}

Retrieves all databases defined in a given Data Catalog.

See also: AWS API Documentation

Request Syntax

client.get_databases(
    CatalogId='string',
    NextToken='string',
    MaxResults=123
)
type CatalogId

string

param CatalogId

The ID of the Data Catalog from which to retrieve Databases . If none is provided, the AWS account ID is used by default.

type NextToken

string

param NextToken

A continuation token, if this is a continuation call.

type MaxResults

integer

param MaxResults

The maximum number of databases to return in one response.

rtype

dict

returns

Response Syntax

{
    'DatabaseList': [
        {
            'Name': 'string',
            'Description': 'string',
            'LocationUri': 'string',
            'Parameters': {
                'string': 'string'
            },
            'CreateTime': datetime(2015, 1, 1),
            'CreateTableDefaultPermissions': [
                {
                    'Principal': {
                        'DataLakePrincipalIdentifier': 'string'
                    },
                    'Permissions': [
                        'ALL'|'SELECT'|'ALTER'|'DROP'|'DELETE'|'INSERT'|'CREATE_DATABASE'|'CREATE_TABLE'|'DATA_LOCATION_ACCESS',
                    ]
                },
            ]
        },
    ],
    'NextToken': 'string'
}

Response Structure

  • (dict) --

    • DatabaseList (list) --

      A list of Database objects from the specified catalog.

      • (dict) --

        The Database object represents a logical grouping of tables that might reside in a Hive metastore or an RDBMS.

        • Name (string) --

          The name of the database. For Hive compatibility, this is folded to lowercase when it is stored.

        • Description (string) --

          A description of the database.

        • LocationUri (string) --

          The location of the database (for example, an HDFS path).

        • Parameters (dict) --

          These key-value pairs define parameters and properties of the database.

          • (string) --

            • (string) --

        • CreateTime (datetime) --

          The time at which the metadata database was created in the catalog.

        • CreateTableDefaultPermissions (list) --

          Creates a set of default permissions on the table for principals.

          • (dict) --

            Permissions granted to a principal.

            • Principal (dict) --

              The principal who is granted permissions.

              • DataLakePrincipalIdentifier (string) --

                An identifier for the AWS Lake Formation principal.

            • Permissions (list) --

              The permissions that are granted to the principal.

              • (string) --

    • NextToken (string) --

      A continuation token for paginating the returned list of tokens, returned if the current segment of the list is not the last.

GetPartition (updated) Link ¶
Changes (response)
{'Partition': {'StorageDescriptor': {'Columns': {'Parameters': {'string': 'string'}}}}}

Retrieves information about a specified partition.

See also: AWS API Documentation

Request Syntax

client.get_partition(
    CatalogId='string',
    DatabaseName='string',
    TableName='string',
    PartitionValues=[
        'string',
    ]
)
type CatalogId

string

param CatalogId

The ID of the Data Catalog where the partition in question resides. If none is provided, the AWS account ID is used by default.

type DatabaseName

string

param DatabaseName

[REQUIRED]

The name of the catalog database where the partition resides.

type TableName

string

param TableName

[REQUIRED]

The name of the partition's table.

type PartitionValues

list

param PartitionValues

[REQUIRED]

The values that define the partition.

  • (string) --

rtype

dict

returns

Response Syntax

{
    'Partition': {
        'Values': [
            'string',
        ],
        'DatabaseName': 'string',
        'TableName': 'string',
        'CreationTime': datetime(2015, 1, 1),
        'LastAccessTime': datetime(2015, 1, 1),
        'StorageDescriptor': {
            'Columns': [
                {
                    'Name': 'string',
                    'Type': 'string',
                    'Comment': 'string',
                    'Parameters': {
                        'string': 'string'
                    }
                },
            ],
            'Location': 'string',
            'InputFormat': 'string',
            'OutputFormat': 'string',
            'Compressed': True|False,
            'NumberOfBuckets': 123,
            'SerdeInfo': {
                'Name': 'string',
                'SerializationLibrary': 'string',
                'Parameters': {
                    'string': 'string'
                }
            },
            'BucketColumns': [
                'string',
            ],
            'SortColumns': [
                {
                    'Column': 'string',
                    'SortOrder': 123
                },
            ],
            'Parameters': {
                'string': 'string'
            },
            'SkewedInfo': {
                'SkewedColumnNames': [
                    'string',
                ],
                'SkewedColumnValues': [
                    'string',
                ],
                'SkewedColumnValueLocationMaps': {
                    'string': 'string'
                }
            },
            'StoredAsSubDirectories': True|False
        },
        'Parameters': {
            'string': 'string'
        },
        'LastAnalyzedTime': datetime(2015, 1, 1)
    }
}

Response Structure

  • (dict) --

    • Partition (dict) --

      The requested information, in the form of a Partition object.

      • Values (list) --

        The values of the partition.

        • (string) --

      • DatabaseName (string) --

        The name of the catalog database in which to create the partition.

      • TableName (string) --

        The name of the database table in which to create the partition.

      • CreationTime (datetime) --

        The time at which the partition was created.

      • LastAccessTime (datetime) --

        The last time at which the partition was accessed.

      • StorageDescriptor (dict) --

        Provides information about the physical location where the partition is stored.

        • Columns (list) --

          A list of the Columns in the table.

          • (dict) --

            A column in a Table .

            • Name (string) --

              The name of the Column .

            • Type (string) --

              The data type of the Column .

            • Comment (string) --

              A free-form text comment.

            • Parameters (dict) --

              These key-value pairs define properties associated with the column.

              • (string) --

                • (string) --

        • Location (string) --

          The physical location of the table. By default, this takes the form of the warehouse location, followed by the database location in the warehouse, followed by the table name.

        • InputFormat (string) --

          The input format: SequenceFileInputFormat (binary), or TextInputFormat , or a custom format.

        • OutputFormat (string) --

          The output format: SequenceFileOutputFormat (binary), or IgnoreKeyTextOutputFormat , or a custom format.

        • Compressed (boolean) --

          True if the data in the table is compressed, or False if not.

        • NumberOfBuckets (integer) --

          Must be specified if the table contains any dimension columns.

        • SerdeInfo (dict) --

          The serialization/deserialization (SerDe) information.

          • Name (string) --

            Name of the SerDe.

          • SerializationLibrary (string) --

            Usually the class that implements the SerDe. An example is org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe .

          • Parameters (dict) --

            These key-value pairs define initialization parameters for the SerDe.

            • (string) --

              • (string) --

        • BucketColumns (list) --

          A list of reducer grouping columns, clustering columns, and bucketing columns in the table.

          • (string) --

        • SortColumns (list) --

          A list specifying the sort order of each bucket in the table.

          • (dict) --

            Specifies the sort order of a sorted column.

            • Column (string) --

              The name of the column.

            • SortOrder (integer) --

              Indicates that the column is sorted in ascending order ( == 1 ), or in descending order ( ==0 ).

        • Parameters (dict) --

          The user-supplied properties in key-value form.

          • (string) --

            • (string) --

        • SkewedInfo (dict) --

          The information about values that appear frequently in a column (skewed values).

          • SkewedColumnNames (list) --

            A list of names of columns that contain skewed values.

            • (string) --

          • SkewedColumnValues (list) --

            A list of values that appear so frequently as to be considered skewed.

            • (string) --

          • SkewedColumnValueLocationMaps (dict) --

            A mapping of skewed values to the columns that contain them.

            • (string) --

              • (string) --

        • StoredAsSubDirectories (boolean) --

          True if the table data is stored in subdirectories, or False if not.

      • Parameters (dict) --

        These key-value pairs define partition parameters.

        • (string) --

          • (string) --

      • LastAnalyzedTime (datetime) --

        The last time at which column statistics were computed for this partition.

GetPartitions (updated) Link ¶
Changes (response)
{'Partitions': {'StorageDescriptor': {'Columns': {'Parameters': {'string': 'string'}}}}}

Retrieves information about the partitions in a table.

See also: AWS API Documentation

Request Syntax

client.get_partitions(
    CatalogId='string',
    DatabaseName='string',
    TableName='string',
    Expression='string',
    NextToken='string',
    Segment={
        'SegmentNumber': 123,
        'TotalSegments': 123
    },
    MaxResults=123
)
type CatalogId

string

param CatalogId

The ID of the Data Catalog where the partitions in question reside. If none is provided, the AWS account ID is used by default.

type DatabaseName

string

param DatabaseName

[REQUIRED]

The name of the catalog database where the partitions reside.

type TableName

string

param TableName

[REQUIRED]

The name of the partitions' table.

type Expression

string

param Expression

An expression that filters the partitions to be returned.

The expression uses SQL syntax similar to the SQL WHERE filter clause. The SQL statement parser JSQLParser parses the expression.

Operators : The following are the operators that you can use in the Expression API call:

=

Checks whether the values of the two operands are equal; if yes, then the condition becomes true.

Example: Assume 'variable a' holds 10 and 'variable b' holds 20.

(a = b) is not true.

< >

Checks whether the values of two operands are equal; if the values are not equal, then the condition becomes true.

Example: (a < > b) is true.

>

Checks whether the value of the left operand is greater than the value of the right operand; if yes, then the condition becomes true.

Example: (a > b) is not true.

<

Checks whether the value of the left operand is less than the value of the right operand; if yes, then the condition becomes true.

Example: (a < b) is true.

>=

Checks whether the value of the left operand is greater than or equal to the value of the right operand; if yes, then the condition becomes true.

Example: (a >= b) is not true.

<=

Checks whether the value of the left operand is less than or equal to the value of the right operand; if yes, then the condition becomes true.

Example: (a <= b) is true.

AND, OR, IN, BETWEEN, LIKE, NOT, IS NULL

Logical operators.

Supported Partition Key Types : The following are the supported partition keys.

  • string

  • date

  • timestamp

  • int

  • bigint

  • long

  • tinyint

  • smallint

  • decimal

If an invalid type is encountered, an exception is thrown.

The following list shows the valid operators on each type. When you define a crawler, the partitionKey type is created as a STRING , to be compatible with the catalog partitions.

Sample API Call :

type NextToken

string

param NextToken

A continuation token, if this is not the first call to retrieve these partitions.

type Segment

dict

param Segment

The segment of the table's partitions to scan in this request.

  • SegmentNumber (integer) -- [REQUIRED]

    The zero-based index number of the segment. For example, if the total number of segments is 4, SegmentNumber values range from 0 through 3.

  • TotalSegments (integer) -- [REQUIRED]

    The total number of segments.

type MaxResults

integer

param MaxResults

The maximum number of partitions to return in a single response.

rtype

dict

returns

Response Syntax

{
    'Partitions': [
        {
            'Values': [
                'string',
            ],
            'DatabaseName': 'string',
            'TableName': 'string',
            'CreationTime': datetime(2015, 1, 1),
            'LastAccessTime': datetime(2015, 1, 1),
            'StorageDescriptor': {
                'Columns': [
                    {
                        'Name': 'string',
                        'Type': 'string',
                        'Comment': 'string',
                        'Parameters': {
                            'string': 'string'
                        }
                    },
                ],
                'Location': 'string',
                'InputFormat': 'string',
                'OutputFormat': 'string',
                'Compressed': True|False,
                'NumberOfBuckets': 123,
                'SerdeInfo': {
                    'Name': 'string',
                    'SerializationLibrary': 'string',
                    'Parameters': {
                        'string': 'string'
                    }
                },
                'BucketColumns': [
                    'string',
                ],
                'SortColumns': [
                    {
                        'Column': 'string',
                        'SortOrder': 123
                    },
                ],
                'Parameters': {
                    'string': 'string'
                },
                'SkewedInfo': {
                    'SkewedColumnNames': [
                        'string',
                    ],
                    'SkewedColumnValues': [
                        'string',
                    ],
                    'SkewedColumnValueLocationMaps': {
                        'string': 'string'
                    }
                },
                'StoredAsSubDirectories': True|False
            },
            'Parameters': {
                'string': 'string'
            },
            'LastAnalyzedTime': datetime(2015, 1, 1)
        },
    ],
    'NextToken': 'string'
}

Response Structure

  • (dict) --

    • Partitions (list) --

      A list of requested partitions.

      • (dict) --

        Represents a slice of table data.

        • Values (list) --

          The values of the partition.

          • (string) --

        • DatabaseName (string) --

          The name of the catalog database in which to create the partition.

        • TableName (string) --

          The name of the database table in which to create the partition.

        • CreationTime (datetime) --

          The time at which the partition was created.

        • LastAccessTime (datetime) --

          The last time at which the partition was accessed.

        • StorageDescriptor (dict) --

          Provides information about the physical location where the partition is stored.

          • Columns (list) --

            A list of the Columns in the table.

            • (dict) --

              A column in a Table .

              • Name (string) --

                The name of the Column .

              • Type (string) --

                The data type of the Column .

              • Comment (string) --

                A free-form text comment.

              • Parameters (dict) --

                These key-value pairs define properties associated with the column.

                • (string) --

                  • (string) --

          • Location (string) --

            The physical location of the table. By default, this takes the form of the warehouse location, followed by the database location in the warehouse, followed by the table name.

          • InputFormat (string) --

            The input format: SequenceFileInputFormat (binary), or TextInputFormat , or a custom format.

          • OutputFormat (string) --

            The output format: SequenceFileOutputFormat (binary), or IgnoreKeyTextOutputFormat , or a custom format.

          • Compressed (boolean) --

            True if the data in the table is compressed, or False if not.

          • NumberOfBuckets (integer) --

            Must be specified if the table contains any dimension columns.

          • SerdeInfo (dict) --

            The serialization/deserialization (SerDe) information.

            • Name (string) --

              Name of the SerDe.

            • SerializationLibrary (string) --

              Usually the class that implements the SerDe. An example is org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe .

            • Parameters (dict) --

              These key-value pairs define initialization parameters for the SerDe.

              • (string) --

                • (string) --

          • BucketColumns (list) --

            A list of reducer grouping columns, clustering columns, and bucketing columns in the table.

            • (string) --

          • SortColumns (list) --

            A list specifying the sort order of each bucket in the table.

            • (dict) --

              Specifies the sort order of a sorted column.

              • Column (string) --

                The name of the column.

              • SortOrder (integer) --

                Indicates that the column is sorted in ascending order ( == 1 ), or in descending order ( ==0 ).

          • Parameters (dict) --

            The user-supplied properties in key-value form.

            • (string) --

              • (string) --

          • SkewedInfo (dict) --

            The information about values that appear frequently in a column (skewed values).

            • SkewedColumnNames (list) --

              A list of names of columns that contain skewed values.

              • (string) --

            • SkewedColumnValues (list) --

              A list of values that appear so frequently as to be considered skewed.

              • (string) --

            • SkewedColumnValueLocationMaps (dict) --

              A mapping of skewed values to the columns that contain them.

              • (string) --

                • (string) --

          • StoredAsSubDirectories (boolean) --

            True if the table data is stored in subdirectories, or False if not.

        • Parameters (dict) --

          These key-value pairs define partition parameters.

          • (string) --

            • (string) --

        • LastAnalyzedTime (datetime) --

          The last time at which column statistics were computed for this partition.

    • NextToken (string) --

      A continuation token, if the returned list of partitions does not include the last one.

GetTable (updated) Link ¶
Changes (response)
{'Table': {'IsRegisteredWithLakeFormation': 'boolean',
           'PartitionKeys': {'Parameters': {'string': 'string'}},
           'StorageDescriptor': {'Columns': {'Parameters': {'string': 'string'}}}}}

Retrieves the Table definition in a Data Catalog for a specified table.

See also: AWS API Documentation

Request Syntax

client.get_table(
    CatalogId='string',
    DatabaseName='string',
    Name='string'
)
type CatalogId

string

param CatalogId

The ID of the Data Catalog where the table resides. If none is provided, the AWS account ID is used by default.

type DatabaseName

string

param DatabaseName

[REQUIRED]

The name of the database in the catalog in which the table resides. For Hive compatibility, this name is entirely lowercase.

type Name

string

param Name

[REQUIRED]

The name of the table for which to retrieve the definition. For Hive compatibility, this name is entirely lowercase.

rtype

dict

returns

Response Syntax

{
    'Table': {
        'Name': 'string',
        'DatabaseName': 'string',
        'Description': 'string',
        'Owner': 'string',
        'CreateTime': datetime(2015, 1, 1),
        'UpdateTime': datetime(2015, 1, 1),
        'LastAccessTime': datetime(2015, 1, 1),
        'LastAnalyzedTime': datetime(2015, 1, 1),
        'Retention': 123,
        'StorageDescriptor': {
            'Columns': [
                {
                    'Name': 'string',
                    'Type': 'string',
                    'Comment': 'string',
                    'Parameters': {
                        'string': 'string'
                    }
                },
            ],
            'Location': 'string',
            'InputFormat': 'string',
            'OutputFormat': 'string',
            'Compressed': True|False,
            'NumberOfBuckets': 123,
            'SerdeInfo': {
                'Name': 'string',
                'SerializationLibrary': 'string',
                'Parameters': {
                    'string': 'string'
                }
            },
            'BucketColumns': [
                'string',
            ],
            'SortColumns': [
                {
                    'Column': 'string',
                    'SortOrder': 123
                },
            ],
            'Parameters': {
                'string': 'string'
            },
            'SkewedInfo': {
                'SkewedColumnNames': [
                    'string',
                ],
                'SkewedColumnValues': [
                    'string',
                ],
                'SkewedColumnValueLocationMaps': {
                    'string': 'string'
                }
            },
            'StoredAsSubDirectories': True|False
        },
        'PartitionKeys': [
            {
                'Name': 'string',
                'Type': 'string',
                'Comment': 'string',
                'Parameters': {
                    'string': 'string'
                }
            },
        ],
        'ViewOriginalText': 'string',
        'ViewExpandedText': 'string',
        'TableType': 'string',
        'Parameters': {
            'string': 'string'
        },
        'CreatedBy': 'string',
        'IsRegisteredWithLakeFormation': True|False
    }
}

Response Structure

  • (dict) --

    • Table (dict) --

      The Table object that defines the specified table.

      • Name (string) --

        The table name. For Hive compatibility, this must be entirely lowercase.

      • DatabaseName (string) --

        The name of the database where the table metadata resides. For Hive compatibility, this must be all lowercase.

      • Description (string) --

        A description of the table.

      • Owner (string) --

        The owner of the table.

      • CreateTime (datetime) --

        The time when the table definition was created in the Data Catalog.

      • UpdateTime (datetime) --

        The last time that the table was updated.

      • LastAccessTime (datetime) --

        The last time that the table was accessed. This is usually taken from HDFS, and might not be reliable.

      • LastAnalyzedTime (datetime) --

        The last time that column statistics were computed for this table.

      • Retention (integer) --

        The retention time for this table.

      • StorageDescriptor (dict) --

        A storage descriptor containing information about the physical storage of this table.

        • Columns (list) --

          A list of the Columns in the table.

          • (dict) --

            A column in a Table .

            • Name (string) --

              The name of the Column .

            • Type (string) --

              The data type of the Column .

            • Comment (string) --

              A free-form text comment.

            • Parameters (dict) --

              These key-value pairs define properties associated with the column.

              • (string) --

                • (string) --

        • Location (string) --

          The physical location of the table. By default, this takes the form of the warehouse location, followed by the database location in the warehouse, followed by the table name.

        • InputFormat (string) --

          The input format: SequenceFileInputFormat (binary), or TextInputFormat , or a custom format.

        • OutputFormat (string) --

          The output format: SequenceFileOutputFormat (binary), or IgnoreKeyTextOutputFormat , or a custom format.

        • Compressed (boolean) --

          True if the data in the table is compressed, or False if not.

        • NumberOfBuckets (integer) --

          Must be specified if the table contains any dimension columns.

        • SerdeInfo (dict) --

          The serialization/deserialization (SerDe) information.

          • Name (string) --

            Name of the SerDe.

          • SerializationLibrary (string) --

            Usually the class that implements the SerDe. An example is org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe .

          • Parameters (dict) --

            These key-value pairs define initialization parameters for the SerDe.

            • (string) --

              • (string) --

        • BucketColumns (list) --

          A list of reducer grouping columns, clustering columns, and bucketing columns in the table.

          • (string) --

        • SortColumns (list) --

          A list specifying the sort order of each bucket in the table.

          • (dict) --

            Specifies the sort order of a sorted column.

            • Column (string) --

              The name of the column.

            • SortOrder (integer) --

              Indicates that the column is sorted in ascending order ( == 1 ), or in descending order ( ==0 ).

        • Parameters (dict) --

          The user-supplied properties in key-value form.

          • (string) --

            • (string) --

        • SkewedInfo (dict) --

          The information about values that appear frequently in a column (skewed values).

          • SkewedColumnNames (list) --

            A list of names of columns that contain skewed values.

            • (string) --

          • SkewedColumnValues (list) --

            A list of values that appear so frequently as to be considered skewed.

            • (string) --

          • SkewedColumnValueLocationMaps (dict) --

            A mapping of skewed values to the columns that contain them.

            • (string) --

              • (string) --

        • StoredAsSubDirectories (boolean) --

          True if the table data is stored in subdirectories, or False if not.

      • PartitionKeys (list) --

        A list of columns by which the table is partitioned. Only primitive types are supported as partition keys.

        When you create a table used by Amazon Athena, and you do not specify any partitionKeys , you must at least set the value of partitionKeys to an empty list. For example:

        "PartitionKeys": []

        • (dict) --

          A column in a Table .

          • Name (string) --

            The name of the Column .

          • Type (string) --

            The data type of the Column .

          • Comment (string) --

            A free-form text comment.

          • Parameters (dict) --

            These key-value pairs define properties associated with the column.

            • (string) --

              • (string) --

      • ViewOriginalText (string) --

        If the table is a view, the original text of the view; otherwise null .

      • ViewExpandedText (string) --

        If the table is a view, the expanded text of the view; otherwise null .

      • TableType (string) --

        The type of this table ( EXTERNAL_TABLE , VIRTUAL_VIEW , etc.).

      • Parameters (dict) --

        These key-value pairs define properties associated with the table.

        • (string) --

          • (string) --

      • CreatedBy (string) --

        The person or entity who created the table.

      • IsRegisteredWithLakeFormation (boolean) --

        Indicates whether the table has been registered with AWS Lake Formation.

GetTableVersion (updated) Link ¶
Changes (response)
{'TableVersion': {'Table': {'IsRegisteredWithLakeFormation': 'boolean',
                            'PartitionKeys': {'Parameters': {'string': 'string'}},
                            'StorageDescriptor': {'Columns': {'Parameters': {'string': 'string'}}}}}}

Retrieves a specified version of a table.

See also: AWS API Documentation

Request Syntax

client.get_table_version(
    CatalogId='string',
    DatabaseName='string',
    TableName='string',
    VersionId='string'
)
type CatalogId

string

param CatalogId

The ID of the Data Catalog where the tables reside. If none is provided, the AWS account ID is used by default.

type DatabaseName

string

param DatabaseName

[REQUIRED]

The database in the catalog in which the table resides. For Hive compatibility, this name is entirely lowercase.

type TableName

string

param TableName

[REQUIRED]

The name of the table. For Hive compatibility, this name is entirely lowercase.

type VersionId

string

param VersionId

The ID value of the table version to be retrieved. A VersionID is a string representation of an integer. Each version is incremented by 1.

rtype

dict

returns

Response Syntax

{
    'TableVersion': {
        'Table': {
            'Name': 'string',
            'DatabaseName': 'string',
            'Description': 'string',
            'Owner': 'string',
            'CreateTime': datetime(2015, 1, 1),
            'UpdateTime': datetime(2015, 1, 1),
            'LastAccessTime': datetime(2015, 1, 1),
            'LastAnalyzedTime': datetime(2015, 1, 1),
            'Retention': 123,
            'StorageDescriptor': {
                'Columns': [
                    {
                        'Name': 'string',
                        'Type': 'string',
                        'Comment': 'string',
                        'Parameters': {
                            'string': 'string'
                        }
                    },
                ],
                'Location': 'string',
                'InputFormat': 'string',
                'OutputFormat': 'string',
                'Compressed': True|False,
                'NumberOfBuckets': 123,
                'SerdeInfo': {
                    'Name': 'string',
                    'SerializationLibrary': 'string',
                    'Parameters': {
                        'string': 'string'
                    }
                },
                'BucketColumns': [
                    'string',
                ],
                'SortColumns': [
                    {
                        'Column': 'string',
                        'SortOrder': 123
                    },
                ],
                'Parameters': {
                    'string': 'string'
                },
                'SkewedInfo': {
                    'SkewedColumnNames': [
                        'string',
                    ],
                    'SkewedColumnValues': [
                        'string',
                    ],
                    'SkewedColumnValueLocationMaps': {
                        'string': 'string'
                    }
                },
                'StoredAsSubDirectories': True|False
            },
            'PartitionKeys': [
                {
                    'Name': 'string',
                    'Type': 'string',
                    'Comment': 'string',
                    'Parameters': {
                        'string': 'string'
                    }
                },
            ],
            'ViewOriginalText': 'string',
            'ViewExpandedText': 'string',
            'TableType': 'string',
            'Parameters': {
                'string': 'string'
            },
            'CreatedBy': 'string',
            'IsRegisteredWithLakeFormation': True|False
        },
        'VersionId': 'string'
    }
}

Response Structure

  • (dict) --

    • TableVersion (dict) --

      The requested table version.

      • Table (dict) --

        The table in question.

        • Name (string) --

          The table name. For Hive compatibility, this must be entirely lowercase.

        • DatabaseName (string) --

          The name of the database where the table metadata resides. For Hive compatibility, this must be all lowercase.

        • Description (string) --

          A description of the table.

        • Owner (string) --

          The owner of the table.

        • CreateTime (datetime) --

          The time when the table definition was created in the Data Catalog.

        • UpdateTime (datetime) --

          The last time that the table was updated.

        • LastAccessTime (datetime) --

          The last time that the table was accessed. This is usually taken from HDFS, and might not be reliable.

        • LastAnalyzedTime (datetime) --

          The last time that column statistics were computed for this table.

        • Retention (integer) --

          The retention time for this table.

        • StorageDescriptor (dict) --

          A storage descriptor containing information about the physical storage of this table.

          • Columns (list) --

            A list of the Columns in the table.

            • (dict) --

              A column in a Table .

              • Name (string) --

                The name of the Column .

              • Type (string) --

                The data type of the Column .

              • Comment (string) --

                A free-form text comment.

              • Parameters (dict) --

                These key-value pairs define properties associated with the column.

                • (string) --

                  • (string) --

          • Location (string) --

            The physical location of the table. By default, this takes the form of the warehouse location, followed by the database location in the warehouse, followed by the table name.

          • InputFormat (string) --

            The input format: SequenceFileInputFormat (binary), or TextInputFormat , or a custom format.

          • OutputFormat (string) --

            The output format: SequenceFileOutputFormat (binary), or IgnoreKeyTextOutputFormat , or a custom format.

          • Compressed (boolean) --

            True if the data in the table is compressed, or False if not.

          • NumberOfBuckets (integer) --

            Must be specified if the table contains any dimension columns.

          • SerdeInfo (dict) --

            The serialization/deserialization (SerDe) information.

            • Name (string) --

              Name of the SerDe.

            • SerializationLibrary (string) --

              Usually the class that implements the SerDe. An example is org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe .

            • Parameters (dict) --

              These key-value pairs define initialization parameters for the SerDe.

              • (string) --

                • (string) --

          • BucketColumns (list) --

            A list of reducer grouping columns, clustering columns, and bucketing columns in the table.

            • (string) --

          • SortColumns (list) --

            A list specifying the sort order of each bucket in the table.

            • (dict) --

              Specifies the sort order of a sorted column.

              • Column (string) --

                The name of the column.

              • SortOrder (integer) --

                Indicates that the column is sorted in ascending order ( == 1 ), or in descending order ( ==0 ).

          • Parameters (dict) --

            The user-supplied properties in key-value form.

            • (string) --

              • (string) --

          • SkewedInfo (dict) --

            The information about values that appear frequently in a column (skewed values).

            • SkewedColumnNames (list) --

              A list of names of columns that contain skewed values.

              • (string) --

            • SkewedColumnValues (list) --

              A list of values that appear so frequently as to be considered skewed.

              • (string) --

            • SkewedColumnValueLocationMaps (dict) --

              A mapping of skewed values to the columns that contain them.

              • (string) --

                • (string) --

          • StoredAsSubDirectories (boolean) --

            True if the table data is stored in subdirectories, or False if not.

        • PartitionKeys (list) --

          A list of columns by which the table is partitioned. Only primitive types are supported as partition keys.

          When you create a table used by Amazon Athena, and you do not specify any partitionKeys , you must at least set the value of partitionKeys to an empty list. For example:

          "PartitionKeys": []

          • (dict) --

            A column in a Table .

            • Name (string) --

              The name of the Column .

            • Type (string) --

              The data type of the Column .

            • Comment (string) --

              A free-form text comment.

            • Parameters (dict) --

              These key-value pairs define properties associated with the column.

              • (string) --

                • (string) --

        • ViewOriginalText (string) --

          If the table is a view, the original text of the view; otherwise null .

        • ViewExpandedText (string) --

          If the table is a view, the expanded text of the view; otherwise null .

        • TableType (string) --

          The type of this table ( EXTERNAL_TABLE , VIRTUAL_VIEW , etc.).

        • Parameters (dict) --

          These key-value pairs define properties associated with the table.

          • (string) --

            • (string) --

        • CreatedBy (string) --

          The person or entity who created the table.

        • IsRegisteredWithLakeFormation (boolean) --

          Indicates whether the table has been registered with AWS Lake Formation.

      • VersionId (string) --

        The ID value that identifies this table version. A VersionId is a string representation of an integer. Each version is incremented by 1.

GetTableVersions (updated) Link ¶
Changes (response)
{'TableVersions': {'Table': {'IsRegisteredWithLakeFormation': 'boolean',
                             'PartitionKeys': {'Parameters': {'string': 'string'}},
                             'StorageDescriptor': {'Columns': {'Parameters': {'string': 'string'}}}}}}

Retrieves a list of strings that identify available versions of a specified table.

See also: AWS API Documentation

Request Syntax

client.get_table_versions(
    CatalogId='string',
    DatabaseName='string',
    TableName='string',
    NextToken='string',
    MaxResults=123
)
type CatalogId

string

param CatalogId

The ID of the Data Catalog where the tables reside. If none is provided, the AWS account ID is used by default.

type DatabaseName

string

param DatabaseName

[REQUIRED]

The database in the catalog in which the table resides. For Hive compatibility, this name is entirely lowercase.

type TableName

string

param TableName

[REQUIRED]

The name of the table. For Hive compatibility, this name is entirely lowercase.

type NextToken

string

param NextToken

A continuation token, if this is not the first call.

type MaxResults

integer

param MaxResults

The maximum number of table versions to return in one response.

rtype

dict

returns

Response Syntax

{
    'TableVersions': [
        {
            'Table': {
                'Name': 'string',
                'DatabaseName': 'string',
                'Description': 'string',
                'Owner': 'string',
                'CreateTime': datetime(2015, 1, 1),
                'UpdateTime': datetime(2015, 1, 1),
                'LastAccessTime': datetime(2015, 1, 1),
                'LastAnalyzedTime': datetime(2015, 1, 1),
                'Retention': 123,
                'StorageDescriptor': {
                    'Columns': [
                        {
                            'Name': 'string',
                            'Type': 'string',
                            'Comment': 'string',
                            'Parameters': {
                                'string': 'string'
                            }
                        },
                    ],
                    'Location': 'string',
                    'InputFormat': 'string',
                    'OutputFormat': 'string',
                    'Compressed': True|False,
                    'NumberOfBuckets': 123,
                    'SerdeInfo': {
                        'Name': 'string',
                        'SerializationLibrary': 'string',
                        'Parameters': {
                            'string': 'string'
                        }
                    },
                    'BucketColumns': [
                        'string',
                    ],
                    'SortColumns': [
                        {
                            'Column': 'string',
                            'SortOrder': 123
                        },
                    ],
                    'Parameters': {
                        'string': 'string'
                    },
                    'SkewedInfo': {
                        'SkewedColumnNames': [
                            'string',
                        ],
                        'SkewedColumnValues': [
                            'string',
                        ],
                        'SkewedColumnValueLocationMaps': {
                            'string': 'string'
                        }
                    },
                    'StoredAsSubDirectories': True|False
                },
                'PartitionKeys': [
                    {
                        'Name': 'string',
                        'Type': 'string',
                        'Comment': 'string',
                        'Parameters': {
                            'string': 'string'
                        }
                    },
                ],
                'ViewOriginalText': 'string',
                'ViewExpandedText': 'string',
                'TableType': 'string',
                'Parameters': {
                    'string': 'string'
                },
                'CreatedBy': 'string',
                'IsRegisteredWithLakeFormation': True|False
            },
            'VersionId': 'string'
        },
    ],
    'NextToken': 'string'
}

Response Structure

  • (dict) --

    • TableVersions (list) --

      A list of strings identifying available versions of the specified table.

      • (dict) --

        Specifies a version of a table.

        • Table (dict) --

          The table in question.

          • Name (string) --

            The table name. For Hive compatibility, this must be entirely lowercase.

          • DatabaseName (string) --

            The name of the database where the table metadata resides. For Hive compatibility, this must be all lowercase.

          • Description (string) --

            A description of the table.

          • Owner (string) --

            The owner of the table.

          • CreateTime (datetime) --

            The time when the table definition was created in the Data Catalog.

          • UpdateTime (datetime) --

            The last time that the table was updated.

          • LastAccessTime (datetime) --

            The last time that the table was accessed. This is usually taken from HDFS, and might not be reliable.

          • LastAnalyzedTime (datetime) --

            The last time that column statistics were computed for this table.

          • Retention (integer) --

            The retention time for this table.

          • StorageDescriptor (dict) --

            A storage descriptor containing information about the physical storage of this table.

            • Columns (list) --

              A list of the Columns in the table.

              • (dict) --

                A column in a Table .

                • Name (string) --

                  The name of the Column .

                • Type (string) --

                  The data type of the Column .

                • Comment (string) --

                  A free-form text comment.

                • Parameters (dict) --

                  These key-value pairs define properties associated with the column.

                  • (string) --

                    • (string) --

            • Location (string) --

              The physical location of the table. By default, this takes the form of the warehouse location, followed by the database location in the warehouse, followed by the table name.

            • InputFormat (string) --

              The input format: SequenceFileInputFormat (binary), or TextInputFormat , or a custom format.

            • OutputFormat (string) --

              The output format: SequenceFileOutputFormat (binary), or IgnoreKeyTextOutputFormat , or a custom format.

            • Compressed (boolean) --

              True if the data in the table is compressed, or False if not.

            • NumberOfBuckets (integer) --

              Must be specified if the table contains any dimension columns.

            • SerdeInfo (dict) --

              The serialization/deserialization (SerDe) information.

              • Name (string) --

                Name of the SerDe.

              • SerializationLibrary (string) --

                Usually the class that implements the SerDe. An example is org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe .

              • Parameters (dict) --

                These key-value pairs define initialization parameters for the SerDe.

                • (string) --

                  • (string) --

            • BucketColumns (list) --

              A list of reducer grouping columns, clustering columns, and bucketing columns in the table.

              • (string) --

            • SortColumns (list) --

              A list specifying the sort order of each bucket in the table.

              • (dict) --

                Specifies the sort order of a sorted column.

                • Column (string) --

                  The name of the column.

                • SortOrder (integer) --

                  Indicates that the column is sorted in ascending order ( == 1 ), or in descending order ( ==0 ).

            • Parameters (dict) --

              The user-supplied properties in key-value form.

              • (string) --

                • (string) --

            • SkewedInfo (dict) --

              The information about values that appear frequently in a column (skewed values).

              • SkewedColumnNames (list) --

                A list of names of columns that contain skewed values.

                • (string) --

              • SkewedColumnValues (list) --

                A list of values that appear so frequently as to be considered skewed.

                • (string) --

              • SkewedColumnValueLocationMaps (dict) --

                A mapping of skewed values to the columns that contain them.

                • (string) --

                  • (string) --

            • StoredAsSubDirectories (boolean) --

              True if the table data is stored in subdirectories, or False if not.

          • PartitionKeys (list) --

            A list of columns by which the table is partitioned. Only primitive types are supported as partition keys.

            When you create a table used by Amazon Athena, and you do not specify any partitionKeys , you must at least set the value of partitionKeys to an empty list. For example:

            "PartitionKeys": []

            • (dict) --

              A column in a Table .

              • Name (string) --

                The name of the Column .

              • Type (string) --

                The data type of the Column .

              • Comment (string) --

                A free-form text comment.

              • Parameters (dict) --

                These key-value pairs define properties associated with the column.

                • (string) --

                  • (string) --

          • ViewOriginalText (string) --

            If the table is a view, the original text of the view; otherwise null .

          • ViewExpandedText (string) --

            If the table is a view, the expanded text of the view; otherwise null .

          • TableType (string) --

            The type of this table ( EXTERNAL_TABLE , VIRTUAL_VIEW , etc.).

          • Parameters (dict) --

            These key-value pairs define properties associated with the table.

            • (string) --

              • (string) --

          • CreatedBy (string) --

            The person or entity who created the table.

          • IsRegisteredWithLakeFormation (boolean) --

            Indicates whether the table has been registered with AWS Lake Formation.

        • VersionId (string) --

          The ID value that identifies this table version. A VersionId is a string representation of an integer. Each version is incremented by 1.

    • NextToken (string) --

      A continuation token, if the list of available versions does not include the last one.

GetTables (updated) Link ¶
Changes (response)
{'TableList': {'IsRegisteredWithLakeFormation': 'boolean',
               'PartitionKeys': {'Parameters': {'string': 'string'}},
               'StorageDescriptor': {'Columns': {'Parameters': {'string': 'string'}}}}}

Retrieves the definitions of some or all of the tables in a given Database .

See also: AWS API Documentation

Request Syntax

client.get_tables(
    CatalogId='string',
    DatabaseName='string',
    Expression='string',
    NextToken='string',
    MaxResults=123
)
type CatalogId

string

param CatalogId

The ID of the Data Catalog where the tables reside. If none is provided, the AWS account ID is used by default.

type DatabaseName

string

param DatabaseName

[REQUIRED]

The database in the catalog whose tables to list. For Hive compatibility, this name is entirely lowercase.

type Expression

string

param Expression

A regular expression pattern. If present, only those tables whose names match the pattern are returned.

type NextToken

string

param NextToken

A continuation token, included if this is a continuation call.

type MaxResults

integer

param MaxResults

The maximum number of tables to return in a single response.

rtype

dict

returns

Response Syntax

{
    'TableList': [
        {
            'Name': 'string',
            'DatabaseName': 'string',
            'Description': 'string',
            'Owner': 'string',
            'CreateTime': datetime(2015, 1, 1),
            'UpdateTime': datetime(2015, 1, 1),
            'LastAccessTime': datetime(2015, 1, 1),
            'LastAnalyzedTime': datetime(2015, 1, 1),
            'Retention': 123,
            'StorageDescriptor': {
                'Columns': [
                    {
                        'Name': 'string',
                        'Type': 'string',
                        'Comment': 'string',
                        'Parameters': {
                            'string': 'string'
                        }
                    },
                ],
                'Location': 'string',
                'InputFormat': 'string',
                'OutputFormat': 'string',
                'Compressed': True|False,
                'NumberOfBuckets': 123,
                'SerdeInfo': {
                    'Name': 'string',
                    'SerializationLibrary': 'string',
                    'Parameters': {
                        'string': 'string'
                    }
                },
                'BucketColumns': [
                    'string',
                ],
                'SortColumns': [
                    {
                        'Column': 'string',
                        'SortOrder': 123
                    },
                ],
                'Parameters': {
                    'string': 'string'
                },
                'SkewedInfo': {
                    'SkewedColumnNames': [
                        'string',
                    ],
                    'SkewedColumnValues': [
                        'string',
                    ],
                    'SkewedColumnValueLocationMaps': {
                        'string': 'string'
                    }
                },
                'StoredAsSubDirectories': True|False
            },
            'PartitionKeys': [
                {
                    'Name': 'string',
                    'Type': 'string',
                    'Comment': 'string',
                    'Parameters': {
                        'string': 'string'
                    }
                },
            ],
            'ViewOriginalText': 'string',
            'ViewExpandedText': 'string',
            'TableType': 'string',
            'Parameters': {
                'string': 'string'
            },
            'CreatedBy': 'string',
            'IsRegisteredWithLakeFormation': True|False
        },
    ],
    'NextToken': 'string'
}

Response Structure

  • (dict) --

    • TableList (list) --

      A list of the requested Table objects.

      • (dict) --

        Represents a collection of related data organized in columns and rows.

        • Name (string) --

          The table name. For Hive compatibility, this must be entirely lowercase.

        • DatabaseName (string) --

          The name of the database where the table metadata resides. For Hive compatibility, this must be all lowercase.

        • Description (string) --

          A description of the table.

        • Owner (string) --

          The owner of the table.

        • CreateTime (datetime) --

          The time when the table definition was created in the Data Catalog.

        • UpdateTime (datetime) --

          The last time that the table was updated.

        • LastAccessTime (datetime) --

          The last time that the table was accessed. This is usually taken from HDFS, and might not be reliable.

        • LastAnalyzedTime (datetime) --

          The last time that column statistics were computed for this table.

        • Retention (integer) --

          The retention time for this table.

        • StorageDescriptor (dict) --

          A storage descriptor containing information about the physical storage of this table.

          • Columns (list) --

            A list of the Columns in the table.

            • (dict) --

              A column in a Table .

              • Name (string) --

                The name of the Column .

              • Type (string) --

                The data type of the Column .

              • Comment (string) --

                A free-form text comment.

              • Parameters (dict) --

                These key-value pairs define properties associated with the column.

                • (string) --

                  • (string) --

          • Location (string) --

            The physical location of the table. By default, this takes the form of the warehouse location, followed by the database location in the warehouse, followed by the table name.

          • InputFormat (string) --

            The input format: SequenceFileInputFormat (binary), or TextInputFormat , or a custom format.

          • OutputFormat (string) --

            The output format: SequenceFileOutputFormat (binary), or IgnoreKeyTextOutputFormat , or a custom format.

          • Compressed (boolean) --

            True if the data in the table is compressed, or False if not.

          • NumberOfBuckets (integer) --

            Must be specified if the table contains any dimension columns.

          • SerdeInfo (dict) --

            The serialization/deserialization (SerDe) information.

            • Name (string) --

              Name of the SerDe.

            • SerializationLibrary (string) --

              Usually the class that implements the SerDe. An example is org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe .

            • Parameters (dict) --

              These key-value pairs define initialization parameters for the SerDe.

              • (string) --

                • (string) --

          • BucketColumns (list) --

            A list of reducer grouping columns, clustering columns, and bucketing columns in the table.

            • (string) --

          • SortColumns (list) --

            A list specifying the sort order of each bucket in the table.

            • (dict) --

              Specifies the sort order of a sorted column.

              • Column (string) --

                The name of the column.

              • SortOrder (integer) --

                Indicates that the column is sorted in ascending order ( == 1 ), or in descending order ( ==0 ).

          • Parameters (dict) --

            The user-supplied properties in key-value form.

            • (string) --

              • (string) --

          • SkewedInfo (dict) --

            The information about values that appear frequently in a column (skewed values).

            • SkewedColumnNames (list) --

              A list of names of columns that contain skewed values.

              • (string) --

            • SkewedColumnValues (list) --

              A list of values that appear so frequently as to be considered skewed.

              • (string) --

            • SkewedColumnValueLocationMaps (dict) --

              A mapping of skewed values to the columns that contain them.

              • (string) --

                • (string) --

          • StoredAsSubDirectories (boolean) --

            True if the table data is stored in subdirectories, or False if not.

        • PartitionKeys (list) --

          A list of columns by which the table is partitioned. Only primitive types are supported as partition keys.

          When you create a table used by Amazon Athena, and you do not specify any partitionKeys , you must at least set the value of partitionKeys to an empty list. For example:

          "PartitionKeys": []

          • (dict) --

            A column in a Table .

            • Name (string) --

              The name of the Column .

            • Type (string) --

              The data type of the Column .

            • Comment (string) --

              A free-form text comment.

            • Parameters (dict) --

              These key-value pairs define properties associated with the column.

              • (string) --

                • (string) --

        • ViewOriginalText (string) --

          If the table is a view, the original text of the view; otherwise null .

        • ViewExpandedText (string) --

          If the table is a view, the expanded text of the view; otherwise null .

        • TableType (string) --

          The type of this table ( EXTERNAL_TABLE , VIRTUAL_VIEW , etc.).

        • Parameters (dict) --

          These key-value pairs define properties associated with the table.

          • (string) --

            • (string) --

        • CreatedBy (string) --

          The person or entity who created the table.

        • IsRegisteredWithLakeFormation (boolean) --

          Indicates whether the table has been registered with AWS Lake Formation.

    • NextToken (string) --

      A continuation token, present if the current list segment is not the last.

UpdateDatabase (updated) Link ¶
Changes (request)
{'DatabaseInput': {'CreateTableDefaultPermissions': [{'Permissions': ['ALL | '
                                                                      'SELECT '
                                                                      '| ALTER '
                                                                      '| DROP '
                                                                      '| '
                                                                      'DELETE '
                                                                      '| '
                                                                      'INSERT '
                                                                      '| '
                                                                      'CREATE_DATABASE '
                                                                      '| '
                                                                      'CREATE_TABLE '
                                                                      '| '
                                                                      'DATA_LOCATION_ACCESS'],
                                                      'Principal': {'DataLakePrincipalIdentifier': 'string'}}]}}

Updates an existing database definition in a Data Catalog.

See also: AWS API Documentation

Request Syntax

client.update_database(
    CatalogId='string',
    Name='string',
    DatabaseInput={
        'Name': 'string',
        'Description': 'string',
        'LocationUri': 'string',
        'Parameters': {
            'string': 'string'
        },
        'CreateTableDefaultPermissions': [
            {
                'Principal': {
                    'DataLakePrincipalIdentifier': 'string'
                },
                'Permissions': [
                    'ALL'|'SELECT'|'ALTER'|'DROP'|'DELETE'|'INSERT'|'CREATE_DATABASE'|'CREATE_TABLE'|'DATA_LOCATION_ACCESS',
                ]
            },
        ]
    }
)
type CatalogId

string

param CatalogId

The ID of the Data Catalog in which the metadata database resides. If none is provided, the AWS account ID is used by default.

type Name

string

param Name

[REQUIRED]

The name of the database to update in the catalog. For Hive compatibility, this is folded to lowercase.

type DatabaseInput

dict

param DatabaseInput

[REQUIRED]

A DatabaseInput object specifying the new definition of the metadata database in the catalog.

  • Name (string) -- [REQUIRED]

    The name of the database. For Hive compatibility, this is folded to lowercase when it is stored.

  • Description (string) --

    A description of the database.

  • LocationUri (string) --

    The location of the database (for example, an HDFS path).

  • Parameters (dict) --

    These key-value pairs define parameters and properties of the database.

    These key-value pairs define parameters and properties of the database.

    • (string) --

      • (string) --

  • CreateTableDefaultPermissions (list) --

    Creates a set of default permissions on the table for principals.

    • (dict) --

      Permissions granted to a principal.

      • Principal (dict) --

        The principal who is granted permissions.

        • DataLakePrincipalIdentifier (string) --

          An identifier for the AWS Lake Formation principal.

      • Permissions (list) --

        The permissions that are granted to the principal.

        • (string) --

rtype

dict

returns

Response Syntax

{}

Response Structure

  • (dict) --

UpdatePartition (updated) Link ¶
Changes (request)
{'PartitionInput': {'StorageDescriptor': {'Columns': {'Parameters': {'string': 'string'}}}}}

Updates a partition.

See also: AWS API Documentation

Request Syntax

client.update_partition(
    CatalogId='string',
    DatabaseName='string',
    TableName='string',
    PartitionValueList=[
        'string',
    ],
    PartitionInput={
        'Values': [
            'string',
        ],
        'LastAccessTime': datetime(2015, 1, 1),
        'StorageDescriptor': {
            'Columns': [
                {
                    'Name': 'string',
                    'Type': 'string',
                    'Comment': 'string',
                    'Parameters': {
                        'string': 'string'
                    }
                },
            ],
            'Location': 'string',
            'InputFormat': 'string',
            'OutputFormat': 'string',
            'Compressed': True|False,
            'NumberOfBuckets': 123,
            'SerdeInfo': {
                'Name': 'string',
                'SerializationLibrary': 'string',
                'Parameters': {
                    'string': 'string'
                }
            },
            'BucketColumns': [
                'string',
            ],
            'SortColumns': [
                {
                    'Column': 'string',
                    'SortOrder': 123
                },
            ],
            'Parameters': {
                'string': 'string'
            },
            'SkewedInfo': {
                'SkewedColumnNames': [
                    'string',
                ],
                'SkewedColumnValues': [
                    'string',
                ],
                'SkewedColumnValueLocationMaps': {
                    'string': 'string'
                }
            },
            'StoredAsSubDirectories': True|False
        },
        'Parameters': {
            'string': 'string'
        },
        'LastAnalyzedTime': datetime(2015, 1, 1)
    }
)
type CatalogId

string

param CatalogId

The ID of the Data Catalog where the partition to be updated resides. If none is provided, the AWS account ID is used by default.

type DatabaseName

string

param DatabaseName

[REQUIRED]

The name of the catalog database in which the table in question resides.

type TableName

string

param TableName

[REQUIRED]

The name of the table in which the partition to be updated is located.

type PartitionValueList

list

param PartitionValueList

[REQUIRED]

A list of the values defining the partition.

  • (string) --

type PartitionInput

dict

param PartitionInput

[REQUIRED]

The new partition object to update the partition to.

  • Values (list) --

    The values of the partition. Although this parameter is not required by the SDK, you must specify this parameter for a valid input.

    • (string) --

  • LastAccessTime (datetime) --

    The last time at which the partition was accessed.

  • StorageDescriptor (dict) --

    Provides information about the physical location where the partition is stored.

    • Columns (list) --

      A list of the Columns in the table.

      • (dict) --

        A column in a Table .

        • Name (string) -- [REQUIRED]

          The name of the Column .

        • Type (string) --

          The data type of the Column .

        • Comment (string) --

          A free-form text comment.

        • Parameters (dict) --

          These key-value pairs define properties associated with the column.

          • (string) --

            • (string) --

    • Location (string) --

      The physical location of the table. By default, this takes the form of the warehouse location, followed by the database location in the warehouse, followed by the table name.

    • InputFormat (string) --

      The input format: SequenceFileInputFormat (binary), or TextInputFormat , or a custom format.

    • OutputFormat (string) --

      The output format: SequenceFileOutputFormat (binary), or IgnoreKeyTextOutputFormat , or a custom format.

    • Compressed (boolean) --

      True if the data in the table is compressed, or False if not.

    • NumberOfBuckets (integer) --

      Must be specified if the table contains any dimension columns.

    • SerdeInfo (dict) --

      The serialization/deserialization (SerDe) information.

      • Name (string) --

        Name of the SerDe.

      • SerializationLibrary (string) --

        Usually the class that implements the SerDe. An example is org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe .

      • Parameters (dict) --

        These key-value pairs define initialization parameters for the SerDe.

        • (string) --

          • (string) --

    • BucketColumns (list) --

      A list of reducer grouping columns, clustering columns, and bucketing columns in the table.

      • (string) --

    • SortColumns (list) --

      A list specifying the sort order of each bucket in the table.

      • (dict) --

        Specifies the sort order of a sorted column.

        • Column (string) -- [REQUIRED]

          The name of the column.

        • SortOrder (integer) -- [REQUIRED]

          Indicates that the column is sorted in ascending order ( == 1 ), or in descending order ( ==0 ).

    • Parameters (dict) --

      The user-supplied properties in key-value form.

      • (string) --

        • (string) --

    • SkewedInfo (dict) --

      The information about values that appear frequently in a column (skewed values).

      • SkewedColumnNames (list) --

        A list of names of columns that contain skewed values.

        • (string) --

      • SkewedColumnValues (list) --

        A list of values that appear so frequently as to be considered skewed.

        • (string) --

      • SkewedColumnValueLocationMaps (dict) --

        A mapping of skewed values to the columns that contain them.

        • (string) --

          • (string) --

    • StoredAsSubDirectories (boolean) --

      True if the table data is stored in subdirectories, or False if not.

  • Parameters (dict) --

    These key-value pairs define partition parameters.

    • (string) --

      • (string) --

  • LastAnalyzedTime (datetime) --

    The last time at which column statistics were computed for this partition.

rtype

dict

returns

Response Syntax

{}

Response Structure

  • (dict) --

UpdateTable (updated) Link ¶
Changes (request)
{'TableInput': {'PartitionKeys': {'Parameters': {'string': 'string'}},
                'StorageDescriptor': {'Columns': {'Parameters': {'string': 'string'}}}}}

Updates a metadata table in the Data Catalog.

See also: AWS API Documentation

Request Syntax

client.update_table(
    CatalogId='string',
    DatabaseName='string',
    TableInput={
        'Name': 'string',
        'Description': 'string',
        'Owner': 'string',
        'LastAccessTime': datetime(2015, 1, 1),
        'LastAnalyzedTime': datetime(2015, 1, 1),
        'Retention': 123,
        'StorageDescriptor': {
            'Columns': [
                {
                    'Name': 'string',
                    'Type': 'string',
                    'Comment': 'string',
                    'Parameters': {
                        'string': 'string'
                    }
                },
            ],
            'Location': 'string',
            'InputFormat': 'string',
            'OutputFormat': 'string',
            'Compressed': True|False,
            'NumberOfBuckets': 123,
            'SerdeInfo': {
                'Name': 'string',
                'SerializationLibrary': 'string',
                'Parameters': {
                    'string': 'string'
                }
            },
            'BucketColumns': [
                'string',
            ],
            'SortColumns': [
                {
                    'Column': 'string',
                    'SortOrder': 123
                },
            ],
            'Parameters': {
                'string': 'string'
            },
            'SkewedInfo': {
                'SkewedColumnNames': [
                    'string',
                ],
                'SkewedColumnValues': [
                    'string',
                ],
                'SkewedColumnValueLocationMaps': {
                    'string': 'string'
                }
            },
            'StoredAsSubDirectories': True|False
        },
        'PartitionKeys': [
            {
                'Name': 'string',
                'Type': 'string',
                'Comment': 'string',
                'Parameters': {
                    'string': 'string'
                }
            },
        ],
        'ViewOriginalText': 'string',
        'ViewExpandedText': 'string',
        'TableType': 'string',
        'Parameters': {
            'string': 'string'
        }
    },
    SkipArchive=True|False
)
type CatalogId

string

param CatalogId

The ID of the Data Catalog where the table resides. If none is provided, the AWS account ID is used by default.

type DatabaseName

string

param DatabaseName

[REQUIRED]

The name of the catalog database in which the table resides. For Hive compatibility, this name is entirely lowercase.

type TableInput

dict

param TableInput

[REQUIRED]

An updated TableInput object to define the metadata table in the catalog.

  • Name (string) -- [REQUIRED]

    The table name. For Hive compatibility, this is folded to lowercase when it is stored.

  • Description (string) --

    A description of the table.

  • Owner (string) --

    The table owner.

  • LastAccessTime (datetime) --

    The last time that the table was accessed.

  • LastAnalyzedTime (datetime) --

    The last time that column statistics were computed for this table.

  • Retention (integer) --

    The retention time for this table.

  • StorageDescriptor (dict) --

    A storage descriptor containing information about the physical storage of this table.

    • Columns (list) --

      A list of the Columns in the table.

      • (dict) --

        A column in a Table .

        • Name (string) -- [REQUIRED]

          The name of the Column .

        • Type (string) --

          The data type of the Column .

        • Comment (string) --

          A free-form text comment.

        • Parameters (dict) --

          These key-value pairs define properties associated with the column.

          • (string) --

            • (string) --

    • Location (string) --

      The physical location of the table. By default, this takes the form of the warehouse location, followed by the database location in the warehouse, followed by the table name.

    • InputFormat (string) --

      The input format: SequenceFileInputFormat (binary), or TextInputFormat , or a custom format.

    • OutputFormat (string) --

      The output format: SequenceFileOutputFormat (binary), or IgnoreKeyTextOutputFormat , or a custom format.

    • Compressed (boolean) --

      True if the data in the table is compressed, or False if not.

    • NumberOfBuckets (integer) --

      Must be specified if the table contains any dimension columns.

    • SerdeInfo (dict) --

      The serialization/deserialization (SerDe) information.

      • Name (string) --

        Name of the SerDe.

      • SerializationLibrary (string) --

        Usually the class that implements the SerDe. An example is org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe .

      • Parameters (dict) --

        These key-value pairs define initialization parameters for the SerDe.

        • (string) --

          • (string) --

    • BucketColumns (list) --

      A list of reducer grouping columns, clustering columns, and bucketing columns in the table.

      • (string) --

    • SortColumns (list) --

      A list specifying the sort order of each bucket in the table.

      • (dict) --

        Specifies the sort order of a sorted column.

        • Column (string) -- [REQUIRED]

          The name of the column.

        • SortOrder (integer) -- [REQUIRED]

          Indicates that the column is sorted in ascending order ( == 1 ), or in descending order ( ==0 ).

    • Parameters (dict) --

      The user-supplied properties in key-value form.

      • (string) --

        • (string) --

    • SkewedInfo (dict) --

      The information about values that appear frequently in a column (skewed values).

      • SkewedColumnNames (list) --

        A list of names of columns that contain skewed values.

        • (string) --

      • SkewedColumnValues (list) --

        A list of values that appear so frequently as to be considered skewed.

        • (string) --

      • SkewedColumnValueLocationMaps (dict) --

        A mapping of skewed values to the columns that contain them.

        • (string) --

          • (string) --

    • StoredAsSubDirectories (boolean) --

      True if the table data is stored in subdirectories, or False if not.

  • PartitionKeys (list) --

    A list of columns by which the table is partitioned. Only primitive types are supported as partition keys.

    When you create a table used by Amazon Athena, and you do not specify any partitionKeys , you must at least set the value of partitionKeys to an empty list. For example:

    "PartitionKeys": []

    • (dict) --

      A column in a Table .

      • Name (string) -- [REQUIRED]

        The name of the Column .

      • Type (string) --

        The data type of the Column .

      • Comment (string) --

        A free-form text comment.

      • Parameters (dict) --

        These key-value pairs define properties associated with the column.

        • (string) --

          • (string) --

  • ViewOriginalText (string) --

    If the table is a view, the original text of the view; otherwise null .

  • ViewExpandedText (string) --

    If the table is a view, the expanded text of the view; otherwise null .

  • TableType (string) --

    The type of this table ( EXTERNAL_TABLE , VIRTUAL_VIEW , etc.).

  • Parameters (dict) --

    These key-value pairs define properties associated with the table.

    • (string) --

      • (string) --

type SkipArchive

boolean

param SkipArchive

By default, UpdateTable always creates an archived version of the table before updating it. However, if skipArchive is set to true, UpdateTable does not create the archived version.

rtype

dict

returns

Response Syntax

{}

Response Structure

  • (dict) --