AWS Glue DataBrew

2021/01/28 - AWS Glue DataBrew - 9 updated api methods

Changes  Update databrew client to latest version

CreateDataset (updated) Link ¶
Changes (request)
{'FormatOptions': {'Csv': {'Delimiter': 'string'}}}

Creates a new DataBrew dataset.

See also: AWS API Documentation

Request Syntax

client.create_dataset(
    Name='string',
    FormatOptions={
        'Json': {
            'MultiLine': True|False
        },
        'Excel': {
            'SheetNames': [
                'string',
            ],
            'SheetIndexes': [
                123,
            ]
        },
        'Csv': {
            'Delimiter': 'string'
        }
    },
    Input={
        'S3InputDefinition': {
            'Bucket': 'string',
            'Key': 'string'
        },
        'DataCatalogInputDefinition': {
            'CatalogId': 'string',
            'DatabaseName': 'string',
            'TableName': 'string',
            'TempDirectory': {
                'Bucket': 'string',
                'Key': 'string'
            }
        }
    },
    Tags={
        'string': 'string'
    }
)
type Name:

string

param Name:

[REQUIRED]

The name of the dataset to be created. Valid characters are alphanumeric (A-Z, a-z, 0-9), hyphen (-), period (.), and space.

type FormatOptions:

dict

param FormatOptions:

Options that define the structure of either Csv, Excel, or JSON input.

  • Json (dict) --

    Options that define how JSON input is to be interpreted by DataBrew.

    • MultiLine (boolean) --

      A value that specifies whether JSON input contains embedded new line characters.

  • Excel (dict) --

    Options that define how Excel input is to be interpreted by DataBrew.

    • SheetNames (list) --

      Specifies one or more named sheets in the Excel file, which will be included in the dataset.

      • (string) --

    • SheetIndexes (list) --

      Specifies one or more sheet numbers in the Excel file, which will be included in the dataset.

      • (integer) --

  • Csv (dict) --

    Options that define how Csv input is to be interpreted by DataBrew.

    • Delimiter (string) --

      A single character that specifies the delimiter being used in the Csv file.

type Input:

dict

param Input:

[REQUIRED]

Information on how DataBrew can find data, in either the AWS Glue Data Catalog or Amazon S3.

  • S3InputDefinition (dict) --

    The Amazon S3 location where the data is stored.

    • Bucket (string) -- [REQUIRED]

      The S3 bucket name.

    • Key (string) --

      The unique name of the object in the bucket.

  • DataCatalogInputDefinition (dict) --

    The AWS Glue Data Catalog parameters for the data.

    • CatalogId (string) --

      The unique identifier of the AWS account that holds the Data Catalog that stores the data.

    • DatabaseName (string) -- [REQUIRED]

      The name of a database in the Data Catalog.

    • TableName (string) -- [REQUIRED]

      The name of a database table in the Data Catalog. This table corresponds to a DataBrew dataset.

    • TempDirectory (dict) --

      An Amazon location that AWS Glue Data Catalog can use as a temporary directory.

      • Bucket (string) -- [REQUIRED]

        The S3 bucket name.

      • Key (string) --

        The unique name of the object in the bucket.

type Tags:

dict

param Tags:

Metadata tags to apply to this dataset.

  • (string) --

    • (string) --

rtype:

dict

returns:

Response Syntax

{
    'Name': 'string'
}

Response Structure

  • (dict) --

    • Name (string) --

      The name of the dataset that you created.

CreateRecipeJob (updated) Link ¶
Changes (request)
{'Outputs': {'FormatOptions': {'Csv': {'Delimiter': 'string'}}}}

Creates a new job to transform input data, using steps defined in an existing AWS Glue DataBrew recipe

See also: AWS API Documentation

Request Syntax

client.create_recipe_job(
    DatasetName='string',
    EncryptionKeyArn='string',
    EncryptionMode='SSE-KMS'|'SSE-S3',
    Name='string',
    LogSubscription='ENABLE'|'DISABLE',
    MaxCapacity=123,
    MaxRetries=123,
    Outputs=[
        {
            'CompressionFormat': 'GZIP'|'LZ4'|'SNAPPY'|'BZIP2'|'DEFLATE'|'LZO'|'BROTLI'|'ZSTD'|'ZLIB',
            'Format': 'CSV'|'JSON'|'PARQUET'|'GLUEPARQUET'|'AVRO'|'ORC'|'XML',
            'PartitionColumns': [
                'string',
            ],
            'Location': {
                'Bucket': 'string',
                'Key': 'string'
            },
            'Overwrite': True|False,
            'FormatOptions': {
                'Csv': {
                    'Delimiter': 'string'
                }
            }
        },
    ],
    ProjectName='string',
    RecipeReference={
        'Name': 'string',
        'RecipeVersion': 'string'
    },
    RoleArn='string',
    Tags={
        'string': 'string'
    },
    Timeout=123
)
type DatasetName:

string

param DatasetName:

The name of the dataset that this job processes.

type EncryptionKeyArn:

string

param EncryptionKeyArn:

The Amazon Resource Name (ARN) of an encryption key that is used to protect the job.

type EncryptionMode:

string

param EncryptionMode:

The encryption mode for the job, which can be one of the following:

  • SSE-KMS - Server-side encryption with AWS KMS-managed keys.

  • SSE-S3 - Server-side encryption with keys managed by Amazon S3.

type Name:

string

param Name:

[REQUIRED]

A unique name for the job. Valid characters are alphanumeric (A-Z, a-z, 0-9), hyphen (-), period (.), and space.

type LogSubscription:

string

param LogSubscription:

Enables or disables Amazon CloudWatch logging for the job. If logging is enabled, CloudWatch writes one log stream for each job run.

type MaxCapacity:

integer

param MaxCapacity:

The maximum number of nodes that DataBrew can consume when the job processes data.

type MaxRetries:

integer

param MaxRetries:

The maximum number of times to retry the job after a job run fails.

type Outputs:

list

param Outputs:

[REQUIRED]

One or more artifacts that represent the output from running the job.

  • (dict) --

    Parameters that specify how and where DataBrew will write the output generated by recipe jobs or profile jobs.

    • CompressionFormat (string) --

      The compression algorithm used to compress the output text of the job.

    • Format (string) --

      The data format of the output of the job.

    • PartitionColumns (list) --

      The names of one or more partition columns for the output of the job.

      • (string) --

    • Location (dict) -- [REQUIRED]

      The location in Amazon S3 where the job writes its output.

      • Bucket (string) -- [REQUIRED]

        The S3 bucket name.

      • Key (string) --

        The unique name of the object in the bucket.

    • Overwrite (boolean) --

      A value that, if true, means that any data in the location specified for output is overwritten with new output.

    • FormatOptions (dict) --

      Options that define how DataBrew formats job output files.

      • Csv (dict) --

        Options that define how DataBrew writes Csv output.

        • Delimiter (string) --

          A single character that specifies the delimiter used to create Csv job output.

type ProjectName:

string

param ProjectName:

Either the name of an existing project, or a combination of a recipe and a dataset to associate with the recipe.

type RecipeReference:

dict

param RecipeReference:

Represents the name and version of a DataBrew recipe.

  • Name (string) -- [REQUIRED]

    The name of the recipe.

  • RecipeVersion (string) --

    The identifier for the version for the recipe.

type RoleArn:

string

param RoleArn:

[REQUIRED]

The Amazon Resource Name (ARN) of the AWS Identity and Access Management (IAM) role to be assumed when DataBrew runs the job.

type Tags:

dict

param Tags:

Metadata tags to apply to this job.

  • (string) --

    • (string) --

type Timeout:

integer

param Timeout:

The job's timeout in minutes. A job that attempts to run longer than this timeout period ends with a status of TIMEOUT.

rtype:

dict

returns:

Response Syntax

{
    'Name': 'string'
}

Response Structure

  • (dict) --

    • Name (string) --

      The name of the job that you created.

DescribeDataset (updated) Link ¶
Changes (response)
{'FormatOptions': {'Csv': {'Delimiter': 'string'}}}

Returns the definition of a specific DataBrew dataset.

See also: AWS API Documentation

Request Syntax

client.describe_dataset(
    Name='string'
)
type Name:

string

param Name:

[REQUIRED]

The name of the dataset to be described.

rtype:

dict

returns:

Response Syntax

{
    'CreatedBy': 'string',
    'CreateDate': datetime(2015, 1, 1),
    'Name': 'string',
    'FormatOptions': {
        'Json': {
            'MultiLine': True|False
        },
        'Excel': {
            'SheetNames': [
                'string',
            ],
            'SheetIndexes': [
                123,
            ]
        },
        'Csv': {
            'Delimiter': 'string'
        }
    },
    'Input': {
        'S3InputDefinition': {
            'Bucket': 'string',
            'Key': 'string'
        },
        'DataCatalogInputDefinition': {
            'CatalogId': 'string',
            'DatabaseName': 'string',
            'TableName': 'string',
            'TempDirectory': {
                'Bucket': 'string',
                'Key': 'string'
            }
        }
    },
    'LastModifiedDate': datetime(2015, 1, 1),
    'LastModifiedBy': 'string',
    'Source': 'S3'|'DATA-CATALOG',
    'Tags': {
        'string': 'string'
    },
    'ResourceArn': 'string'
}

Response Structure

  • (dict) --

    • CreatedBy (string) --

      The identifier (user name) of the user who created the dataset.

    • CreateDate (datetime) --

      The date and time that the dataset was created.

    • Name (string) --

      The name of the dataset.

    • FormatOptions (dict) --

      Options that define the structure of either Csv, Excel, or JSON input.

      • Json (dict) --

        Options that define how JSON input is to be interpreted by DataBrew.

        • MultiLine (boolean) --

          A value that specifies whether JSON input contains embedded new line characters.

      • Excel (dict) --

        Options that define how Excel input is to be interpreted by DataBrew.

        • SheetNames (list) --

          Specifies one or more named sheets in the Excel file, which will be included in the dataset.

          • (string) --

        • SheetIndexes (list) --

          Specifies one or more sheet numbers in the Excel file, which will be included in the dataset.

          • (integer) --

      • Csv (dict) --

        Options that define how Csv input is to be interpreted by DataBrew.

        • Delimiter (string) --

          A single character that specifies the delimiter being used in the Csv file.

    • Input (dict) --

      Information on how DataBrew can find data, in either the AWS Glue Data Catalog or Amazon S3.

      • S3InputDefinition (dict) --

        The Amazon S3 location where the data is stored.

        • Bucket (string) --

          The S3 bucket name.

        • Key (string) --

          The unique name of the object in the bucket.

      • DataCatalogInputDefinition (dict) --

        The AWS Glue Data Catalog parameters for the data.

        • CatalogId (string) --

          The unique identifier of the AWS account that holds the Data Catalog that stores the data.

        • DatabaseName (string) --

          The name of a database in the Data Catalog.

        • TableName (string) --

          The name of a database table in the Data Catalog. This table corresponds to a DataBrew dataset.

        • TempDirectory (dict) --

          An Amazon location that AWS Glue Data Catalog can use as a temporary directory.

          • Bucket (string) --

            The S3 bucket name.

          • Key (string) --

            The unique name of the object in the bucket.

    • LastModifiedDate (datetime) --

      The date and time that the dataset was last modified.

    • LastModifiedBy (string) --

      The identifier (user name) of the user who last modified the dataset.

    • Source (string) --

      The location of the data for this dataset, Amazon S3 or the AWS Glue Data Catalog.

    • Tags (dict) --

      Metadata tags associated with this dataset.

      • (string) --

        • (string) --

    • ResourceArn (string) --

      The Amazon Resource Name (ARN) of the dataset.

DescribeJob (updated) Link ¶
Changes (response)
{'Outputs': {'FormatOptions': {'Csv': {'Delimiter': 'string'}}}}

Returns the definition of a specific DataBrew job.

See also: AWS API Documentation

Request Syntax

client.describe_job(
    Name='string'
)
type Name:

string

param Name:

[REQUIRED]

The name of the job to be described.

rtype:

dict

returns:

Response Syntax

{
    'CreateDate': datetime(2015, 1, 1),
    'CreatedBy': 'string',
    'DatasetName': 'string',
    'EncryptionKeyArn': 'string',
    'EncryptionMode': 'SSE-KMS'|'SSE-S3',
    'Name': 'string',
    'Type': 'PROFILE'|'RECIPE',
    'LastModifiedBy': 'string',
    'LastModifiedDate': datetime(2015, 1, 1),
    'LogSubscription': 'ENABLE'|'DISABLE',
    'MaxCapacity': 123,
    'MaxRetries': 123,
    'Outputs': [
        {
            'CompressionFormat': 'GZIP'|'LZ4'|'SNAPPY'|'BZIP2'|'DEFLATE'|'LZO'|'BROTLI'|'ZSTD'|'ZLIB',
            'Format': 'CSV'|'JSON'|'PARQUET'|'GLUEPARQUET'|'AVRO'|'ORC'|'XML',
            'PartitionColumns': [
                'string',
            ],
            'Location': {
                'Bucket': 'string',
                'Key': 'string'
            },
            'Overwrite': True|False,
            'FormatOptions': {
                'Csv': {
                    'Delimiter': 'string'
                }
            }
        },
    ],
    'ProjectName': 'string',
    'RecipeReference': {
        'Name': 'string',
        'RecipeVersion': 'string'
    },
    'ResourceArn': 'string',
    'RoleArn': 'string',
    'Tags': {
        'string': 'string'
    },
    'Timeout': 123
}

Response Structure

  • (dict) --

    • CreateDate (datetime) --

      The date and time that the job was created.

    • CreatedBy (string) --

      The identifier (user name) of the user associated with the creation of the job.

    • DatasetName (string) --

      The dataset that the job acts upon.

    • EncryptionKeyArn (string) --

      The Amazon Resource Name (ARN) of an encryption key that is used to protect the job.

    • EncryptionMode (string) --

      The encryption mode for the job, which can be one of the following:

      • SSE-KMS - Server-side encryption with AWS KMS-managed keys.

      • SSE-S3 - Server-side encryption with keys managed by Amazon S3.

    • Name (string) --

      The name of the job.

    • Type (string) --

      The job type, which must be one of the following:

      • PROFILE - The job analyzes the dataset to determine its size, data types, data distribution, and more.

      • RECIPE - The job applies one or more transformations to a dataset.

    • LastModifiedBy (string) --

      The identifier (user name) of the user who last modified the job.

    • LastModifiedDate (datetime) --

      The date and time that the job was last modified.

    • LogSubscription (string) --

      Indicates whether Amazon CloudWatch logging is enabled for this job.

    • MaxCapacity (integer) --

      The maximum number of compute nodes that DataBrew can consume when the job processes data.

    • MaxRetries (integer) --

      The maximum number of times to retry the job after a job run fails.

    • Outputs (list) --

      One or more artifacts that represent the output from running the job.

      • (dict) --

        Parameters that specify how and where DataBrew will write the output generated by recipe jobs or profile jobs.

        • CompressionFormat (string) --

          The compression algorithm used to compress the output text of the job.

        • Format (string) --

          The data format of the output of the job.

        • PartitionColumns (list) --

          The names of one or more partition columns for the output of the job.

          • (string) --

        • Location (dict) --

          The location in Amazon S3 where the job writes its output.

          • Bucket (string) --

            The S3 bucket name.

          • Key (string) --

            The unique name of the object in the bucket.

        • Overwrite (boolean) --

          A value that, if true, means that any data in the location specified for output is overwritten with new output.

        • FormatOptions (dict) --

          Options that define how DataBrew formats job output files.

          • Csv (dict) --

            Options that define how DataBrew writes Csv output.

            • Delimiter (string) --

              A single character that specifies the delimiter used to create Csv job output.

    • ProjectName (string) --

      The DataBrew project associated with this job.

    • RecipeReference (dict) --

      Represents the name and version of a DataBrew recipe.

      • Name (string) --

        The name of the recipe.

      • RecipeVersion (string) --

        The identifier for the version for the recipe.

    • ResourceArn (string) --

      The Amazon Resource Name (ARN) of the job.

    • RoleArn (string) --

      The ARN of the AWS Identity and Access Management (IAM) role to be assumed when DataBrew runs the job.

    • Tags (dict) --

      Metadata tags associated with this job.

      • (string) --

        • (string) --

    • Timeout (integer) --

      The job's timeout in minutes. A job that attempts to run longer than this timeout period ends with a status of TIMEOUT.

ListDatasets (updated) Link ¶
Changes (response)
{'Datasets': {'FormatOptions': {'Csv': {'Delimiter': 'string'}}}}

Lists all of the DataBrew datasets.

See also: AWS API Documentation

Request Syntax

client.list_datasets(
    MaxResults=123,
    NextToken='string'
)
type MaxResults:

integer

param MaxResults:

The maximum number of results to return in this request.

type NextToken:

string

param NextToken:

The token returned by a previous call to retrieve the next set of results.

rtype:

dict

returns:

Response Syntax

{
    'Datasets': [
        {
            'AccountId': 'string',
            'CreatedBy': 'string',
            'CreateDate': datetime(2015, 1, 1),
            'Name': 'string',
            'FormatOptions': {
                'Json': {
                    'MultiLine': True|False
                },
                'Excel': {
                    'SheetNames': [
                        'string',
                    ],
                    'SheetIndexes': [
                        123,
                    ]
                },
                'Csv': {
                    'Delimiter': 'string'
                }
            },
            'Input': {
                'S3InputDefinition': {
                    'Bucket': 'string',
                    'Key': 'string'
                },
                'DataCatalogInputDefinition': {
                    'CatalogId': 'string',
                    'DatabaseName': 'string',
                    'TableName': 'string',
                    'TempDirectory': {
                        'Bucket': 'string',
                        'Key': 'string'
                    }
                }
            },
            'LastModifiedDate': datetime(2015, 1, 1),
            'LastModifiedBy': 'string',
            'Source': 'S3'|'DATA-CATALOG',
            'Tags': {
                'string': 'string'
            },
            'ResourceArn': 'string'
        },
    ],
    'NextToken': 'string'
}

Response Structure

  • (dict) --

    • Datasets (list) --

      A list of datasets that are defined.

      • (dict) --

        Represents a dataset that can be processed by DataBrew.

        • AccountId (string) --

          The ID of the AWS account that owns the dataset.

        • CreatedBy (string) --

          The Amazon Resource Name (ARN) of the user who created the dataset.

        • CreateDate (datetime) --

          The date and time that the dataset was created.

        • Name (string) --

          The unique name of the dataset.

        • FormatOptions (dict) --

          Options that define how DataBrew interprets the data in the dataset.

          • Json (dict) --

            Options that define how JSON input is to be interpreted by DataBrew.

            • MultiLine (boolean) --

              A value that specifies whether JSON input contains embedded new line characters.

          • Excel (dict) --

            Options that define how Excel input is to be interpreted by DataBrew.

            • SheetNames (list) --

              Specifies one or more named sheets in the Excel file, which will be included in the dataset.

              • (string) --

            • SheetIndexes (list) --

              Specifies one or more sheet numbers in the Excel file, which will be included in the dataset.

              • (integer) --

          • Csv (dict) --

            Options that define how Csv input is to be interpreted by DataBrew.

            • Delimiter (string) --

              A single character that specifies the delimiter being used in the Csv file.

        • Input (dict) --

          Information on how DataBrew can find the dataset, in either the AWS Glue Data Catalog or Amazon S3.

          • S3InputDefinition (dict) --

            The Amazon S3 location where the data is stored.

            • Bucket (string) --

              The S3 bucket name.

            • Key (string) --

              The unique name of the object in the bucket.

          • DataCatalogInputDefinition (dict) --

            The AWS Glue Data Catalog parameters for the data.

            • CatalogId (string) --

              The unique identifier of the AWS account that holds the Data Catalog that stores the data.

            • DatabaseName (string) --

              The name of a database in the Data Catalog.

            • TableName (string) --

              The name of a database table in the Data Catalog. This table corresponds to a DataBrew dataset.

            • TempDirectory (dict) --

              An Amazon location that AWS Glue Data Catalog can use as a temporary directory.

              • Bucket (string) --

                The S3 bucket name.

              • Key (string) --

                The unique name of the object in the bucket.

        • LastModifiedDate (datetime) --

          The last modification date and time of the dataset.

        • LastModifiedBy (string) --

          The Amazon Resource Name (ARN) of the user who last modified the dataset.

        • Source (string) --

          The location of the data for the dataset, either Amazon S3 or the AWS Glue Data Catalog.

        • Tags (dict) --

          Metadata tags that have been applied to the dataset.

          • (string) --

            • (string) --

        • ResourceArn (string) --

          The unique Amazon Resource Name (ARN) for the dataset.

    • NextToken (string) --

      A token that you can use in a subsequent call to retrieve the next set of results.

ListJobRuns (updated) Link ¶
Changes (response)
{'JobRuns': {'Outputs': {'FormatOptions': {'Csv': {'Delimiter': 'string'}}}}}

Lists all of the previous runs of a particular DataBrew job.

See also: AWS API Documentation

Request Syntax

client.list_job_runs(
    Name='string',
    MaxResults=123,
    NextToken='string'
)
type Name:

string

param Name:

[REQUIRED]

The name of the job.

type MaxResults:

integer

param MaxResults:

The maximum number of results to return in this request.

type NextToken:

string

param NextToken:

The token returned by a previous call to retrieve the next set of results.

rtype:

dict

returns:

Response Syntax

{
    'JobRuns': [
        {
            'Attempt': 123,
            'CompletedOn': datetime(2015, 1, 1),
            'DatasetName': 'string',
            'ErrorMessage': 'string',
            'ExecutionTime': 123,
            'JobName': 'string',
            'RunId': 'string',
            'State': 'STARTING'|'RUNNING'|'STOPPING'|'STOPPED'|'SUCCEEDED'|'FAILED'|'TIMEOUT',
            'LogSubscription': 'ENABLE'|'DISABLE',
            'LogGroupName': 'string',
            'Outputs': [
                {
                    'CompressionFormat': 'GZIP'|'LZ4'|'SNAPPY'|'BZIP2'|'DEFLATE'|'LZO'|'BROTLI'|'ZSTD'|'ZLIB',
                    'Format': 'CSV'|'JSON'|'PARQUET'|'GLUEPARQUET'|'AVRO'|'ORC'|'XML',
                    'PartitionColumns': [
                        'string',
                    ],
                    'Location': {
                        'Bucket': 'string',
                        'Key': 'string'
                    },
                    'Overwrite': True|False,
                    'FormatOptions': {
                        'Csv': {
                            'Delimiter': 'string'
                        }
                    }
                },
            ],
            'RecipeReference': {
                'Name': 'string',
                'RecipeVersion': 'string'
            },
            'StartedBy': 'string',
            'StartedOn': datetime(2015, 1, 1)
        },
    ],
    'NextToken': 'string'
}

Response Structure

  • (dict) --

    • JobRuns (list) --

      A list of job runs that have occurred for the specified job.

      • (dict) --

        Represents one run of a DataBrew job.

        • Attempt (integer) --

          The number of times that DataBrew has attempted to run the job.

        • CompletedOn (datetime) --

          The date and time when the job completed processing.

        • DatasetName (string) --

          The name of the dataset for the job to process.

        • ErrorMessage (string) --

          A message indicating an error (if any) that was encountered when the job ran.

        • ExecutionTime (integer) --

          The amount of time, in seconds, during which a job run consumed resources.

        • JobName (string) --

          The name of the job being processed during this run.

        • RunId (string) --

          The unique identifier of the job run.

        • State (string) --

          The current state of the job run entity itself.

        • LogSubscription (string) --

          The current status of Amazon CloudWatch logging for the job run.

        • LogGroupName (string) --

          The name of an Amazon CloudWatch log group, where the job writes diagnostic messages when it runs.

        • Outputs (list) --

          One or more output artifacts from a job run.

          • (dict) --

            Parameters that specify how and where DataBrew will write the output generated by recipe jobs or profile jobs.

            • CompressionFormat (string) --

              The compression algorithm used to compress the output text of the job.

            • Format (string) --

              The data format of the output of the job.

            • PartitionColumns (list) --

              The names of one or more partition columns for the output of the job.

              • (string) --

            • Location (dict) --

              The location in Amazon S3 where the job writes its output.

              • Bucket (string) --

                The S3 bucket name.

              • Key (string) --

                The unique name of the object in the bucket.

            • Overwrite (boolean) --

              A value that, if true, means that any data in the location specified for output is overwritten with new output.

            • FormatOptions (dict) --

              Options that define how DataBrew formats job output files.

              • Csv (dict) --

                Options that define how DataBrew writes Csv output.

                • Delimiter (string) --

                  A single character that specifies the delimiter used to create Csv job output.

        • RecipeReference (dict) --

          The set of steps processed by the job.

          • Name (string) --

            The name of the recipe.

          • RecipeVersion (string) --

            The identifier for the version for the recipe.

        • StartedBy (string) --

          The Amazon Resource Name (ARN) of the user who initiated the job run.

        • StartedOn (datetime) --

          The date and time when the job run began.

    • NextToken (string) --

      A token that you can use in a subsequent call to retrieve the next set of results.

ListJobs (updated) Link ¶
Changes (response)
{'Jobs': {'Outputs': {'FormatOptions': {'Csv': {'Delimiter': 'string'}}}}}

Lists all of the DataBrew jobs that are defined.

See also: AWS API Documentation

Request Syntax

client.list_jobs(
    DatasetName='string',
    MaxResults=123,
    NextToken='string',
    ProjectName='string'
)
type DatasetName:

string

param DatasetName:

The name of a dataset. Using this parameter indicates to return only those jobs that act on the specified dataset.

type MaxResults:

integer

param MaxResults:

The maximum number of results to return in this request.

type NextToken:

string

param NextToken:

A token generated by DataBrew that specifies where to continue pagination if a previous request was truncated. To get the next set of pages, pass in the NextToken value from the response object of the previous page call.

type ProjectName:

string

param ProjectName:

The name of a project. Using this parameter indicates to return only those jobs that are associated with the specified project.

rtype:

dict

returns:

Response Syntax

{
    'Jobs': [
        {
            'AccountId': 'string',
            'CreatedBy': 'string',
            'CreateDate': datetime(2015, 1, 1),
            'DatasetName': 'string',
            'EncryptionKeyArn': 'string',
            'EncryptionMode': 'SSE-KMS'|'SSE-S3',
            'Name': 'string',
            'Type': 'PROFILE'|'RECIPE',
            'LastModifiedBy': 'string',
            'LastModifiedDate': datetime(2015, 1, 1),
            'LogSubscription': 'ENABLE'|'DISABLE',
            'MaxCapacity': 123,
            'MaxRetries': 123,
            'Outputs': [
                {
                    'CompressionFormat': 'GZIP'|'LZ4'|'SNAPPY'|'BZIP2'|'DEFLATE'|'LZO'|'BROTLI'|'ZSTD'|'ZLIB',
                    'Format': 'CSV'|'JSON'|'PARQUET'|'GLUEPARQUET'|'AVRO'|'ORC'|'XML',
                    'PartitionColumns': [
                        'string',
                    ],
                    'Location': {
                        'Bucket': 'string',
                        'Key': 'string'
                    },
                    'Overwrite': True|False,
                    'FormatOptions': {
                        'Csv': {
                            'Delimiter': 'string'
                        }
                    }
                },
            ],
            'ProjectName': 'string',
            'RecipeReference': {
                'Name': 'string',
                'RecipeVersion': 'string'
            },
            'ResourceArn': 'string',
            'RoleArn': 'string',
            'Timeout': 123,
            'Tags': {
                'string': 'string'
            }
        },
    ],
    'NextToken': 'string'
}

Response Structure

  • (dict) --

    • Jobs (list) --

      A list of jobs that are defined.

      • (dict) --

        Represents all of the attributes of a DataBrew job.

        • AccountId (string) --

          The ID of the AWS account that owns the job.

        • CreatedBy (string) --

          The Amazon Resource Name (ARN) of the user who created the job.

        • CreateDate (datetime) --

          The date and time that the job was created.

        • DatasetName (string) --

          A dataset that the job is to process.

        • EncryptionKeyArn (string) --

          The Amazon Resource Name (ARN) of an encryption key that is used to protect the job output. For more information, see Encrypting data written by DataBrew jobs

        • EncryptionMode (string) --

          The encryption mode for the job, which can be one of the following:

          • SSE-KMS - Server-side encryption with AWS KMS-managed keys.

          • SSE-S3 - Server-side encryption with keys managed by Amazon S3.

        • Name (string) --

          The unique name of the job.

        • Type (string) --

          The job type of the job, which must be one of the following:

          • PROFILE - A job to analyze a dataset, to determine its size, data types, data distribution, and more.

          • RECIPE - A job to apply one or more transformations to a dataset.

        • LastModifiedBy (string) --

          The Amazon Resource Name (ARN) of the user who last modified the job.

        • LastModifiedDate (datetime) --

          The modification date and time of the job.

        • LogSubscription (string) --

          The current status of Amazon CloudWatch logging for the job.

        • MaxCapacity (integer) --

          The maximum number of nodes that can be consumed when the job processes data.

        • MaxRetries (integer) --

          The maximum number of times to retry the job after a job run fails.

        • Outputs (list) --

          One or more artifacts that represent output from running the job.

          • (dict) --

            Parameters that specify how and where DataBrew will write the output generated by recipe jobs or profile jobs.

            • CompressionFormat (string) --

              The compression algorithm used to compress the output text of the job.

            • Format (string) --

              The data format of the output of the job.

            • PartitionColumns (list) --

              The names of one or more partition columns for the output of the job.

              • (string) --

            • Location (dict) --

              The location in Amazon S3 where the job writes its output.

              • Bucket (string) --

                The S3 bucket name.

              • Key (string) --

                The unique name of the object in the bucket.

            • Overwrite (boolean) --

              A value that, if true, means that any data in the location specified for output is overwritten with new output.

            • FormatOptions (dict) --

              Options that define how DataBrew formats job output files.

              • Csv (dict) --

                Options that define how DataBrew writes Csv output.

                • Delimiter (string) --

                  A single character that specifies the delimiter used to create Csv job output.

        • ProjectName (string) --

          The name of the project that the job is associated with.

        • RecipeReference (dict) --

          A set of steps that the job runs.

          • Name (string) --

            The name of the recipe.

          • RecipeVersion (string) --

            The identifier for the version for the recipe.

        • ResourceArn (string) --

          The unique Amazon Resource Name (ARN) for the job.

        • RoleArn (string) --

          The Amazon Resource Name (ARN) of the role that will be assumed for this job.

        • Timeout (integer) --

          The job's timeout in minutes. A job that attempts to run longer than this timeout period ends with a status of TIMEOUT.

        • Tags (dict) --

          Metadata tags that have been applied to the job.

          • (string) --

            • (string) --

    • NextToken (string) --

      A token that you can use in a subsequent call to retrieve the next set of results.

UpdateDataset (updated) Link ¶
Changes (request)
{'FormatOptions': {'Csv': {'Delimiter': 'string'}}}

Modifies the definition of an existing DataBrew dataset.

See also: AWS API Documentation

Request Syntax

client.update_dataset(
    Name='string',
    FormatOptions={
        'Json': {
            'MultiLine': True|False
        },
        'Excel': {
            'SheetNames': [
                'string',
            ],
            'SheetIndexes': [
                123,
            ]
        },
        'Csv': {
            'Delimiter': 'string'
        }
    },
    Input={
        'S3InputDefinition': {
            'Bucket': 'string',
            'Key': 'string'
        },
        'DataCatalogInputDefinition': {
            'CatalogId': 'string',
            'DatabaseName': 'string',
            'TableName': 'string',
            'TempDirectory': {
                'Bucket': 'string',
                'Key': 'string'
            }
        }
    }
)
type Name:

string

param Name:

[REQUIRED]

The name of the dataset to be updated.

type FormatOptions:

dict

param FormatOptions:

Options that define the structure of either Csv, Excel, or JSON input.

  • Json (dict) --

    Options that define how JSON input is to be interpreted by DataBrew.

    • MultiLine (boolean) --

      A value that specifies whether JSON input contains embedded new line characters.

  • Excel (dict) --

    Options that define how Excel input is to be interpreted by DataBrew.

    • SheetNames (list) --

      Specifies one or more named sheets in the Excel file, which will be included in the dataset.

      • (string) --

    • SheetIndexes (list) --

      Specifies one or more sheet numbers in the Excel file, which will be included in the dataset.

      • (integer) --

  • Csv (dict) --

    Options that define how Csv input is to be interpreted by DataBrew.

    • Delimiter (string) --

      A single character that specifies the delimiter being used in the Csv file.

type Input:

dict

param Input:

[REQUIRED]

Information on how DataBrew can find data, in either the AWS Glue Data Catalog or Amazon S3.

  • S3InputDefinition (dict) --

    The Amazon S3 location where the data is stored.

    • Bucket (string) -- [REQUIRED]

      The S3 bucket name.

    • Key (string) --

      The unique name of the object in the bucket.

  • DataCatalogInputDefinition (dict) --

    The AWS Glue Data Catalog parameters for the data.

    • CatalogId (string) --

      The unique identifier of the AWS account that holds the Data Catalog that stores the data.

    • DatabaseName (string) -- [REQUIRED]

      The name of a database in the Data Catalog.

    • TableName (string) -- [REQUIRED]

      The name of a database table in the Data Catalog. This table corresponds to a DataBrew dataset.

    • TempDirectory (dict) --

      An Amazon location that AWS Glue Data Catalog can use as a temporary directory.

      • Bucket (string) -- [REQUIRED]

        The S3 bucket name.

      • Key (string) --

        The unique name of the object in the bucket.

rtype:

dict

returns:

Response Syntax

{
    'Name': 'string'
}

Response Structure

  • (dict) --

    • Name (string) --

      The name of the dataset that you updated.

UpdateRecipeJob (updated) Link ¶
Changes (request)
{'Outputs': {'FormatOptions': {'Csv': {'Delimiter': 'string'}}}}

Modifies the definition of an existing DataBrew recipe job.

See also: AWS API Documentation

Request Syntax

client.update_recipe_job(
    EncryptionKeyArn='string',
    EncryptionMode='SSE-KMS'|'SSE-S3',
    Name='string',
    LogSubscription='ENABLE'|'DISABLE',
    MaxCapacity=123,
    MaxRetries=123,
    Outputs=[
        {
            'CompressionFormat': 'GZIP'|'LZ4'|'SNAPPY'|'BZIP2'|'DEFLATE'|'LZO'|'BROTLI'|'ZSTD'|'ZLIB',
            'Format': 'CSV'|'JSON'|'PARQUET'|'GLUEPARQUET'|'AVRO'|'ORC'|'XML',
            'PartitionColumns': [
                'string',
            ],
            'Location': {
                'Bucket': 'string',
                'Key': 'string'
            },
            'Overwrite': True|False,
            'FormatOptions': {
                'Csv': {
                    'Delimiter': 'string'
                }
            }
        },
    ],
    RoleArn='string',
    Timeout=123
)
type EncryptionKeyArn:

string

param EncryptionKeyArn:

The Amazon Resource Name (ARN) of an encryption key that is used to protect the job.

type EncryptionMode:

string

param EncryptionMode:

The encryption mode for the job, which can be one of the following:

  • SSE-KMS - Server-side encryption with AWS KMS-managed keys.

  • SSE-S3 - Server-side encryption with keys managed by Amazon S3.

type Name:

string

param Name:

[REQUIRED]

The name of the job to update.

type LogSubscription:

string

param LogSubscription:

Enables or disables Amazon CloudWatch logging for the job. If logging is enabled, CloudWatch writes one log stream for each job run.

type MaxCapacity:

integer

param MaxCapacity:

The maximum number of nodes that DataBrew can consume when the job processes data.

type MaxRetries:

integer

param MaxRetries:

The maximum number of times to retry the job after a job run fails.

type Outputs:

list

param Outputs:

[REQUIRED]

One or more artifacts that represent the output from running the job.

  • (dict) --

    Parameters that specify how and where DataBrew will write the output generated by recipe jobs or profile jobs.

    • CompressionFormat (string) --

      The compression algorithm used to compress the output text of the job.

    • Format (string) --

      The data format of the output of the job.

    • PartitionColumns (list) --

      The names of one or more partition columns for the output of the job.

      • (string) --

    • Location (dict) -- [REQUIRED]

      The location in Amazon S3 where the job writes its output.

      • Bucket (string) -- [REQUIRED]

        The S3 bucket name.

      • Key (string) --

        The unique name of the object in the bucket.

    • Overwrite (boolean) --

      A value that, if true, means that any data in the location specified for output is overwritten with new output.

    • FormatOptions (dict) --

      Options that define how DataBrew formats job output files.

      • Csv (dict) --

        Options that define how DataBrew writes Csv output.

        • Delimiter (string) --

          A single character that specifies the delimiter used to create Csv job output.

type RoleArn:

string

param RoleArn:

[REQUIRED]

The Amazon Resource Name (ARN) of the AWS Identity and Access Management (IAM) role to be assumed when DataBrew runs the job.

type Timeout:

integer

param Timeout:

The job's timeout in minutes. A job that attempts to run longer than this timeout period ends with a status of TIMEOUT.

rtype:

dict

returns:

Response Syntax

{
    'Name': 'string'
}

Response Structure

  • (dict) --

    • Name (string) --

      The name of the job that you updated.