AWS Glue DataBrew

2022/03/31 - AWS Glue DataBrew - 4 updated api methods

Changes  This AWS Glue Databrew release adds feature to support ORC as an input format.

CreateDataset (updated) Link ¶
Changes (request)
{'Format': {'ORC'}}

Creates a new DataBrew dataset.

See also: AWS API Documentation

Request Syntax

client.create_dataset(
    Name='string',
    Format='CSV'|'JSON'|'PARQUET'|'EXCEL'|'ORC',
    FormatOptions={
        'Json': {
            'MultiLine': True|False
        },
        'Excel': {
            'SheetNames': [
                'string',
            ],
            'SheetIndexes': [
                123,
            ],
            'HeaderRow': True|False
        },
        'Csv': {
            'Delimiter': 'string',
            'HeaderRow': True|False
        }
    },
    Input={
        'S3InputDefinition': {
            'Bucket': 'string',
            'Key': 'string',
            'BucketOwner': 'string'
        },
        'DataCatalogInputDefinition': {
            'CatalogId': 'string',
            'DatabaseName': 'string',
            'TableName': 'string',
            'TempDirectory': {
                'Bucket': 'string',
                'Key': 'string',
                'BucketOwner': 'string'
            }
        },
        'DatabaseInputDefinition': {
            'GlueConnectionName': 'string',
            'DatabaseTableName': 'string',
            'TempDirectory': {
                'Bucket': 'string',
                'Key': 'string',
                'BucketOwner': 'string'
            },
            'QueryString': 'string'
        },
        'Metadata': {
            'SourceArn': 'string'
        }
    },
    PathOptions={
        'LastModifiedDateCondition': {
            'Expression': 'string',
            'ValuesMap': {
                'string': 'string'
            }
        },
        'FilesLimit': {
            'MaxFiles': 123,
            'OrderedBy': 'LAST_MODIFIED_DATE',
            'Order': 'DESCENDING'|'ASCENDING'
        },
        'Parameters': {
            'string': {
                'Name': 'string',
                'Type': 'Datetime'|'Number'|'String',
                'DatetimeOptions': {
                    'Format': 'string',
                    'TimezoneOffset': 'string',
                    'LocaleCode': 'string'
                },
                'CreateColumn': True|False,
                'Filter': {
                    'Expression': 'string',
                    'ValuesMap': {
                        'string': 'string'
                    }
                }
            }
        }
    },
    Tags={
        'string': 'string'
    }
)
type Name:

string

param Name:

[REQUIRED]

The name of the dataset to be created. Valid characters are alphanumeric (A-Z, a-z, 0-9), hyphen (-), period (.), and space.

type Format:

string

param Format:

The file format of a dataset that is created from an Amazon S3 file or folder.

type FormatOptions:

dict

param FormatOptions:

Represents a set of options that define the structure of either comma-separated value (CSV), Excel, or JSON input.

  • Json (dict) --

    Options that define how JSON input is to be interpreted by DataBrew.

    • MultiLine (boolean) --

      A value that specifies whether JSON input contains embedded new line characters.

  • Excel (dict) --

    Options that define how Excel input is to be interpreted by DataBrew.

    • SheetNames (list) --

      One or more named sheets in the Excel file that will be included in the dataset.

      • (string) --

    • SheetIndexes (list) --

      One or more sheet numbers in the Excel file that will be included in the dataset.

      • (integer) --

    • HeaderRow (boolean) --

      A variable that specifies whether the first row in the file is parsed as the header. If this value is false, column names are auto-generated.

  • Csv (dict) --

    Options that define how CSV input is to be interpreted by DataBrew.

    • Delimiter (string) --

      A single character that specifies the delimiter being used in the CSV file.

    • HeaderRow (boolean) --

      A variable that specifies whether the first row in the file is parsed as the header. If this value is false, column names are auto-generated.

type Input:

dict

param Input:

[REQUIRED]

Represents information on how DataBrew can find data, in either the Glue Data Catalog or Amazon S3.

  • S3InputDefinition (dict) --

    The Amazon S3 location where the data is stored.

    • Bucket (string) -- [REQUIRED]

      The Amazon S3 bucket name.

    • Key (string) --

      The unique name of the object in the bucket.

    • BucketOwner (string) --

      The Amazon Web Services account ID of the bucket owner.

  • DataCatalogInputDefinition (dict) --

    The Glue Data Catalog parameters for the data.

    • CatalogId (string) --

      The unique identifier of the Amazon Web Services account that holds the Data Catalog that stores the data.

    • DatabaseName (string) -- [REQUIRED]

      The name of a database in the Data Catalog.

    • TableName (string) -- [REQUIRED]

      The name of a database table in the Data Catalog. This table corresponds to a DataBrew dataset.

    • TempDirectory (dict) --

      Represents an Amazon location where DataBrew can store intermediate results.

      • Bucket (string) -- [REQUIRED]

        The Amazon S3 bucket name.

      • Key (string) --

        The unique name of the object in the bucket.

      • BucketOwner (string) --

        The Amazon Web Services account ID of the bucket owner.

  • DatabaseInputDefinition (dict) --

    Connection information for dataset input files stored in a database.

    • GlueConnectionName (string) -- [REQUIRED]

      The Glue Connection that stores the connection information for the target database.

    • DatabaseTableName (string) --

      The table within the target database.

    • TempDirectory (dict) --

      Represents an Amazon S3 location (bucket name, bucket owner, and object key) where DataBrew can read input data, or write output from a job.

      • Bucket (string) -- [REQUIRED]

        The Amazon S3 bucket name.

      • Key (string) --

        The unique name of the object in the bucket.

      • BucketOwner (string) --

        The Amazon Web Services account ID of the bucket owner.

    • QueryString (string) --

      Custom SQL to run against the provided Glue connection. This SQL will be used as the input for DataBrew projects and jobs.

  • Metadata (dict) --

    Contains additional resource information needed for specific datasets.

    • SourceArn (string) --

      The Amazon Resource Name (ARN) associated with the dataset. Currently, DataBrew only supports ARNs from Amazon AppFlow.

type PathOptions:

dict

param PathOptions:

A set of options that defines how DataBrew interprets an Amazon S3 path of the dataset.

  • LastModifiedDateCondition (dict) --

    If provided, this structure defines a date range for matching Amazon S3 objects based on their LastModifiedDate attribute in Amazon S3.

    • Expression (string) -- [REQUIRED]

      The expression which includes condition names followed by substitution variables, possibly grouped and combined with other conditions. For example, "(starts_with :prefix1 or starts_with :prefix2) and (ends_with :suffix1 or ends_with :suffix2)". Substitution variables should start with ':' symbol.

    • ValuesMap (dict) -- [REQUIRED]

      The map of substitution variable names to their values used in this filter expression.

      • (string) --

        • (string) --

  • FilesLimit (dict) --

    If provided, this structure imposes a limit on a number of files that should be selected.

    • MaxFiles (integer) -- [REQUIRED]

      The number of Amazon S3 files to select.

    • OrderedBy (string) --

      A criteria to use for Amazon S3 files sorting before their selection. By default uses LAST_MODIFIED_DATE as a sorting criteria. Currently it's the only allowed value.

    • Order (string) --

      A criteria to use for Amazon S3 files sorting before their selection. By default uses DESCENDING order, i.e. most recent files are selected first. Another possible value is ASCENDING.

  • Parameters (dict) --

    A structure that maps names of parameters used in the Amazon S3 path of a dataset to their definitions.

    • (string) --

      • (dict) --

        Represents a dataset parameter that defines type and conditions for a parameter in the Amazon S3 path of the dataset.

        • Name (string) -- [REQUIRED]

          The name of the parameter that is used in the dataset's Amazon S3 path.

        • Type (string) -- [REQUIRED]

          The type of the dataset parameter, can be one of a 'String', 'Number' or 'Datetime'.

        • DatetimeOptions (dict) --

          Additional parameter options such as a format and a timezone. Required for datetime parameters.

          • Format (string) -- [REQUIRED]

            Required option, that defines the datetime format used for a date parameter in the Amazon S3 path. Should use only supported datetime specifiers and separation characters, all literal a-z or A-Z characters should be escaped with single quotes. E.g. "MM.dd.yyyy-'at'-HH:mm".

          • TimezoneOffset (string) --

            Optional value for a timezone offset of the datetime parameter value in the Amazon S3 path. Shouldn't be used if Format for this parameter includes timezone fields. If no offset specified, UTC is assumed.

          • LocaleCode (string) --

            Optional value for a non-US locale code, needed for correct interpretation of some date formats.

        • CreateColumn (boolean) --

          Optional boolean value that defines whether the captured value of this parameter should be used to create a new column in a dataset.

        • Filter (dict) --

          The optional filter expression structure to apply additional matching criteria to the parameter.

          • Expression (string) -- [REQUIRED]

            The expression which includes condition names followed by substitution variables, possibly grouped and combined with other conditions. For example, "(starts_with :prefix1 or starts_with :prefix2) and (ends_with :suffix1 or ends_with :suffix2)". Substitution variables should start with ':' symbol.

          • ValuesMap (dict) -- [REQUIRED]

            The map of substitution variable names to their values used in this filter expression.

            • (string) --

              • (string) --

type Tags:

dict

param Tags:

Metadata tags to apply to this dataset.

  • (string) --

    • (string) --

rtype:

dict

returns:

Response Syntax

{
    'Name': 'string'
}

Response Structure

  • (dict) --

    • Name (string) --

      The name of the dataset that you created.

DescribeDataset (updated) Link ¶
Changes (response)
{'Format': {'ORC'}}

Returns the definition of a specific DataBrew dataset.

See also: AWS API Documentation

Request Syntax

client.describe_dataset(
    Name='string'
)
type Name:

string

param Name:

[REQUIRED]

The name of the dataset to be described.

rtype:

dict

returns:

Response Syntax

{
    'CreatedBy': 'string',
    'CreateDate': datetime(2015, 1, 1),
    'Name': 'string',
    'Format': 'CSV'|'JSON'|'PARQUET'|'EXCEL'|'ORC',
    'FormatOptions': {
        'Json': {
            'MultiLine': True|False
        },
        'Excel': {
            'SheetNames': [
                'string',
            ],
            'SheetIndexes': [
                123,
            ],
            'HeaderRow': True|False
        },
        'Csv': {
            'Delimiter': 'string',
            'HeaderRow': True|False
        }
    },
    'Input': {
        'S3InputDefinition': {
            'Bucket': 'string',
            'Key': 'string',
            'BucketOwner': 'string'
        },
        'DataCatalogInputDefinition': {
            'CatalogId': 'string',
            'DatabaseName': 'string',
            'TableName': 'string',
            'TempDirectory': {
                'Bucket': 'string',
                'Key': 'string',
                'BucketOwner': 'string'
            }
        },
        'DatabaseInputDefinition': {
            'GlueConnectionName': 'string',
            'DatabaseTableName': 'string',
            'TempDirectory': {
                'Bucket': 'string',
                'Key': 'string',
                'BucketOwner': 'string'
            },
            'QueryString': 'string'
        },
        'Metadata': {
            'SourceArn': 'string'
        }
    },
    'LastModifiedDate': datetime(2015, 1, 1),
    'LastModifiedBy': 'string',
    'Source': 'S3'|'DATA-CATALOG'|'DATABASE',
    'PathOptions': {
        'LastModifiedDateCondition': {
            'Expression': 'string',
            'ValuesMap': {
                'string': 'string'
            }
        },
        'FilesLimit': {
            'MaxFiles': 123,
            'OrderedBy': 'LAST_MODIFIED_DATE',
            'Order': 'DESCENDING'|'ASCENDING'
        },
        'Parameters': {
            'string': {
                'Name': 'string',
                'Type': 'Datetime'|'Number'|'String',
                'DatetimeOptions': {
                    'Format': 'string',
                    'TimezoneOffset': 'string',
                    'LocaleCode': 'string'
                },
                'CreateColumn': True|False,
                'Filter': {
                    'Expression': 'string',
                    'ValuesMap': {
                        'string': 'string'
                    }
                }
            }
        }
    },
    'Tags': {
        'string': 'string'
    },
    'ResourceArn': 'string'
}

Response Structure

  • (dict) --

    • CreatedBy (string) --

      The identifier (user name) of the user who created the dataset.

    • CreateDate (datetime) --

      The date and time that the dataset was created.

    • Name (string) --

      The name of the dataset.

    • Format (string) --

      The file format of a dataset that is created from an Amazon S3 file or folder.

    • FormatOptions (dict) --

      Represents a set of options that define the structure of either comma-separated value (CSV), Excel, or JSON input.

      • Json (dict) --

        Options that define how JSON input is to be interpreted by DataBrew.

        • MultiLine (boolean) --

          A value that specifies whether JSON input contains embedded new line characters.

      • Excel (dict) --

        Options that define how Excel input is to be interpreted by DataBrew.

        • SheetNames (list) --

          One or more named sheets in the Excel file that will be included in the dataset.

          • (string) --

        • SheetIndexes (list) --

          One or more sheet numbers in the Excel file that will be included in the dataset.

          • (integer) --

        • HeaderRow (boolean) --

          A variable that specifies whether the first row in the file is parsed as the header. If this value is false, column names are auto-generated.

      • Csv (dict) --

        Options that define how CSV input is to be interpreted by DataBrew.

        • Delimiter (string) --

          A single character that specifies the delimiter being used in the CSV file.

        • HeaderRow (boolean) --

          A variable that specifies whether the first row in the file is parsed as the header. If this value is false, column names are auto-generated.

    • Input (dict) --

      Represents information on how DataBrew can find data, in either the Glue Data Catalog or Amazon S3.

      • S3InputDefinition (dict) --

        The Amazon S3 location where the data is stored.

        • Bucket (string) --

          The Amazon S3 bucket name.

        • Key (string) --

          The unique name of the object in the bucket.

        • BucketOwner (string) --

          The Amazon Web Services account ID of the bucket owner.

      • DataCatalogInputDefinition (dict) --

        The Glue Data Catalog parameters for the data.

        • CatalogId (string) --

          The unique identifier of the Amazon Web Services account that holds the Data Catalog that stores the data.

        • DatabaseName (string) --

          The name of a database in the Data Catalog.

        • TableName (string) --

          The name of a database table in the Data Catalog. This table corresponds to a DataBrew dataset.

        • TempDirectory (dict) --

          Represents an Amazon location where DataBrew can store intermediate results.

          • Bucket (string) --

            The Amazon S3 bucket name.

          • Key (string) --

            The unique name of the object in the bucket.

          • BucketOwner (string) --

            The Amazon Web Services account ID of the bucket owner.

      • DatabaseInputDefinition (dict) --

        Connection information for dataset input files stored in a database.

        • GlueConnectionName (string) --

          The Glue Connection that stores the connection information for the target database.

        • DatabaseTableName (string) --

          The table within the target database.

        • TempDirectory (dict) --

          Represents an Amazon S3 location (bucket name, bucket owner, and object key) where DataBrew can read input data, or write output from a job.

          • Bucket (string) --

            The Amazon S3 bucket name.

          • Key (string) --

            The unique name of the object in the bucket.

          • BucketOwner (string) --

            The Amazon Web Services account ID of the bucket owner.

        • QueryString (string) --

          Custom SQL to run against the provided Glue connection. This SQL will be used as the input for DataBrew projects and jobs.

      • Metadata (dict) --

        Contains additional resource information needed for specific datasets.

        • SourceArn (string) --

          The Amazon Resource Name (ARN) associated with the dataset. Currently, DataBrew only supports ARNs from Amazon AppFlow.

    • LastModifiedDate (datetime) --

      The date and time that the dataset was last modified.

    • LastModifiedBy (string) --

      The identifier (user name) of the user who last modified the dataset.

    • Source (string) --

      The location of the data for this dataset, Amazon S3 or the Glue Data Catalog.

    • PathOptions (dict) --

      A set of options that defines how DataBrew interprets an Amazon S3 path of the dataset.

      • LastModifiedDateCondition (dict) --

        If provided, this structure defines a date range for matching Amazon S3 objects based on their LastModifiedDate attribute in Amazon S3.

        • Expression (string) --

          The expression which includes condition names followed by substitution variables, possibly grouped and combined with other conditions. For example, "(starts_with :prefix1 or starts_with :prefix2) and (ends_with :suffix1 or ends_with :suffix2)". Substitution variables should start with ':' symbol.

        • ValuesMap (dict) --

          The map of substitution variable names to their values used in this filter expression.

          • (string) --

            • (string) --

      • FilesLimit (dict) --

        If provided, this structure imposes a limit on a number of files that should be selected.

        • MaxFiles (integer) --

          The number of Amazon S3 files to select.

        • OrderedBy (string) --

          A criteria to use for Amazon S3 files sorting before their selection. By default uses LAST_MODIFIED_DATE as a sorting criteria. Currently it's the only allowed value.

        • Order (string) --

          A criteria to use for Amazon S3 files sorting before their selection. By default uses DESCENDING order, i.e. most recent files are selected first. Another possible value is ASCENDING.

      • Parameters (dict) --

        A structure that maps names of parameters used in the Amazon S3 path of a dataset to their definitions.

        • (string) --

          • (dict) --

            Represents a dataset parameter that defines type and conditions for a parameter in the Amazon S3 path of the dataset.

            • Name (string) --

              The name of the parameter that is used in the dataset's Amazon S3 path.

            • Type (string) --

              The type of the dataset parameter, can be one of a 'String', 'Number' or 'Datetime'.

            • DatetimeOptions (dict) --

              Additional parameter options such as a format and a timezone. Required for datetime parameters.

              • Format (string) --

                Required option, that defines the datetime format used for a date parameter in the Amazon S3 path. Should use only supported datetime specifiers and separation characters, all literal a-z or A-Z characters should be escaped with single quotes. E.g. "MM.dd.yyyy-'at'-HH:mm".

              • TimezoneOffset (string) --

                Optional value for a timezone offset of the datetime parameter value in the Amazon S3 path. Shouldn't be used if Format for this parameter includes timezone fields. If no offset specified, UTC is assumed.

              • LocaleCode (string) --

                Optional value for a non-US locale code, needed for correct interpretation of some date formats.

            • CreateColumn (boolean) --

              Optional boolean value that defines whether the captured value of this parameter should be used to create a new column in a dataset.

            • Filter (dict) --

              The optional filter expression structure to apply additional matching criteria to the parameter.

              • Expression (string) --

                The expression which includes condition names followed by substitution variables, possibly grouped and combined with other conditions. For example, "(starts_with :prefix1 or starts_with :prefix2) and (ends_with :suffix1 or ends_with :suffix2)". Substitution variables should start with ':' symbol.

              • ValuesMap (dict) --

                The map of substitution variable names to their values used in this filter expression.

                • (string) --

                  • (string) --

    • Tags (dict) --

      Metadata tags associated with this dataset.

      • (string) --

        • (string) --

    • ResourceArn (string) --

      The Amazon Resource Name (ARN) of the dataset.

ListDatasets (updated) Link ¶
Changes (response)
{'Datasets': {'Format': {'ORC'}}}

Lists all of the DataBrew datasets.

See also: AWS API Documentation

Request Syntax

client.list_datasets(
    MaxResults=123,
    NextToken='string'
)
type MaxResults:

integer

param MaxResults:

The maximum number of results to return in this request.

type NextToken:

string

param NextToken:

The token returned by a previous call to retrieve the next set of results.

rtype:

dict

returns:

Response Syntax

{
    'Datasets': [
        {
            'AccountId': 'string',
            'CreatedBy': 'string',
            'CreateDate': datetime(2015, 1, 1),
            'Name': 'string',
            'Format': 'CSV'|'JSON'|'PARQUET'|'EXCEL'|'ORC',
            'FormatOptions': {
                'Json': {
                    'MultiLine': True|False
                },
                'Excel': {
                    'SheetNames': [
                        'string',
                    ],
                    'SheetIndexes': [
                        123,
                    ],
                    'HeaderRow': True|False
                },
                'Csv': {
                    'Delimiter': 'string',
                    'HeaderRow': True|False
                }
            },
            'Input': {
                'S3InputDefinition': {
                    'Bucket': 'string',
                    'Key': 'string',
                    'BucketOwner': 'string'
                },
                'DataCatalogInputDefinition': {
                    'CatalogId': 'string',
                    'DatabaseName': 'string',
                    'TableName': 'string',
                    'TempDirectory': {
                        'Bucket': 'string',
                        'Key': 'string',
                        'BucketOwner': 'string'
                    }
                },
                'DatabaseInputDefinition': {
                    'GlueConnectionName': 'string',
                    'DatabaseTableName': 'string',
                    'TempDirectory': {
                        'Bucket': 'string',
                        'Key': 'string',
                        'BucketOwner': 'string'
                    },
                    'QueryString': 'string'
                },
                'Metadata': {
                    'SourceArn': 'string'
                }
            },
            'LastModifiedDate': datetime(2015, 1, 1),
            'LastModifiedBy': 'string',
            'Source': 'S3'|'DATA-CATALOG'|'DATABASE',
            'PathOptions': {
                'LastModifiedDateCondition': {
                    'Expression': 'string',
                    'ValuesMap': {
                        'string': 'string'
                    }
                },
                'FilesLimit': {
                    'MaxFiles': 123,
                    'OrderedBy': 'LAST_MODIFIED_DATE',
                    'Order': 'DESCENDING'|'ASCENDING'
                },
                'Parameters': {
                    'string': {
                        'Name': 'string',
                        'Type': 'Datetime'|'Number'|'String',
                        'DatetimeOptions': {
                            'Format': 'string',
                            'TimezoneOffset': 'string',
                            'LocaleCode': 'string'
                        },
                        'CreateColumn': True|False,
                        'Filter': {
                            'Expression': 'string',
                            'ValuesMap': {
                                'string': 'string'
                            }
                        }
                    }
                }
            },
            'Tags': {
                'string': 'string'
            },
            'ResourceArn': 'string'
        },
    ],
    'NextToken': 'string'
}

Response Structure

  • (dict) --

    • Datasets (list) --

      A list of datasets that are defined.

      • (dict) --

        Represents a dataset that can be processed by DataBrew.

        • AccountId (string) --

          The ID of the Amazon Web Services account that owns the dataset.

        • CreatedBy (string) --

          The Amazon Resource Name (ARN) of the user who created the dataset.

        • CreateDate (datetime) --

          The date and time that the dataset was created.

        • Name (string) --

          The unique name of the dataset.

        • Format (string) --

          The file format of a dataset that is created from an Amazon S3 file or folder.

        • FormatOptions (dict) --

          A set of options that define how DataBrew interprets the data in the dataset.

          • Json (dict) --

            Options that define how JSON input is to be interpreted by DataBrew.

            • MultiLine (boolean) --

              A value that specifies whether JSON input contains embedded new line characters.

          • Excel (dict) --

            Options that define how Excel input is to be interpreted by DataBrew.

            • SheetNames (list) --

              One or more named sheets in the Excel file that will be included in the dataset.

              • (string) --

            • SheetIndexes (list) --

              One or more sheet numbers in the Excel file that will be included in the dataset.

              • (integer) --

            • HeaderRow (boolean) --

              A variable that specifies whether the first row in the file is parsed as the header. If this value is false, column names are auto-generated.

          • Csv (dict) --

            Options that define how CSV input is to be interpreted by DataBrew.

            • Delimiter (string) --

              A single character that specifies the delimiter being used in the CSV file.

            • HeaderRow (boolean) --

              A variable that specifies whether the first row in the file is parsed as the header. If this value is false, column names are auto-generated.

        • Input (dict) --

          Information on how DataBrew can find the dataset, in either the Glue Data Catalog or Amazon S3.

          • S3InputDefinition (dict) --

            The Amazon S3 location where the data is stored.

            • Bucket (string) --

              The Amazon S3 bucket name.

            • Key (string) --

              The unique name of the object in the bucket.

            • BucketOwner (string) --

              The Amazon Web Services account ID of the bucket owner.

          • DataCatalogInputDefinition (dict) --

            The Glue Data Catalog parameters for the data.

            • CatalogId (string) --

              The unique identifier of the Amazon Web Services account that holds the Data Catalog that stores the data.

            • DatabaseName (string) --

              The name of a database in the Data Catalog.

            • TableName (string) --

              The name of a database table in the Data Catalog. This table corresponds to a DataBrew dataset.

            • TempDirectory (dict) --

              Represents an Amazon location where DataBrew can store intermediate results.

              • Bucket (string) --

                The Amazon S3 bucket name.

              • Key (string) --

                The unique name of the object in the bucket.

              • BucketOwner (string) --

                The Amazon Web Services account ID of the bucket owner.

          • DatabaseInputDefinition (dict) --

            Connection information for dataset input files stored in a database.

            • GlueConnectionName (string) --

              The Glue Connection that stores the connection information for the target database.

            • DatabaseTableName (string) --

              The table within the target database.

            • TempDirectory (dict) --

              Represents an Amazon S3 location (bucket name, bucket owner, and object key) where DataBrew can read input data, or write output from a job.

              • Bucket (string) --

                The Amazon S3 bucket name.

              • Key (string) --

                The unique name of the object in the bucket.

              • BucketOwner (string) --

                The Amazon Web Services account ID of the bucket owner.

            • QueryString (string) --

              Custom SQL to run against the provided Glue connection. This SQL will be used as the input for DataBrew projects and jobs.

          • Metadata (dict) --

            Contains additional resource information needed for specific datasets.

            • SourceArn (string) --

              The Amazon Resource Name (ARN) associated with the dataset. Currently, DataBrew only supports ARNs from Amazon AppFlow.

        • LastModifiedDate (datetime) --

          The last modification date and time of the dataset.

        • LastModifiedBy (string) --

          The Amazon Resource Name (ARN) of the user who last modified the dataset.

        • Source (string) --

          The location of the data for the dataset, either Amazon S3 or the Glue Data Catalog.

        • PathOptions (dict) --

          A set of options that defines how DataBrew interprets an Amazon S3 path of the dataset.

          • LastModifiedDateCondition (dict) --

            If provided, this structure defines a date range for matching Amazon S3 objects based on their LastModifiedDate attribute in Amazon S3.

            • Expression (string) --

              The expression which includes condition names followed by substitution variables, possibly grouped and combined with other conditions. For example, "(starts_with :prefix1 or starts_with :prefix2) and (ends_with :suffix1 or ends_with :suffix2)". Substitution variables should start with ':' symbol.

            • ValuesMap (dict) --

              The map of substitution variable names to their values used in this filter expression.

              • (string) --

                • (string) --

          • FilesLimit (dict) --

            If provided, this structure imposes a limit on a number of files that should be selected.

            • MaxFiles (integer) --

              The number of Amazon S3 files to select.

            • OrderedBy (string) --

              A criteria to use for Amazon S3 files sorting before their selection. By default uses LAST_MODIFIED_DATE as a sorting criteria. Currently it's the only allowed value.

            • Order (string) --

              A criteria to use for Amazon S3 files sorting before their selection. By default uses DESCENDING order, i.e. most recent files are selected first. Another possible value is ASCENDING.

          • Parameters (dict) --

            A structure that maps names of parameters used in the Amazon S3 path of a dataset to their definitions.

            • (string) --

              • (dict) --

                Represents a dataset parameter that defines type and conditions for a parameter in the Amazon S3 path of the dataset.

                • Name (string) --

                  The name of the parameter that is used in the dataset's Amazon S3 path.

                • Type (string) --

                  The type of the dataset parameter, can be one of a 'String', 'Number' or 'Datetime'.

                • DatetimeOptions (dict) --

                  Additional parameter options such as a format and a timezone. Required for datetime parameters.

                  • Format (string) --

                    Required option, that defines the datetime format used for a date parameter in the Amazon S3 path. Should use only supported datetime specifiers and separation characters, all literal a-z or A-Z characters should be escaped with single quotes. E.g. "MM.dd.yyyy-'at'-HH:mm".

                  • TimezoneOffset (string) --

                    Optional value for a timezone offset of the datetime parameter value in the Amazon S3 path. Shouldn't be used if Format for this parameter includes timezone fields. If no offset specified, UTC is assumed.

                  • LocaleCode (string) --

                    Optional value for a non-US locale code, needed for correct interpretation of some date formats.

                • CreateColumn (boolean) --

                  Optional boolean value that defines whether the captured value of this parameter should be used to create a new column in a dataset.

                • Filter (dict) --

                  The optional filter expression structure to apply additional matching criteria to the parameter.

                  • Expression (string) --

                    The expression which includes condition names followed by substitution variables, possibly grouped and combined with other conditions. For example, "(starts_with :prefix1 or starts_with :prefix2) and (ends_with :suffix1 or ends_with :suffix2)". Substitution variables should start with ':' symbol.

                  • ValuesMap (dict) --

                    The map of substitution variable names to their values used in this filter expression.

                    • (string) --

                      • (string) --

        • Tags (dict) --

          Metadata tags that have been applied to the dataset.

          • (string) --

            • (string) --

        • ResourceArn (string) --

          The unique Amazon Resource Name (ARN) for the dataset.

    • NextToken (string) --

      A token that you can use in a subsequent call to retrieve the next set of results.

UpdateDataset (updated) Link ¶
Changes (request)
{'Format': {'ORC'}}

Modifies the definition of an existing DataBrew dataset.

See also: AWS API Documentation

Request Syntax

client.update_dataset(
    Name='string',
    Format='CSV'|'JSON'|'PARQUET'|'EXCEL'|'ORC',
    FormatOptions={
        'Json': {
            'MultiLine': True|False
        },
        'Excel': {
            'SheetNames': [
                'string',
            ],
            'SheetIndexes': [
                123,
            ],
            'HeaderRow': True|False
        },
        'Csv': {
            'Delimiter': 'string',
            'HeaderRow': True|False
        }
    },
    Input={
        'S3InputDefinition': {
            'Bucket': 'string',
            'Key': 'string',
            'BucketOwner': 'string'
        },
        'DataCatalogInputDefinition': {
            'CatalogId': 'string',
            'DatabaseName': 'string',
            'TableName': 'string',
            'TempDirectory': {
                'Bucket': 'string',
                'Key': 'string',
                'BucketOwner': 'string'
            }
        },
        'DatabaseInputDefinition': {
            'GlueConnectionName': 'string',
            'DatabaseTableName': 'string',
            'TempDirectory': {
                'Bucket': 'string',
                'Key': 'string',
                'BucketOwner': 'string'
            },
            'QueryString': 'string'
        },
        'Metadata': {
            'SourceArn': 'string'
        }
    },
    PathOptions={
        'LastModifiedDateCondition': {
            'Expression': 'string',
            'ValuesMap': {
                'string': 'string'
            }
        },
        'FilesLimit': {
            'MaxFiles': 123,
            'OrderedBy': 'LAST_MODIFIED_DATE',
            'Order': 'DESCENDING'|'ASCENDING'
        },
        'Parameters': {
            'string': {
                'Name': 'string',
                'Type': 'Datetime'|'Number'|'String',
                'DatetimeOptions': {
                    'Format': 'string',
                    'TimezoneOffset': 'string',
                    'LocaleCode': 'string'
                },
                'CreateColumn': True|False,
                'Filter': {
                    'Expression': 'string',
                    'ValuesMap': {
                        'string': 'string'
                    }
                }
            }
        }
    }
)
type Name:

string

param Name:

[REQUIRED]

The name of the dataset to be updated.

type Format:

string

param Format:

The file format of a dataset that is created from an Amazon S3 file or folder.

type FormatOptions:

dict

param FormatOptions:

Represents a set of options that define the structure of either comma-separated value (CSV), Excel, or JSON input.

  • Json (dict) --

    Options that define how JSON input is to be interpreted by DataBrew.

    • MultiLine (boolean) --

      A value that specifies whether JSON input contains embedded new line characters.

  • Excel (dict) --

    Options that define how Excel input is to be interpreted by DataBrew.

    • SheetNames (list) --

      One or more named sheets in the Excel file that will be included in the dataset.

      • (string) --

    • SheetIndexes (list) --

      One or more sheet numbers in the Excel file that will be included in the dataset.

      • (integer) --

    • HeaderRow (boolean) --

      A variable that specifies whether the first row in the file is parsed as the header. If this value is false, column names are auto-generated.

  • Csv (dict) --

    Options that define how CSV input is to be interpreted by DataBrew.

    • Delimiter (string) --

      A single character that specifies the delimiter being used in the CSV file.

    • HeaderRow (boolean) --

      A variable that specifies whether the first row in the file is parsed as the header. If this value is false, column names are auto-generated.

type Input:

dict

param Input:

[REQUIRED]

Represents information on how DataBrew can find data, in either the Glue Data Catalog or Amazon S3.

  • S3InputDefinition (dict) --

    The Amazon S3 location where the data is stored.

    • Bucket (string) -- [REQUIRED]

      The Amazon S3 bucket name.

    • Key (string) --

      The unique name of the object in the bucket.

    • BucketOwner (string) --

      The Amazon Web Services account ID of the bucket owner.

  • DataCatalogInputDefinition (dict) --

    The Glue Data Catalog parameters for the data.

    • CatalogId (string) --

      The unique identifier of the Amazon Web Services account that holds the Data Catalog that stores the data.

    • DatabaseName (string) -- [REQUIRED]

      The name of a database in the Data Catalog.

    • TableName (string) -- [REQUIRED]

      The name of a database table in the Data Catalog. This table corresponds to a DataBrew dataset.

    • TempDirectory (dict) --

      Represents an Amazon location where DataBrew can store intermediate results.

      • Bucket (string) -- [REQUIRED]

        The Amazon S3 bucket name.

      • Key (string) --

        The unique name of the object in the bucket.

      • BucketOwner (string) --

        The Amazon Web Services account ID of the bucket owner.

  • DatabaseInputDefinition (dict) --

    Connection information for dataset input files stored in a database.

    • GlueConnectionName (string) -- [REQUIRED]

      The Glue Connection that stores the connection information for the target database.

    • DatabaseTableName (string) --

      The table within the target database.

    • TempDirectory (dict) --

      Represents an Amazon S3 location (bucket name, bucket owner, and object key) where DataBrew can read input data, or write output from a job.

      • Bucket (string) -- [REQUIRED]

        The Amazon S3 bucket name.

      • Key (string) --

        The unique name of the object in the bucket.

      • BucketOwner (string) --

        The Amazon Web Services account ID of the bucket owner.

    • QueryString (string) --

      Custom SQL to run against the provided Glue connection. This SQL will be used as the input for DataBrew projects and jobs.

  • Metadata (dict) --

    Contains additional resource information needed for specific datasets.

    • SourceArn (string) --

      The Amazon Resource Name (ARN) associated with the dataset. Currently, DataBrew only supports ARNs from Amazon AppFlow.

type PathOptions:

dict

param PathOptions:

A set of options that defines how DataBrew interprets an Amazon S3 path of the dataset.

  • LastModifiedDateCondition (dict) --

    If provided, this structure defines a date range for matching Amazon S3 objects based on their LastModifiedDate attribute in Amazon S3.

    • Expression (string) -- [REQUIRED]

      The expression which includes condition names followed by substitution variables, possibly grouped and combined with other conditions. For example, "(starts_with :prefix1 or starts_with :prefix2) and (ends_with :suffix1 or ends_with :suffix2)". Substitution variables should start with ':' symbol.

    • ValuesMap (dict) -- [REQUIRED]

      The map of substitution variable names to their values used in this filter expression.

      • (string) --

        • (string) --

  • FilesLimit (dict) --

    If provided, this structure imposes a limit on a number of files that should be selected.

    • MaxFiles (integer) -- [REQUIRED]

      The number of Amazon S3 files to select.

    • OrderedBy (string) --

      A criteria to use for Amazon S3 files sorting before their selection. By default uses LAST_MODIFIED_DATE as a sorting criteria. Currently it's the only allowed value.

    • Order (string) --

      A criteria to use for Amazon S3 files sorting before their selection. By default uses DESCENDING order, i.e. most recent files are selected first. Another possible value is ASCENDING.

  • Parameters (dict) --

    A structure that maps names of parameters used in the Amazon S3 path of a dataset to their definitions.

    • (string) --

      • (dict) --

        Represents a dataset parameter that defines type and conditions for a parameter in the Amazon S3 path of the dataset.

        • Name (string) -- [REQUIRED]

          The name of the parameter that is used in the dataset's Amazon S3 path.

        • Type (string) -- [REQUIRED]

          The type of the dataset parameter, can be one of a 'String', 'Number' or 'Datetime'.

        • DatetimeOptions (dict) --

          Additional parameter options such as a format and a timezone. Required for datetime parameters.

          • Format (string) -- [REQUIRED]

            Required option, that defines the datetime format used for a date parameter in the Amazon S3 path. Should use only supported datetime specifiers and separation characters, all literal a-z or A-Z characters should be escaped with single quotes. E.g. "MM.dd.yyyy-'at'-HH:mm".

          • TimezoneOffset (string) --

            Optional value for a timezone offset of the datetime parameter value in the Amazon S3 path. Shouldn't be used if Format for this parameter includes timezone fields. If no offset specified, UTC is assumed.

          • LocaleCode (string) --

            Optional value for a non-US locale code, needed for correct interpretation of some date formats.

        • CreateColumn (boolean) --

          Optional boolean value that defines whether the captured value of this parameter should be used to create a new column in a dataset.

        • Filter (dict) --

          The optional filter expression structure to apply additional matching criteria to the parameter.

          • Expression (string) -- [REQUIRED]

            The expression which includes condition names followed by substitution variables, possibly grouped and combined with other conditions. For example, "(starts_with :prefix1 or starts_with :prefix2) and (ends_with :suffix1 or ends_with :suffix2)". Substitution variables should start with ':' symbol.

          • ValuesMap (dict) -- [REQUIRED]

            The map of substitution variable names to their values used in this filter expression.

            • (string) --

              • (string) --

rtype:

dict

returns:

Response Syntax

{
    'Name': 'string'
}

Response Structure

  • (dict) --

    • Name (string) --

      The name of the dataset that you updated.