2021/03/30 - AWS Glue DataBrew - 4 updated api methods
Changes This SDK release adds two new dataset features: 1) support for specifying a database connection as a dataset input 2) support for dynamic datasets that accept configurable parameters in S3 path.
{'Input': {'DatabaseInputDefinition': {'DatabaseTableName': 'string', 'GlueConnectionName': 'string', 'TempDirectory': {'Bucket': 'string', 'Key': 'string'}}}, 'PathOptions': {'FilesLimit': {'MaxFiles': 'integer', 'Order': 'DESCENDING | ASCENDING', 'OrderedBy': 'LAST_MODIFIED_DATE'}, 'LastModifiedDateCondition': {'Expression': 'string', 'ValuesMap': {'string': 'string'}}, 'Parameters': {'string': {'CreateColumn': 'boolean', 'DatetimeOptions': {'Format': 'string', 'LocaleCode': 'string', 'TimezoneOffset': 'string'}, 'Filter': {'Expression': 'string', 'ValuesMap': {'string': 'string'}}, 'Name': 'string', 'Type': 'Datetime | Number | ' 'String'}}}}
Creates a new DataBrew dataset.
See also: AWS API Documentation
Request Syntax
client.create_dataset( Name='string', Format='CSV'|'JSON'|'PARQUET'|'EXCEL', FormatOptions={ 'Json': { 'MultiLine': True|False }, 'Excel': { 'SheetNames': [ 'string', ], 'SheetIndexes': [ 123, ], 'HeaderRow': True|False }, 'Csv': { 'Delimiter': 'string', 'HeaderRow': True|False } }, Input={ 'S3InputDefinition': { 'Bucket': 'string', 'Key': 'string' }, 'DataCatalogInputDefinition': { 'CatalogId': 'string', 'DatabaseName': 'string', 'TableName': 'string', 'TempDirectory': { 'Bucket': 'string', 'Key': 'string' } }, 'DatabaseInputDefinition': { 'GlueConnectionName': 'string', 'DatabaseTableName': 'string', 'TempDirectory': { 'Bucket': 'string', 'Key': 'string' } } }, PathOptions={ 'LastModifiedDateCondition': { 'Expression': 'string', 'ValuesMap': { 'string': 'string' } }, 'FilesLimit': { 'MaxFiles': 123, 'OrderedBy': 'LAST_MODIFIED_DATE', 'Order': 'DESCENDING'|'ASCENDING' }, 'Parameters': { 'string': { 'Name': 'string', 'Type': 'Datetime'|'Number'|'String', 'DatetimeOptions': { 'Format': 'string', 'TimezoneOffset': 'string', 'LocaleCode': 'string' }, 'CreateColumn': True|False, 'Filter': { 'Expression': 'string', 'ValuesMap': { 'string': 'string' } } } } }, Tags={ 'string': 'string' } )
string
[REQUIRED]
The name of the dataset to be created. Valid characters are alphanumeric (A-Z, a-z, 0-9), hyphen (-), period (.), and space.
string
The file format of a dataset that is created from an S3 file or folder.
dict
Represents a set of options that define the structure of either comma-separated value (CSV), Excel, or JSON input.
Json (dict) --
Options that define how JSON input is to be interpreted by DataBrew.
MultiLine (boolean) --
A value that specifies whether JSON input contains embedded new line characters.
Excel (dict) --
Options that define how Excel input is to be interpreted by DataBrew.
SheetNames (list) --
One or more named sheets in the Excel file that will be included in the dataset.
(string) --
SheetIndexes (list) --
One or more sheet numbers in the Excel file that will be included in the dataset.
(integer) --
HeaderRow (boolean) --
A variable that specifies whether the first row in the file is parsed as the header. If this value is false, column names are auto-generated.
Csv (dict) --
Options that define how CSV input is to be interpreted by DataBrew.
Delimiter (string) --
A single character that specifies the delimiter being used in the CSV file.
HeaderRow (boolean) --
A variable that specifies whether the first row in the file is parsed as the header. If this value is false, column names are auto-generated.
dict
[REQUIRED]
Represents information on how DataBrew can find data, in either the AWS Glue Data Catalog or Amazon S3.
S3InputDefinition (dict) --
The Amazon S3 location where the data is stored.
Bucket (string) -- [REQUIRED]
The S3 bucket name.
Key (string) --
The unique name of the object in the bucket.
DataCatalogInputDefinition (dict) --
The AWS Glue Data Catalog parameters for the data.
CatalogId (string) --
The unique identifier of the AWS account that holds the Data Catalog that stores the data.
DatabaseName (string) -- [REQUIRED]
The name of a database in the Data Catalog.
TableName (string) -- [REQUIRED]
The name of a database table in the Data Catalog. This table corresponds to a DataBrew dataset.
TempDirectory (dict) --
An Amazon location that AWS Glue Data Catalog can use as a temporary directory.
Bucket (string) -- [REQUIRED]
The S3 bucket name.
Key (string) --
The unique name of the object in the bucket.
DatabaseInputDefinition (dict) --
Connection information for dataset input files stored in a database.
GlueConnectionName (string) -- [REQUIRED]
The AWS Glue Connection that stores the connection information for the target database.
DatabaseTableName (string) -- [REQUIRED]
The table within the target database.
TempDirectory (dict) --
Represents an Amazon S3 location (bucket name and object key) where DataBrew can read input data, or write output from a job.
Bucket (string) -- [REQUIRED]
The S3 bucket name.
Key (string) --
The unique name of the object in the bucket.
dict
A set of options that defines how DataBrew interprets an S3 path of the dataset.
LastModifiedDateCondition (dict) --
If provided, this structure defines a date range for matching S3 objects based on their LastModifiedDate attribute in S3.
Expression (string) -- [REQUIRED]
The expression which includes condition names followed by substitution variables, possibly grouped and combined with other conditions. For example, "(starts_with :prefix1 or starts_with :prefix2) and (ends_with :suffix1 or ends_with :suffix2)". Substitution variables should start with ':' symbol.
ValuesMap (dict) -- [REQUIRED]
The map of substitution variable names to their values used in this filter expression.
(string) --
(string) --
FilesLimit (dict) --
If provided, this structure imposes a limit on a number of files that should be selected.
MaxFiles (integer) -- [REQUIRED]
The number of S3 files to select.
OrderedBy (string) --
A criteria to use for S3 files sorting before their selection. By default uses LAST_MODIFIED_DATE as a sorting criteria. Currently it's the only allowed value.
Order (string) --
A criteria to use for S3 files sorting before their selection. By default uses DESCENDING order, i.e. most recent files are selected first. Anotherpossible value is ASCENDING.
Parameters (dict) --
A structure that maps names of parameters used in the S3 path of a dataset to their definitions.
(string) --
(dict) --
Represents a dataset paramater that defines type and conditions for a parameter in the S3 path of the dataset.
Name (string) -- [REQUIRED]
The name of the parameter that is used in the dataset's S3 path.
Type (string) -- [REQUIRED]
The type of the dataset parameter, can be one of a 'String', 'Number' or 'Datetime'.
DatetimeOptions (dict) --
Additional parameter options such as a format and a timezone. Required for datetime parameters.
Format (string) -- [REQUIRED]
Required option, that defines the datetime format used for a date parameter in the S3 path. Should use only supported datetime specifiers and separation characters, all litera a-z or A-Z character should be escaped with single quotes. E.g. "MM.dd.yyyy-'at'-HH:mm".
TimezoneOffset (string) --
Optional value for a timezone offset of the datetime parameter value in the S3 path. Shouldn't be used if Format for this parameter includes timezone fields. If no offset specified, UTC is assumed.
LocaleCode (string) --
Optional value for a non-US locale code, needed for correct interpretation of some date formats.
CreateColumn (boolean) --
Optional boolean value that defines whether the captured value of this parameter should be loaded as an additional column in the dataset.
Filter (dict) --
The optional filter expression structure to apply additional matching criteria to the parameter.
Expression (string) -- [REQUIRED]
The expression which includes condition names followed by substitution variables, possibly grouped and combined with other conditions. For example, "(starts_with :prefix1 or starts_with :prefix2) and (ends_with :suffix1 or ends_with :suffix2)". Substitution variables should start with ':' symbol.
ValuesMap (dict) -- [REQUIRED]
The map of substitution variable names to their values used in this filter expression.
(string) --
(string) --
dict
Metadata tags to apply to this dataset.
(string) --
(string) --
dict
Response Syntax
{ 'Name': 'string' }
Response Structure
(dict) --
Name (string) --
The name of the dataset that you created.
{'Input': {'DatabaseInputDefinition': {'DatabaseTableName': 'string', 'GlueConnectionName': 'string', 'TempDirectory': {'Bucket': 'string', 'Key': 'string'}}}, 'PathOptions': {'FilesLimit': {'MaxFiles': 'integer', 'Order': 'DESCENDING | ASCENDING', 'OrderedBy': 'LAST_MODIFIED_DATE'}, 'LastModifiedDateCondition': {'Expression': 'string', 'ValuesMap': {'string': 'string'}}, 'Parameters': {'string': {'CreateColumn': 'boolean', 'DatetimeOptions': {'Format': 'string', 'LocaleCode': 'string', 'TimezoneOffset': 'string'}, 'Filter': {'Expression': 'string', 'ValuesMap': {'string': 'string'}}, 'Name': 'string', 'Type': 'Datetime | Number | ' 'String'}}}, 'Source': {'DATABASE'}}
Returns the definition of a specific DataBrew dataset.
See also: AWS API Documentation
Request Syntax
client.describe_dataset( Name='string' )
string
[REQUIRED]
The name of the dataset to be described.
dict
Response Syntax
{ 'CreatedBy': 'string', 'CreateDate': datetime(2015, 1, 1), 'Name': 'string', 'Format': 'CSV'|'JSON'|'PARQUET'|'EXCEL', 'FormatOptions': { 'Json': { 'MultiLine': True|False }, 'Excel': { 'SheetNames': [ 'string', ], 'SheetIndexes': [ 123, ], 'HeaderRow': True|False }, 'Csv': { 'Delimiter': 'string', 'HeaderRow': True|False } }, 'Input': { 'S3InputDefinition': { 'Bucket': 'string', 'Key': 'string' }, 'DataCatalogInputDefinition': { 'CatalogId': 'string', 'DatabaseName': 'string', 'TableName': 'string', 'TempDirectory': { 'Bucket': 'string', 'Key': 'string' } }, 'DatabaseInputDefinition': { 'GlueConnectionName': 'string', 'DatabaseTableName': 'string', 'TempDirectory': { 'Bucket': 'string', 'Key': 'string' } } }, 'LastModifiedDate': datetime(2015, 1, 1), 'LastModifiedBy': 'string', 'Source': 'S3'|'DATA-CATALOG'|'DATABASE', 'PathOptions': { 'LastModifiedDateCondition': { 'Expression': 'string', 'ValuesMap': { 'string': 'string' } }, 'FilesLimit': { 'MaxFiles': 123, 'OrderedBy': 'LAST_MODIFIED_DATE', 'Order': 'DESCENDING'|'ASCENDING' }, 'Parameters': { 'string': { 'Name': 'string', 'Type': 'Datetime'|'Number'|'String', 'DatetimeOptions': { 'Format': 'string', 'TimezoneOffset': 'string', 'LocaleCode': 'string' }, 'CreateColumn': True|False, 'Filter': { 'Expression': 'string', 'ValuesMap': { 'string': 'string' } } } } }, 'Tags': { 'string': 'string' }, 'ResourceArn': 'string' }
Response Structure
(dict) --
CreatedBy (string) --
The identifier (user name) of the user who created the dataset.
CreateDate (datetime) --
The date and time that the dataset was created.
Name (string) --
The name of the dataset.
Format (string) --
The file format of a dataset that is created from an S3 file or folder.
FormatOptions (dict) --
Represents a set of options that define the structure of either comma-separated value (CSV), Excel, or JSON input.
Json (dict) --
Options that define how JSON input is to be interpreted by DataBrew.
MultiLine (boolean) --
A value that specifies whether JSON input contains embedded new line characters.
Excel (dict) --
Options that define how Excel input is to be interpreted by DataBrew.
SheetNames (list) --
One or more named sheets in the Excel file that will be included in the dataset.
(string) --
SheetIndexes (list) --
One or more sheet numbers in the Excel file that will be included in the dataset.
(integer) --
HeaderRow (boolean) --
A variable that specifies whether the first row in the file is parsed as the header. If this value is false, column names are auto-generated.
Csv (dict) --
Options that define how CSV input is to be interpreted by DataBrew.
Delimiter (string) --
A single character that specifies the delimiter being used in the CSV file.
HeaderRow (boolean) --
A variable that specifies whether the first row in the file is parsed as the header. If this value is false, column names are auto-generated.
Input (dict) --
Represents information on how DataBrew can find data, in either the AWS Glue Data Catalog or Amazon S3.
S3InputDefinition (dict) --
The Amazon S3 location where the data is stored.
Bucket (string) --
The S3 bucket name.
Key (string) --
The unique name of the object in the bucket.
DataCatalogInputDefinition (dict) --
The AWS Glue Data Catalog parameters for the data.
CatalogId (string) --
The unique identifier of the AWS account that holds the Data Catalog that stores the data.
DatabaseName (string) --
The name of a database in the Data Catalog.
TableName (string) --
The name of a database table in the Data Catalog. This table corresponds to a DataBrew dataset.
TempDirectory (dict) --
An Amazon location that AWS Glue Data Catalog can use as a temporary directory.
Bucket (string) --
The S3 bucket name.
Key (string) --
The unique name of the object in the bucket.
DatabaseInputDefinition (dict) --
Connection information for dataset input files stored in a database.
GlueConnectionName (string) --
The AWS Glue Connection that stores the connection information for the target database.
DatabaseTableName (string) --
The table within the target database.
TempDirectory (dict) --
Represents an Amazon S3 location (bucket name and object key) where DataBrew can read input data, or write output from a job.
Bucket (string) --
The S3 bucket name.
Key (string) --
The unique name of the object in the bucket.
LastModifiedDate (datetime) --
The date and time that the dataset was last modified.
LastModifiedBy (string) --
The identifier (user name) of the user who last modified the dataset.
Source (string) --
The location of the data for this dataset, Amazon S3 or the AWS Glue Data Catalog.
PathOptions (dict) --
A set of options that defines how DataBrew interprets an S3 path of the dataset.
LastModifiedDateCondition (dict) --
If provided, this structure defines a date range for matching S3 objects based on their LastModifiedDate attribute in S3.
Expression (string) --
The expression which includes condition names followed by substitution variables, possibly grouped and combined with other conditions. For example, "(starts_with :prefix1 or starts_with :prefix2) and (ends_with :suffix1 or ends_with :suffix2)". Substitution variables should start with ':' symbol.
ValuesMap (dict) --
The map of substitution variable names to their values used in this filter expression.
(string) --
(string) --
FilesLimit (dict) --
If provided, this structure imposes a limit on a number of files that should be selected.
MaxFiles (integer) --
The number of S3 files to select.
OrderedBy (string) --
A criteria to use for S3 files sorting before their selection. By default uses LAST_MODIFIED_DATE as a sorting criteria. Currently it's the only allowed value.
Order (string) --
A criteria to use for S3 files sorting before their selection. By default uses DESCENDING order, i.e. most recent files are selected first. Anotherpossible value is ASCENDING.
Parameters (dict) --
A structure that maps names of parameters used in the S3 path of a dataset to their definitions.
(string) --
(dict) --
Represents a dataset paramater that defines type and conditions for a parameter in the S3 path of the dataset.
Name (string) --
The name of the parameter that is used in the dataset's S3 path.
Type (string) --
The type of the dataset parameter, can be one of a 'String', 'Number' or 'Datetime'.
DatetimeOptions (dict) --
Additional parameter options such as a format and a timezone. Required for datetime parameters.
Format (string) --
Required option, that defines the datetime format used for a date parameter in the S3 path. Should use only supported datetime specifiers and separation characters, all litera a-z or A-Z character should be escaped with single quotes. E.g. "MM.dd.yyyy-'at'-HH:mm".
TimezoneOffset (string) --
Optional value for a timezone offset of the datetime parameter value in the S3 path. Shouldn't be used if Format for this parameter includes timezone fields. If no offset specified, UTC is assumed.
LocaleCode (string) --
Optional value for a non-US locale code, needed for correct interpretation of some date formats.
CreateColumn (boolean) --
Optional boolean value that defines whether the captured value of this parameter should be loaded as an additional column in the dataset.
Filter (dict) --
The optional filter expression structure to apply additional matching criteria to the parameter.
Expression (string) --
The expression which includes condition names followed by substitution variables, possibly grouped and combined with other conditions. For example, "(starts_with :prefix1 or starts_with :prefix2) and (ends_with :suffix1 or ends_with :suffix2)". Substitution variables should start with ':' symbol.
ValuesMap (dict) --
The map of substitution variable names to their values used in this filter expression.
(string) --
(string) --
Tags (dict) --
Metadata tags associated with this dataset.
(string) --
(string) --
ResourceArn (string) --
The Amazon Resource Name (ARN) of the dataset.
{'Datasets': {'Input': {'DatabaseInputDefinition': {'DatabaseTableName': 'string', 'GlueConnectionName': 'string', 'TempDirectory': {'Bucket': 'string', 'Key': 'string'}}}, 'PathOptions': {'FilesLimit': {'MaxFiles': 'integer', 'Order': 'DESCENDING | ASCENDING', 'OrderedBy': 'LAST_MODIFIED_DATE'}, 'LastModifiedDateCondition': {'Expression': 'string', 'ValuesMap': {'string': 'string'}}, 'Parameters': {'string': {'CreateColumn': 'boolean', 'DatetimeOptions': {'Format': 'string', 'LocaleCode': 'string', 'TimezoneOffset': 'string'}, 'Filter': {'Expression': 'string', 'ValuesMap': {'string': 'string'}}, 'Name': 'string', 'Type': 'Datetime | ' 'Number | ' 'String'}}}, 'Source': {'DATABASE'}}}
Lists all of the DataBrew datasets.
See also: AWS API Documentation
Request Syntax
client.list_datasets( MaxResults=123, NextToken='string' )
integer
The maximum number of results to return in this request.
string
The token returned by a previous call to retrieve the next set of results.
dict
Response Syntax
{ 'Datasets': [ { 'AccountId': 'string', 'CreatedBy': 'string', 'CreateDate': datetime(2015, 1, 1), 'Name': 'string', 'Format': 'CSV'|'JSON'|'PARQUET'|'EXCEL', 'FormatOptions': { 'Json': { 'MultiLine': True|False }, 'Excel': { 'SheetNames': [ 'string', ], 'SheetIndexes': [ 123, ], 'HeaderRow': True|False }, 'Csv': { 'Delimiter': 'string', 'HeaderRow': True|False } }, 'Input': { 'S3InputDefinition': { 'Bucket': 'string', 'Key': 'string' }, 'DataCatalogInputDefinition': { 'CatalogId': 'string', 'DatabaseName': 'string', 'TableName': 'string', 'TempDirectory': { 'Bucket': 'string', 'Key': 'string' } }, 'DatabaseInputDefinition': { 'GlueConnectionName': 'string', 'DatabaseTableName': 'string', 'TempDirectory': { 'Bucket': 'string', 'Key': 'string' } } }, 'LastModifiedDate': datetime(2015, 1, 1), 'LastModifiedBy': 'string', 'Source': 'S3'|'DATA-CATALOG'|'DATABASE', 'PathOptions': { 'LastModifiedDateCondition': { 'Expression': 'string', 'ValuesMap': { 'string': 'string' } }, 'FilesLimit': { 'MaxFiles': 123, 'OrderedBy': 'LAST_MODIFIED_DATE', 'Order': 'DESCENDING'|'ASCENDING' }, 'Parameters': { 'string': { 'Name': 'string', 'Type': 'Datetime'|'Number'|'String', 'DatetimeOptions': { 'Format': 'string', 'TimezoneOffset': 'string', 'LocaleCode': 'string' }, 'CreateColumn': True|False, 'Filter': { 'Expression': 'string', 'ValuesMap': { 'string': 'string' } } } } }, 'Tags': { 'string': 'string' }, 'ResourceArn': 'string' }, ], 'NextToken': 'string' }
Response Structure
(dict) --
Datasets (list) --
A list of datasets that are defined.
(dict) --
Represents a dataset that can be processed by DataBrew.
AccountId (string) --
The ID of the AWS account that owns the dataset.
CreatedBy (string) --
The Amazon Resource Name (ARN) of the user who created the dataset.
CreateDate (datetime) --
The date and time that the dataset was created.
Name (string) --
The unique name of the dataset.
Format (string) --
The file format of a dataset that is created from an S3 file or folder.
FormatOptions (dict) --
A set of options that define how DataBrew interprets the data in the dataset.
Json (dict) --
Options that define how JSON input is to be interpreted by DataBrew.
MultiLine (boolean) --
A value that specifies whether JSON input contains embedded new line characters.
Excel (dict) --
Options that define how Excel input is to be interpreted by DataBrew.
SheetNames (list) --
One or more named sheets in the Excel file that will be included in the dataset.
(string) --
SheetIndexes (list) --
One or more sheet numbers in the Excel file that will be included in the dataset.
(integer) --
HeaderRow (boolean) --
A variable that specifies whether the first row in the file is parsed as the header. If this value is false, column names are auto-generated.
Csv (dict) --
Options that define how CSV input is to be interpreted by DataBrew.
Delimiter (string) --
A single character that specifies the delimiter being used in the CSV file.
HeaderRow (boolean) --
A variable that specifies whether the first row in the file is parsed as the header. If this value is false, column names are auto-generated.
Input (dict) --
Information on how DataBrew can find the dataset, in either the AWS Glue Data Catalog or Amazon S3.
S3InputDefinition (dict) --
The Amazon S3 location where the data is stored.
Bucket (string) --
The S3 bucket name.
Key (string) --
The unique name of the object in the bucket.
DataCatalogInputDefinition (dict) --
The AWS Glue Data Catalog parameters for the data.
CatalogId (string) --
The unique identifier of the AWS account that holds the Data Catalog that stores the data.
DatabaseName (string) --
The name of a database in the Data Catalog.
TableName (string) --
The name of a database table in the Data Catalog. This table corresponds to a DataBrew dataset.
TempDirectory (dict) --
An Amazon location that AWS Glue Data Catalog can use as a temporary directory.
Bucket (string) --
The S3 bucket name.
Key (string) --
The unique name of the object in the bucket.
DatabaseInputDefinition (dict) --
Connection information for dataset input files stored in a database.
GlueConnectionName (string) --
The AWS Glue Connection that stores the connection information for the target database.
DatabaseTableName (string) --
The table within the target database.
TempDirectory (dict) --
Represents an Amazon S3 location (bucket name and object key) where DataBrew can read input data, or write output from a job.
Bucket (string) --
The S3 bucket name.
Key (string) --
The unique name of the object in the bucket.
LastModifiedDate (datetime) --
The last modification date and time of the dataset.
LastModifiedBy (string) --
The Amazon Resource Name (ARN) of the user who last modified the dataset.
Source (string) --
The location of the data for the dataset, either Amazon S3 or the AWS Glue Data Catalog.
PathOptions (dict) --
A set of options that defines how DataBrew interprets an S3 path of the dataset.
LastModifiedDateCondition (dict) --
If provided, this structure defines a date range for matching S3 objects based on their LastModifiedDate attribute in S3.
Expression (string) --
The expression which includes condition names followed by substitution variables, possibly grouped and combined with other conditions. For example, "(starts_with :prefix1 or starts_with :prefix2) and (ends_with :suffix1 or ends_with :suffix2)". Substitution variables should start with ':' symbol.
ValuesMap (dict) --
The map of substitution variable names to their values used in this filter expression.
(string) --
(string) --
FilesLimit (dict) --
If provided, this structure imposes a limit on a number of files that should be selected.
MaxFiles (integer) --
The number of S3 files to select.
OrderedBy (string) --
A criteria to use for S3 files sorting before their selection. By default uses LAST_MODIFIED_DATE as a sorting criteria. Currently it's the only allowed value.
Order (string) --
A criteria to use for S3 files sorting before their selection. By default uses DESCENDING order, i.e. most recent files are selected first. Anotherpossible value is ASCENDING.
Parameters (dict) --
A structure that maps names of parameters used in the S3 path of a dataset to their definitions.
(string) --
(dict) --
Represents a dataset paramater that defines type and conditions for a parameter in the S3 path of the dataset.
Name (string) --
The name of the parameter that is used in the dataset's S3 path.
Type (string) --
The type of the dataset parameter, can be one of a 'String', 'Number' or 'Datetime'.
DatetimeOptions (dict) --
Additional parameter options such as a format and a timezone. Required for datetime parameters.
Format (string) --
Required option, that defines the datetime format used for a date parameter in the S3 path. Should use only supported datetime specifiers and separation characters, all litera a-z or A-Z character should be escaped with single quotes. E.g. "MM.dd.yyyy-'at'-HH:mm".
TimezoneOffset (string) --
Optional value for a timezone offset of the datetime parameter value in the S3 path. Shouldn't be used if Format for this parameter includes timezone fields. If no offset specified, UTC is assumed.
LocaleCode (string) --
Optional value for a non-US locale code, needed for correct interpretation of some date formats.
CreateColumn (boolean) --
Optional boolean value that defines whether the captured value of this parameter should be loaded as an additional column in the dataset.
Filter (dict) --
The optional filter expression structure to apply additional matching criteria to the parameter.
Expression (string) --
The expression which includes condition names followed by substitution variables, possibly grouped and combined with other conditions. For example, "(starts_with :prefix1 or starts_with :prefix2) and (ends_with :suffix1 or ends_with :suffix2)". Substitution variables should start with ':' symbol.
ValuesMap (dict) --
The map of substitution variable names to their values used in this filter expression.
(string) --
(string) --
Tags (dict) --
Metadata tags that have been applied to the dataset.
(string) --
(string) --
ResourceArn (string) --
The unique Amazon Resource Name (ARN) for the dataset.
NextToken (string) --
A token that you can use in a subsequent call to retrieve the next set of results.
{'Input': {'DatabaseInputDefinition': {'DatabaseTableName': 'string', 'GlueConnectionName': 'string', 'TempDirectory': {'Bucket': 'string', 'Key': 'string'}}}, 'PathOptions': {'FilesLimit': {'MaxFiles': 'integer', 'Order': 'DESCENDING | ASCENDING', 'OrderedBy': 'LAST_MODIFIED_DATE'}, 'LastModifiedDateCondition': {'Expression': 'string', 'ValuesMap': {'string': 'string'}}, 'Parameters': {'string': {'CreateColumn': 'boolean', 'DatetimeOptions': {'Format': 'string', 'LocaleCode': 'string', 'TimezoneOffset': 'string'}, 'Filter': {'Expression': 'string', 'ValuesMap': {'string': 'string'}}, 'Name': 'string', 'Type': 'Datetime | Number | ' 'String'}}}}
Modifies the definition of an existing DataBrew dataset.
See also: AWS API Documentation
Request Syntax
client.update_dataset( Name='string', Format='CSV'|'JSON'|'PARQUET'|'EXCEL', FormatOptions={ 'Json': { 'MultiLine': True|False }, 'Excel': { 'SheetNames': [ 'string', ], 'SheetIndexes': [ 123, ], 'HeaderRow': True|False }, 'Csv': { 'Delimiter': 'string', 'HeaderRow': True|False } }, Input={ 'S3InputDefinition': { 'Bucket': 'string', 'Key': 'string' }, 'DataCatalogInputDefinition': { 'CatalogId': 'string', 'DatabaseName': 'string', 'TableName': 'string', 'TempDirectory': { 'Bucket': 'string', 'Key': 'string' } }, 'DatabaseInputDefinition': { 'GlueConnectionName': 'string', 'DatabaseTableName': 'string', 'TempDirectory': { 'Bucket': 'string', 'Key': 'string' } } }, PathOptions={ 'LastModifiedDateCondition': { 'Expression': 'string', 'ValuesMap': { 'string': 'string' } }, 'FilesLimit': { 'MaxFiles': 123, 'OrderedBy': 'LAST_MODIFIED_DATE', 'Order': 'DESCENDING'|'ASCENDING' }, 'Parameters': { 'string': { 'Name': 'string', 'Type': 'Datetime'|'Number'|'String', 'DatetimeOptions': { 'Format': 'string', 'TimezoneOffset': 'string', 'LocaleCode': 'string' }, 'CreateColumn': True|False, 'Filter': { 'Expression': 'string', 'ValuesMap': { 'string': 'string' } } } } } )
string
[REQUIRED]
The name of the dataset to be updated.
string
The file format of a dataset that is created from an S3 file or folder.
dict
Represents a set of options that define the structure of either comma-separated value (CSV), Excel, or JSON input.
Json (dict) --
Options that define how JSON input is to be interpreted by DataBrew.
MultiLine (boolean) --
A value that specifies whether JSON input contains embedded new line characters.
Excel (dict) --
Options that define how Excel input is to be interpreted by DataBrew.
SheetNames (list) --
One or more named sheets in the Excel file that will be included in the dataset.
(string) --
SheetIndexes (list) --
One or more sheet numbers in the Excel file that will be included in the dataset.
(integer) --
HeaderRow (boolean) --
A variable that specifies whether the first row in the file is parsed as the header. If this value is false, column names are auto-generated.
Csv (dict) --
Options that define how CSV input is to be interpreted by DataBrew.
Delimiter (string) --
A single character that specifies the delimiter being used in the CSV file.
HeaderRow (boolean) --
A variable that specifies whether the first row in the file is parsed as the header. If this value is false, column names are auto-generated.
dict
[REQUIRED]
Represents information on how DataBrew can find data, in either the AWS Glue Data Catalog or Amazon S3.
S3InputDefinition (dict) --
The Amazon S3 location where the data is stored.
Bucket (string) -- [REQUIRED]
The S3 bucket name.
Key (string) --
The unique name of the object in the bucket.
DataCatalogInputDefinition (dict) --
The AWS Glue Data Catalog parameters for the data.
CatalogId (string) --
The unique identifier of the AWS account that holds the Data Catalog that stores the data.
DatabaseName (string) -- [REQUIRED]
The name of a database in the Data Catalog.
TableName (string) -- [REQUIRED]
The name of a database table in the Data Catalog. This table corresponds to a DataBrew dataset.
TempDirectory (dict) --
An Amazon location that AWS Glue Data Catalog can use as a temporary directory.
Bucket (string) -- [REQUIRED]
The S3 bucket name.
Key (string) --
The unique name of the object in the bucket.
DatabaseInputDefinition (dict) --
Connection information for dataset input files stored in a database.
GlueConnectionName (string) -- [REQUIRED]
The AWS Glue Connection that stores the connection information for the target database.
DatabaseTableName (string) -- [REQUIRED]
The table within the target database.
TempDirectory (dict) --
Represents an Amazon S3 location (bucket name and object key) where DataBrew can read input data, or write output from a job.
Bucket (string) -- [REQUIRED]
The S3 bucket name.
Key (string) --
The unique name of the object in the bucket.
dict
A set of options that defines how DataBrew interprets an S3 path of the dataset.
LastModifiedDateCondition (dict) --
If provided, this structure defines a date range for matching S3 objects based on their LastModifiedDate attribute in S3.
Expression (string) -- [REQUIRED]
The expression which includes condition names followed by substitution variables, possibly grouped and combined with other conditions. For example, "(starts_with :prefix1 or starts_with :prefix2) and (ends_with :suffix1 or ends_with :suffix2)". Substitution variables should start with ':' symbol.
ValuesMap (dict) -- [REQUIRED]
The map of substitution variable names to their values used in this filter expression.
(string) --
(string) --
FilesLimit (dict) --
If provided, this structure imposes a limit on a number of files that should be selected.
MaxFiles (integer) -- [REQUIRED]
The number of S3 files to select.
OrderedBy (string) --
A criteria to use for S3 files sorting before their selection. By default uses LAST_MODIFIED_DATE as a sorting criteria. Currently it's the only allowed value.
Order (string) --
A criteria to use for S3 files sorting before their selection. By default uses DESCENDING order, i.e. most recent files are selected first. Anotherpossible value is ASCENDING.
Parameters (dict) --
A structure that maps names of parameters used in the S3 path of a dataset to their definitions.
(string) --
(dict) --
Represents a dataset paramater that defines type and conditions for a parameter in the S3 path of the dataset.
Name (string) -- [REQUIRED]
The name of the parameter that is used in the dataset's S3 path.
Type (string) -- [REQUIRED]
The type of the dataset parameter, can be one of a 'String', 'Number' or 'Datetime'.
DatetimeOptions (dict) --
Additional parameter options such as a format and a timezone. Required for datetime parameters.
Format (string) -- [REQUIRED]
Required option, that defines the datetime format used for a date parameter in the S3 path. Should use only supported datetime specifiers and separation characters, all litera a-z or A-Z character should be escaped with single quotes. E.g. "MM.dd.yyyy-'at'-HH:mm".
TimezoneOffset (string) --
Optional value for a timezone offset of the datetime parameter value in the S3 path. Shouldn't be used if Format for this parameter includes timezone fields. If no offset specified, UTC is assumed.
LocaleCode (string) --
Optional value for a non-US locale code, needed for correct interpretation of some date formats.
CreateColumn (boolean) --
Optional boolean value that defines whether the captured value of this parameter should be loaded as an additional column in the dataset.
Filter (dict) --
The optional filter expression structure to apply additional matching criteria to the parameter.
Expression (string) -- [REQUIRED]
The expression which includes condition names followed by substitution variables, possibly grouped and combined with other conditions. For example, "(starts_with :prefix1 or starts_with :prefix2) and (ends_with :suffix1 or ends_with :suffix2)". Substitution variables should start with ':' symbol.
ValuesMap (dict) -- [REQUIRED]
The map of substitution variable names to their values used in this filter expression.
(string) --
(string) --
dict
Response Syntax
{ 'Name': 'string' }
Response Structure
(dict) --
Name (string) --
The name of the dataset that you updated.