AWS API Changes

2022/03/31 - AWS Glue DataBrew - 4 updated api methods

Changes This AWS Glue Databrew release adds feature to support ORC as an input format.

CreateDataset (updated)

Link ¶
Changes (request)

{'Format': {'ORC'}}

Creates a new DataBrew dataset.

See also: AWS API Documentation

Request Syntax

client.create_dataset(
    Name='string',
    Format='CSV'|'JSON'|'PARQUET'|'EXCEL'|'ORC',
    FormatOptions={
        'Json': {
            'MultiLine': True|False
        },
        'Excel': {
            'SheetNames': [
                'string',
            ],
            'SheetIndexes': [
                123,
            ],
            'HeaderRow': True|False
        },
        'Csv': {
            'Delimiter': 'string',
            'HeaderRow': True|False
        }
    },
    Input={
        'S3InputDefinition': {
            'Bucket': 'string',
            'Key': 'string',
            'BucketOwner': 'string'
        },
        'DataCatalogInputDefinition': {
            'CatalogId': 'string',
            'DatabaseName': 'string',
            'TableName': 'string',
            'TempDirectory': {
                'Bucket': 'string',
                'Key': 'string',
                'BucketOwner': 'string'
            }
        },
        'DatabaseInputDefinition': {
            'GlueConnectionName': 'string',
            'DatabaseTableName': 'string',
            'TempDirectory': {
                'Bucket': 'string',
                'Key': 'string',
                'BucketOwner': 'string'
            },
            'QueryString': 'string'
        },
        'Metadata': {
            'SourceArn': 'string'
        }
    },
    PathOptions={
        'LastModifiedDateCondition': {
            'Expression': 'string',
            'ValuesMap': {
                'string': 'string'
            }
        },
        'FilesLimit': {
            'MaxFiles': 123,
            'OrderedBy': 'LAST_MODIFIED_DATE',
            'Order': 'DESCENDING'|'ASCENDING'
        },
        'Parameters': {
            'string': {
                'Name': 'string',
                'Type': 'Datetime'|'Number'|'String',
                'DatetimeOptions': {
                    'Format': 'string',
                    'TimezoneOffset': 'string',
                    'LocaleCode': 'string'
                },
                'CreateColumn': True|False,
                'Filter': {
                    'Expression': 'string',
                    'ValuesMap': {
                        'string': 'string'
                    }
                }
            }
        }
    },
    Tags={
        'string': 'string'
    }
)

type Name

string

param Name

[REQUIRED]

The name of the dataset to be created. Valid characters are alphanumeric (A-Z, a-z, 0-9), hyphen (-), period (.), and space.

type Format

string

param Format

The file format of a dataset that is created from an Amazon S3 file or folder.

type FormatOptions

dict

param FormatOptions

Represents a set of options that define the structure of either comma-separated value (CSV), Excel, or JSON input.

Json (dict) --

Options that define how JSON input is to be interpreted by DataBrew.
- MultiLine (boolean) --
  
  A value that specifies whether JSON input contains embedded new line characters.
Excel (dict) --

Options that define how Excel input is to be interpreted by DataBrew.
- SheetNames (list) --
  
  One or more named sheets in the Excel file that will be included in the dataset.
  - (string) --
- SheetIndexes (list) --
  
  One or more sheet numbers in the Excel file that will be included in the dataset.
  - (integer) --
- HeaderRow (boolean) --
  
  A variable that specifies whether the first row in the file is parsed as the header. If this value is false, column names are auto-generated.
Csv (dict) --

Options that define how CSV input is to be interpreted by DataBrew.
- Delimiter (string) --
  
  A single character that specifies the delimiter being used in the CSV file.
- HeaderRow (boolean) --
  
  A variable that specifies whether the first row in the file is parsed as the header. If this value is false, column names are auto-generated.

type Input

dict

param Input

[REQUIRED]

Represents information on how DataBrew can find data, in either the Glue Data Catalog or Amazon S3.

S3InputDefinition (dict) --

The Amazon S3 location where the data is stored.
- Bucket (string) -- [REQUIRED]
  
  The Amazon S3 bucket name.
- Key (string) --
  
  The unique name of the object in the bucket.
- BucketOwner (string) --
  
  The Amazon Web Services account ID of the bucket owner.
DataCatalogInputDefinition (dict) --

The Glue Data Catalog parameters for the data.
- CatalogId (string) --
  
  The unique identifier of the Amazon Web Services account that holds the Data Catalog that stores the data.
- DatabaseName (string) -- [REQUIRED]
  
  The name of a database in the Data Catalog.
- TableName (string) -- [REQUIRED]
  
  The name of a database table in the Data Catalog. This table corresponds to a DataBrew dataset.
- TempDirectory (dict) --
  
  Represents an Amazon location where DataBrew can store intermediate results.
  - Bucket (string) -- [REQUIRED]
    
    The Amazon S3 bucket name.
  - Key (string) --
    
    The unique name of the object in the bucket.
  - BucketOwner (string) --
    
    The Amazon Web Services account ID of the bucket owner.
DatabaseInputDefinition (dict) --

Connection information for dataset input files stored in a database.
- GlueConnectionName (string) -- [REQUIRED]
  
  The Glue Connection that stores the connection information for the target database.
- DatabaseTableName (string) --
  
  The table within the target database.
- TempDirectory (dict) --
  
  Represents an Amazon S3 location (bucket name, bucket owner, and object key) where DataBrew can read input data, or write output from a job.
  - Bucket (string) -- [REQUIRED]
    
    The Amazon S3 bucket name.
  - Key (string) --
    
    The unique name of the object in the bucket.
  - BucketOwner (string) --
    
    The Amazon Web Services account ID of the bucket owner.
- QueryString (string) --
  
  Custom SQL to run against the provided Glue connection. This SQL will be used as the input for DataBrew projects and jobs.
Metadata (dict) --

Contains additional resource information needed for specific datasets.
- SourceArn (string) --
  
  The Amazon Resource Name (ARN) associated with the dataset. Currently, DataBrew only supports ARNs from Amazon AppFlow.

type PathOptions

dict

param PathOptions

A set of options that defines how DataBrew interprets an Amazon S3 path of the dataset.

LastModifiedDateCondition (dict) --

If provided, this structure defines a date range for matching Amazon S3 objects based on their LastModifiedDate attribute in Amazon S3.
- Expression (string) -- [REQUIRED]
  
  The expression which includes condition names followed by substitution variables, possibly grouped and combined with other conditions. For example, "(starts_with :prefix1 or starts_with :prefix2) and (ends_with :suffix1 or ends_with :suffix2)". Substitution variables should start with ':' symbol.
- ValuesMap (dict) -- [REQUIRED]
  
  The map of substitution variable names to their values used in this filter expression.
  - (string) --
    - (string) --
FilesLimit (dict) --

If provided, this structure imposes a limit on a number of files that should be selected.
- MaxFiles (integer) -- [REQUIRED]
  
  The number of Amazon S3 files to select.
- OrderedBy (string) --
  
  A criteria to use for Amazon S3 files sorting before their selection. By default uses LAST_MODIFIED_DATE as a sorting criteria. Currently it's the only allowed value.
- Order (string) --
  
  A criteria to use for Amazon S3 files sorting before their selection. By default uses DESCENDING order, i.e. most recent files are selected first. Another possible value is ASCENDING.
Parameters (dict) --

A structure that maps names of parameters used in the Amazon S3 path of a dataset to their definitions.
- (string) --
  - (dict) --
    
    Represents a dataset parameter that defines type and conditions for a parameter in the Amazon S3 path of the dataset.
    - Name (string) -- [REQUIRED]
      
      The name of the parameter that is used in the dataset's Amazon S3 path.
    - Type (string) -- [REQUIRED]
      
      The type of the dataset parameter, can be one of a 'String', 'Number' or 'Datetime'.
    - DatetimeOptions (dict) --
      
      Additional parameter options such as a format and a timezone. Required for datetime parameters.
      - Format (string) -- [REQUIRED]
        
        Required option, that defines the datetime format used for a date parameter in the Amazon S3 path. Should use only supported datetime specifiers and separation characters, all literal a-z or A-Z characters should be escaped with single quotes. E.g. "MM.dd.yyyy-'at'-HH:mm".
      - TimezoneOffset (string) --
        
        Optional value for a timezone offset of the datetime parameter value in the Amazon S3 path. Shouldn't be used if Format for this parameter includes timezone fields. If no offset specified, UTC is assumed.
      - LocaleCode (string) --
        
        Optional value for a non-US locale code, needed for correct interpretation of some date formats.
    - CreateColumn (boolean) --
      
      Optional boolean value that defines whether the captured value of this parameter should be used to create a new column in a dataset.
    - Filter (dict) --
      
      The optional filter expression structure to apply additional matching criteria to the parameter.
      - Expression (string) -- [REQUIRED]
        
        The expression which includes condition names followed by substitution variables, possibly grouped and combined with other conditions. For example, "(starts_with :prefix1 or starts_with :prefix2) and (ends_with :suffix1 or ends_with :suffix2)". Substitution variables should start with ':' symbol.
      - ValuesMap (dict) -- [REQUIRED]
        
        The map of substitution variable names to their values used in this filter expression.
        
        (string) --
        
        (string) --

type Tags

dict

param Tags

Metadata tags to apply to this dataset.

(string) --
- (string) --

rtype

dict

returns

Response Syntax

{
    'Name': 'string'
}

Response Structure

(dict) --
- Name (string) --
  
  The name of the dataset that you created.

DescribeDataset (updated)

Link ¶
Changes (response)

{'Format': {'ORC'}}

Returns the definition of a specific DataBrew dataset.

See also: AWS API Documentation

Request Syntax

client.describe_dataset(
    Name='string'
)

type Name

string

param Name

[REQUIRED]

The name of the dataset to be described.

rtype

dict

returns

Response Syntax

{
    'CreatedBy': 'string',
    'CreateDate': datetime(2015, 1, 1),
    'Name': 'string',
    'Format': 'CSV'|'JSON'|'PARQUET'|'EXCEL'|'ORC',
    'FormatOptions': {
        'Json': {
            'MultiLine': True|False
        },
        'Excel': {
            'SheetNames': [
                'string',
            ],
            'SheetIndexes': [
                123,
            ],
            'HeaderRow': True|False
        },
        'Csv': {
            'Delimiter': 'string',
            'HeaderRow': True|False
        }
    },
    'Input': {
        'S3InputDefinition': {
            'Bucket': 'string',
            'Key': 'string',
            'BucketOwner': 'string'
        },
        'DataCatalogInputDefinition': {
            'CatalogId': 'string',
            'DatabaseName': 'string',
            'TableName': 'string',
            'TempDirectory': {
                'Bucket': 'string',
                'Key': 'string',
                'BucketOwner': 'string'
            }
        },
        'DatabaseInputDefinition': {
            'GlueConnectionName': 'string',
            'DatabaseTableName': 'string',
            'TempDirectory': {
                'Bucket': 'string',
                'Key': 'string',
                'BucketOwner': 'string'
            },
            'QueryString': 'string'
        },
        'Metadata': {
            'SourceArn': 'string'
        }
    },
    'LastModifiedDate': datetime(2015, 1, 1),
    'LastModifiedBy': 'string',
    'Source': 'S3'|'DATA-CATALOG'|'DATABASE',
    'PathOptions': {
        'LastModifiedDateCondition': {
            'Expression': 'string',
            'ValuesMap': {
                'string': 'string'
            }
        },
        'FilesLimit': {
            'MaxFiles': 123,
            'OrderedBy': 'LAST_MODIFIED_DATE',
            'Order': 'DESCENDING'|'ASCENDING'
        },
        'Parameters': {
            'string': {
                'Name': 'string',
                'Type': 'Datetime'|'Number'|'String',
                'DatetimeOptions': {
                    'Format': 'string',
                    'TimezoneOffset': 'string',
                    'LocaleCode': 'string'
                },
                'CreateColumn': True|False,
                'Filter': {
                    'Expression': 'string',
                    'ValuesMap': {
                        'string': 'string'
                    }
                }
            }
        }
    },
    'Tags': {
        'string': 'string'
    },
    'ResourceArn': 'string'
}

Response Structure

(dict) --
- CreatedBy (string) --
  
  The identifier (user name) of the user who created the dataset.
- CreateDate (datetime) --
  
  The date and time that the dataset was created.
- Name (string) --
  
  The name of the dataset.
- Format (string) --
  
  The file format of a dataset that is created from an Amazon S3 file or folder.
- FormatOptions (dict) --
  
  Represents a set of options that define the structure of either comma-separated value (CSV), Excel, or JSON input.
  - Json (dict) --
    
    Options that define how JSON input is to be interpreted by DataBrew.
    - MultiLine (boolean) --
      
      A value that specifies whether JSON input contains embedded new line characters.
  - Excel (dict) --
    
    Options that define how Excel input is to be interpreted by DataBrew.
    - SheetNames (list) --
      
      One or more named sheets in the Excel file that will be included in the dataset.
      - (string) --
    - SheetIndexes (list) --
      
      One or more sheet numbers in the Excel file that will be included in the dataset.
      - (integer) --
    - HeaderRow (boolean) --
      
      A variable that specifies whether the first row in the file is parsed as the header. If this value is false, column names are auto-generated.
  - Csv (dict) --
    
    Options that define how CSV input is to be interpreted by DataBrew.
    - Delimiter (string) --
      
      A single character that specifies the delimiter being used in the CSV file.
    - HeaderRow (boolean) --
      
      A variable that specifies whether the first row in the file is parsed as the header. If this value is false, column names are auto-generated.
- Input (dict) --
  
  Represents information on how DataBrew can find data, in either the Glue Data Catalog or Amazon S3.
  - S3InputDefinition (dict) --
    
    The Amazon S3 location where the data is stored.
    - Bucket (string) --
      
      The Amazon S3 bucket name.
    - Key (string) --
      
      The unique name of the object in the bucket.
    - BucketOwner (string) --
      
      The Amazon Web Services account ID of the bucket owner.
  - DataCatalogInputDefinition (dict) --
    
    The Glue Data Catalog parameters for the data.
    - CatalogId (string) --
      
      The unique identifier of the Amazon Web Services account that holds the Data Catalog that stores the data.
    - DatabaseName (string) --
      
      The name of a database in the Data Catalog.
    - TableName (string) --
      
      The name of a database table in the Data Catalog. This table corresponds to a DataBrew dataset.
    - TempDirectory (dict) --
      
      Represents an Amazon location where DataBrew can store intermediate results.
      - Bucket (string) --
        
        The Amazon S3 bucket name.
      - Key (string) --
        
        The unique name of the object in the bucket.
      - BucketOwner (string) --
        
        The Amazon Web Services account ID of the bucket owner.
  - DatabaseInputDefinition (dict) --
    
    Connection information for dataset input files stored in a database.
    - GlueConnectionName (string) --
      
      The Glue Connection that stores the connection information for the target database.
    - DatabaseTableName (string) --
      
      The table within the target database.
    - TempDirectory (dict) --
      
      Represents an Amazon S3 location (bucket name, bucket owner, and object key) where DataBrew can read input data, or write output from a job.
      - Bucket (string) --
        
        The Amazon S3 bucket name.
      - Key (string) --
        
        The unique name of the object in the bucket.
      - BucketOwner (string) --
        
        The Amazon Web Services account ID of the bucket owner.
    - QueryString (string) --
      
      Custom SQL to run against the provided Glue connection. This SQL will be used as the input for DataBrew projects and jobs.
  - Metadata (dict) --
    
    Contains additional resource information needed for specific datasets.
    - SourceArn (string) --
      
      The Amazon Resource Name (ARN) associated with the dataset. Currently, DataBrew only supports ARNs from Amazon AppFlow.
- LastModifiedDate (datetime) --
  
  The date and time that the dataset was last modified.
- LastModifiedBy (string) --
  
  The identifier (user name) of the user who last modified the dataset.
- Source (string) --
  
  The location of the data for this dataset, Amazon S3 or the Glue Data Catalog.
- PathOptions (dict) --
  
  A set of options that defines how DataBrew interprets an Amazon S3 path of the dataset.
  - LastModifiedDateCondition (dict) --
    
    If provided, this structure defines a date range for matching Amazon S3 objects based on their LastModifiedDate attribute in Amazon S3.
    - Expression (string) --
      
      The expression which includes condition names followed by substitution variables, possibly grouped and combined with other conditions. For example, "(starts_with :prefix1 or starts_with :prefix2) and (ends_with :suffix1 or ends_with :suffix2)". Substitution variables should start with ':' symbol.
    - ValuesMap (dict) --
      
      The map of substitution variable names to their values used in this filter expression.
      - (string) --
        
        (string) --
  - FilesLimit (dict) --
    
    If provided, this structure imposes a limit on a number of files that should be selected.
    - MaxFiles (integer) --
      
      The number of Amazon S3 files to select.
    - OrderedBy (string) --
      
      A criteria to use for Amazon S3 files sorting before their selection. By default uses LAST_MODIFIED_DATE as a sorting criteria. Currently it's the only allowed value.
    - Order (string) --
      
      A criteria to use for Amazon S3 files sorting before their selection. By default uses DESCENDING order, i.e. most recent files are selected first. Another possible value is ASCENDING.
  - Parameters (dict) --
    
    A structure that maps names of parameters used in the Amazon S3 path of a dataset to their definitions.
    - (string) --
      - (dict) --
        
        Represents a dataset parameter that defines type and conditions for a parameter in the Amazon S3 path of the dataset.
        
        Name (string) --
        
        The name of the parameter that is used in the dataset's Amazon S3 path.
        
        Type (string) --
        
        The type of the dataset parameter, can be one of a 'String', 'Number' or 'Datetime'.
        
        DatetimeOptions (dict) --
        
        Additional parameter options such as a format and a timezone. Required for datetime parameters.
        
        Format (string) --
        
        Required option, that defines the datetime format used for a date parameter in the Amazon S3 path. Should use only supported datetime specifiers and separation characters, all literal a-z or A-Z characters should be escaped with single quotes. E.g. "MM.dd.yyyy-'at'-HH:mm".
        
        TimezoneOffset (string) --
        
        Optional value for a timezone offset of the datetime parameter value in the Amazon S3 path. Shouldn't be used if Format for this parameter includes timezone fields. If no offset specified, UTC is assumed.
        
        LocaleCode (string) --
        
        Optional value for a non-US locale code, needed for correct interpretation of some date formats.
        
        CreateColumn (boolean) --
        
        Optional boolean value that defines whether the captured value of this parameter should be used to create a new column in a dataset.
        
        Filter (dict) --
        
        The optional filter expression structure to apply additional matching criteria to the parameter.
        
        Expression (string) --
        
        The expression which includes condition names followed by substitution variables, possibly grouped and combined with other conditions. For example, "(starts_with :prefix1 or starts_with :prefix2) and (ends_with :suffix1 or ends_with :suffix2)". Substitution variables should start with ':' symbol.
        
        ValuesMap (dict) --
        
        The map of substitution variable names to their values used in this filter expression.
        
        (string) --
        
        (string) --
- Tags (dict) --
  
  Metadata tags associated with this dataset.
  - (string) --
    - (string) --
- ResourceArn (string) --
  
  The Amazon Resource Name (ARN) of the dataset.

ListDatasets (updated)

Link ¶
Changes (response)

{'Datasets': {'Format': {'ORC'}}}

Lists all of the DataBrew datasets.

See also: AWS API Documentation

Request Syntax

client.list_datasets(
    MaxResults=123,
    NextToken='string'
)

type MaxResults

integer

param MaxResults

The maximum number of results to return in this request.

type NextToken

string

param NextToken

The token returned by a previous call to retrieve the next set of results.

rtype

dict

returns

Response Syntax

{
    'Datasets': [
        {
            'AccountId': 'string',
            'CreatedBy': 'string',
            'CreateDate': datetime(2015, 1, 1),
            'Name': 'string',
            'Format': 'CSV'|'JSON'|'PARQUET'|'EXCEL'|'ORC',
            'FormatOptions': {
                'Json': {
                    'MultiLine': True|False
                },
                'Excel': {
                    'SheetNames': [
                        'string',
                    ],
                    'SheetIndexes': [
                        123,
                    ],
                    'HeaderRow': True|False
                },
                'Csv': {
                    'Delimiter': 'string',
                    'HeaderRow': True|False
                }
            },
            'Input': {
                'S3InputDefinition': {
                    'Bucket': 'string',
                    'Key': 'string',
                    'BucketOwner': 'string'
                },
                'DataCatalogInputDefinition': {
                    'CatalogId': 'string',
                    'DatabaseName': 'string',
                    'TableName': 'string',
                    'TempDirectory': {
                        'Bucket': 'string',
                        'Key': 'string',
                        'BucketOwner': 'string'
                    }
                },
                'DatabaseInputDefinition': {
                    'GlueConnectionName': 'string',
                    'DatabaseTableName': 'string',
                    'TempDirectory': {
                        'Bucket': 'string',
                        'Key': 'string',
                        'BucketOwner': 'string'
                    },
                    'QueryString': 'string'
                },
                'Metadata': {
                    'SourceArn': 'string'
                }
            },
            'LastModifiedDate': datetime(2015, 1, 1),
            'LastModifiedBy': 'string',
            'Source': 'S3'|'DATA-CATALOG'|'DATABASE',
            'PathOptions': {
                'LastModifiedDateCondition': {
                    'Expression': 'string',
                    'ValuesMap': {
                        'string': 'string'
                    }
                },
                'FilesLimit': {
                    'MaxFiles': 123,
                    'OrderedBy': 'LAST_MODIFIED_DATE',
                    'Order': 'DESCENDING'|'ASCENDING'
                },
                'Parameters': {
                    'string': {
                        'Name': 'string',
                        'Type': 'Datetime'|'Number'|'String',
                        'DatetimeOptions': {
                            'Format': 'string',
                            'TimezoneOffset': 'string',
                            'LocaleCode': 'string'
                        },
                        'CreateColumn': True|False,
                        'Filter': {
                            'Expression': 'string',
                            'ValuesMap': {
                                'string': 'string'
                            }
                        }
                    }
                }
            },
            'Tags': {
                'string': 'string'
            },
            'ResourceArn': 'string'
        },
    ],
    'NextToken': 'string'
}

Response Structure

(dict) --
- Datasets (list) --
  
  A list of datasets that are defined.
  - (dict) --
    
    Represents a dataset that can be processed by DataBrew.
    - AccountId (string) --
      
      The ID of the Amazon Web Services account that owns the dataset.
    - CreatedBy (string) --
      
      The Amazon Resource Name (ARN) of the user who created the dataset.
    - CreateDate (datetime) --
      
      The date and time that the dataset was created.
    - Name (string) --
      
      The unique name of the dataset.
    - Format (string) --
      
      The file format of a dataset that is created from an Amazon S3 file or folder.
    - FormatOptions (dict) --
      
      A set of options that define how DataBrew interprets the data in the dataset.
      - Json (dict) --
        
        Options that define how JSON input is to be interpreted by DataBrew.
        
        MultiLine (boolean) --
        
        A value that specifies whether JSON input contains embedded new line characters.
      - Excel (dict) --
        
        Options that define how Excel input is to be interpreted by DataBrew.
        
        SheetNames (list) --
        
        One or more named sheets in the Excel file that will be included in the dataset.
        
        (string) --
        
        SheetIndexes (list) --
        
        One or more sheet numbers in the Excel file that will be included in the dataset.
        
        (integer) --
        
        HeaderRow (boolean) --
        
        A variable that specifies whether the first row in the file is parsed as the header. If this value is false, column names are auto-generated.
      - Csv (dict) --
        
        Options that define how CSV input is to be interpreted by DataBrew.
        
        Delimiter (string) --
        
        A single character that specifies the delimiter being used in the CSV file.
        
        HeaderRow (boolean) --
        
        A variable that specifies whether the first row in the file is parsed as the header. If this value is false, column names are auto-generated.
    - Input (dict) --
      
      Information on how DataBrew can find the dataset, in either the Glue Data Catalog or Amazon S3.
      - S3InputDefinition (dict) --
        
        The Amazon S3 location where the data is stored.
        
        Bucket (string) --
        
        The Amazon S3 bucket name.
        
        Key (string) --
        
        The unique name of the object in the bucket.
        
        BucketOwner (string) --
        
        The Amazon Web Services account ID of the bucket owner.
      - DataCatalogInputDefinition (dict) --
        
        The Glue Data Catalog parameters for the data.
        
        CatalogId (string) --
        
        The unique identifier of the Amazon Web Services account that holds the Data Catalog that stores the data.
        
        DatabaseName (string) --
        
        The name of a database in the Data Catalog.
        
        TableName (string) --
        
        The name of a database table in the Data Catalog. This table corresponds to a DataBrew dataset.
        
        TempDirectory (dict) --
        
        Represents an Amazon location where DataBrew can store intermediate results.
        
        Bucket (string) --
        
        The Amazon S3 bucket name.
        
        Key (string) --
        
        The unique name of the object in the bucket.
        
        BucketOwner (string) --
        
        The Amazon Web Services account ID of the bucket owner.
      - DatabaseInputDefinition (dict) --
        
        Connection information for dataset input files stored in a database.
        
        GlueConnectionName (string) --
        
        The Glue Connection that stores the connection information for the target database.
        
        DatabaseTableName (string) --
        
        The table within the target database.
        
        TempDirectory (dict) --
        
        Represents an Amazon S3 location (bucket name, bucket owner, and object key) where DataBrew can read input data, or write output from a job.
        
        Bucket (string) --
        
        The Amazon S3 bucket name.
        
        Key (string) --
        
        The unique name of the object in the bucket.
        
        BucketOwner (string) --
        
        The Amazon Web Services account ID of the bucket owner.
        
        QueryString (string) --
        
        Custom SQL to run against the provided Glue connection. This SQL will be used as the input for DataBrew projects and jobs.
      - Metadata (dict) --
        
        Contains additional resource information needed for specific datasets.
        
        SourceArn (string) --
        
        The Amazon Resource Name (ARN) associated with the dataset. Currently, DataBrew only supports ARNs from Amazon AppFlow.
    - LastModifiedDate (datetime) --
      
      The last modification date and time of the dataset.
    - LastModifiedBy (string) --
      
      The Amazon Resource Name (ARN) of the user who last modified the dataset.
    - Source (string) --
      
      The location of the data for the dataset, either Amazon S3 or the Glue Data Catalog.
    - PathOptions (dict) --
      
      A set of options that defines how DataBrew interprets an Amazon S3 path of the dataset.
      - LastModifiedDateCondition (dict) --
        
        If provided, this structure defines a date range for matching Amazon S3 objects based on their LastModifiedDate attribute in Amazon S3.
        
        Expression (string) --
        
        The expression which includes condition names followed by substitution variables, possibly grouped and combined with other conditions. For example, "(starts_with :prefix1 or starts_with :prefix2) and (ends_with :suffix1 or ends_with :suffix2)". Substitution variables should start with ':' symbol.
        
        ValuesMap (dict) --
        
        The map of substitution variable names to their values used in this filter expression.
        
        (string) --
        
        (string) --
      - FilesLimit (dict) --
        
        If provided, this structure imposes a limit on a number of files that should be selected.
        
        MaxFiles (integer) --
        
        The number of Amazon S3 files to select.
        
        OrderedBy (string) --
        
        A criteria to use for Amazon S3 files sorting before their selection. By default uses LAST_MODIFIED_DATE as a sorting criteria. Currently it's the only allowed value.
        
        Order (string) --
        
        A criteria to use for Amazon S3 files sorting before their selection. By default uses DESCENDING order, i.e. most recent files are selected first. Another possible value is ASCENDING.
      - Parameters (dict) --
        
        A structure that maps names of parameters used in the Amazon S3 path of a dataset to their definitions.
        
        (string) --
        
        (dict) --
        
        Represents a dataset parameter that defines type and conditions for a parameter in the Amazon S3 path of the dataset.
        
        Name (string) --
        
        The name of the parameter that is used in the dataset's Amazon S3 path.
        
        Type (string) --
        
        The type of the dataset parameter, can be one of a 'String', 'Number' or 'Datetime'.
        
        DatetimeOptions (dict) --
        
        Additional parameter options such as a format and a timezone. Required for datetime parameters.
        
        Format (string) --
        
        Required option, that defines the datetime format used for a date parameter in the Amazon S3 path. Should use only supported datetime specifiers and separation characters, all literal a-z or A-Z characters should be escaped with single quotes. E.g. "MM.dd.yyyy-'at'-HH:mm".
        
        TimezoneOffset (string) --
        
        Optional value for a timezone offset of the datetime parameter value in the Amazon S3 path. Shouldn't be used if Format for this parameter includes timezone fields. If no offset specified, UTC is assumed.
        
        LocaleCode (string) --
        
        Optional value for a non-US locale code, needed for correct interpretation of some date formats.
        
        CreateColumn (boolean) --
        
        Optional boolean value that defines whether the captured value of this parameter should be used to create a new column in a dataset.
        
        Filter (dict) --
        
        The optional filter expression structure to apply additional matching criteria to the parameter.
        
        Expression (string) --
        
        The expression which includes condition names followed by substitution variables, possibly grouped and combined with other conditions. For example, "(starts_with :prefix1 or starts_with :prefix2) and (ends_with :suffix1 or ends_with :suffix2)". Substitution variables should start with ':' symbol.
        
        ValuesMap (dict) --
        
        The map of substitution variable names to their values used in this filter expression.
        
        (string) --
        
        (string) --
    - Tags (dict) --
      
      Metadata tags that have been applied to the dataset.
      - (string) --
        
        (string) --
    - ResourceArn (string) --
      
      The unique Amazon Resource Name (ARN) for the dataset.
- NextToken (string) --
  
  A token that you can use in a subsequent call to retrieve the next set of results.

UpdateDataset (updated)

Link ¶
Changes (request)

{'Format': {'ORC'}}

Modifies the definition of an existing DataBrew dataset.

See also: AWS API Documentation

Request Syntax

client.update_dataset(
    Name='string',
    Format='CSV'|'JSON'|'PARQUET'|'EXCEL'|'ORC',
    FormatOptions={
        'Json': {
            'MultiLine': True|False
        },
        'Excel': {
            'SheetNames': [
                'string',
            ],
            'SheetIndexes': [
                123,
            ],
            'HeaderRow': True|False
        },
        'Csv': {
            'Delimiter': 'string',
            'HeaderRow': True|False
        }
    },
    Input={
        'S3InputDefinition': {
            'Bucket': 'string',
            'Key': 'string',
            'BucketOwner': 'string'
        },
        'DataCatalogInputDefinition': {
            'CatalogId': 'string',
            'DatabaseName': 'string',
            'TableName': 'string',
            'TempDirectory': {
                'Bucket': 'string',
                'Key': 'string',
                'BucketOwner': 'string'
            }
        },
        'DatabaseInputDefinition': {
            'GlueConnectionName': 'string',
            'DatabaseTableName': 'string',
            'TempDirectory': {
                'Bucket': 'string',
                'Key': 'string',
                'BucketOwner': 'string'
            },
            'QueryString': 'string'
        },
        'Metadata': {
            'SourceArn': 'string'
        }
    },
    PathOptions={
        'LastModifiedDateCondition': {
            'Expression': 'string',
            'ValuesMap': {
                'string': 'string'
            }
        },
        'FilesLimit': {
            'MaxFiles': 123,
            'OrderedBy': 'LAST_MODIFIED_DATE',
            'Order': 'DESCENDING'|'ASCENDING'
        },
        'Parameters': {
            'string': {
                'Name': 'string',
                'Type': 'Datetime'|'Number'|'String',
                'DatetimeOptions': {
                    'Format': 'string',
                    'TimezoneOffset': 'string',
                    'LocaleCode': 'string'
                },
                'CreateColumn': True|False,
                'Filter': {
                    'Expression': 'string',
                    'ValuesMap': {
                        'string': 'string'
                    }
                }
            }
        }
    }
)

type Name

string

param Name

[REQUIRED]

The name of the dataset to be updated.

type Format

string

param Format

The file format of a dataset that is created from an Amazon S3 file or folder.

type FormatOptions

dict

param FormatOptions

Represents a set of options that define the structure of either comma-separated value (CSV), Excel, or JSON input.

Json (dict) --

Options that define how JSON input is to be interpreted by DataBrew.
- MultiLine (boolean) --
  
  A value that specifies whether JSON input contains embedded new line characters.
Excel (dict) --

Options that define how Excel input is to be interpreted by DataBrew.
- SheetNames (list) --
  
  One or more named sheets in the Excel file that will be included in the dataset.
  - (string) --
- SheetIndexes (list) --
  
  One or more sheet numbers in the Excel file that will be included in the dataset.
  - (integer) --
- HeaderRow (boolean) --
  
  A variable that specifies whether the first row in the file is parsed as the header. If this value is false, column names are auto-generated.
Csv (dict) --

Options that define how CSV input is to be interpreted by DataBrew.
- Delimiter (string) --
  
  A single character that specifies the delimiter being used in the CSV file.
- HeaderRow (boolean) --
  
  A variable that specifies whether the first row in the file is parsed as the header. If this value is false, column names are auto-generated.

type Input

dict

param Input

[REQUIRED]

Represents information on how DataBrew can find data, in either the Glue Data Catalog or Amazon S3.

S3InputDefinition (dict) --

The Amazon S3 location where the data is stored.
- Bucket (string) -- [REQUIRED]
  
  The Amazon S3 bucket name.
- Key (string) --
  
  The unique name of the object in the bucket.
- BucketOwner (string) --
  
  The Amazon Web Services account ID of the bucket owner.
DataCatalogInputDefinition (dict) --

The Glue Data Catalog parameters for the data.
- CatalogId (string) --
  
  The unique identifier of the Amazon Web Services account that holds the Data Catalog that stores the data.
- DatabaseName (string) -- [REQUIRED]
  
  The name of a database in the Data Catalog.
- TableName (string) -- [REQUIRED]
  
  The name of a database table in the Data Catalog. This table corresponds to a DataBrew dataset.
- TempDirectory (dict) --
  
  Represents an Amazon location where DataBrew can store intermediate results.
  - Bucket (string) -- [REQUIRED]
    
    The Amazon S3 bucket name.
  - Key (string) --
    
    The unique name of the object in the bucket.
  - BucketOwner (string) --
    
    The Amazon Web Services account ID of the bucket owner.
DatabaseInputDefinition (dict) --

Connection information for dataset input files stored in a database.
- GlueConnectionName (string) -- [REQUIRED]
  
  The Glue Connection that stores the connection information for the target database.
- DatabaseTableName (string) --
  
  The table within the target database.
- TempDirectory (dict) --
  
  Represents an Amazon S3 location (bucket name, bucket owner, and object key) where DataBrew can read input data, or write output from a job.
  - Bucket (string) -- [REQUIRED]
    
    The Amazon S3 bucket name.
  - Key (string) --
    
    The unique name of the object in the bucket.
  - BucketOwner (string) --
    
    The Amazon Web Services account ID of the bucket owner.
- QueryString (string) --
  
  Custom SQL to run against the provided Glue connection. This SQL will be used as the input for DataBrew projects and jobs.
Metadata (dict) --

Contains additional resource information needed for specific datasets.
- SourceArn (string) --
  
  The Amazon Resource Name (ARN) associated with the dataset. Currently, DataBrew only supports ARNs from Amazon AppFlow.

type PathOptions

dict

param PathOptions

A set of options that defines how DataBrew interprets an Amazon S3 path of the dataset.

LastModifiedDateCondition (dict) --

If provided, this structure defines a date range for matching Amazon S3 objects based on their LastModifiedDate attribute in Amazon S3.
- Expression (string) -- [REQUIRED]
  
  The expression which includes condition names followed by substitution variables, possibly grouped and combined with other conditions. For example, "(starts_with :prefix1 or starts_with :prefix2) and (ends_with :suffix1 or ends_with :suffix2)". Substitution variables should start with ':' symbol.
- ValuesMap (dict) -- [REQUIRED]
  
  The map of substitution variable names to their values used in this filter expression.
  - (string) --
    - (string) --
FilesLimit (dict) --

If provided, this structure imposes a limit on a number of files that should be selected.
- MaxFiles (integer) -- [REQUIRED]
  
  The number of Amazon S3 files to select.
- OrderedBy (string) --
  
  A criteria to use for Amazon S3 files sorting before their selection. By default uses LAST_MODIFIED_DATE as a sorting criteria. Currently it's the only allowed value.
- Order (string) --
  
  A criteria to use for Amazon S3 files sorting before their selection. By default uses DESCENDING order, i.e. most recent files are selected first. Another possible value is ASCENDING.
Parameters (dict) --

A structure that maps names of parameters used in the Amazon S3 path of a dataset to their definitions.
- (string) --
  - (dict) --
    
    Represents a dataset parameter that defines type and conditions for a parameter in the Amazon S3 path of the dataset.
    - Name (string) -- [REQUIRED]
      
      The name of the parameter that is used in the dataset's Amazon S3 path.
    - Type (string) -- [REQUIRED]
      
      The type of the dataset parameter, can be one of a 'String', 'Number' or 'Datetime'.
    - DatetimeOptions (dict) --
      
      Additional parameter options such as a format and a timezone. Required for datetime parameters.
      - Format (string) -- [REQUIRED]
        
        Required option, that defines the datetime format used for a date parameter in the Amazon S3 path. Should use only supported datetime specifiers and separation characters, all literal a-z or A-Z characters should be escaped with single quotes. E.g. "MM.dd.yyyy-'at'-HH:mm".
      - TimezoneOffset (string) --
        
        Optional value for a timezone offset of the datetime parameter value in the Amazon S3 path. Shouldn't be used if Format for this parameter includes timezone fields. If no offset specified, UTC is assumed.
      - LocaleCode (string) --
        
        Optional value for a non-US locale code, needed for correct interpretation of some date formats.
    - CreateColumn (boolean) --
      
      Optional boolean value that defines whether the captured value of this parameter should be used to create a new column in a dataset.
    - Filter (dict) --
      
      The optional filter expression structure to apply additional matching criteria to the parameter.
      - Expression (string) -- [REQUIRED]
        
        The expression which includes condition names followed by substitution variables, possibly grouped and combined with other conditions. For example, "(starts_with :prefix1 or starts_with :prefix2) and (ends_with :suffix1 or ends_with :suffix2)". Substitution variables should start with ':' symbol.
      - ValuesMap (dict) -- [REQUIRED]
        
        The map of substitution variable names to their values used in this filter expression.
        
        (string) --
        
        (string) --

rtype

dict

returns

Response Syntax

{
    'Name': 'string'
}

Response Structure

(dict) --
- Name (string) --
  
  The name of the dataset that you updated.