AWS API Changes

2021/02/25 - AWS Glue DataBrew - 4 updated api methods

Changes This SDK release adds two new dataset features: 1) support for specifying the file format for a dataset, and 2) support for specifying whether the first row of a CSV or Excel file contains a header.

CreateDataset (updated)

Link ¶
Changes (request)

{'Format': 'CSV | JSON | PARQUET | EXCEL',
 'FormatOptions': {'Csv': {'HeaderRow': 'boolean'},
                   'Excel': {'HeaderRow': 'boolean'}}}

Creates a new DataBrew dataset.

See also: AWS API Documentation

Request Syntax

client.create_dataset(
    Name='string',
    Format='CSV'|'JSON'|'PARQUET'|'EXCEL',
    FormatOptions={
        'Json': {
            'MultiLine': True|False
        },
        'Excel': {
            'SheetNames': [
                'string',
            ],
            'SheetIndexes': [
                123,
            ],
            'HeaderRow': True|False
        },
        'Csv': {
            'Delimiter': 'string',
            'HeaderRow': True|False
        }
    },
    Input={
        'S3InputDefinition': {
            'Bucket': 'string',
            'Key': 'string'
        },
        'DataCatalogInputDefinition': {
            'CatalogId': 'string',
            'DatabaseName': 'string',
            'TableName': 'string',
            'TempDirectory': {
                'Bucket': 'string',
                'Key': 'string'
            }
        }
    },
    Tags={
        'string': 'string'
    }
)

type Name

string

param Name

[REQUIRED]

The name of the dataset to be created. Valid characters are alphanumeric (A-Z, a-z, 0-9), hyphen (-), period (.), and space.

type Format

string

param Format

Specifies the file format of a dataset created from an S3 file or folder.

type FormatOptions

dict

param FormatOptions

Options that define the structure of either Csv, Excel, or JSON input.

Json (dict) --

Options that define how JSON input is to be interpreted by DataBrew.
- MultiLine (boolean) --
  
  A value that specifies whether JSON input contains embedded new line characters.
Excel (dict) --

Options that define how Excel input is to be interpreted by DataBrew.
- SheetNames (list) --
  
  Specifies one or more named sheets in the Excel file, which will be included in the dataset.
  - (string) --
- SheetIndexes (list) --
  
  Specifies one or more sheet numbers in the Excel file, which will be included in the dataset.
  - (integer) --
- HeaderRow (boolean) --
  
  A variable that specifies whether the first row in the file will be parsed as the header. If false, column names will be auto-generated.
Csv (dict) --

Options that define how Csv input is to be interpreted by DataBrew.
- Delimiter (string) --
  
  A single character that specifies the delimiter being used in the Csv file.
- HeaderRow (boolean) --
  
  A variable that specifies whether the first row in the file will be parsed as the header. If false, column names will be auto-generated.

type Input

dict

param Input

[REQUIRED]

Information on how DataBrew can find data, in either the AWS Glue Data Catalog or Amazon S3.

S3InputDefinition (dict) --

The Amazon S3 location where the data is stored.
- Bucket (string) -- [REQUIRED]
  
  The S3 bucket name.
- Key (string) --
  
  The unique name of the object in the bucket.
DataCatalogInputDefinition (dict) --

The AWS Glue Data Catalog parameters for the data.
- CatalogId (string) --
  
  The unique identifier of the AWS account that holds the Data Catalog that stores the data.
- DatabaseName (string) -- [REQUIRED]
  
  The name of a database in the Data Catalog.
- TableName (string) -- [REQUIRED]
  
  The name of a database table in the Data Catalog. This table corresponds to a DataBrew dataset.
- TempDirectory (dict) --
  
  An Amazon location that AWS Glue Data Catalog can use as a temporary directory.
  - Bucket (string) -- [REQUIRED]
    
    The S3 bucket name.
  - Key (string) --
    
    The unique name of the object in the bucket.

type Tags

dict

param Tags

Metadata tags to apply to this dataset.

(string) --
- (string) --

rtype

dict

returns

Response Syntax

{
    'Name': 'string'
}

Response Structure

(dict) --
- Name (string) --
  
  The name of the dataset that you created.

DescribeDataset (updated)

Link ¶
Changes (response)

{'Format': 'CSV | JSON | PARQUET | EXCEL',
 'FormatOptions': {'Csv': {'HeaderRow': 'boolean'},
                   'Excel': {'HeaderRow': 'boolean'}}}

Returns the definition of a specific DataBrew dataset.

See also: AWS API Documentation

Request Syntax

client.describe_dataset(
    Name='string'
)

type Name

string

param Name

[REQUIRED]

The name of the dataset to be described.

rtype

dict

returns

Response Syntax

{
    'CreatedBy': 'string',
    'CreateDate': datetime(2015, 1, 1),
    'Name': 'string',
    'Format': 'CSV'|'JSON'|'PARQUET'|'EXCEL',
    'FormatOptions': {
        'Json': {
            'MultiLine': True|False
        },
        'Excel': {
            'SheetNames': [
                'string',
            ],
            'SheetIndexes': [
                123,
            ],
            'HeaderRow': True|False
        },
        'Csv': {
            'Delimiter': 'string',
            'HeaderRow': True|False
        }
    },
    'Input': {
        'S3InputDefinition': {
            'Bucket': 'string',
            'Key': 'string'
        },
        'DataCatalogInputDefinition': {
            'CatalogId': 'string',
            'DatabaseName': 'string',
            'TableName': 'string',
            'TempDirectory': {
                'Bucket': 'string',
                'Key': 'string'
            }
        }
    },
    'LastModifiedDate': datetime(2015, 1, 1),
    'LastModifiedBy': 'string',
    'Source': 'S3'|'DATA-CATALOG',
    'Tags': {
        'string': 'string'
    },
    'ResourceArn': 'string'
}

Response Structure

(dict) --
- CreatedBy (string) --
  
  The identifier (user name) of the user who created the dataset.
- CreateDate (datetime) --
  
  The date and time that the dataset was created.
- Name (string) --
  
  The name of the dataset.
- Format (string) --
  
  Specifies the file format of a dataset created from an S3 file or folder.
- FormatOptions (dict) --
  
  Options that define the structure of either Csv, Excel, or JSON input.
  - Json (dict) --
    
    Options that define how JSON input is to be interpreted by DataBrew.
    - MultiLine (boolean) --
      
      A value that specifies whether JSON input contains embedded new line characters.
  - Excel (dict) --
    
    Options that define how Excel input is to be interpreted by DataBrew.
    - SheetNames (list) --
      
      Specifies one or more named sheets in the Excel file, which will be included in the dataset.
      - (string) --
    - SheetIndexes (list) --
      
      Specifies one or more sheet numbers in the Excel file, which will be included in the dataset.
      - (integer) --
    - HeaderRow (boolean) --
      
      A variable that specifies whether the first row in the file will be parsed as the header. If false, column names will be auto-generated.
  - Csv (dict) --
    
    Options that define how Csv input is to be interpreted by DataBrew.
    - Delimiter (string) --
      
      A single character that specifies the delimiter being used in the Csv file.
    - HeaderRow (boolean) --
      
      A variable that specifies whether the first row in the file will be parsed as the header. If false, column names will be auto-generated.
- Input (dict) --
  
  Information on how DataBrew can find data, in either the AWS Glue Data Catalog or Amazon S3.
  - S3InputDefinition (dict) --
    
    The Amazon S3 location where the data is stored.
    - Bucket (string) --
      
      The S3 bucket name.
    - Key (string) --
      
      The unique name of the object in the bucket.
  - DataCatalogInputDefinition (dict) --
    
    The AWS Glue Data Catalog parameters for the data.
    - CatalogId (string) --
      
      The unique identifier of the AWS account that holds the Data Catalog that stores the data.
    - DatabaseName (string) --
      
      The name of a database in the Data Catalog.
    - TableName (string) --
      
      The name of a database table in the Data Catalog. This table corresponds to a DataBrew dataset.
    - TempDirectory (dict) --
      
      An Amazon location that AWS Glue Data Catalog can use as a temporary directory.
      - Bucket (string) --
        
        The S3 bucket name.
      - Key (string) --
        
        The unique name of the object in the bucket.
- LastModifiedDate (datetime) --
  
  The date and time that the dataset was last modified.
- LastModifiedBy (string) --
  
  The identifier (user name) of the user who last modified the dataset.
- Source (string) --
  
  The location of the data for this dataset, Amazon S3 or the AWS Glue Data Catalog.
- Tags (dict) --
  
  Metadata tags associated with this dataset.
  - (string) --
    - (string) --
- ResourceArn (string) --
  
  The Amazon Resource Name (ARN) of the dataset.

ListDatasets (updated)

Link ¶
Changes (response)

{'Datasets': {'Format': 'CSV | JSON | PARQUET | EXCEL',
              'FormatOptions': {'Csv': {'HeaderRow': 'boolean'},
                                'Excel': {'HeaderRow': 'boolean'}}}}

Lists all of the DataBrew datasets.

See also: AWS API Documentation

Request Syntax

client.list_datasets(
    MaxResults=123,
    NextToken='string'
)

type MaxResults

integer

param MaxResults

The maximum number of results to return in this request.

type NextToken

string

param NextToken

The token returned by a previous call to retrieve the next set of results.

rtype

dict

returns

Response Syntax

{
    'Datasets': [
        {
            'AccountId': 'string',
            'CreatedBy': 'string',
            'CreateDate': datetime(2015, 1, 1),
            'Name': 'string',
            'Format': 'CSV'|'JSON'|'PARQUET'|'EXCEL',
            'FormatOptions': {
                'Json': {
                    'MultiLine': True|False
                },
                'Excel': {
                    'SheetNames': [
                        'string',
                    ],
                    'SheetIndexes': [
                        123,
                    ],
                    'HeaderRow': True|False
                },
                'Csv': {
                    'Delimiter': 'string',
                    'HeaderRow': True|False
                }
            },
            'Input': {
                'S3InputDefinition': {
                    'Bucket': 'string',
                    'Key': 'string'
                },
                'DataCatalogInputDefinition': {
                    'CatalogId': 'string',
                    'DatabaseName': 'string',
                    'TableName': 'string',
                    'TempDirectory': {
                        'Bucket': 'string',
                        'Key': 'string'
                    }
                }
            },
            'LastModifiedDate': datetime(2015, 1, 1),
            'LastModifiedBy': 'string',
            'Source': 'S3'|'DATA-CATALOG',
            'Tags': {
                'string': 'string'
            },
            'ResourceArn': 'string'
        },
    ],
    'NextToken': 'string'
}

Response Structure

(dict) --
- Datasets (list) --
  
  A list of datasets that are defined.
  - (dict) --
    
    Represents a dataset that can be processed by DataBrew.
    - AccountId (string) --
      
      The ID of the AWS account that owns the dataset.
    - CreatedBy (string) --
      
      The Amazon Resource Name (ARN) of the user who created the dataset.
    - CreateDate (datetime) --
      
      The date and time that the dataset was created.
    - Name (string) --
      
      The unique name of the dataset.
    - Format (string) --
      
      Specifies the file format of a dataset created from an S3 file or folder.
    - FormatOptions (dict) --
      
      Options that define how DataBrew interprets the data in the dataset.
      - Json (dict) --
        
        Options that define how JSON input is to be interpreted by DataBrew.
        
        MultiLine (boolean) --
        
        A value that specifies whether JSON input contains embedded new line characters.
      - Excel (dict) --
        
        Options that define how Excel input is to be interpreted by DataBrew.
        
        SheetNames (list) --
        
        Specifies one or more named sheets in the Excel file, which will be included in the dataset.
        
        (string) --
        
        SheetIndexes (list) --
        
        Specifies one or more sheet numbers in the Excel file, which will be included in the dataset.
        
        (integer) --
        
        HeaderRow (boolean) --
        
        A variable that specifies whether the first row in the file will be parsed as the header. If false, column names will be auto-generated.
      - Csv (dict) --
        
        Options that define how Csv input is to be interpreted by DataBrew.
        
        Delimiter (string) --
        
        A single character that specifies the delimiter being used in the Csv file.
        
        HeaderRow (boolean) --
        
        A variable that specifies whether the first row in the file will be parsed as the header. If false, column names will be auto-generated.
    - Input (dict) --
      
      Information on how DataBrew can find the dataset, in either the AWS Glue Data Catalog or Amazon S3.
      - S3InputDefinition (dict) --
        
        The Amazon S3 location where the data is stored.
        
        Bucket (string) --
        
        The S3 bucket name.
        
        Key (string) --
        
        The unique name of the object in the bucket.
      - DataCatalogInputDefinition (dict) --
        
        The AWS Glue Data Catalog parameters for the data.
        
        CatalogId (string) --
        
        The unique identifier of the AWS account that holds the Data Catalog that stores the data.
        
        DatabaseName (string) --
        
        The name of a database in the Data Catalog.
        
        TableName (string) --
        
        The name of a database table in the Data Catalog. This table corresponds to a DataBrew dataset.
        
        TempDirectory (dict) --
        
        An Amazon location that AWS Glue Data Catalog can use as a temporary directory.
        
        Bucket (string) --
        
        The S3 bucket name.
        
        Key (string) --
        
        The unique name of the object in the bucket.
    - LastModifiedDate (datetime) --
      
      The last modification date and time of the dataset.
    - LastModifiedBy (string) --
      
      The Amazon Resource Name (ARN) of the user who last modified the dataset.
    - Source (string) --
      
      The location of the data for the dataset, either Amazon S3 or the AWS Glue Data Catalog.
    - Tags (dict) --
      
      Metadata tags that have been applied to the dataset.
      - (string) --
        
        (string) --
    - ResourceArn (string) --
      
      The unique Amazon Resource Name (ARN) for the dataset.
- NextToken (string) --
  
  A token that you can use in a subsequent call to retrieve the next set of results.

UpdateDataset (updated)

Link ¶
Changes (request)

{'Format': 'CSV | JSON | PARQUET | EXCEL',
 'FormatOptions': {'Csv': {'HeaderRow': 'boolean'},
                   'Excel': {'HeaderRow': 'boolean'}}}

Modifies the definition of an existing DataBrew dataset.

See also: AWS API Documentation

Request Syntax

client.update_dataset(
    Name='string',
    Format='CSV'|'JSON'|'PARQUET'|'EXCEL',
    FormatOptions={
        'Json': {
            'MultiLine': True|False
        },
        'Excel': {
            'SheetNames': [
                'string',
            ],
            'SheetIndexes': [
                123,
            ],
            'HeaderRow': True|False
        },
        'Csv': {
            'Delimiter': 'string',
            'HeaderRow': True|False
        }
    },
    Input={
        'S3InputDefinition': {
            'Bucket': 'string',
            'Key': 'string'
        },
        'DataCatalogInputDefinition': {
            'CatalogId': 'string',
            'DatabaseName': 'string',
            'TableName': 'string',
            'TempDirectory': {
                'Bucket': 'string',
                'Key': 'string'
            }
        }
    }
)

type Name

string

param Name

[REQUIRED]

The name of the dataset to be updated.

type Format

string

param Format

Specifies the file format of a dataset created from an S3 file or folder.

type FormatOptions

dict

param FormatOptions

Options that define the structure of either Csv, Excel, or JSON input.

Json (dict) --

Options that define how JSON input is to be interpreted by DataBrew.
- MultiLine (boolean) --
  
  A value that specifies whether JSON input contains embedded new line characters.
Excel (dict) --

Options that define how Excel input is to be interpreted by DataBrew.
- SheetNames (list) --
  
  Specifies one or more named sheets in the Excel file, which will be included in the dataset.
  - (string) --
- SheetIndexes (list) --
  
  Specifies one or more sheet numbers in the Excel file, which will be included in the dataset.
  - (integer) --
- HeaderRow (boolean) --
  
  A variable that specifies whether the first row in the file will be parsed as the header. If false, column names will be auto-generated.
Csv (dict) --

Options that define how Csv input is to be interpreted by DataBrew.
- Delimiter (string) --
  
  A single character that specifies the delimiter being used in the Csv file.
- HeaderRow (boolean) --
  
  A variable that specifies whether the first row in the file will be parsed as the header. If false, column names will be auto-generated.

type Input

dict

param Input

[REQUIRED]

Information on how DataBrew can find data, in either the AWS Glue Data Catalog or Amazon S3.

S3InputDefinition (dict) --

The Amazon S3 location where the data is stored.
- Bucket (string) -- [REQUIRED]
  
  The S3 bucket name.
- Key (string) --
  
  The unique name of the object in the bucket.
DataCatalogInputDefinition (dict) --

The AWS Glue Data Catalog parameters for the data.
- CatalogId (string) --
  
  The unique identifier of the AWS account that holds the Data Catalog that stores the data.
- DatabaseName (string) -- [REQUIRED]
  
  The name of a database in the Data Catalog.
- TableName (string) -- [REQUIRED]
  
  The name of a database table in the Data Catalog. This table corresponds to a DataBrew dataset.
- TempDirectory (dict) --
  
  An Amazon location that AWS Glue Data Catalog can use as a temporary directory.
  - Bucket (string) -- [REQUIRED]
    
    The S3 bucket name.
  - Key (string) --
    
    The unique name of the object in the bucket.

rtype

dict

returns

Response Syntax

{
    'Name': 'string'
}

Response Structure

(dict) --
- Name (string) --
  
  The name of the dataset that you updated.