2021/02/25 - AWS Glue DataBrew - 4 updated api methods
Changes This SDK release adds two new dataset features: 1) support for specifying the file format for a dataset, and 2) support for specifying whether the first row of a CSV or Excel file contains a header.
{'Format': 'CSV | JSON | PARQUET | EXCEL',
'FormatOptions': {'Csv': {'HeaderRow': 'boolean'},
'Excel': {'HeaderRow': 'boolean'}}}
Creates a new DataBrew dataset.
See also: AWS API Documentation
Request Syntax
client.create_dataset(
Name='string',
Format='CSV'|'JSON'|'PARQUET'|'EXCEL',
FormatOptions={
'Json': {
'MultiLine': True|False
},
'Excel': {
'SheetNames': [
'string',
],
'SheetIndexes': [
123,
],
'HeaderRow': True|False
},
'Csv': {
'Delimiter': 'string',
'HeaderRow': True|False
}
},
Input={
'S3InputDefinition': {
'Bucket': 'string',
'Key': 'string'
},
'DataCatalogInputDefinition': {
'CatalogId': 'string',
'DatabaseName': 'string',
'TableName': 'string',
'TempDirectory': {
'Bucket': 'string',
'Key': 'string'
}
}
},
Tags={
'string': 'string'
}
)
string
[REQUIRED]
The name of the dataset to be created. Valid characters are alphanumeric (A-Z, a-z, 0-9), hyphen (-), period (.), and space.
string
Specifies the file format of a dataset created from an S3 file or folder.
dict
Options that define the structure of either Csv, Excel, or JSON input.
Json (dict) --
Options that define how JSON input is to be interpreted by DataBrew.
MultiLine (boolean) --
A value that specifies whether JSON input contains embedded new line characters.
Excel (dict) --
Options that define how Excel input is to be interpreted by DataBrew.
SheetNames (list) --
Specifies one or more named sheets in the Excel file, which will be included in the dataset.
(string) --
SheetIndexes (list) --
Specifies one or more sheet numbers in the Excel file, which will be included in the dataset.
(integer) --
HeaderRow (boolean) --
A variable that specifies whether the first row in the file will be parsed as the header. If false, column names will be auto-generated.
Csv (dict) --
Options that define how Csv input is to be interpreted by DataBrew.
Delimiter (string) --
A single character that specifies the delimiter being used in the Csv file.
HeaderRow (boolean) --
A variable that specifies whether the first row in the file will be parsed as the header. If false, column names will be auto-generated.
dict
[REQUIRED]
Information on how DataBrew can find data, in either the AWS Glue Data Catalog or Amazon S3.
S3InputDefinition (dict) --
The Amazon S3 location where the data is stored.
Bucket (string) -- [REQUIRED]
The S3 bucket name.
Key (string) --
The unique name of the object in the bucket.
DataCatalogInputDefinition (dict) --
The AWS Glue Data Catalog parameters for the data.
CatalogId (string) --
The unique identifier of the AWS account that holds the Data Catalog that stores the data.
DatabaseName (string) -- [REQUIRED]
The name of a database in the Data Catalog.
TableName (string) -- [REQUIRED]
The name of a database table in the Data Catalog. This table corresponds to a DataBrew dataset.
TempDirectory (dict) --
An Amazon location that AWS Glue Data Catalog can use as a temporary directory.
Bucket (string) -- [REQUIRED]
The S3 bucket name.
Key (string) --
The unique name of the object in the bucket.
dict
Metadata tags to apply to this dataset.
(string) --
(string) --
dict
Response Syntax
{
'Name': 'string'
}
Response Structure
(dict) --
Name (string) --
The name of the dataset that you created.
{'Format': 'CSV | JSON | PARQUET | EXCEL',
'FormatOptions': {'Csv': {'HeaderRow': 'boolean'},
'Excel': {'HeaderRow': 'boolean'}}}
Returns the definition of a specific DataBrew dataset.
See also: AWS API Documentation
Request Syntax
client.describe_dataset(
Name='string'
)
string
[REQUIRED]
The name of the dataset to be described.
dict
Response Syntax
{
'CreatedBy': 'string',
'CreateDate': datetime(2015, 1, 1),
'Name': 'string',
'Format': 'CSV'|'JSON'|'PARQUET'|'EXCEL',
'FormatOptions': {
'Json': {
'MultiLine': True|False
},
'Excel': {
'SheetNames': [
'string',
],
'SheetIndexes': [
123,
],
'HeaderRow': True|False
},
'Csv': {
'Delimiter': 'string',
'HeaderRow': True|False
}
},
'Input': {
'S3InputDefinition': {
'Bucket': 'string',
'Key': 'string'
},
'DataCatalogInputDefinition': {
'CatalogId': 'string',
'DatabaseName': 'string',
'TableName': 'string',
'TempDirectory': {
'Bucket': 'string',
'Key': 'string'
}
}
},
'LastModifiedDate': datetime(2015, 1, 1),
'LastModifiedBy': 'string',
'Source': 'S3'|'DATA-CATALOG',
'Tags': {
'string': 'string'
},
'ResourceArn': 'string'
}
Response Structure
(dict) --
CreatedBy (string) --
The identifier (user name) of the user who created the dataset.
CreateDate (datetime) --
The date and time that the dataset was created.
Name (string) --
The name of the dataset.
Format (string) --
Specifies the file format of a dataset created from an S3 file or folder.
FormatOptions (dict) --
Options that define the structure of either Csv, Excel, or JSON input.
Json (dict) --
Options that define how JSON input is to be interpreted by DataBrew.
MultiLine (boolean) --
A value that specifies whether JSON input contains embedded new line characters.
Excel (dict) --
Options that define how Excel input is to be interpreted by DataBrew.
SheetNames (list) --
Specifies one or more named sheets in the Excel file, which will be included in the dataset.
(string) --
SheetIndexes (list) --
Specifies one or more sheet numbers in the Excel file, which will be included in the dataset.
(integer) --
HeaderRow (boolean) --
A variable that specifies whether the first row in the file will be parsed as the header. If false, column names will be auto-generated.
Csv (dict) --
Options that define how Csv input is to be interpreted by DataBrew.
Delimiter (string) --
A single character that specifies the delimiter being used in the Csv file.
HeaderRow (boolean) --
A variable that specifies whether the first row in the file will be parsed as the header. If false, column names will be auto-generated.
Input (dict) --
Information on how DataBrew can find data, in either the AWS Glue Data Catalog or Amazon S3.
S3InputDefinition (dict) --
The Amazon S3 location where the data is stored.
Bucket (string) --
The S3 bucket name.
Key (string) --
The unique name of the object in the bucket.
DataCatalogInputDefinition (dict) --
The AWS Glue Data Catalog parameters for the data.
CatalogId (string) --
The unique identifier of the AWS account that holds the Data Catalog that stores the data.
DatabaseName (string) --
The name of a database in the Data Catalog.
TableName (string) --
The name of a database table in the Data Catalog. This table corresponds to a DataBrew dataset.
TempDirectory (dict) --
An Amazon location that AWS Glue Data Catalog can use as a temporary directory.
Bucket (string) --
The S3 bucket name.
Key (string) --
The unique name of the object in the bucket.
LastModifiedDate (datetime) --
The date and time that the dataset was last modified.
LastModifiedBy (string) --
The identifier (user name) of the user who last modified the dataset.
Source (string) --
The location of the data for this dataset, Amazon S3 or the AWS Glue Data Catalog.
Tags (dict) --
Metadata tags associated with this dataset.
(string) --
(string) --
ResourceArn (string) --
The Amazon Resource Name (ARN) of the dataset.
{'Datasets': {'Format': 'CSV | JSON | PARQUET | EXCEL',
'FormatOptions': {'Csv': {'HeaderRow': 'boolean'},
'Excel': {'HeaderRow': 'boolean'}}}}
Lists all of the DataBrew datasets.
See also: AWS API Documentation
Request Syntax
client.list_datasets(
MaxResults=123,
NextToken='string'
)
integer
The maximum number of results to return in this request.
string
The token returned by a previous call to retrieve the next set of results.
dict
Response Syntax
{
'Datasets': [
{
'AccountId': 'string',
'CreatedBy': 'string',
'CreateDate': datetime(2015, 1, 1),
'Name': 'string',
'Format': 'CSV'|'JSON'|'PARQUET'|'EXCEL',
'FormatOptions': {
'Json': {
'MultiLine': True|False
},
'Excel': {
'SheetNames': [
'string',
],
'SheetIndexes': [
123,
],
'HeaderRow': True|False
},
'Csv': {
'Delimiter': 'string',
'HeaderRow': True|False
}
},
'Input': {
'S3InputDefinition': {
'Bucket': 'string',
'Key': 'string'
},
'DataCatalogInputDefinition': {
'CatalogId': 'string',
'DatabaseName': 'string',
'TableName': 'string',
'TempDirectory': {
'Bucket': 'string',
'Key': 'string'
}
}
},
'LastModifiedDate': datetime(2015, 1, 1),
'LastModifiedBy': 'string',
'Source': 'S3'|'DATA-CATALOG',
'Tags': {
'string': 'string'
},
'ResourceArn': 'string'
},
],
'NextToken': 'string'
}
Response Structure
(dict) --
Datasets (list) --
A list of datasets that are defined.
(dict) --
Represents a dataset that can be processed by DataBrew.
AccountId (string) --
The ID of the AWS account that owns the dataset.
CreatedBy (string) --
The Amazon Resource Name (ARN) of the user who created the dataset.
CreateDate (datetime) --
The date and time that the dataset was created.
Name (string) --
The unique name of the dataset.
Format (string) --
Specifies the file format of a dataset created from an S3 file or folder.
FormatOptions (dict) --
Options that define how DataBrew interprets the data in the dataset.
Json (dict) --
Options that define how JSON input is to be interpreted by DataBrew.
MultiLine (boolean) --
A value that specifies whether JSON input contains embedded new line characters.
Excel (dict) --
Options that define how Excel input is to be interpreted by DataBrew.
SheetNames (list) --
Specifies one or more named sheets in the Excel file, which will be included in the dataset.
(string) --
SheetIndexes (list) --
Specifies one or more sheet numbers in the Excel file, which will be included in the dataset.
(integer) --
HeaderRow (boolean) --
A variable that specifies whether the first row in the file will be parsed as the header. If false, column names will be auto-generated.
Csv (dict) --
Options that define how Csv input is to be interpreted by DataBrew.
Delimiter (string) --
A single character that specifies the delimiter being used in the Csv file.
HeaderRow (boolean) --
A variable that specifies whether the first row in the file will be parsed as the header. If false, column names will be auto-generated.
Input (dict) --
Information on how DataBrew can find the dataset, in either the AWS Glue Data Catalog or Amazon S3.
S3InputDefinition (dict) --
The Amazon S3 location where the data is stored.
Bucket (string) --
The S3 bucket name.
Key (string) --
The unique name of the object in the bucket.
DataCatalogInputDefinition (dict) --
The AWS Glue Data Catalog parameters for the data.
CatalogId (string) --
The unique identifier of the AWS account that holds the Data Catalog that stores the data.
DatabaseName (string) --
The name of a database in the Data Catalog.
TableName (string) --
The name of a database table in the Data Catalog. This table corresponds to a DataBrew dataset.
TempDirectory (dict) --
An Amazon location that AWS Glue Data Catalog can use as a temporary directory.
Bucket (string) --
The S3 bucket name.
Key (string) --
The unique name of the object in the bucket.
LastModifiedDate (datetime) --
The last modification date and time of the dataset.
LastModifiedBy (string) --
The Amazon Resource Name (ARN) of the user who last modified the dataset.
Source (string) --
The location of the data for the dataset, either Amazon S3 or the AWS Glue Data Catalog.
Tags (dict) --
Metadata tags that have been applied to the dataset.
(string) --
(string) --
ResourceArn (string) --
The unique Amazon Resource Name (ARN) for the dataset.
NextToken (string) --
A token that you can use in a subsequent call to retrieve the next set of results.
{'Format': 'CSV | JSON | PARQUET | EXCEL',
'FormatOptions': {'Csv': {'HeaderRow': 'boolean'},
'Excel': {'HeaderRow': 'boolean'}}}
Modifies the definition of an existing DataBrew dataset.
See also: AWS API Documentation
Request Syntax
client.update_dataset(
Name='string',
Format='CSV'|'JSON'|'PARQUET'|'EXCEL',
FormatOptions={
'Json': {
'MultiLine': True|False
},
'Excel': {
'SheetNames': [
'string',
],
'SheetIndexes': [
123,
],
'HeaderRow': True|False
},
'Csv': {
'Delimiter': 'string',
'HeaderRow': True|False
}
},
Input={
'S3InputDefinition': {
'Bucket': 'string',
'Key': 'string'
},
'DataCatalogInputDefinition': {
'CatalogId': 'string',
'DatabaseName': 'string',
'TableName': 'string',
'TempDirectory': {
'Bucket': 'string',
'Key': 'string'
}
}
}
)
string
[REQUIRED]
The name of the dataset to be updated.
string
Specifies the file format of a dataset created from an S3 file or folder.
dict
Options that define the structure of either Csv, Excel, or JSON input.
Json (dict) --
Options that define how JSON input is to be interpreted by DataBrew.
MultiLine (boolean) --
A value that specifies whether JSON input contains embedded new line characters.
Excel (dict) --
Options that define how Excel input is to be interpreted by DataBrew.
SheetNames (list) --
Specifies one or more named sheets in the Excel file, which will be included in the dataset.
(string) --
SheetIndexes (list) --
Specifies one or more sheet numbers in the Excel file, which will be included in the dataset.
(integer) --
HeaderRow (boolean) --
A variable that specifies whether the first row in the file will be parsed as the header. If false, column names will be auto-generated.
Csv (dict) --
Options that define how Csv input is to be interpreted by DataBrew.
Delimiter (string) --
A single character that specifies the delimiter being used in the Csv file.
HeaderRow (boolean) --
A variable that specifies whether the first row in the file will be parsed as the header. If false, column names will be auto-generated.
dict
[REQUIRED]
Information on how DataBrew can find data, in either the AWS Glue Data Catalog or Amazon S3.
S3InputDefinition (dict) --
The Amazon S3 location where the data is stored.
Bucket (string) -- [REQUIRED]
The S3 bucket name.
Key (string) --
The unique name of the object in the bucket.
DataCatalogInputDefinition (dict) --
The AWS Glue Data Catalog parameters for the data.
CatalogId (string) --
The unique identifier of the AWS account that holds the Data Catalog that stores the data.
DatabaseName (string) -- [REQUIRED]
The name of a database in the Data Catalog.
TableName (string) -- [REQUIRED]
The name of a database table in the Data Catalog. This table corresponds to a DataBrew dataset.
TempDirectory (dict) --
An Amazon location that AWS Glue Data Catalog can use as a temporary directory.
Bucket (string) -- [REQUIRED]
The S3 bucket name.
Key (string) --
The unique name of the object in the bucket.
dict
Response Syntax
{
'Name': 'string'
}
Response Structure
(dict) --
Name (string) --
The name of the dataset that you updated.