AWS Glue

2019/02/22 - AWS Glue - 11 new 4 updated api methods

Changes  AWS Glue adds support for assigning AWS resource tags to jobs, triggers, development endpoints, and crawlers. Each tag consists of a key and an optional value, both of which you define. With this capacity, customers can use tags in AWS Glue to easily organize and identify your resources, create cost allocation reports, and control access to resources.

BatchGetCrawlers (new) Link ¶

Returns a list of resource metadata for a given list of crawler names. After calling the ListCrawlers operation, you can call this operation to access the data to which you have been granted permissions to based on tags.

See also: AWS API Documentation

Request Syntax

client.batch_get_crawlers(
    CrawlerNames=[
        'string',
    ]
)
type CrawlerNames

list

param CrawlerNames

[REQUIRED]

A list of crawler names, which may be the names returned from the ListCrawlers operation.

  • (string) --

rtype

dict

returns

Response Syntax

{
    'Crawlers': [
        {
            'Name': 'string',
            'Role': 'string',
            'Targets': {
                'S3Targets': [
                    {
                        'Path': 'string',
                        'Exclusions': [
                            'string',
                        ]
                    },
                ],
                'JdbcTargets': [
                    {
                        'ConnectionName': 'string',
                        'Path': 'string',
                        'Exclusions': [
                            'string',
                        ]
                    },
                ],
                'DynamoDBTargets': [
                    {
                        'Path': 'string'
                    },
                ]
            },
            'DatabaseName': 'string',
            'Description': 'string',
            'Classifiers': [
                'string',
            ],
            'SchemaChangePolicy': {
                'UpdateBehavior': 'LOG'|'UPDATE_IN_DATABASE',
                'DeleteBehavior': 'LOG'|'DELETE_FROM_DATABASE'|'DEPRECATE_IN_DATABASE'
            },
            'State': 'READY'|'RUNNING'|'STOPPING',
            'TablePrefix': 'string',
            'Schedule': {
                'ScheduleExpression': 'string',
                'State': 'SCHEDULED'|'NOT_SCHEDULED'|'TRANSITIONING'
            },
            'CrawlElapsedTime': 123,
            'CreationTime': datetime(2015, 1, 1),
            'LastUpdated': datetime(2015, 1, 1),
            'LastCrawl': {
                'Status': 'SUCCEEDED'|'CANCELLED'|'FAILED',
                'ErrorMessage': 'string',
                'LogGroup': 'string',
                'LogStream': 'string',
                'MessagePrefix': 'string',
                'StartTime': datetime(2015, 1, 1)
            },
            'Version': 123,
            'Configuration': 'string',
            'CrawlerSecurityConfiguration': 'string'
        },
    ],
    'CrawlersNotFound': [
        'string',
    ]
}

Response Structure

  • (dict) --

    • Crawlers (list) --

      A list of crawler definitions.

      • (dict) --

        Specifies a crawler program that examines a data source and uses classifiers to try to determine its schema. If successful, the crawler records metadata concerning the data source in the AWS Glue Data Catalog.

        • Name (string) --

          The crawler name.

        • Role (string) --

          The IAM role (or ARN of an IAM role) used to access customer resources, such as data in Amazon S3.

        • Targets (dict) --

          A collection of targets to crawl.

          • S3Targets (list) --

            Specifies Amazon S3 targets.

            • (dict) --

              Specifies a data store in Amazon S3.

              • Path (string) --

                The path to the Amazon S3 target.

              • Exclusions (list) --

                A list of glob patterns used to exclude from the crawl. For more information, see Catalog Tables with a Crawler.

                • (string) --

          • JdbcTargets (list) --

            Specifies JDBC targets.

            • (dict) --

              Specifies a JDBC data store to crawl.

              • ConnectionName (string) --

                The name of the connection to use to connect to the JDBC target.

              • Path (string) --

                The path of the JDBC target.

              • Exclusions (list) --

                A list of glob patterns used to exclude from the crawl. For more information, see Catalog Tables with a Crawler.

                • (string) --

          • DynamoDBTargets (list) --

            Specifies DynamoDB targets.

            • (dict) --

              Specifies a DynamoDB table to crawl.

              • Path (string) --

                The name of the DynamoDB table to crawl.

        • DatabaseName (string) --

          The database where metadata is written by this crawler.

        • Description (string) --

          A description of the crawler.

        • Classifiers (list) --

          A list of custom classifiers associated with the crawler.

          • (string) --

        • SchemaChangePolicy (dict) --

          Sets the behavior when the crawler finds a changed or deleted object.

          • UpdateBehavior (string) --

            The update behavior when the crawler finds a changed schema.

          • DeleteBehavior (string) --

            The deletion behavior when the crawler finds a deleted object.

        • State (string) --

          Indicates whether the crawler is running, or whether a run is pending.

        • TablePrefix (string) --

          The prefix added to the names of tables that are created.

        • Schedule (dict) --

          For scheduled crawlers, the schedule when the crawler runs.

          • ScheduleExpression (string) --

            A cron expression used to specify the schedule (see Time-Based Schedules for Jobs and Crawlers. For example, to run something every day at 12:15 UTC, you would specify: cron(15 12 * * ? *) .

          • State (string) --

            The state of the schedule.

        • CrawlElapsedTime (integer) --

          If the crawler is running, contains the total time elapsed since the last crawl began.

        • CreationTime (datetime) --

          The time when the crawler was created.

        • LastUpdated (datetime) --

          The time the crawler was last updated.

        • LastCrawl (dict) --

          The status of the last crawl, and potentially error information if an error occurred.

          • Status (string) --

            Status of the last crawl.

          • ErrorMessage (string) --

            If an error occurred, the error information about the last crawl.

          • LogGroup (string) --

            The log group for the last crawl.

          • LogStream (string) --

            The log stream for the last crawl.

          • MessagePrefix (string) --

            The prefix for a message about this crawl.

          • StartTime (datetime) --

            The time at which the crawl started.

        • Version (integer) --

          The version of the crawler.

        • Configuration (string) --

          Crawler configuration information. This versioned JSON string allows users to specify aspects of a crawler's behavior. For more information, see Configuring a Crawler.

        • CrawlerSecurityConfiguration (string) --

          The name of the SecurityConfiguration structure to be used by this Crawler.

    • CrawlersNotFound (list) --

      A list of crawlers not found.

      • (string) --

GetTags (new) Link ¶

Retrieves a list of tags associated with a resource.

See also: AWS API Documentation

Request Syntax

client.get_tags(
    ResourceArn='string'
)
type ResourceArn

string

param ResourceArn

[REQUIRED]

The Amazon ARN of the resource for which to retrieve tags.

rtype

dict

returns

Response Syntax

{
    'Tags': {
        'string': 'string'
    }
}

Response Structure

  • (dict) --

    • Tags (dict) --

      The requested tags.

      • (string) --

        • (string) --

BatchGetTriggers (new) Link ¶

Returns a list of resource metadata for a given list of trigger names. After calling the ListTriggers operation, you can call this operation to access the data to which you have been granted permissions. This operation supports all IAM permissions, including permission conditions that uses tags.

See also: AWS API Documentation

Request Syntax

client.batch_get_triggers(
    TriggerNames=[
        'string',
    ]
)
type TriggerNames

list

param TriggerNames

[REQUIRED]

A list of trigger names, which may be the names returned from the ListTriggers operation.

  • (string) --

rtype

dict

returns

Response Syntax

{
    'Triggers': [
        {
            'Name': 'string',
            'Id': 'string',
            'Type': 'SCHEDULED'|'CONDITIONAL'|'ON_DEMAND',
            'State': 'CREATING'|'CREATED'|'ACTIVATING'|'ACTIVATED'|'DEACTIVATING'|'DEACTIVATED'|'DELETING'|'UPDATING',
            'Description': 'string',
            'Schedule': 'string',
            'Actions': [
                {
                    'JobName': 'string',
                    'Arguments': {
                        'string': 'string'
                    },
                    'Timeout': 123,
                    'NotificationProperty': {
                        'NotifyDelayAfter': 123
                    },
                    'SecurityConfiguration': 'string'
                },
            ],
            'Predicate': {
                'Logical': 'AND'|'ANY',
                'Conditions': [
                    {
                        'LogicalOperator': 'EQUALS',
                        'JobName': 'string',
                        'State': 'STARTING'|'RUNNING'|'STOPPING'|'STOPPED'|'SUCCEEDED'|'FAILED'|'TIMEOUT'
                    },
                ]
            }
        },
    ],
    'TriggersNotFound': [
        'string',
    ]
}

Response Structure

  • (dict) --

    • Triggers (list) --

      A list of trigger definitions.

      • (dict) --

        Information about a specific trigger.

        • Name (string) --

          Name of the trigger.

        • Id (string) --

          Reserved for future use.

        • Type (string) --

          The type of trigger that this is.

        • State (string) --

          The current state of the trigger.

        • Description (string) --

          A description of this trigger.

        • Schedule (string) --

          A cron expression used to specify the schedule (see Time-Based Schedules for Jobs and Crawlers. For example, to run something every day at 12:15 UTC, you would specify: cron(15 12 * * ? *) .

        • Actions (list) --

          The actions initiated by this trigger.

          • (dict) --

            Defines an action to be initiated by a trigger.

            • JobName (string) --

              The name of a job to be executed.

            • Arguments (dict) --

              The job arguments used when this trigger fires. For this job run, they replace the default arguments set in the job definition itself.

              You can specify arguments here that your own job-execution script consumes, as well as arguments that AWS Glue itself consumes.

              For information about how to specify and consume your own Job arguments, see the Calling AWS Glue APIs in Python topic in the developer guide.

              For information about the key-value pairs that AWS Glue consumes to set up your job, see the Special Parameters Used by AWS Glue topic in the developer guide.

              • (string) --

                • (string) --

            • Timeout (integer) --

              The JobRun timeout in minutes. This is the maximum time that a job run can consume resources before it is terminated and enters TIMEOUT status. The default is 2,880 minutes (48 hours). This overrides the timeout value set in the parent job.

            • NotificationProperty (dict) --

              Specifies configuration properties of a job run notification.

              • NotifyDelayAfter (integer) --

                After a job run starts, the number of minutes to wait before sending a job run delay notification.

            • SecurityConfiguration (string) --

              The name of the SecurityConfiguration structure to be used with this action.

        • Predicate (dict) --

          The predicate of this trigger, which defines when it will fire.

          • Logical (string) --

            Optional field if only one condition is listed. If multiple conditions are listed, then this field is required.

          • Conditions (list) --

            A list of the conditions that determine when the trigger will fire.

            • (dict) --

              Defines a condition under which a trigger fires.

              • LogicalOperator (string) --

                A logical operator.

              • JobName (string) --

                The name of the Job to whose JobRuns this condition applies and on which this trigger waits.

              • State (string) --

                The condition state. Currently, the values supported are SUCCEEDED, STOPPED, TIMEOUT and FAILED.

    • TriggersNotFound (list) --

      A list of names of triggers not found.

      • (string) --

ListJobs (new) Link ¶

Retrieves the names of all job resources in this AWS account, or the resources with the specified tag. This operation allows you to see which resources are available in your account, and their names.

This operation takes the optional Tags field which you can use as a filter on the response so that tagged resources can be retrieved as a group. If you choose to use tags filtering, only resources with the tag will be retrieved.

See also: AWS API Documentation

Request Syntax

client.list_jobs(
    NextToken='string',
    MaxResults=123,
    Tags={
        'string': 'string'
    }
)
type NextToken

string

param NextToken

A continuation token, if this is a continuation request.

type MaxResults

integer

param MaxResults

The maximum size of a list to return.

type Tags

dict

param Tags

Specifies to return only these tagged resources.

  • (string) --

    • (string) --

rtype

dict

returns

Response Syntax

{
    'JobNames': [
        'string',
    ],
    'NextToken': 'string'
}

Response Structure

  • (dict) --

    • JobNames (list) --

      The names of all jobs in the account, or the jobs with the specified tags.

      • (string) --

    • NextToken (string) --

      A continuation token, if the returned list does not contain the last metric available.

TagResource (new) Link ¶

Adds tags to a resource. A tag is a label you can assign to an AWS resource. In AWS Glue, you can tag only certain resources. For information about what resources you can tag, see AWS Tags in AWS Glue.

See also: AWS API Documentation

Request Syntax

client.tag_resource(
    ResourceArn='string',
    TagsToAdd={
        'string': 'string'
    }
)
type ResourceArn

string

param ResourceArn

[REQUIRED]

The ARN of the AWS Glue resource to which to add the tags. For more information about AWS Glue resource ARNs, see the AWS Glue ARN string pattern.

type TagsToAdd

dict

param TagsToAdd

[REQUIRED]

Tags to add to this resource.

  • (string) --

    • (string) --

rtype

dict

returns

Response Syntax

{}

Response Structure

  • (dict) --

ListDevEndpoints (new) Link ¶

Retrieves the names of all DevEndpoint resources in this AWS account, or the resources with the specified tag. This operation allows you to see which resources are available in your account, and their names.

This operation takes the optional Tags field which you can use as a filter on the response so that tagged resources can be retrieved as a group. If you choose to use tags filtering, only resources with the tag will be retrieved.

See also: AWS API Documentation

Request Syntax

client.list_dev_endpoints(
    NextToken='string',
    MaxResults=123,
    Tags={
        'string': 'string'
    }
)
type NextToken

string

param NextToken

A continuation token, if this is a continuation request.

type MaxResults

integer

param MaxResults

The maximum size of a list to return.

type Tags

dict

param Tags

Specifies to return only these tagged resources.

  • (string) --

    • (string) --

rtype

dict

returns

Response Syntax

{
    'DevEndpointNames': [
        'string',
    ],
    'NextToken': 'string'
}

Response Structure

  • (dict) --

    • DevEndpointNames (list) --

      The names of all DevEndpoints in the account, or the DevEndpoints with the specified tags.

      • (string) --

    • NextToken (string) --

      A continuation token, if the returned list does not contain the last metric available.

BatchGetDevEndpoints (new) Link ¶

Returns a list of resource metadata for a given list of DevEndpoint names. After calling the ListDevEndpoints operation, you can call this operation to access the data to which you have been granted permissions. This operation supports all IAM permissions, including permission conditions that uses tags.

See also: AWS API Documentation

Request Syntax

client.batch_get_dev_endpoints(
    DevEndpointNames=[
        'string',
    ]
)
type DevEndpointNames

list

param DevEndpointNames

[REQUIRED]

The list of DevEndpoint names, which may be the names returned from the ListDevEndpoint operation.

  • (string) --

rtype

dict

returns

Response Syntax

{
    'DevEndpoints': [
        {
            'EndpointName': 'string',
            'RoleArn': 'string',
            'SecurityGroupIds': [
                'string',
            ],
            'SubnetId': 'string',
            'YarnEndpointAddress': 'string',
            'PrivateAddress': 'string',
            'ZeppelinRemoteSparkInterpreterPort': 123,
            'PublicAddress': 'string',
            'Status': 'string',
            'NumberOfNodes': 123,
            'AvailabilityZone': 'string',
            'VpcId': 'string',
            'ExtraPythonLibsS3Path': 'string',
            'ExtraJarsS3Path': 'string',
            'FailureReason': 'string',
            'LastUpdateStatus': 'string',
            'CreatedTimestamp': datetime(2015, 1, 1),
            'LastModifiedTimestamp': datetime(2015, 1, 1),
            'PublicKey': 'string',
            'PublicKeys': [
                'string',
            ],
            'SecurityConfiguration': 'string'
        },
    ],
    'DevEndpointsNotFound': [
        'string',
    ]
}

Response Structure

  • (dict) --

    • DevEndpoints (list) --

      A list of DevEndpoint definitions.

      • (dict) --

        A development endpoint where a developer can remotely debug ETL scripts.

        • EndpointName (string) --

          The name of the DevEndpoint.

        • RoleArn (string) --

          The AWS ARN of the IAM role used in this DevEndpoint.

        • SecurityGroupIds (list) --

          A list of security group identifiers used in this DevEndpoint.

          • (string) --

        • SubnetId (string) --

          The subnet ID for this DevEndpoint.

        • YarnEndpointAddress (string) --

          The YARN endpoint address used by this DevEndpoint.

        • PrivateAddress (string) --

          A private IP address to access the DevEndpoint within a VPC, if the DevEndpoint is created within one. The PrivateAddress field is present only when you create the DevEndpoint within your virtual private cloud (VPC).

        • ZeppelinRemoteSparkInterpreterPort (integer) --

          The Apache Zeppelin port for the remote Apache Spark interpreter.

        • PublicAddress (string) --

          The public IP address used by this DevEndpoint. The PublicAddress field is present only when you create a non-VPC (virtual private cloud) DevEndpoint.

        • Status (string) --

          The current status of this DevEndpoint.

        • NumberOfNodes (integer) --

          The number of AWS Glue Data Processing Units (DPUs) allocated to this DevEndpoint.

        • AvailabilityZone (string) --

          The AWS availability zone where this DevEndpoint is located.

        • VpcId (string) --

          The ID of the virtual private cloud (VPC) used by this DevEndpoint.

        • ExtraPythonLibsS3Path (string) --

          Path(s) to one or more Python libraries in an S3 bucket that should be loaded in your DevEndpoint. Multiple values must be complete paths separated by a comma.

          Please note that only pure Python libraries can currently be used on a DevEndpoint. Libraries that rely on C extensions, such as the pandas Python data analysis library, are not yet supported.

        • ExtraJarsS3Path (string) --

          Path to one or more Java Jars in an S3 bucket that should be loaded in your DevEndpoint.

          Please note that only pure Java/Scala libraries can currently be used on a DevEndpoint.

        • FailureReason (string) --

          The reason for a current failure in this DevEndpoint.

        • LastUpdateStatus (string) --

          The status of the last update.

        • CreatedTimestamp (datetime) --

          The point in time at which this DevEndpoint was created.

        • LastModifiedTimestamp (datetime) --

          The point in time at which this DevEndpoint was last modified.

        • PublicKey (string) --

          The public key to be used by this DevEndpoint for authentication. This attribute is provided for backward compatibility, as the recommended attribute to use is public keys.

        • PublicKeys (list) --

          A list of public keys to be used by the DevEndpoints for authentication. The use of this attribute is preferred over a single public key because the public keys allow you to have a different private key per client.

          Note

          If you previously created an endpoint with a public key, you must remove that key to be able to set a list of public keys: call the UpdateDevEndpoint API with the public key content in the deletePublicKeys attribute, and the list of new keys in the addPublicKeys attribute.

          • (string) --

        • SecurityConfiguration (string) --

          The name of the SecurityConfiguration structure to be used with this DevEndpoint.

    • DevEndpointsNotFound (list) --

      A list of DevEndpoints not found.

      • (string) --

BatchGetJobs (new) Link ¶

Returns a list of resource metadata for a given list of job names. After calling the ListJobs operation, you can call this operation to access the data to which you have been granted permissions. This operation supports all IAM permissions, including permission conditions that uses tags.

See also: AWS API Documentation

Request Syntax

client.batch_get_jobs(
    JobNames=[
        'string',
    ]
)
type JobNames

list

param JobNames

[REQUIRED]

A list of job names, which may be the names returned from the ListJobs operation.

  • (string) --

rtype

dict

returns

Response Syntax

{
    'Jobs': [
        {
            'Name': 'string',
            'Description': 'string',
            'LogUri': 'string',
            'Role': 'string',
            'CreatedOn': datetime(2015, 1, 1),
            'LastModifiedOn': datetime(2015, 1, 1),
            'ExecutionProperty': {
                'MaxConcurrentRuns': 123
            },
            'Command': {
                'Name': 'string',
                'ScriptLocation': 'string'
            },
            'DefaultArguments': {
                'string': 'string'
            },
            'Connections': {
                'Connections': [
                    'string',
                ]
            },
            'MaxRetries': 123,
            'AllocatedCapacity': 123,
            'Timeout': 123,
            'MaxCapacity': 123.0,
            'NotificationProperty': {
                'NotifyDelayAfter': 123
            },
            'SecurityConfiguration': 'string'
        },
    ],
    'JobsNotFound': [
        'string',
    ]
}

Response Structure

  • (dict) --

    • Jobs (list) --

      A list of job definitions.

      • (dict) --

        Specifies a job definition.

        • Name (string) --

          The name you assign to this job definition.

        • Description (string) --

          Description of the job being defined.

        • LogUri (string) --

          This field is reserved for future use.

        • Role (string) --

          The name or ARN of the IAM role associated with this job.

        • CreatedOn (datetime) --

          The time and date that this job definition was created.

        • LastModifiedOn (datetime) --

          The last point in time when this job definition was modified.

        • ExecutionProperty (dict) --

          An ExecutionProperty specifying the maximum number of concurrent runs allowed for this job.

          • MaxConcurrentRuns (integer) --

            The maximum number of concurrent runs allowed for the job. The default is 1. An error is returned when this threshold is reached. The maximum value you can specify is controlled by a service limit.

        • Command (dict) --

          The JobCommand that executes this job.

          • Name (string) --

            The name of the job command: this must be glueetl , for an Apache Spark ETL job, or pythonshell , for a Python shell job.

          • ScriptLocation (string) --

            Specifies the S3 path to a script that executes a job (required).

        • DefaultArguments (dict) --

          The default arguments for this job, specified as name-value pairs.

          You can specify arguments here that your own job-execution script consumes, as well as arguments that AWS Glue itself consumes.

          For information about how to specify and consume your own Job arguments, see the Calling AWS Glue APIs in Python topic in the developer guide.

          For information about the key-value pairs that AWS Glue consumes to set up your job, see the Special Parameters Used by AWS Glue topic in the developer guide.

          • (string) --

            • (string) --

        • Connections (dict) --

          The connections used for this job.

          • Connections (list) --

            A list of connections used by the job.

            • (string) --

        • MaxRetries (integer) --

          The maximum number of times to retry this job after a JobRun fails.

        • AllocatedCapacity (integer) --

          This field is deprecated, use MaxCapacity instead.

          The number of AWS Glue data processing units (DPUs) allocated to runs of this job. From 2 to 100 DPUs can be allocated; the default is 10. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory. For more information, see the AWS Glue pricing page.

        • Timeout (integer) --

          The job timeout in minutes. This is the maximum time that a job run can consume resources before it is terminated and enters TIMEOUT status. The default is 2,880 minutes (48 hours).

        • MaxCapacity (float) --

          The number of AWS Glue data processing units (DPUs) that can be allocated when this job runs. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory. For more information, see the AWS Glue pricing page.

          The value that can be allocated for MaxCapacity depends on whether you are running a python shell job, or an Apache Spark ETL job:

          • When you specify a python shell job ( JobCommand.Name ="pythonshell"), you can allocate either 0.0625 or 1 DPU. The default is 0.0625 DPU.

          • When you specify an Apache Spark ETL job ( JobCommand.Name ="glueetl"), you can allocate from 2 to 100 DPUs. The default is 10 DPUs. This job type cannot have a fractional DPU allocation.

        • NotificationProperty (dict) --

          Specifies configuration properties of a job notification.

          • NotifyDelayAfter (integer) --

            After a job run starts, the number of minutes to wait before sending a job run delay notification.

        • SecurityConfiguration (string) --

          The name of the SecurityConfiguration structure to be used with this job.

    • JobsNotFound (list) --

      A list of names of jobs not found.

      • (string) --

ListTriggers (new) Link ¶

Retrieves the names of all trigger resources in this AWS account, or the resources with the specified tag. This operation allows you to see which resources are available in your account, and their names.

This operation takes the optional Tags field which you can use as a filter on the response so that tagged resources can be retrieved as a group. If you choose to use tags filtering, only resources with the tag will be retrieved.

See also: AWS API Documentation

Request Syntax

client.list_triggers(
    NextToken='string',
    DependentJobName='string',
    MaxResults=123,
    Tags={
        'string': 'string'
    }
)
type NextToken

string

param NextToken

A continuation token, if this is a continuation request.

type DependentJobName

string

param DependentJobName

The name of the job for which to retrieve triggers. The trigger that can start this job will be returned, and if there is no such trigger, all triggers will be returned.

type MaxResults

integer

param MaxResults

The maximum size of a list to return.

type Tags

dict

param Tags

Specifies to return only these tagged resources.

  • (string) --

    • (string) --

rtype

dict

returns

Response Syntax

{
    'TriggerNames': [
        'string',
    ],
    'NextToken': 'string'
}

Response Structure

  • (dict) --

    • TriggerNames (list) --

      The names of all triggers in the account, or the triggers with the specified tags.

      • (string) --

    • NextToken (string) --

      A continuation token, if the returned list does not contain the last metric available.

ListCrawlers (new) Link ¶

Retrieves the names of all crawler resources in this AWS account, or the resources with the specified tag. This operation allows you to see which resources are available in your account, and their names.

This operation takes the optional Tags field which you can use as a filter on the response so that tagged resources can be retrieved as a group. If you choose to use tags filtering, only resources with the tag will be retrieved.

See also: AWS API Documentation

Request Syntax

client.list_crawlers(
    MaxResults=123,
    NextToken='string',
    Tags={
        'string': 'string'
    }
)
type MaxResults

integer

param MaxResults

The maximum size of a list to return.

type NextToken

string

param NextToken

A continuation token, if this is a continuation request.

type Tags

dict

param Tags

Specifies to return only these tagged resources.

  • (string) --

    • (string) --

rtype

dict

returns

Response Syntax

{
    'CrawlerNames': [
        'string',
    ],
    'NextToken': 'string'
}

Response Structure

  • (dict) --

    • CrawlerNames (list) --

      The names of all crawlers in the account, or the crawlers with the specified tags.

      • (string) --

    • NextToken (string) --

      A continuation token, if the returned list does not contain the last metric available.

UntagResource (new) Link ¶

Removes tags from a resource.

See also: AWS API Documentation

Request Syntax

client.untag_resource(
    ResourceArn='string',
    TagsToRemove=[
        'string',
    ]
)
type ResourceArn

string

param ResourceArn

[REQUIRED]

The ARN of the resource from which to remove the tags.

type TagsToRemove

list

param TagsToRemove

[REQUIRED]

Tags to remove from this resource.

  • (string) --

rtype

dict

returns

Response Syntax

{}

Response Structure

  • (dict) --

CreateCrawler (updated) Link ¶
Changes (request)
{'Tags': {'string': 'string'}}

Creates a new crawler with specified targets, role, configuration, and optional schedule. At least one crawl target must be specified, in the s3Targets field, the jdbcTargets field, or the DynamoDBTargets field.

See also: AWS API Documentation

Request Syntax

client.create_crawler(
    Name='string',
    Role='string',
    DatabaseName='string',
    Description='string',
    Targets={
        'S3Targets': [
            {
                'Path': 'string',
                'Exclusions': [
                    'string',
                ]
            },
        ],
        'JdbcTargets': [
            {
                'ConnectionName': 'string',
                'Path': 'string',
                'Exclusions': [
                    'string',
                ]
            },
        ],
        'DynamoDBTargets': [
            {
                'Path': 'string'
            },
        ]
    },
    Schedule='string',
    Classifiers=[
        'string',
    ],
    TablePrefix='string',
    SchemaChangePolicy={
        'UpdateBehavior': 'LOG'|'UPDATE_IN_DATABASE',
        'DeleteBehavior': 'LOG'|'DELETE_FROM_DATABASE'|'DEPRECATE_IN_DATABASE'
    },
    Configuration='string',
    CrawlerSecurityConfiguration='string',
    Tags={
        'string': 'string'
    }
)
type Name

string

param Name

[REQUIRED]

Name of the new crawler.

type Role

string

param Role

[REQUIRED]

The IAM role (or ARN of an IAM role) used by the new crawler to access customer resources.

type DatabaseName

string

param DatabaseName

[REQUIRED]

The AWS Glue database where results are written, such as: arn:aws:daylight:us-east-1::database/sometable/* .

type Description

string

param Description

A description of the new crawler.

type Targets

dict

param Targets

[REQUIRED]

A list of collection of targets to crawl.

  • S3Targets (list) --

    Specifies Amazon S3 targets.

    • (dict) --

      Specifies a data store in Amazon S3.

      • Path (string) --

        The path to the Amazon S3 target.

      • Exclusions (list) --

        A list of glob patterns used to exclude from the crawl. For more information, see Catalog Tables with a Crawler.

        • (string) --

  • JdbcTargets (list) --

    Specifies JDBC targets.

    • (dict) --

      Specifies a JDBC data store to crawl.

      • ConnectionName (string) --

        The name of the connection to use to connect to the JDBC target.

      • Path (string) --

        The path of the JDBC target.

      • Exclusions (list) --

        A list of glob patterns used to exclude from the crawl. For more information, see Catalog Tables with a Crawler.

        • (string) --

  • DynamoDBTargets (list) --

    Specifies DynamoDB targets.

    • (dict) --

      Specifies a DynamoDB table to crawl.

      • Path (string) --

        The name of the DynamoDB table to crawl.

type Schedule

string

param Schedule

A cron expression used to specify the schedule (see Time-Based Schedules for Jobs and Crawlers. For example, to run something every day at 12:15 UTC, you would specify: cron(15 12 * * ? *) .

type Classifiers

list

param Classifiers

A list of custom classifiers that the user has registered. By default, all built-in classifiers are included in a crawl, but these custom classifiers always override the default classifiers for a given classification.

  • (string) --

type TablePrefix

string

param TablePrefix

The table prefix used for catalog tables that are created.

type SchemaChangePolicy

dict

param SchemaChangePolicy

Policy for the crawler's update and deletion behavior.

  • UpdateBehavior (string) --

    The update behavior when the crawler finds a changed schema.

  • DeleteBehavior (string) --

    The deletion behavior when the crawler finds a deleted object.

type Configuration

string

param Configuration

Crawler configuration information. This versioned JSON string allows users to specify aspects of a crawler's behavior. For more information, see Configuring a Crawler.

type CrawlerSecurityConfiguration

string

param CrawlerSecurityConfiguration

The name of the SecurityConfiguration structure to be used by this Crawler.

type Tags

dict

param Tags

The tags to use with this crawler request. You may use tags to limit access to the crawler. For more information about tags in AWS Glue, see AWS Tags in AWS Glue in the developer guide.

  • (string) --

    • (string) --

rtype

dict

returns

Response Syntax

{}

Response Structure

  • (dict) --

CreateDevEndpoint (updated) Link ¶
Changes (request)
{'Tags': {'string': 'string'}}

Creates a new DevEndpoint.

See also: AWS API Documentation

Request Syntax

client.create_dev_endpoint(
    EndpointName='string',
    RoleArn='string',
    SecurityGroupIds=[
        'string',
    ],
    SubnetId='string',
    PublicKey='string',
    PublicKeys=[
        'string',
    ],
    NumberOfNodes=123,
    ExtraPythonLibsS3Path='string',
    ExtraJarsS3Path='string',
    SecurityConfiguration='string',
    Tags={
        'string': 'string'
    }
)
type EndpointName

string

param EndpointName

[REQUIRED]

The name to be assigned to the new DevEndpoint.

type RoleArn

string

param RoleArn

[REQUIRED]

The IAM role for the DevEndpoint.

type SecurityGroupIds

list

param SecurityGroupIds

Security group IDs for the security groups to be used by the new DevEndpoint.

  • (string) --

type SubnetId

string

param SubnetId

The subnet ID for the new DevEndpoint to use.

type PublicKey

string

param PublicKey

The public key to be used by this DevEndpoint for authentication. This attribute is provided for backward compatibility, as the recommended attribute to use is public keys.

type PublicKeys

list

param PublicKeys

A list of public keys to be used by the DevEndpoints for authentication. The use of this attribute is preferred over a single public key because the public keys allow you to have a different private key per client.

Note

If you previously created an endpoint with a public key, you must remove that key to be able to set a list of public keys: call the UpdateDevEndpoint API with the public key content in the deletePublicKeys attribute, and the list of new keys in the addPublicKeys attribute.

  • (string) --

type NumberOfNodes

integer

param NumberOfNodes

The number of AWS Glue Data Processing Units (DPUs) to allocate to this DevEndpoint.

type ExtraPythonLibsS3Path

string

param ExtraPythonLibsS3Path

Path(s) to one or more Python libraries in an S3 bucket that should be loaded in your DevEndpoint. Multiple values must be complete paths separated by a comma.

Please note that only pure Python libraries can currently be used on a DevEndpoint. Libraries that rely on C extensions, such as the pandas Python data analysis library, are not yet supported.

type ExtraJarsS3Path

string

param ExtraJarsS3Path

Path to one or more Java Jars in an S3 bucket that should be loaded in your DevEndpoint.

type SecurityConfiguration

string

param SecurityConfiguration

The name of the SecurityConfiguration structure to be used with this DevEndpoint.

type Tags

dict

param Tags

The tags to use with this DevEndpoint. You may use tags to limit access to the DevEndpoint. For more information about tags in AWS Glue, see AWS Tags in AWS Glue in the developer guide.

  • (string) --

    • (string) --

rtype

dict

returns

Response Syntax

{
    'EndpointName': 'string',
    'Status': 'string',
    'SecurityGroupIds': [
        'string',
    ],
    'SubnetId': 'string',
    'RoleArn': 'string',
    'YarnEndpointAddress': 'string',
    'ZeppelinRemoteSparkInterpreterPort': 123,
    'NumberOfNodes': 123,
    'AvailabilityZone': 'string',
    'VpcId': 'string',
    'ExtraPythonLibsS3Path': 'string',
    'ExtraJarsS3Path': 'string',
    'FailureReason': 'string',
    'SecurityConfiguration': 'string',
    'CreatedTimestamp': datetime(2015, 1, 1)
}

Response Structure

  • (dict) --

    • EndpointName (string) --

      The name assigned to the new DevEndpoint.

    • Status (string) --

      The current status of the new DevEndpoint.

    • SecurityGroupIds (list) --

      The security groups assigned to the new DevEndpoint.

      • (string) --

    • SubnetId (string) --

      The subnet ID assigned to the new DevEndpoint.

    • RoleArn (string) --

      The AWS ARN of the role assigned to the new DevEndpoint.

    • YarnEndpointAddress (string) --

      The address of the YARN endpoint used by this DevEndpoint.

    • ZeppelinRemoteSparkInterpreterPort (integer) --

      The Apache Zeppelin port for the remote Apache Spark interpreter.

    • NumberOfNodes (integer) --

      The number of AWS Glue Data Processing Units (DPUs) allocated to this DevEndpoint.

    • AvailabilityZone (string) --

      The AWS availability zone where this DevEndpoint is located.

    • VpcId (string) --

      The ID of the VPC used by this DevEndpoint.

    • ExtraPythonLibsS3Path (string) --

      Path(s) to one or more Python libraries in an S3 bucket that will be loaded in your DevEndpoint.

    • ExtraJarsS3Path (string) --

      Path to one or more Java Jars in an S3 bucket that will be loaded in your DevEndpoint.

    • FailureReason (string) --

      The reason for a current failure in this DevEndpoint.

    • SecurityConfiguration (string) --

      The name of the SecurityConfiguration structure being used with this DevEndpoint.

    • CreatedTimestamp (datetime) --

      The point in time at which this DevEndpoint was created.

CreateJob (updated) Link ¶
Changes (request)
{'Tags': {'string': 'string'}}

Creates a new job definition.

See also: AWS API Documentation

Request Syntax

client.create_job(
    Name='string',
    Description='string',
    LogUri='string',
    Role='string',
    ExecutionProperty={
        'MaxConcurrentRuns': 123
    },
    Command={
        'Name': 'string',
        'ScriptLocation': 'string'
    },
    DefaultArguments={
        'string': 'string'
    },
    Connections={
        'Connections': [
            'string',
        ]
    },
    MaxRetries=123,
    AllocatedCapacity=123,
    Timeout=123,
    MaxCapacity=123.0,
    NotificationProperty={
        'NotifyDelayAfter': 123
    },
    SecurityConfiguration='string',
    Tags={
        'string': 'string'
    }
)
type Name

string

param Name

[REQUIRED]

The name you assign to this job definition. It must be unique in your account.

type Description

string

param Description

Description of the job being defined.

type LogUri

string

param LogUri

This field is reserved for future use.

type Role

string

param Role

[REQUIRED]

The name or ARN of the IAM role associated with this job.

type ExecutionProperty

dict

param ExecutionProperty

An ExecutionProperty specifying the maximum number of concurrent runs allowed for this job.

  • MaxConcurrentRuns (integer) --

    The maximum number of concurrent runs allowed for the job. The default is 1. An error is returned when this threshold is reached. The maximum value you can specify is controlled by a service limit.

type Command

dict

param Command

[REQUIRED]

The JobCommand that executes this job.

  • Name (string) --

    The name of the job command: this must be glueetl , for an Apache Spark ETL job, or pythonshell , for a Python shell job.

  • ScriptLocation (string) --

    Specifies the S3 path to a script that executes a job (required).

type DefaultArguments

dict

param DefaultArguments

The default arguments for this job.

You can specify arguments here that your own job-execution script consumes, as well as arguments that AWS Glue itself consumes.

For information about how to specify and consume your own Job arguments, see the Calling AWS Glue APIs in Python topic in the developer guide.

For information about the key-value pairs that AWS Glue consumes to set up your job, see the Special Parameters Used by AWS Glue topic in the developer guide.

  • (string) --

    • (string) --

type Connections

dict

param Connections

The connections used for this job.

  • Connections (list) --

    A list of connections used by the job.

    • (string) --

type MaxRetries

integer

param MaxRetries

The maximum number of times to retry this job if it fails.

type AllocatedCapacity

integer

param AllocatedCapacity

This parameter is deprecated. Use MaxCapacity instead.

The number of AWS Glue data processing units (DPUs) to allocate to this Job. From 2 to 100 DPUs can be allocated; the default is 10. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory. For more information, see the AWS Glue pricing page.

type Timeout

integer

param Timeout

The job timeout in minutes. This is the maximum time that a job run can consume resources before it is terminated and enters TIMEOUT status. The default is 2,880 minutes (48 hours).

type MaxCapacity

float

param MaxCapacity

The number of AWS Glue data processing units (DPUs) that can be allocated when this job runs. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory. For more information, see the AWS Glue pricing page.

The value that can be allocated for MaxCapacity depends on whether you are running a python shell job, or an Apache Spark ETL job:

  • When you specify a python shell job ( JobCommand.Name ="pythonshell"), you can allocate either 0.0625 or 1 DPU. The default is 0.0625 DPU.

  • When you specify an Apache Spark ETL job ( JobCommand.Name ="glueetl"), you can allocate from 2 to 100 DPUs. The default is 10 DPUs. This job type cannot have a fractional DPU allocation.

type NotificationProperty

dict

param NotificationProperty

Specifies configuration properties of a job notification.

  • NotifyDelayAfter (integer) --

    After a job run starts, the number of minutes to wait before sending a job run delay notification.

type SecurityConfiguration

string

param SecurityConfiguration

The name of the SecurityConfiguration structure to be used with this job.

type Tags

dict

param Tags

The tags to use with this job. You may use tags to limit access to the job. For more information about tags in AWS Glue, see AWS Tags in AWS Glue in the developer guide.

  • (string) --

    • (string) --

rtype

dict

returns

Response Syntax

{
    'Name': 'string'
}

Response Structure

  • (dict) --

    • Name (string) --

      The unique name that was provided for this job definition.

CreateTrigger (updated) Link ¶
Changes (request)
{'Tags': {'string': 'string'}}

Creates a new trigger.

See also: AWS API Documentation

Request Syntax

client.create_trigger(
    Name='string',
    Type='SCHEDULED'|'CONDITIONAL'|'ON_DEMAND',
    Schedule='string',
    Predicate={
        'Logical': 'AND'|'ANY',
        'Conditions': [
            {
                'LogicalOperator': 'EQUALS',
                'JobName': 'string',
                'State': 'STARTING'|'RUNNING'|'STOPPING'|'STOPPED'|'SUCCEEDED'|'FAILED'|'TIMEOUT'
            },
        ]
    },
    Actions=[
        {
            'JobName': 'string',
            'Arguments': {
                'string': 'string'
            },
            'Timeout': 123,
            'NotificationProperty': {
                'NotifyDelayAfter': 123
            },
            'SecurityConfiguration': 'string'
        },
    ],
    Description='string',
    StartOnCreation=True|False,
    Tags={
        'string': 'string'
    }
)
type Name

string

param Name

[REQUIRED]

The name of the trigger.

type Type

string

param Type

[REQUIRED]

The type of the new trigger.

type Schedule

string

param Schedule

A cron expression used to specify the schedule (see Time-Based Schedules for Jobs and Crawlers. For example, to run something every day at 12:15 UTC, you would specify: cron(15 12 * * ? *) .

This field is required when the trigger type is SCHEDULED.

type Predicate

dict

param Predicate

A predicate to specify when the new trigger should fire.

This field is required when the trigger type is CONDITIONAL.

  • Logical (string) --

    Optional field if only one condition is listed. If multiple conditions are listed, then this field is required.

  • Conditions (list) --

    A list of the conditions that determine when the trigger will fire.

    • (dict) --

      Defines a condition under which a trigger fires.

      • LogicalOperator (string) --

        A logical operator.

      • JobName (string) --

        The name of the Job to whose JobRuns this condition applies and on which this trigger waits.

      • State (string) --

        The condition state. Currently, the values supported are SUCCEEDED, STOPPED, TIMEOUT and FAILED.

type Actions

list

param Actions

[REQUIRED]

The actions initiated by this trigger when it fires.

  • (dict) --

    Defines an action to be initiated by a trigger.

    • JobName (string) --

      The name of a job to be executed.

    • Arguments (dict) --

      The job arguments used when this trigger fires. For this job run, they replace the default arguments set in the job definition itself.

      You can specify arguments here that your own job-execution script consumes, as well as arguments that AWS Glue itself consumes.

      For information about how to specify and consume your own Job arguments, see the Calling AWS Glue APIs in Python topic in the developer guide.

      For information about the key-value pairs that AWS Glue consumes to set up your job, see the Special Parameters Used by AWS Glue topic in the developer guide.

      • (string) --

        • (string) --

    • Timeout (integer) --

      The JobRun timeout in minutes. This is the maximum time that a job run can consume resources before it is terminated and enters TIMEOUT status. The default is 2,880 minutes (48 hours). This overrides the timeout value set in the parent job.

    • NotificationProperty (dict) --

      Specifies configuration properties of a job run notification.

      • NotifyDelayAfter (integer) --

        After a job run starts, the number of minutes to wait before sending a job run delay notification.

    • SecurityConfiguration (string) --

      The name of the SecurityConfiguration structure to be used with this action.

type Description

string

param Description

A description of the new trigger.

type StartOnCreation

boolean

param StartOnCreation

Set to true to start SCHEDULED and CONDITIONAL triggers when created. True not supported for ON_DEMAND triggers.

type Tags

dict

param Tags

The tags to use with this trigger. You may use tags to limit access to the trigger. For more information about tags in AWS Glue, see AWS Tags in AWS Glue in the developer guide.

  • (string) --

    • (string) --

rtype

dict

returns

Response Syntax

{
    'Name': 'string'
}

Response Structure

  • (dict) --

    • Name (string) --

      The name of the trigger.