Configuring the ADLS Gen2 Sink Connector

Overview

The Azure Data Lake Storage Gen2 Sink connector can be configured for the following categories.

  • Azure Connection
    The settings to control the target Azure Data Lake Storage account and container, as well as the Azure client Retry options

  • Account Key / Access Key Authentication
    The settings needed to use Account Key based authentication

  • Client Secret Authentication
    The settings needed to use Azure AD Client Secret based authentication

  • Retries
    The settings to control retry and failure handling logic.

  • Container File
    The settings determining the staging and target locations for the Avro Object Container file, as well as file rotation settings.

  • Offset Commit
    The settings determining the offsets of processed records are committed to Kafka.

To find out how to configure a connector in Axual Self Service, see Starting Connectors

Azure Connection

Key Type Default Description

adls.endpoint

String

<null>

The url to connect to the storage service, usually looks like https://<account name>.dfs.core.windows.net

adls.container.name

String

<null>

The name of the container in the storage account

adls.auth.method

String

AccountKey

The authentication methods for Azure. Available options are AccountKey, ClientSecret

adls.client.timeout.seconds

Integer

15

The maximum number of seconds the Azure Data Lake Storage client will wait for a call to return before failing

adls.client.retry.maximum.tries

Integer

4

The maximum number of times the Azure Data Lake Storage client will retry a call before failing

adls.client.retry.interval

Long

10000

The number of milliseconds the Azure Data Lake Storage client will wait before retrying. When exponential retry is enabled, this value is doubled for each retry up to the maximum retry timeout

adls.client.retry.exponential

Boolean

false

Selects if the Azure Data Lake Storage Client will use exponential backoff for retries

adls.client.retry.maximum.interval

Long

60000

The maximum number of milliseconds the Azure Data Lake Storage client will wait before retrying when using exponential backoff

Account Key / Access Key Authentication

The account key method uses an Access Key for the Storage account to access the resources.

Key Type Default Description

adls.account.name

String

<null>

The name of the Azure Datalake Storage account

adls.account.key

String

<null>

One of the access keys of the Azure Datalake Storage account. These can be found in the Azure portal page of the account

Client Secret Authentication

The client secret authentication uses a secret of an Azure Active Directory user or application registration.

Key Type Default Description

adls.tenant.id

String

<null>

The id of the Azure Tenant for the Azure AD user/application registration

adls.client.id

String

<null>

The id of the client in the Azure AD user/application registration

adls.client.secret

String

<null>

The secret for the client in the Azure AD user/application registration

Retries

The number of retries and wait intervals can be configured to prevent immediate failures and allow the connector to survive small interruptions in network and storage services.

Key Type Default Description

retry.maximum.tries

Integer

10

The maximum number of times to retry an action before failing

retry.interval

Long

500

The number of milliseconds to wait before retrying

retry.exponential

Boolean

true

Use exponential backoff for retries. This doubles the retry interval for each subsequent retry

retry.maximum.interval

Long

15000

The maximum number of milliseconds to wait before retrying when using exponential backoff

Container File

Processed records are stored in Avro Object Container files in the staging directory. Files are moved, or rotated, to the target directory when a specific state is detected. + Each file contains the records of a single partition.

These are the current situations where the file is rotated to the target directory:

  • Target directory change detected when a timestamp is pattern is in use

  • Maximum number of records per file reached

  • Maximum file size exceeded.

  • Inactivity time reached. This rotation occurs when a container file with records is stored in staging, but no records have been processed for a specific time.

You can control the file locations and triggers using these configuration options

Key Type Default Description

base.directory

String

<empty string>

This is the base path for the files. The target and staging paths use this directory as root. The base directory will be created if does not exist it.

staging.directory

String

staging

The directory where the Avro container files will be created before being moved to the target directory

target.directory

String

target

The directory pattern where the finalized Avro container files will be loaded. To separate files based on time this setting has a time format support. The value target/year={yyyy}/month={MM}/day={dd} will store

compression

String

<null>

Set the compression type for the container file. Valid values are snappy and <null>

sync.interval

Integer

64000

The approximate number of uncompressed bytes to write in each Avro block. A higher number will result in less calls to the Azure Data Lake Storage, better compression and higher throughput.Valid values range from 32 to 2^30 Suggested values are between 2K and 2M. Lower values can result in lower throughput, as the block is written in a synchronous call to Azure

rotation.time.zone

String

UTC

The Java ZoneID name to use for in determining the target directory when a time format is used. This can be a value like GMT, UTC, Europe/Amsterdam or UTC+1:00

rotation.time.source

String

processed

The source of the timestamp used as part of the time based rotation. Use processed to use the time when the connector is processing the record. Use produced to use the record timestamp for the file rotation

rotation.record.count

Integer

100

The maximum number of records that each Avro container file have before rotating to the next file

rotation.inactivity

Long

1800000

The number of milliseconds of inactivity to wait for new incoming records before rotating to a new file. This prevents a file to remain in staging when no new data comes in.

rotation.filesize

Long

100000000

The maximum filesize of the container file. A file rotation will take place when this limit is reached

Offset Commit

These settings control when a record will be included as part of the offset commit flow.

Key Type Default Description

commit.rotated.only

Boolean

true

If set to true only the offsets of the records in a rotated file are committed. When false the offsets of records in staging files are committed as well

commit.record.count

Integer

100

The maximum number of records processed by a task before requesting a commit offsets when the setting to commit only rotated files is disabled