Configuring the ADLS Gen2 Sink Connector

Overview

The Azure Data Lake Storage Gen2 Sink connector can be configured for the following categories.

Azure Connection
The settings to control the target Azure Data Lake Storage account and container, as well as the Azure client Retry options
Account Key / Access Key Authentication
The settings needed to use Account Key based authentication
Client Secret Authentication
The settings needed to use Azure AD Client Secret based authentication
Retries
The settings to control retry and failure handling logic.
Converter Configuration
The plugin configuration needed to configure the connector.
Container File
The settings determining the staging and target locations for the Avro Object Container file, as well as file rotation settings.
Offset Commit
The settings determining the offsets of processed records are committed to Kafka.

To find out how to configure a connector in Axual Self Service, see Starting Connectors

Azure Connection

Key Type Default Description

Key	Type	Default	Description
`adls.endpoint`	String	<null>	The url to connect to the storage service, usually looks like https://<account name>.dfs.core.windows.net
`adls.container.name`	String	<null>	The name of the container in the storage account
`adls.auth.method`	String	AccountKey	The authentication methods for Azure. Available options are *AccountKey, ClientSecret*
`adls.client.timeout.seconds`	Integer	15	The maximum number of seconds the Azure Data Lake Storage client will wait for a call to return before failing
`adls.client.retry.maximum.tries`	Integer	4	The maximum number of times the Azure Data Lake Storage client will retry a call before failing
`adls.client.retry.interval`	Long	10000	The number of milliseconds the Azure Data Lake Storage client will wait before retrying. When exponential retry is enabled, this value is doubled for each retry up to the maximum retry timeout
`adls.client.retry.exponential`	Boolean	false	Selects if the Azure Data Lake Storage Client will use exponential backoff for retries
`adls.client.retry.maximum.interval`	Long	60000	The maximum number of milliseconds the Azure Data Lake Storage client will wait before retrying when using exponential backoff

adls.endpoint

String

<null>

The url to connect to the storage service, usually looks like https://<account name>.dfs.core.windows.net

adls.container.name

String

<null>

The name of the container in the storage account

adls.auth.method

String

AccountKey

The authentication methods for Azure. Available options are AccountKey, ClientSecret

adls.client.timeout.seconds

Integer

The maximum number of seconds the Azure Data Lake Storage client will wait for a call to return before failing

adls.client.retry.maximum.tries

Integer

The maximum number of times the Azure Data Lake Storage client will retry a call before failing

adls.client.retry.interval

Long

10000

The number of milliseconds the Azure Data Lake Storage client will wait before retrying. When exponential retry is enabled, this value is doubled for each retry up to the maximum retry timeout

adls.client.retry.exponential

Boolean

false

Selects if the Azure Data Lake Storage Client will use exponential backoff for retries

adls.client.retry.maximum.interval

Long

60000

The maximum number of milliseconds the Azure Data Lake Storage client will wait before retrying when using exponential backoff

Account Key / Access Key Authentication

The account key method uses an Access Key for the Storage account to access the resources.

Key Type Default Description

Key	Type	Default	Description
`adls.account.name`	String	<null>	The name of the Azure Datalake Storage account
`adls.account.key`	String	<null>	One of the access keys of the Azure Datalake Storage account. These can be found in the Azure portal page of the account

adls.account.name

String

<null>

The name of the Azure Datalake Storage account

adls.account.key

String

<null>

One of the access keys of the Azure Datalake Storage account. These can be found in the Azure portal page of the account

Client Secret Authentication

The client secret authentication uses a secret of an Azure Active Directory user or application registration.

Key Type Default Description

Key	Type	Default	Description
`adls.tenant.id`	String	<null>	The id of the Azure Tenant for the Azure AD user/application registration
`adls.client.id`	String	<null>	The id of the client in the Azure AD user/application registration
`adls.client.secret`	String	<null>	The secret for the client in the Azure AD user/application registration

adls.tenant.id

String

<null>

The id of the Azure Tenant for the Azure AD user/application registration

adls.client.id

String

<null>

The id of the client in the Azure AD user/application registration

adls.client.secret

String

<null>

The secret for the client in the Azure AD user/application registration

Retries

The number of retries and wait intervals can be configured to prevent immediate failures and allow the connector to survive small interruptions in network and storage services.

Key Type Default Description

Key	Type	Default	Description
`retry.maximum.tries`	Integer	10	The maximum number of times to retry an action before failing
`retry.interval`	Long	500	The number of milliseconds to wait before retrying
`retry.exponential`	Boolean	true	Use exponential backoff for retries. This doubles the retry interval for each subsequent retry
`retry.maximum.interval`	Long	15000	The maximum number of milliseconds to wait before retrying when using exponential backoff

retry.maximum.tries

Integer

The maximum number of times to retry an action before failing

retry.interval

Long

500

The number of milliseconds to wait before retrying

retry.exponential

Boolean

true

Use exponential backoff for retries. This doubles the retry interval for each subsequent retry

retry.maximum.interval

Long

15000

The maximum number of milliseconds to wait before retrying when using exponential backoff

Converter Configuration

Provide the following under common plugin configuration

Key Value Description

Key	Value	Description
`header.converter`	`org.apache.kafka.connect.storage.SimpleHeaderConverter`	This converter attempts to create string data from all headers, which may result in unexpected characters.
`key.converter/value.convertor`	`io.axual.connect.plugins.adls.gen2.AvroObjectConverter`	Custom Avro Converter is needed to read from Avro Topics to keep the schema. It is advised to use explicit converters because else system defaults are used which can be changed by operators.

header.converter

org.apache.kafka.connect.storage.SimpleHeaderConverter

This converter attempts to create string data from all headers, which may result in unexpected characters.

key.converter/value.convertor

io.axual.connect.plugins.adls.gen2.AvroObjectConverter

Custom Avro Converter is needed to read from Avro Topics to keep the schema. It is advised to use explicit converters because else system defaults are used which can be changed by operators.

The following convertors can also be used as key or value converter if the data on the topic is in the matching format.

org.apache.kafka.connect.converters.ByteArrayConverter
org.apache.kafka.connect.converters.DoubleConverter
org.apache.kafka.connect.converters.FloatConverter
org.apache.kafka.connect.converters.IntegerConverter
org.apache.kafka.connect.converters.LongConverter
org.apache.kafka.connect.storage.StringConverter

Container File

Processed records are stored in Avro Object Container files in the staging directory. Files are moved, or rotated, to the target directory when a specific state is detected. + Each file contains the records of a single partition.

These are the current situations where the file is rotated to the target directory:

Target directory change detected when a timestamp is pattern is in use
Maximum number of records per file reached
Maximum file size exceeded.
Inactivity time reached. This rotation occurs when a container file with records is stored in staging, but no records have been processed for a specific time.

You can control the file locations and triggers using these configuration options

Key Type Default Description

Key	Type	Default	Description
`base.directory`	String	<empty string>	This is the base path for the files. The target and staging paths use this directory as root. The base directory will be created if does not exist it.
`staging.directory`	String	staging	The directory where the Avro container files will be created before being moved to the target directory
`target.directory`	String	target	The directory pattern where the finalized Avro container files will be loaded. To separate files based on time this setting has a time format support. The value *target/year={yyyy}/month={MM}/day={dd}* will store
`compression`	String	<null>	Set the compression type for the container file. Valid values are *snappy* and <null>
`sync.interval`	Integer	64000	The approximate number of uncompressed bytes to write in each Avro block. A higher number will result in less calls to the Azure Data Lake Storage, better compression and higher throughput.Valid values range from 32 to 2^30 Suggested values are between 2K and 2M. Lower values can result in lower throughput, as the block is written in a synchronous call to Azure
`rotation.time.zone`	String	UTC	The Java ZoneID name to use for in determining the target directory when a time format is used. This can be a value like *GMT, UTC, Europe/Amsterdam* or *UTC+1:00*
`rotation.time.source`	String	processed	The source of the timestamp used as part of the time based rotation. Use *processed* to use the time when the connector is processing the record. Use *produced* to use the record timestamp for the file rotation
`rotation.record.count`	Integer	100	The maximum number of records that each Avro container file have before rotating to the next file
`rotation.inactivity`	Long	1800000	The number of milliseconds of inactivity to wait for new incoming records before rotating to a new file. This prevents a file to remain in staging when no new data comes in.
`rotation.filesize`	Long	100000000	The maximum filesize of the container file. A file rotation will take place when this limit is reached

base.directory

String

<empty string>

This is the base path for the files. The target and staging paths use this directory as root. The base directory will be created if does not exist it.

staging.directory

String

staging

The directory where the Avro container files will be created before being moved to the target directory

target.directory

String

target

The directory pattern where the finalized Avro container files will be loaded. To separate files based on time this setting has a time format support. The value target/year={yyyy}/month={MM}/day={dd} will store

compression

String

<null>

Set the compression type for the container file. Valid values are snappy and <null>

sync.interval

Integer

64000

The approximate number of uncompressed bytes to write in each Avro block. A higher number will result in less calls to the Azure Data Lake Storage, better compression and higher throughput.Valid values range from 32 to 2^30 Suggested values are between 2K and 2M. Lower values can result in lower throughput, as the block is written in a synchronous call to Azure

rotation.time.zone

String

UTC

The Java ZoneID name to use for in determining the target directory when a time format is used. This can be a value like GMT, UTC, Europe/Amsterdam or UTC+1:00

rotation.time.source

String

processed

The source of the timestamp used as part of the time based rotation. Use processed to use the time when the connector is processing the record. Use produced to use the record timestamp for the file rotation

rotation.record.count

Integer

100

The maximum number of records that each Avro container file have before rotating to the next file

rotation.inactivity

Long

1800000

The number of milliseconds of inactivity to wait for new incoming records before rotating to a new file. This prevents a file to remain in staging when no new data comes in.

rotation.filesize

Long

100000000

The maximum filesize of the container file. A file rotation will take place when this limit is reached

Offset Commit

These settings control when a record will be included as part of the offset commit flow.

Key Type Default Description

Key	Type	Default	Description
`commit.rotated.only`	Boolean	true	If set to true only the offsets of the records in a rotated file are committed. When false the offsets of records in staging files are committed as well
`commit.record.count`	Integer	100	The maximum number of records processed by a task before requesting a commit offsets when the setting to commit only rotated files is disabled

commit.rotated.only

Boolean

true

If set to true only the offsets of the records in a rotated file are committed. When false the offsets of records in staging files are committed as well

commit.record.count

Integer

100

The maximum number of records processed by a task before requesting a commit offsets when the setting to commit only rotated files is disabled