Configuring the ADLS Gen2 Sink Connector
Overview
The Azure Data Lake Storage Gen2 Sink connector can be configured for the following categories.
-
Azure Connection
The settings to control the target Azure Data Lake Storage account and container, as well as the Azure client Retry options -
Account Key / Access Key Authentication
The settings needed to use Account Key based authentication -
Client Secret Authentication
The settings needed to use Azure AD Client Secret based authentication -
Retries
The settings to control retry and failure handling logic. -
Converter Configuration
The plugin configuration needed to configure the connector. -
Container File
The settings determining the staging and target locations for the Avro Object Container file, as well as file rotation settings. -
Offset Commit
The settings determining the offsets of processed records are committed to Kafka.
To find out how to configure a connector in Axual Self Service, see Starting Connectors |
Azure Connection
Key | Type | Default | Description |
---|---|---|---|
|
String |
<null> |
The url to connect to the storage service, usually looks like https://<account name>.dfs.core.windows.net |
|
String |
<null> |
The name of the container in the storage account |
|
String |
AccountKey |
The authentication methods for Azure. Available options are AccountKey, ClientSecret |
|
Integer |
15 |
The maximum number of seconds the Azure Data Lake Storage client will wait for a call to return before failing |
|
Integer |
4 |
The maximum number of times the Azure Data Lake Storage client will retry a call before failing |
|
Long |
10000 |
The number of milliseconds the Azure Data Lake Storage client will wait before retrying. When exponential retry is enabled, this value is doubled for each retry up to the maximum retry timeout |
|
Boolean |
false |
Selects if the Azure Data Lake Storage Client will use exponential backoff for retries |
|
Long |
60000 |
The maximum number of milliseconds the Azure Data Lake Storage client will wait before retrying when using exponential backoff |
Account Key / Access Key Authentication
The account key method uses an Access Key for the Storage account to access the resources.
Key | Type | Default | Description |
---|---|---|---|
|
String |
<null> |
The name of the Azure Datalake Storage account |
|
String |
<null> |
One of the access keys of the Azure Datalake Storage account. These can be found in the Azure portal page of the account |
Client Secret Authentication
The client secret authentication uses a secret of an Azure Active Directory user or application registration.
Key | Type | Default | Description |
---|---|---|---|
|
String |
<null> |
The id of the Azure Tenant for the Azure AD user/application registration |
|
String |
<null> |
The id of the client in the Azure AD user/application registration |
|
String |
<null> |
The secret for the client in the Azure AD user/application registration |
Retries
The number of retries and wait intervals can be configured to prevent immediate failures and allow the connector to survive small interruptions in network and storage services.
Key | Type | Default | Description |
---|---|---|---|
|
Integer |
10 |
The maximum number of times to retry an action before failing |
|
Long |
500 |
The number of milliseconds to wait before retrying |
|
Boolean |
true |
Use exponential backoff for retries. This doubles the retry interval for each subsequent retry |
|
Long |
15000 |
The maximum number of milliseconds to wait before retrying when using exponential backoff |
Converter Configuration
Provide the following under common plugin configuration
Key | Value | Description |
---|---|---|
|
|
This converter attempts to create string data from all headers, which may result in unexpected characters. |
|
|
Custom Avro Converter is needed to read from Avro Topics to keep the schema. It is advised to use explicit converters because else system defaults are used which can be changed by operators. |
The following convertors can also be used as key or value converter if the data on the topic is in the matching format.
-
org.apache.kafka.connect.converters.ByteArrayConverter
-
org.apache.kafka.connect.converters.DoubleConverter
-
org.apache.kafka.connect.converters.FloatConverter
-
org.apache.kafka.connect.converters.IntegerConverter
-
org.apache.kafka.connect.converters.LongConverter
-
org.apache.kafka.connect.storage.StringConverter
Container File
Processed records are stored in Avro Object Container files in the staging directory. Files are moved, or rotated, to the target directory when a specific state is detected. + Each file contains the records of a single partition.
These are the current situations where the file is rotated to the target directory:
-
Target directory change detected when a timestamp is pattern is in use
-
Maximum number of records per file reached
-
Maximum file size exceeded.
-
Inactivity time reached. This rotation occurs when a container file with records is stored in staging, but no records have been processed for a specific time.
You can control the file locations and triggers using these configuration options
Key | Type | Default | Description |
---|---|---|---|
|
String |
<empty string> |
This is the base path for the files. The target and staging paths use this directory as root. The base directory will be created if does not exist it. |
|
String |
staging |
The directory where the Avro container files will be created before being moved to the target directory |
|
String |
target |
The directory pattern where the finalized Avro container files will be loaded. To separate files based on time this setting has a time format support. The value target/year={yyyy}/month={MM}/day={dd} will store |
|
String |
<null> |
Set the compression type for the container file. Valid values are snappy and <null> |
|
Integer |
64000 |
The approximate number of uncompressed bytes to write in each Avro block. A higher number will result in less calls to the Azure Data Lake Storage, better compression and higher throughput.Valid values range from 32 to 2^30 Suggested values are between 2K and 2M. Lower values can result in lower throughput, as the block is written in a synchronous call to Azure |
|
String |
UTC |
The Java ZoneID name to use for in determining the target directory when a time format is used. This can be a value like GMT, UTC, Europe/Amsterdam or UTC+1:00 |
|
String |
processed |
The source of the timestamp used as part of the time based rotation. Use processed to use the time when the connector is processing the record. Use produced to use the record timestamp for the file rotation |
|
Integer |
100 |
The maximum number of records that each Avro container file have before rotating to the next file |
|
Long |
1800000 |
The number of milliseconds of inactivity to wait for new incoming records before rotating to a new file. This prevents a file to remain in staging when no new data comes in. |
|
Long |
100000000 |
The maximum filesize of the container file. A file rotation will take place when this limit is reached |
Offset Commit
These settings control when a record will be included as part of the offset commit flow.
Key | Type | Default | Description |
---|---|---|---|
|
Boolean |
true |
If set to true only the offsets of the records in a rotated file are committed. When false the offsets of records in staging files are committed as well |
|
Integer |
100 |
The maximum number of records processed by a task before requesting a commit offsets when the setting to commit only rotated files is disabled |