Step 1: Selecting Data Formats

Table of Contents

Data formats in Kafka
- Serialization and Schema Support
- Picking a Data Type
Schema Registries
Serializers and Deserializers
Next Step: Creating Topics

Data formats in Kafka

Axual supports the following Data formats for keys and values:

String
Binary
JSON Schemaless
XML Schemaless
AVRO
Protobuf
JSON Schema

Once selected, they cannot be changed for the given Topic. Hence, it is important to make this decision early in the design process, with both producers and consumers involved to ensure aligned expectations.

Serialization and Schema Support

This is why schema-enforced formats are preferred. A Schema Registry mediates and validates that schema structures are followed, even for complex object hierarchies. This ensures smooth interaction between producers and consumers, both at setup and as APIs evolve. With proper schema evolution and backward compatibility, disruptive migrations can be avoided.

By contrast, semi-structured formats like JSON Schemaless and XML Schemaless rely on client-side schema handling. This is error-prone, since developers must handle validation themselves. As a result, major producer changes can break consumers at runtime when deserialization fails.

Picking a Data Type

Based on the arguments above, and since schema evolution is the key factor for future compatibility, we strongly recommend using one of the schema types: Avro, Protobuf, or JSON Schema.

Of the three options, Axual recommends Avro as the preferred schema type, as it is the most natural fit for Kafka. Avro has been supported in our platform since the early days and is also widely adopted across our customer base. It has also become the de facto choice for many Kafka-based solutions in the industry, making it easier to align with common practices.

Protobuf was developed by Google and the main motivation has been high performant gRPC integrations. It offers slight performance improvements in exchange for a slightly worse Schema evolution support.

JSON schema is less preferred mostly because of its weak Schema evolution capabilities. Additionally, resulting messages are larger than any of the other two types, making it a less efficient solution as well as the least performant of the three.

Schema Format	Binary Encoding	Schema Evolution	Self-describing	Kafka Ecosystem Fit	Use Case
Avro	✅	Flexible	✅	Excellent	Streaming, big data
Protobuf	✅	Rigid, field IDs	❌	Good	RPC, microservices
JSON Schema	❌ (text)	Weak	❌	Moderate	APIs, validation

Schema Format

Binary Encoding

Schema Evolution

Self-describing

Kafka Ecosystem Fit

Use Case

Avro

✅

Flexible

✅

Excellent

Streaming, big data

Protobuf

✅

Rigid, field IDs

❌

Good

RPC, microservices

JSON Schema

❌ (text)

Weak

❌

Moderate

APIs, validation

Schema Registries

The Axual Platform supports two types of Schema Registries:

Apicurio
Confluent Schema Registry

Apicurio version 2.6.x is the latest currently supported version in the Platform.

The preferred option is Apicurio since with that, all three Schema Types (AVRO, Protobuf and JSON Schema) are supported. Further, Apicurio is actively supported with regular updates and is the preferred option on our cloud offering.

Schema Registry	AVRO	Protobuf	JSON Schema	Active Updates
Apicurio	✅	✅	✅	✅
Confluent	✅	❌	❌	❌

Schema Registry

AVRO

Protobuf

JSON Schema

Active Updates

Apicurio

✅

Confluent

✅

❌

You can follow the steps in the Instance Configuration page to view and configure the Instance Schema Registry configuration.

Serializers and Deserializers

Serializers are pluggable pieces of software responsible for converting the data objects the producer wants to produce to the Topic to a byte array format that the broker can store on the Topic log. Conversely, Deserializers are responsible for converting byte arrays fetched from the broker into structured objects by retrieving the Schema from the Registry and using it to decode the data.

Before sending data to the broker, the producer contacts the Schema Registry to obtain a schema ID, which it serializes alongside the payload. When a consumer later deserializes the message, it uses that schema ID to retrieve the correct schema from the Registry, ensuring it can interpret the payload correctly.

Serdes are configurable on the client applications irrespective of the Schema Registry selected. Apicurio Serdes are compatible with Confluent Schema Registry and the inverse can also be true if Apicurio is running in Confluent Compatible (ccompat) mode.

Available Options

Kafka serdes allow for extensibility by implementing the two classes, org.apache.kafka.common.serialization.Serializer and org.apache.kafka.common.serialization.Deserializer. Together, they form a new serde implementation. While many such implementations exist, we will focus on those supported by the Schema Registries integrated with the Axual Platform and explain which ones we recommend and why.

Schema ID Encoding Strategies

As mentioned earlier, it is the Serdes' responsibility to fetch and serialize the ID of the Schema used. Different implementations offer for two established standards:

Payload encoding: the schema ID is written in the payload byte array after a "magic byte", preceding the actual serialized message
Header encoding: the schema ID is written in the Kafka message headers

Axual advocates for Payload encoding which can be achieved by Apicurio Serdes version series 2.x.x with the right configuration or by Confluent Serdes (for AVRO topics).

Apicurio Serde Configuration

Given the fact that we prefer payload encoding, this gives a set of properties that we request clients to use.

These are relevant both to consumers and producers and are:

SerdeConfig.ENABLE_HEADERS: false
SerdeConfig.ARTIFACT_RESOLVER_STRATEGY: TopicIdStrategy.class.getCanonicalName()
SerdeConfig.USE_ID: "contentId"
SerdeConfig.ID_HANDLER: Legacy4ByteIdHandler.class.getCanonicalName()

For JSON Schema we additionally advise:

SerdeConfig.VALIDATION_ENABLED: true

More information on each of those configurations can be found in Apicurio documentation. Following this instruction guarantees smooth integration with the platform from both producer and consumer perspective.

Code Samples

KSML sample code with supported serde configurations
Java sample code with supported serde configurations

Impact

Selecting the right Serde implementation and configuration can impact the consuming parties as well as some Axual services that have certain expectations.

Those services include:

Topic Browse: Dynamically switches between Apicurio v2.6.x and Confluent implementations to deserialize. Supports only Apicurio serdes for Protobuf and JSON Schema.
Distributor: Expects Apicurio v2.6.x or Confluent implementations
REST Proxy: Only supports Avro with header encoding

If you are unsure whether your chosen implementation will work, we recommend using one of the suggested options or contacting us for support.

Next Step: Creating Topics

Proceed to Step 2: Creating Topics