Step 1: Selecting Data Formats
Data formats in Kafka
Axual supports the following Data formats for keys and values:
-
String
-
Binary
-
JSON Schemaless
-
XML Schemaless
-
AVRO
-
Protobuf
-
JSON Schema
Once selected, they cannot be changed for the given Topic. Hence, it is important to make this decision early in the design process, with both producers and consumers involved to ensure aligned expectations.
Serialization and Schema Support
This is why schema-enforced formats are preferred. A Schema Registry mediates and validates that schema structures are followed, even for complex object hierarchies. This ensures smooth interaction between producers and consumers, both at setup and as APIs evolve. With proper schema evolution and backward compatibility, disruptive migrations can be avoided.
By contrast, semi-structured formats like JSON Schemaless and XML Schemaless rely on client-side schema handling. This is error-prone, since developers must handle validation themselves. As a result, major producer changes can break consumers at runtime when deserialization fails.
Picking a Data Type
Based on the arguments above, and since schema evolution is the key factor for future compatibility, we strongly recommend using one of the schema types: Avro, Protobuf, or JSON Schema.
Of the three options, Axual recommends Avro as the preferred schema type, as it is the most natural fit for Kafka. Avro has been supported in our platform since the early days and is also widely adopted across our customer base. It has also become the de facto choice for many Kafka-based solutions in the industry, making it easier to align with common practices.
Protobuf was developed by Google and the main motivation has been high performant gRPC integrations. It offers slight performance improvements in exchange for a slightly worse Schema evolution support.
JSON schema is less preferred mostly because of its weak Schema evolution capabilities. Additionally, resulting messages are larger than any of the other two types, making it a less efficient solution as well as the least performant of the three.
Schema Format | Binary Encoding | Schema Evolution | Self-describing | Kafka Ecosystem Fit | Use Case |
---|---|---|---|---|---|
Avro |
✅ |
Flexible |
✅ |
Excellent |
Streaming, big data |
Protobuf |
✅ |
Rigid, field IDs |
❌ |
Good |
RPC, microservices |
JSON Schema |
❌ (text) |
Weak |
❌ |
Moderate |
APIs, validation |
Schema Registries
The Axual Platform supports two types of Schema Registries:
-
Apicurio
-
Confluent Schema Registry
Apicurio version 2.6.x is the latest currently supported version in the Platform. |
The preferred option is Apicurio since with that, all three Schema Types (AVRO, Protobuf and JSON Schema) are supported. Further, Apicurio is actively supported with regular updates and is the preferred option on our cloud offering.
Schema Registry | AVRO | Protobuf | JSON Schema | Active Updates |
---|---|---|---|---|
Apicurio |
✅ |
✅ |
✅ |
✅ |
Confluent |
✅ |
❌ |
❌ |
❌ |
You can follow the steps in the Instance Configuration page to view and configure the Instance Schema Registry configuration. |
Serializers and Deserializers
Serializers are pluggable pieces of software responsible for converting the data objects the producer wants to produce to the Topic to a byte array format that the broker can store on the Topic log. Conversely, Deserializers are responsible for converting byte arrays fetched from the broker into structured objects by retrieving the Schema from the Registry and using it to decode the data.
Before sending data to the broker, the producer contacts the Schema Registry to obtain a schema ID, which it serializes alongside the payload. When a consumer later deserializes the message, it uses that schema ID to retrieve the correct schema from the Registry, ensuring it can interpret the payload correctly.
Serdes are configurable on the client applications irrespective of the Schema Registry selected. Apicurio Serdes are compatible with Confluent Schema Registry and the inverse can also be true if Apicurio is running in Confluent Compatible (ccompat) mode.
Available Options
Kafka serdes allow for extensibility by implementing the two classes, org.apache.kafka.common.serialization.Serializer
and org.apache.kafka.common.serialization.Deserializer
.
Together, they form a new serde implementation. While many such implementations exist, we will focus on those supported by the Schema Registries integrated with the Axual Platform and explain which ones we recommend and why.
Schema ID Encoding Strategies
As mentioned earlier, it is the Serdes' responsibility to fetch and serialize the ID of the Schema used. Different implementations offer for two established standards:
-
Payload encoding: the schema ID is written in the payload byte array after a "magic byte", preceding the actual serialized message
-
Header encoding: the schema ID is written in the Kafka message headers
Axual advocates for Payload encoding which can be achieved by Apicurio Serdes version series 2.x.x with the right configuration or by Confluent Serdes (for AVRO topics).
Apicurio Serde Configuration
Given the fact that we prefer payload encoding, this gives a set of properties that we request clients to use.
These are relevant both to consumers and producers and are:
SerdeConfig.ENABLE_HEADERS: false
SerdeConfig.ARTIFACT_RESOLVER_STRATEGY: TopicIdStrategy.class.getCanonicalName()
SerdeConfig.USE_ID: "contentId"
SerdeConfig.ID_HANDLER: Legacy4ByteIdHandler.class.getCanonicalName()
For JSON Schema we additionally advise:
SerdeConfig.VALIDATION_ENABLED: true
More information on each of those configurations can be found in Apicurio documentation. Following this instruction guarantees smooth integration with the platform from both producer and consumer perspective.
Code Samples
-
KSML sample code with supported serde configurations
-
Java sample code with supported serde configurations
Impact
Selecting the right Serde implementation and configuration can impact the consuming parties as well as some Axual services that have certain expectations.
Those services include:
-
Topic Browse: Dynamically switches between Apicurio v2.6.x and Confluent implementations to deserialize. Supports only Apicurio serdes for Protobuf and JSON Schema.
-
Distributor: Expects Apicurio v2.6.x or Confluent implementations
-
REST Proxy: Only supports Avro with header encoding
If you are unsure whether your chosen implementation will work, we recommend using one of the suggested options or contacting us for support.
Next Step: Creating Topics
Proceed to Step 2: Creating Topics