Choosing a KSML Application Deployment Size
A guide to selecting a deployment size for KSML applications on the Axual Platform, and to diagnosing resource constraints once an application is running in production.
|
Figures labelled measured come directly from our benchmark and can be trusted for the scenario described; the rest are based on published papers (see References). The quick-start grid and tier ranges are starting points to validate under your own load. Your own workload is always the final word. |
Description
When you deploy a KSML application, you choose a deployment size that sets how much CPU and memory it gets. Pick too small a size and the application can fall behind, get throttled, or run out of memory; pick too large and you pay for resources you never use.
A KSML application has a fixed memory cost of ~150 MB for the Python runtime. On top of that, two things drive resource use:
-
What the pipeline does (its shape). A filter is cheap. Stateful operations (aggregations, joins, windows) need far more memory and CPU because they hold state in memory.
-
How hard you push it (the load). More messages per second, and bigger messages, raise CPU and memory above that baseline.
Shape sets the floor; load sets the ceiling. A heavy pipeline at low load can use more than a light pipeline at high load, so size for the peak combination of both, not the average.
These recommendations come from a benchmark across ten common KSML patterns (simple filters through windowed aggregations and joins) at rates from a few hundred to tens of thousands of messages per second, measuring memory, CPU, and consumer lag.
If you know what your pipeline does and roughly its message rate, go straight to Quick start: pick a size in two questions. For the reasoning, read Selecting an initial size. If an application is already lagging or under memory pressure, go to Diagnosing resource constraints. The right fix is often not a bigger size.
Quick start: pick a size in two questions
1. What does your pipeline do?
-
Just passing or filtering: drops, keeps, or routes messages, maybe tweaks a field; remembers nothing (
filter,map,route). -
Light processing: a few steps chained, reformatting, simple enrichment; still stateless.
-
Remembers or combines: counts, sums, groups, or joins (
aggregate,count,join). -
Remembers over time windows: groups into time buckets ("per 5 minutes") or detects patterns across many messages; the heaviest kind (windowed aggregations, complex event processing).
2. How many messages per second?
| Pipeline | ≤1k/s | 1k–10k/s | 10k–50k/s | 50k+/s |
|---|---|---|---|---|
Passing / filtering |
Smallest |
Small |
Medium |
Medium |
Light processing |
Smallest |
Small |
Medium |
Large |
Remembers / combines |
Small |
Medium |
Large |
Large + scale out |
Remembers over windows |
Medium |
Medium |
Large |
XL + scale out |
Size names map to whatever the platform offers (commonly XS, S, M).
|
Bump up a size if: messages are large (tens of KB); per-message logic is heavy (large regex, heavy text processing); or there are many distinct keys (e.g. millions of users) and the pipeline is stateful. |
|
Scale out above ~50k msg/s. A single KSML instance does its main work on one processing path, so making it bigger gives diminishing returns. Run several instances side by side, each handling a share of partitions. The input topic needs at least as many partitions as instances. Think "a few medium instances together," not "one giant instance." |
When in doubt, start one size up, watch how it runs, and adjust.
Selecting an initial size
Deployment size reference
|
The following tier ranges are guidance, not guarantees. Tier names vary by installation — a typical Axual Cloud install offers three sizes (often XS, S, M); on-premise may define more, fewer, or differently named ones. Ask your platform operator for the exact CPU/memory values. Where only three tiers exist, map the smallest, medium, and largest to tiers 1, 3, and 5; pick the nearest fit for the rest. |
| Tier | Typical workload | CPU | Memory | Throughput | Data volume | State store |
|---|---|---|---|---|---|---|
Smallest |
Dev/test; simple stateless filter at low rate; lightweight producer |
~0.5 core |
~1 Gi |
<1k msg/s |
<1 MB/s |
none |
Small |
Light stateless: filter + transform, content routing at moderate rate |
~1 core |
~2 Gi |
1–10k msg/s |
1–10 MB/s |
none or <256 MB |
Medium |
Filter + transform, JSON serdes, regex, a single non-windowed aggregate |
~1–2 cores |
~4 Gi |
5–25k msg/s |
5–25 MB/s |
256 MB–1 GB |
Large |
Windowed aggregates, stream–table joins, multi-step CEP, manual state stores |
~2–3 cores |
~8 Gi |
10–50k msg/s |
25–100 MB/s |
1–4 GB |
Extra Large |
Heavy CEP at high throughput, large windowed state, multi-stream joins |
~3–4+ cores |
~12–16 Gi |
50k+ msg/s sustained |
100+ MB/s |
4–8 GB |
Throughput ranges overlap deliberately: a stateless filter sustains the top of a tier’s range, while a CEP pipeline at the same tier saturates near the bottom. Match both the workload profile and the expected load.
Memory
Two components scale very differently:
-
GraalVM non-heap (the Python interpreter): effectively constant. Measured at 147–200 MB (median 169) across every scenario and load level. Plain Kafka Streams uses only 49–76 MB, so the KSML Python layer adds ~100 MB (median +99) regardless of what the pipeline does. This is the floor under every KSML application.
-
JVM heap: scales with topology and load. Stateless pipelines held ~110–250 MB of peak heap from low to medium load, rising to ~400–480 MB at high load (in-flight buffering). Stateful pipelines rose from ~280–295 MB at low load to ~585–656 MB at high load (state-store write buffers).
A stateful windowed pipeline at low load already uses more memory than a stateless filter at high load, yet both keep growing with load.
| Load | Stateless filter | Stateful filter + aggregate |
|---|---|---|
Low |
176 / 194 / 370 MB |
166 / 281 / 447 MB |
Medium |
182 / 250 / 432 MB |
188 / 354 / 542 MB |
High |
188 / 477 / 665 MB |
195 / 656 / 851 MB |
Non-heap stays flat; heap grows with load — modestly when stateless, steeply when stateful.
CPU
Measured as container CPU, where 100% denotes one fully utilised core. Three roughly independent factors:
-
Message rate. The stateless filter rose from ~20–40% of a core at low/medium load to ~55% (peak 72%) at high load; stateful pipelines climbed more steeply; the windowed aggregate reached ~87% and the manual state store peaked above 90% under the heaviest load.
-
Per-record logic. A field comparison is cheap; regex, multi-step chains, and
flatMapexpansion cost more; heavier stateless patterns drew ~55–63% of a core at high load. -
Statefulness. Stateful pipelines consistently use much more CPU than stateless ones at the same load, approaching a full core under heavy load; stateless filters and transforms stay near half a core.
|
Scale out, don’t add threads blindly. A KSML instance can use more than one core only when assigned multiple partitions with matching stream threads, but per-partition Python work is the limiting factor. |
Resource cost by operation type
The kind of operations in your pipeline is the best single predictor of what it needs. Costs below are relative — what each adds on top of the fixed runtime baseline.
| Operation | Stateful | Record volume | CPU | Memory | Notes |
|---|---|---|---|---|---|
|
No |
Reduces |
Low |
Negligible |
Cheapest; place early in the pipeline |
|
No |
Unchanged |
Low–mod |
Negligible |
Key-changing map may force repartition |
|
No |
Unchanged |
Low–mod |
Negligible |
Cost is the user logic |
|
No |
Increases (fan-out) |
Moderate |
Low |
Multiplies all downstream cost |
|
No |
Splits |
Low |
Negligible |
Total cost depends on the branches |
|
No |
Combines |
Low |
Negligible |
Inexpensive |
|
No (triggers repartition) |
Unchanged |
Moderate |
Low |
Writes an internal repartition topic |
|
Yes |
Reduces to one/key |
Moderate |
Moderate |
Memory grows with key cardinality |
|
Yes |
Reduces to one/key |
Mod–high |
Mod–high |
State store + combiner; grows with keys and value size |
Windowed aggregate / count / reduce |
Yes |
Reduces per window |
High |
High |
Holds many concurrent windows per key; most memory-intensive |
Join (stream–table) |
Yes |
Varies |
Mod–high |
High |
Materialises the table side |
Windowed join (stream–stream) |
Yes |
Varies |
High |
High |
Buffers both streams for the window |
Three principles:
-
State is the dominant memory driver. Going from stateless to any aggregation, window, or join is the single largest memory step; windowing multiplies it, since several windows per key are held until retention expires.
-
Selectivity propagates. A selective filter early cuts work for everything downstream; a
flatMapearly multiplies it. Placing the most selective operations first is among the most effective optimisations. -
Re-keying has a hidden cost.
groupByor a key-changingmapwrites to an internal repartition topic, adding network and broker load not visible in the app’s own metrics, but real for capacity planning.
Tier profiles in brief
Smallest: one predicate or field transform at low rate; no state. Budget is mostly the ~170 MB runtime, so almost no bursting headroom. For dev/test, simple producers, low-rate routers. Avoid anything that may burst past a few thousand msg/s.
Small: light stateless at moderate sustained rates. Buys headroom, not new capability. Avoid stateful work or anything needing >1 core sustained.
Medium: the right start for most apps — filter + transform, JSON, regex, routing, a single non-windowed aggregate. Carries light state and heavier per-record logic at once. Avoid windowed CEP, multi-stream joins, or large state.
Large: windowed aggregation, stream–table joins, multi-stage CEP, meaningful state cardinality, or peaks beyond 1–2 cores. First tier where state-store buffering (not in-flight records) dominates memory. Avoid sustained >~50k msg/s or very large state.
Extra Large: very high-throughput stateful work, large windowed state, multi-stream joins. Here the question becomes "how many instances," not "how much bigger." Beyond this, scale horizontally by adding instances and partitions.
Worked example: simple filter vs complex event processing
The same input topic at the same rate can need very different sizes depending on the pipeline.
A stateless filter — one predicate per record:
pipelines:
drop_low_value:
from: events
via:
- type: filter
if:
expression: value.get("score") > 50
to: filtered_events
A CEP pipeline on the same input (filter, group by key, aggregate in a 5-minute tumbling window, Python pattern detection across the windowed state, filter again):
pipelines:
windowed_pattern:
from: events
via:
- type: filter
if: { expression: value.get("score") > 50 }
- type: groupByKey
- type: windowedBy
windowedBy: { windowType: time, duration: PT5M }
- type: aggregate
store: pattern_state
- type: mapValues
mapper: detect_pattern
- type: filter
if: { expression: value.get("alert") == True }
to: alerts
| Simple filter | Windowed aggregate | |
|---|---|---|
Input topic, rate, partitions, threads |
identical |
identical |
Peak total, low load |
~370 MB |
~460 MB |
Peak total, high load |
~665 MB |
~823 MB |
CPU, low load |
~25% core |
~26% core |
CPU, high load |
~55% core |
~87% core |
State store |
none |
grows with keys until retention expires |
Tier, low load |
Smallest |
Medium |
Tier, high load |
Small / Medium |
Large / Extra Large |
Pipeline complexity alone drives a multi-tier difference even at identical rate, partitions, and threads. Within either pipeline, the right tier depends on peak load: sizing for low load and then running high exhausts the headroom. Provision the windowed pipeline with a state-store volume sized for its key cardinality.
Diagnosing resource constraints
In production, the right response to a resource problem is rarely just "bigger." A larger size fixes only a genuine CPU/memory shortage on an instance already using its parallelism well. First identify what is constrained and why.
| Symptom | Likely cause | Response |
|---|---|---|
Heap >80%, stable under load |
Tier memory genuinely insufficient |
Next tier up |
Heap >80% and rising continuously |
State store not cleaned up (no TTL/retention, unbounded keys) |
Fix cleanup; a bigger size only delays failure |
Out-of-memory terminations |
Insufficient memory |
Next tier up |
CPU >70% on one thread, others idle |
Threads outnumber assigned partitions |
Align |
CPU >70% across all threads, all partitions assigned |
Genuine single-instance saturation |
Bigger tier, or scale out |
CPU >70%, partitions per instance at the parallelism limit |
One instance can’t get more parallelism |
Scale out; bigger won’t help past partition count |
Lag rising, CPU low |
Network bottleneck, slow downstream, or producer stall |
Investigate network/downstream; size changes nothing |
Lag rising, high CPU at peaks only |
Burst exceeds single-instance capacity |
Bigger tier for headroom, or scale out |
State-store volume >80% |
Volume undersized or state unbounded |
If stable, grow the volume; if growing, fix cleanup first |
GC >1/min sustained |
Heap pressure |
More memory, or cut allocation in the Python logic |
Pod CPU throttling |
Hitting the CPU limit |
Increase the CPU tier; this is what throttling means |
All metrics healthy, lag still rising |
Producers outpace total consumer capacity |
Scale out: more instances and partitions |
Increase the size when CPU or memory is at its limit under steady-state (not bursty) load and stable rather than growing, or when pod CPU throttling appears.
Do not increase the size when the constraint is parallelism (align threads/partitions), the network (scale out), unbounded state (fix cleanup), a memory leak (fix it), or one saturated thread while others idle (fix topology/partitioning).
Scale out instead when per-instance network bandwidth is the bottleneck (above ~100 MB/s on gigabit), you’ve hit the input topic’s partition count and need more parallelism (add partitions too), HA needs a redundant instance, or peak load needs more than four cores (two smaller instances beat one large one).
Optimise first where you can: precompute lookups in global init and avoid object creation in hot paths; use an efficient serialisation format (e.g. Avro over verbose text); add window retention/TTL to bound state; filter unwanted records in the first stage, before expensive processing.
Reduce the size when, for at least seven consecutive days: CPU stays below 30% at peak, heap below 50% of the limit, lag near zero, and memory shows no upward trend.
Scope and limitations
-
Throughput, CPU, and data-volume ranges are rule-of-thumb starting points, not guarantees; achievable figures depend on payload size, serialisation format, per-record logic, and infrastructure. Validate against your own workload before treating any figure as a hard limit.
-
The guide uses consumer lag as the signal of whether an instance is keeping up, not end-to-end latency. Apps with strict latency targets (e.g. P99) should validate against those directly, with extra memory headroom on stateful tiers for GC pauses.
-
Standby replicas (
num.standby.replicas) aren’t exposed in the size selector; if enabled for HA, double the per-instance memory and state-volume footprint in planning. -
Network bandwidth binds before CPU at very high throughput; beyond ~100 MB/s per instance on gigabit, scale out rather than size up.
References
Primarily our own benchmark of KSML across ten processing patterns (memory, CPU, consumer lag under sustained load). The general relationships — stateful and windowing operations cost far more than stateless, and resource use rises with load — are also documented in published Kafka Streams research:
-
van Dongen, G. & Van den Poel, D. (2020). Evaluation of Stream Processing Frameworks. IEEE TPDS 31(8), 1845–1858. https://doi.org/10.1109/TPDS.2020.2978480
-
Bordin, M. V. et al. (2020). DSPBench: A Suite of Benchmark Applications for Distributed Data Stream Processing Systems. IEEE Access 8, 222900–222917. https://doi.org/10.1109/ACCESS.2020.3043948
-
Apache Kafka: Kafka Streams Memory Management
-
Axual: Managing KSML Applications
-
KSML: Performance Optimization — KSML Documentation