Choosing a KSML Application Deployment Size

Table of Contents

Description
Quick start: pick a size in two questions
- 1. What does your pipeline do?
- 2. How many messages per second?
Selecting an initial size
Resource cost by operation type
Tier profiles in brief
Worked example: simple filter vs complex event processing
Diagnosing resource constraints
Scope and limitations
References

A guide to selecting a deployment size for KSML applications on the Axual Platform, and to diagnosing resource constraints once an application is running in production.

Figures labelled measured come directly from our benchmark and can be trusted for the scenario described; the rest are based on published papers (see References). The quick-start grid and tier ranges are starting points to validate under your own load. Your own workload is always the final word.

Description

When you deploy a KSML application, you choose a deployment size that sets how much CPU and memory it gets. Pick too small a size and the application can fall behind, get throttled, or run out of memory; pick too large and you pay for resources you never use.

A KSML application has a fixed memory cost of ~150 MB for the Python runtime. On top of that, two things drive resource use:

What the pipeline does (its shape). A filter is cheap. Stateful operations (aggregations, joins, windows) need far more memory and CPU because they hold state in memory.
How hard you push it (the load). More messages per second, and bigger messages, raise CPU and memory above that baseline.

Shape sets the floor; load sets the ceiling. A heavy pipeline at low load can use more than a light pipeline at high load, so size for the peak combination of both, not the average.

These recommendations come from a benchmark across ten common KSML patterns (simple filters through windowed aggregations and joins) at rates from a few hundred to tens of thousands of messages per second, measuring memory, CPU, and consumer lag.

If you know what your pipeline does and roughly its message rate, go straight to Quick start: pick a size in two questions. For the reasoning, read Selecting an initial size. If an application is already lagging or under memory pressure, go to Diagnosing resource constraints. The right fix is often not a bigger size.

Quick start: pick a size in two questions

1. What does your pipeline do?

Just passing or filtering: drops, keeps, or routes messages, maybe tweaks a field; remembers nothing (filter, map, route).
Light processing: a few steps chained, reformatting, simple enrichment; still stateless.
Remembers or combines: counts, sums, groups, or joins (aggregate, count, join).
Remembers over time windows: groups into time buckets ("per 5 minutes") or detects patterns across many messages; the heaviest kind (windowed aggregations, complex event processing).

2. How many messages per second?

Pipeline	≤1k/s	1k–10k/s	10k–50k/s	50k+/s
Passing / filtering	Smallest	Small	Medium	Medium
Light processing	Smallest	Small	Medium	Large
Remembers / combines	Small	Medium	Large	Large + scale out
Remembers over windows	Medium	Medium	Large	XL + scale out

Pipeline

≤1k/s

1k–10k/s

10k–50k/s

50k+/s

Passing / filtering

Smallest

Small

Medium

Light processing

Smallest

Small

Medium

Large

Remembers / combines

Small

Medium

Large

Large + scale out

Remembers over windows

Medium

Large

XL + scale out

Size names map to whatever the platform offers (commonly XS, S, M).

Bump up a size if: messages are large (tens of KB); per-message logic is heavy (large regex, heavy text processing); or there are many distinct keys (e.g. millions of users) and the pipeline is stateful.

Scale out above ~50k msg/s. A single KSML instance does its main work on one processing path, so making it bigger gives diminishing returns. Run several instances side by side, each handling a share of partitions. The input topic needs at least as many partitions as instances. Think "a few medium instances together," not "one giant instance."

When in doubt, start one size up, watch how it runs, and adjust.

Selecting an initial size

Deployment size reference

The following tier ranges are guidance, not guarantees. Tier names vary by installation — a typical Axual Cloud install offers three sizes (often XS, S, M); on-premise may define more, fewer, or differently named ones. Ask your platform operator for the exact CPU/memory values.

Where only three tiers exist, map the smallest, medium, and largest to tiers 1, 3, and 5; pick the nearest fit for the rest.

Tier	Typical workload	CPU	Memory	Throughput	Data volume	State store
Smallest	Dev/test; simple stateless filter at low rate; lightweight producer	~0.5 core	~1 Gi	<1k msg/s	<1 MB/s	none
Small	Light stateless: filter + transform, content routing at moderate rate	~1 core	~2 Gi	1–10k msg/s	1–10 MB/s	none or <256 MB
Medium	Filter + transform, JSON serdes, regex, a single non-windowed aggregate	~1–2 cores	~4 Gi	5–25k msg/s	5–25 MB/s	256 MB–1 GB
Large	Windowed aggregates, stream–table joins, multi-step CEP, manual state stores	~2–3 cores	~8 Gi	10–50k msg/s	25–100 MB/s	1–4 GB
Extra Large	Heavy CEP at high throughput, large windowed state, multi-stream joins	~3–4+ cores	~12–16 Gi	50k+ msg/s sustained	100+ MB/s	4–8 GB

Tier

Typical workload

CPU

Memory

Throughput

Data volume

State store

Smallest

Dev/test; simple stateless filter at low rate; lightweight producer

~0.5 core

~1 Gi

<1k msg/s

<1 MB/s

none

Small

Light stateless: filter + transform, content routing at moderate rate

~1 core

~2 Gi

1–10k msg/s

1–10 MB/s

none or <256 MB

Medium

Filter + transform, JSON serdes, regex, a single non-windowed aggregate

~1–2 cores

~4 Gi

5–25k msg/s

5–25 MB/s

256 MB–1 GB

Large

Windowed aggregates, stream–table joins, multi-step CEP, manual state stores

~2–3 cores

~8 Gi

10–50k msg/s

25–100 MB/s

1–4 GB

Extra Large

Heavy CEP at high throughput, large windowed state, multi-stream joins

~3–4+ cores

~12–16 Gi

50k+ msg/s sustained

100+ MB/s

4–8 GB

Throughput ranges overlap deliberately: a stateless filter sustains the top of a tier’s range, while a CEP pipeline at the same tier saturates near the bottom. Match both the workload profile and the expected load.

Memory

Two components scale very differently:

GraalVM non-heap (the Python interpreter): effectively constant. Measured at 147–200 MB (median 169) across every scenario and load level. Plain Kafka Streams uses only 49–76 MB, so the KSML Python layer adds ~100 MB (median +99) regardless of what the pipeline does. This is the floor under every KSML application.
JVM heap: scales with topology and load. Stateless pipelines held ~110–250 MB of peak heap from low to medium load, rising to ~400–480 MB at high load (in-flight buffering). Stateful pipelines rose from ~280–295 MB at low load to ~585–656 MB at high load (state-store write buffers).

A stateful windowed pipeline at low load already uses more memory than a stateless filter at high load, yet both keep growing with load.

Table 1. Measured peak memory (small payload), non-heap / heap / total
Load	Stateless filter	Stateful filter + aggregate
Low	176 / 194 / 370 MB	166 / 281 / 447 MB
Medium	182 / 250 / 432 MB	188 / 354 / 542 MB
High	188 / 477 / 665 MB	195 / 656 / 851 MB

Non-heap stays flat; heap grows with load — modestly when stateless, steeply when stateful.

CPU

Measured as container CPU, where 100% denotes one fully utilised core. Three roughly independent factors:

Message rate. The stateless filter rose from ~20–40% of a core at low/medium load to ~55% (peak 72%) at high load; stateful pipelines climbed more steeply; the windowed aggregate reached ~87% and the manual state store peaked above 90% under the heaviest load.
Per-record logic. A field comparison is cheap; regex, multi-step chains, and flatMap expansion cost more; heavier stateless patterns drew ~55–63% of a core at high load.
Statefulness. Stateful pipelines consistently use much more CPU than stateless ones at the same load, approaching a full core under heavy load; stateless filters and transforms stay near half a core.

Scale out, don’t add threads blindly. A KSML instance can use more than one core only when assigned multiple partitions with matching stream threads, but per-partition Python work is the limiting factor. num.stream.threads helps only when each thread has a partition to process: four threads on an instance with two partitions leaves two idle and wastes the allocation. Match thread count to assigned partitions; the reliable way to add capacity is more instances and partitions.

Resource cost by operation type

The kind of operations in your pipeline is the best single predictor of what it needs. Costs below are relative — what each adds on top of the fixed runtime baseline.

Operation Stateful Record volume CPU Memory Notes

Operation	Stateful	Record volume	CPU	Memory	Notes
`filter` / `filterNot`	No	Reduces	Low	Negligible	Cheapest; place early in the pipeline
`mapValues` / `map`	No	Unchanged	Low–mod	Negligible	Key-changing map may force repartition
`transformValues`	No	Unchanged	Low–mod	Negligible	Cost is the user logic
`flatMap` / `flatMapValues`	No	Increases (fan-out)	Moderate	Low	Multiplies all downstream cost
`branch` / routing	No	Splits	Low	Negligible	Total cost depends on the branches
`merge`	No	Combines	Low	Negligible	Inexpensive
`groupBy` / `groupByKey`	No (triggers repartition)	Unchanged	Moderate	Low	Writes an internal repartition topic
`count`	Yes	Reduces to one/key	Moderate	Moderate	Memory grows with key cardinality
`reduce` / `aggregate`	Yes	Reduces to one/key	Mod–high	Mod–high	State store + combiner; grows with keys and value size
Windowed aggregate / count / reduce	Yes	Reduces per window	High	High	Holds many concurrent windows per key; most memory-intensive
Join (stream–table)	Yes	Varies	Mod–high	High	Materialises the table side
Windowed join (stream–stream)	Yes	Varies	High	High	Buffers both streams for the window

filter / filterNot

Reduces

Low

Negligible

Cheapest; place early in the pipeline

mapValues / map

Unchanged

Low–mod

Negligible

Key-changing map may force repartition

transformValues

Unchanged

Low–mod

Negligible

Cost is the user logic

flatMap / flatMapValues

Increases (fan-out)

Moderate

Low

Multiplies all downstream cost

branch / routing

Splits

Low

Negligible

Total cost depends on the branches

merge

Combines

Low

Negligible

Inexpensive

groupBy / groupByKey

No (triggers repartition)

Unchanged

Moderate

Low

Writes an internal repartition topic

count

Yes

Reduces to one/key

Moderate

Memory grows with key cardinality

reduce / aggregate

Yes

Reduces to one/key

Mod–high

State store + combiner; grows with keys and value size

Windowed aggregate / count / reduce

Yes

Reduces per window

High

Holds many concurrent windows per key; most memory-intensive

Join (stream–table)

Yes

Varies

Mod–high

High

Materialises the table side

Windowed join (stream–stream)

Yes

Varies

High

Buffers both streams for the window

Three principles:

State is the dominant memory driver. Going from stateless to any aggregation, window, or join is the single largest memory step; windowing multiplies it, since several windows per key are held until retention expires.
Selectivity propagates. A selective filter early cuts work for everything downstream; a flatMap early multiplies it. Placing the most selective operations first is among the most effective optimisations.
Re-keying has a hidden cost. groupBy or a key-changing map writes to an internal repartition topic, adding network and broker load not visible in the app’s own metrics, but real for capacity planning.

Tier profiles in brief

Smallest: one predicate or field transform at low rate; no state. Budget is mostly the ~170 MB runtime, so almost no bursting headroom. For dev/test, simple producers, low-rate routers. Avoid anything that may burst past a few thousand msg/s.

Small: light stateless at moderate sustained rates. Buys headroom, not new capability. Avoid stateful work or anything needing >1 core sustained.

Medium: the right start for most apps — filter + transform, JSON, regex, routing, a single non-windowed aggregate. Carries light state and heavier per-record logic at once. Avoid windowed CEP, multi-stream joins, or large state.

Large: windowed aggregation, stream–table joins, multi-stage CEP, meaningful state cardinality, or peaks beyond 1–2 cores. First tier where state-store buffering (not in-flight records) dominates memory. Avoid sustained >~50k msg/s or very large state.

Extra Large: very high-throughput stateful work, large windowed state, multi-stream joins. Here the question becomes "how many instances," not "how much bigger." Beyond this, scale horizontally by adding instances and partitions.

Worked example: simple filter vs complex event processing

The same input topic at the same rate can need very different sizes depending on the pipeline.

A stateless filter — one predicate per record:

pipelines:
  drop_low_value:
    from: events
    via:
      - type: filter
        if:
          expression: value.get("score") > 50
    to: filtered_events

A CEP pipeline on the same input (filter, group by key, aggregate in a 5-minute tumbling window, Python pattern detection across the windowed state, filter again):

pipelines:
  windowed_pattern:
    from: events
    via:
      - type: filter
        if: { expression: value.get("score") > 50 }
      - type: groupByKey
      - type: windowedBy
        windowedBy: { windowType: time, duration: PT5M }
      - type: aggregate
        store: pattern_state
      - type: mapValues
        mapper: detect_pattern
      - type: filter
        if: { expression: value.get("alert") == True }
    to: alerts

Table 2. Measured peak JVM total and CPU (small payload)
	Simple filter	Windowed aggregate
Input topic, rate, partitions, threads	identical	identical
Peak total, low load	~370 MB	~460 MB
Peak total, high load	~665 MB	~823 MB
CPU, low load	~25% core	~26% core
CPU, high load	~55% core	~87% core
State store	none	grows with keys until retention expires
Tier, low load	Smallest	Medium
Tier, high load	Small / Medium	Large / Extra Large

Pipeline complexity alone drives a multi-tier difference even at identical rate, partitions, and threads. Within either pipeline, the right tier depends on peak load: sizing for low load and then running high exhausts the headroom. Provision the windowed pipeline with a state-store volume sized for its key cardinality.

Diagnosing resource constraints

In production, the right response to a resource problem is rarely just "bigger." A larger size fixes only a genuine CPU/memory shortage on an instance already using its parallelism well. First identify what is constrained and why.

Symptom Likely cause Response

Symptom	Likely cause	Response
Heap >80%, stable under load	Tier memory genuinely insufficient	Next tier up
Heap >80% and rising continuously	State store not cleaned up (no TTL/retention, unbounded keys)	Fix cleanup; a bigger size only delays failure
Out-of-memory terminations	Insufficient memory	Next tier up
CPU >70% on one thread, others idle	Threads outnumber assigned partitions	Align `num.stream.threads` with partitions, or add partitions
CPU >70% across all threads, all partitions assigned	Genuine single-instance saturation	Bigger tier, or scale out
CPU >70%, partitions per instance at the parallelism limit	One instance can’t get more parallelism	Scale out; bigger won’t help past partition count
Lag rising, CPU low	Network bottleneck, slow downstream, or producer stall	Investigate network/downstream; size changes nothing
Lag rising, high CPU at peaks only	Burst exceeds single-instance capacity	Bigger tier for headroom, or scale out
State-store volume >80%	Volume undersized or state unbounded	If stable, grow the volume; if growing, fix cleanup first
GC >1/min sustained	Heap pressure	More memory, or cut allocation in the Python logic
Pod CPU throttling	Hitting the CPU limit	Increase the CPU tier; this is what throttling means
All metrics healthy, lag still rising	Producers outpace total consumer capacity	Scale out: more instances and partitions

Heap >80%, stable under load

Tier memory genuinely insufficient

Next tier up

Heap >80% and rising continuously

State store not cleaned up (no TTL/retention, unbounded keys)

Fix cleanup; a bigger size only delays failure

Out-of-memory terminations

Insufficient memory

Next tier up

CPU >70% on one thread, others idle

Threads outnumber assigned partitions

Align num.stream.threads with partitions, or add partitions

CPU >70% across all threads, all partitions assigned

Genuine single-instance saturation

Bigger tier, or scale out

CPU >70%, partitions per instance at the parallelism limit

One instance can’t get more parallelism

Scale out; bigger won’t help past partition count

Lag rising, CPU low

Network bottleneck, slow downstream, or producer stall

Investigate network/downstream; size changes nothing

Lag rising, high CPU at peaks only

Burst exceeds single-instance capacity

Bigger tier for headroom, or scale out

State-store volume >80%

Volume undersized or state unbounded

If stable, grow the volume; if growing, fix cleanup first

GC >1/min sustained

Heap pressure

More memory, or cut allocation in the Python logic

Pod CPU throttling

Hitting the CPU limit

Increase the CPU tier; this is what throttling means

All metrics healthy, lag still rising

Producers outpace total consumer capacity

Scale out: more instances and partitions

Increase the size when CPU or memory is at its limit under steady-state (not bursty) load and stable rather than growing, or when pod CPU throttling appears.

Do not increase the size when the constraint is parallelism (align threads/partitions), the network (scale out), unbounded state (fix cleanup), a memory leak (fix it), or one saturated thread while others idle (fix topology/partitioning).

Scale out instead when per-instance network bandwidth is the bottleneck (above ~100 MB/s on gigabit), you’ve hit the input topic’s partition count and need more parallelism (add partitions too), HA needs a redundant instance, or peak load needs more than four cores (two smaller instances beat one large one).

Optimise first where you can: precompute lookups in global init and avoid object creation in hot paths; use an efficient serialisation format (e.g. Avro over verbose text); add window retention/TTL to bound state; filter unwanted records in the first stage, before expensive processing.

Reduce the size when, for at least seven consecutive days: CPU stays below 30% at peak, heap below 50% of the limit, lag near zero, and memory shows no upward trend.

Scope and limitations

Throughput, CPU, and data-volume ranges are rule-of-thumb starting points, not guarantees; achievable figures depend on payload size, serialisation format, per-record logic, and infrastructure. Validate against your own workload before treating any figure as a hard limit.
The guide uses consumer lag as the signal of whether an instance is keeping up, not end-to-end latency. Apps with strict latency targets (e.g. P99) should validate against those directly, with extra memory headroom on stateful tiers for GC pauses.
Standby replicas (num.standby.replicas) aren’t exposed in the size selector; if enabled for HA, double the per-instance memory and state-volume footprint in planning.
Network bandwidth binds before CPU at very high throughput; beyond ~100 MB/s per instance on gigabit, scale out rather than size up.

References

Primarily our own benchmark of KSML across ten processing patterns (memory, CPU, consumer lag under sustained load). The general relationships — stateful and windowing operations cost far more than stateless, and resource use rises with load — are also documented in published Kafka Streams research:

van Dongen, G. & Van den Poel, D. (2020). Evaluation of Stream Processing Frameworks. IEEE TPDS 31(8), 1845–1858. https://doi.org/10.1109/TPDS.2020.2978480
Bordin, M. V. et al. (2020). DSPBench: A Suite of Benchmark Applications for Distributed Data Stream Processing Systems. IEEE Access 8, 222900–222917. https://doi.org/10.1109/ACCESS.2020.3043948
Apache Kafka: Kafka Streams Memory Management
Axual: KSML Application Deployment Sizes
Axual: Managing KSML Applications
KSML: Performance Optimization — KSML Documentation