Choosing a KSML Application Deployment Size

A guide to selecting a deployment size for KSML applications on the Axual Platform, and to diagnosing resource constraints once an application is running in production.

Figures labelled measured come directly from our benchmark and can be trusted for the scenario described; the rest are based on published papers (see References). The quick-start grid and tier ranges are starting points to validate under your own load. Your own workload is always the final word.

Description

When you deploy a KSML application, you choose a deployment size that sets how much CPU and memory it gets. Pick too small a size and the application can fall behind, get throttled, or run out of memory; pick too large and you pay for resources you never use.

A KSML application has a fixed memory cost of ~150 MB for the Python runtime. On top of that, two things drive resource use:

  • What the pipeline does (its shape). A filter is cheap. Stateful operations (aggregations, joins, windows) need far more memory and CPU because they hold state in memory.

  • How hard you push it (the load). More messages per second, and bigger messages, raise CPU and memory above that baseline.

Shape sets the floor; load sets the ceiling. A heavy pipeline at low load can use more than a light pipeline at high load, so size for the peak combination of both, not the average.

These recommendations come from a benchmark across ten common KSML patterns (simple filters through windowed aggregations and joins) at rates from a few hundred to tens of thousands of messages per second, measuring memory, CPU, and consumer lag.

If you know what your pipeline does and roughly its message rate, go straight to Quick start: pick a size in two questions. For the reasoning, read Selecting an initial size. If an application is already lagging or under memory pressure, go to Diagnosing resource constraints. The right fix is often not a bigger size.

Quick start: pick a size in two questions

1. What does your pipeline do?

  • Just passing or filtering: drops, keeps, or routes messages, maybe tweaks a field; remembers nothing (filter, map, route).

  • Light processing: a few steps chained, reformatting, simple enrichment; still stateless.

  • Remembers or combines: counts, sums, groups, or joins (aggregate, count, join).

  • Remembers over time windows: groups into time buckets ("per 5 minutes") or detects patterns across many messages; the heaviest kind (windowed aggregations, complex event processing).

2. How many messages per second?

Pipeline ≤1k/s 1k–10k/s 10k–50k/s 50k+/s

Passing / filtering

Smallest

Small

Medium

Medium

Light processing

Smallest

Small

Medium

Large

Remembers / combines

Small

Medium

Large

Large + scale out

Remembers over windows

Medium

Medium

Large

XL + scale out

Size names map to whatever the platform offers (commonly XS, S, M).

Bump up a size if: messages are large (tens of KB); per-message logic is heavy (large regex, heavy text processing); or there are many distinct keys (e.g. millions of users) and the pipeline is stateful.

Scale out above ~50k msg/s. A single KSML instance does its main work on one processing path, so making it bigger gives diminishing returns. Run several instances side by side, each handling a share of partitions. The input topic needs at least as many partitions as instances. Think "a few medium instances together," not "one giant instance."

When in doubt, start one size up, watch how it runs, and adjust.

Selecting an initial size

Deployment size reference

The following tier ranges are guidance, not guarantees. Tier names vary by installation — a typical Axual Cloud install offers three sizes (often XS, S, M); on-premise may define more, fewer, or differently named ones. Ask your platform operator for the exact CPU/memory values.

Where only three tiers exist, map the smallest, medium, and largest to tiers 1, 3, and 5; pick the nearest fit for the rest.

Tier Typical workload CPU Memory Throughput Data volume State store

Smallest

Dev/test; simple stateless filter at low rate; lightweight producer

~0.5 core

~1 Gi

<1k msg/s

<1 MB/s

none

Small

Light stateless: filter + transform, content routing at moderate rate

~1 core

~2 Gi

1–10k msg/s

1–10 MB/s

none or <256 MB

Medium

Filter + transform, JSON serdes, regex, a single non-windowed aggregate

~1–2 cores

~4 Gi

5–25k msg/s

5–25 MB/s

256 MB–1 GB

Large

Windowed aggregates, stream–table joins, multi-step CEP, manual state stores

~2–3 cores

~8 Gi

10–50k msg/s

25–100 MB/s

1–4 GB

Extra Large

Heavy CEP at high throughput, large windowed state, multi-stream joins

~3–4+ cores

~12–16 Gi

50k+ msg/s sustained

100+ MB/s

4–8 GB

Throughput ranges overlap deliberately: a stateless filter sustains the top of a tier’s range, while a CEP pipeline at the same tier saturates near the bottom. Match both the workload profile and the expected load.

Memory

Two components scale very differently:

  • GraalVM non-heap (the Python interpreter): effectively constant. Measured at 147–200 MB (median 169) across every scenario and load level. Plain Kafka Streams uses only 49–76 MB, so the KSML Python layer adds ~100 MB (median +99) regardless of what the pipeline does. This is the floor under every KSML application.

  • JVM heap: scales with topology and load. Stateless pipelines held ~110–250 MB of peak heap from low to medium load, rising to ~400–480 MB at high load (in-flight buffering). Stateful pipelines rose from ~280–295 MB at low load to ~585–656 MB at high load (state-store write buffers).

A stateful windowed pipeline at low load already uses more memory than a stateless filter at high load, yet both keep growing with load.

Table 1. Measured peak memory (small payload), non-heap / heap / total
Load Stateless filter Stateful filter + aggregate

Low

176 / 194 / 370 MB

166 / 281 / 447 MB

Medium

182 / 250 / 432 MB

188 / 354 / 542 MB

High

188 / 477 / 665 MB

195 / 656 / 851 MB

Non-heap stays flat; heap grows with load — modestly when stateless, steeply when stateful.

CPU

Measured as container CPU, where 100% denotes one fully utilised core. Three roughly independent factors:

  • Message rate. The stateless filter rose from ~20–40% of a core at low/medium load to ~55% (peak 72%) at high load; stateful pipelines climbed more steeply; the windowed aggregate reached ~87% and the manual state store peaked above 90% under the heaviest load.

  • Per-record logic. A field comparison is cheap; regex, multi-step chains, and flatMap expansion cost more; heavier stateless patterns drew ~55–63% of a core at high load.

  • Statefulness. Stateful pipelines consistently use much more CPU than stateless ones at the same load, approaching a full core under heavy load; stateless filters and transforms stay near half a core.

Scale out, don’t add threads blindly. A KSML instance can use more than one core only when assigned multiple partitions with matching stream threads, but per-partition Python work is the limiting factor. num.stream.threads helps only when each thread has a partition to process: four threads on an instance with two partitions leaves two idle and wastes the allocation. Match thread count to assigned partitions; the reliable way to add capacity is more instances and partitions.

Resource cost by operation type

The kind of operations in your pipeline is the best single predictor of what it needs. Costs below are relative — what each adds on top of the fixed runtime baseline.

Operation Stateful Record volume CPU Memory Notes

filter / filterNot

No

Reduces

Low

Negligible

Cheapest; place early in the pipeline

mapValues / map

No

Unchanged

Low–mod

Negligible

Key-changing map may force repartition

transformValues

No

Unchanged

Low–mod

Negligible

Cost is the user logic

flatMap / flatMapValues

No

Increases (fan-out)

Moderate

Low

Multiplies all downstream cost

branch / routing

No

Splits

Low

Negligible

Total cost depends on the branches

merge

No

Combines

Low

Negligible

Inexpensive

groupBy / groupByKey

No (triggers repartition)

Unchanged

Moderate

Low

Writes an internal repartition topic

count

Yes

Reduces to one/key

Moderate

Moderate

Memory grows with key cardinality

reduce / aggregate

Yes

Reduces to one/key

Mod–high

Mod–high

State store + combiner; grows with keys and value size

Windowed aggregate / count / reduce

Yes

Reduces per window

High

High

Holds many concurrent windows per key; most memory-intensive

Join (stream–table)

Yes

Varies

Mod–high

High

Materialises the table side

Windowed join (stream–stream)

Yes

Varies

High

High

Buffers both streams for the window

Three principles:

  • State is the dominant memory driver. Going from stateless to any aggregation, window, or join is the single largest memory step; windowing multiplies it, since several windows per key are held until retention expires.

  • Selectivity propagates. A selective filter early cuts work for everything downstream; a flatMap early multiplies it. Placing the most selective operations first is among the most effective optimisations.

  • Re-keying has a hidden cost. groupBy or a key-changing map writes to an internal repartition topic, adding network and broker load not visible in the app’s own metrics, but real for capacity planning.

Tier profiles in brief

Smallest: one predicate or field transform at low rate; no state. Budget is mostly the ~170 MB runtime, so almost no bursting headroom. For dev/test, simple producers, low-rate routers. Avoid anything that may burst past a few thousand msg/s.

Small: light stateless at moderate sustained rates. Buys headroom, not new capability. Avoid stateful work or anything needing >1 core sustained.

Medium: the right start for most apps — filter + transform, JSON, regex, routing, a single non-windowed aggregate. Carries light state and heavier per-record logic at once. Avoid windowed CEP, multi-stream joins, or large state.

Large: windowed aggregation, stream–table joins, multi-stage CEP, meaningful state cardinality, or peaks beyond 1–2 cores. First tier where state-store buffering (not in-flight records) dominates memory. Avoid sustained >~50k msg/s or very large state.

Extra Large: very high-throughput stateful work, large windowed state, multi-stream joins. Here the question becomes "how many instances," not "how much bigger." Beyond this, scale horizontally by adding instances and partitions.

Worked example: simple filter vs complex event processing

The same input topic at the same rate can need very different sizes depending on the pipeline.

A stateless filter — one predicate per record:

pipelines:
  drop_low_value:
    from: events
    via:
      - type: filter
        if:
          expression: value.get("score") > 50
    to: filtered_events

A CEP pipeline on the same input (filter, group by key, aggregate in a 5-minute tumbling window, Python pattern detection across the windowed state, filter again):

pipelines:
  windowed_pattern:
    from: events
    via:
      - type: filter
        if: { expression: value.get("score") > 50 }
      - type: groupByKey
      - type: windowedBy
        windowedBy: { windowType: time, duration: PT5M }
      - type: aggregate
        store: pattern_state
      - type: mapValues
        mapper: detect_pattern
      - type: filter
        if: { expression: value.get("alert") == True }
    to: alerts
Table 2. Measured peak JVM total and CPU (small payload)
Simple filter Windowed aggregate

Input topic, rate, partitions, threads

identical

identical

Peak total, low load

~370 MB

~460 MB

Peak total, high load

~665 MB

~823 MB

CPU, low load

~25% core

~26% core

CPU, high load

~55% core

~87% core

State store

none

grows with keys until retention expires

Tier, low load

Smallest

Medium

Tier, high load

Small / Medium

Large / Extra Large

Pipeline complexity alone drives a multi-tier difference even at identical rate, partitions, and threads. Within either pipeline, the right tier depends on peak load: sizing for low load and then running high exhausts the headroom. Provision the windowed pipeline with a state-store volume sized for its key cardinality.

Diagnosing resource constraints

In production, the right response to a resource problem is rarely just "bigger." A larger size fixes only a genuine CPU/memory shortage on an instance already using its parallelism well. First identify what is constrained and why.

Symptom Likely cause Response

Heap >80%, stable under load

Tier memory genuinely insufficient

Next tier up

Heap >80% and rising continuously

State store not cleaned up (no TTL/retention, unbounded keys)

Fix cleanup; a bigger size only delays failure

Out-of-memory terminations

Insufficient memory

Next tier up

CPU >70% on one thread, others idle

Threads outnumber assigned partitions

Align num.stream.threads with partitions, or add partitions

CPU >70% across all threads, all partitions assigned

Genuine single-instance saturation

Bigger tier, or scale out

CPU >70%, partitions per instance at the parallelism limit

One instance can’t get more parallelism

Scale out; bigger won’t help past partition count

Lag rising, CPU low

Network bottleneck, slow downstream, or producer stall

Investigate network/downstream; size changes nothing

Lag rising, high CPU at peaks only

Burst exceeds single-instance capacity

Bigger tier for headroom, or scale out

State-store volume >80%

Volume undersized or state unbounded

If stable, grow the volume; if growing, fix cleanup first

GC >1/min sustained

Heap pressure

More memory, or cut allocation in the Python logic

Pod CPU throttling

Hitting the CPU limit

Increase the CPU tier; this is what throttling means

All metrics healthy, lag still rising

Producers outpace total consumer capacity

Scale out: more instances and partitions

Increase the size when CPU or memory is at its limit under steady-state (not bursty) load and stable rather than growing, or when pod CPU throttling appears.

Do not increase the size when the constraint is parallelism (align threads/partitions), the network (scale out), unbounded state (fix cleanup), a memory leak (fix it), or one saturated thread while others idle (fix topology/partitioning).

Scale out instead when per-instance network bandwidth is the bottleneck (above ~100 MB/s on gigabit), you’ve hit the input topic’s partition count and need more parallelism (add partitions too), HA needs a redundant instance, or peak load needs more than four cores (two smaller instances beat one large one).

Optimise first where you can: precompute lookups in global init and avoid object creation in hot paths; use an efficient serialisation format (e.g. Avro over verbose text); add window retention/TTL to bound state; filter unwanted records in the first stage, before expensive processing.

Reduce the size when, for at least seven consecutive days: CPU stays below 30% at peak, heap below 50% of the limit, lag near zero, and memory shows no upward trend.

Scope and limitations

  • Throughput, CPU, and data-volume ranges are rule-of-thumb starting points, not guarantees; achievable figures depend on payload size, serialisation format, per-record logic, and infrastructure. Validate against your own workload before treating any figure as a hard limit.

  • The guide uses consumer lag as the signal of whether an instance is keeping up, not end-to-end latency. Apps with strict latency targets (e.g. P99) should validate against those directly, with extra memory headroom on stateful tiers for GC pauses.

  • Standby replicas (num.standby.replicas) aren’t exposed in the size selector; if enabled for HA, double the per-instance memory and state-volume footprint in planning.

  • Network bandwidth binds before CPU at very high throughput; beyond ~100 MB/s per instance on gigabit, scale out rather than size up.

References

Primarily our own benchmark of KSML across ten processing patterns (memory, CPU, consumer lag under sustained load). The general relationships — stateful and windowing operations cost far more than stateless, and resource use rises with load — are also documented in published Kafka Streams research: