Consumer Time to Catch Up
The metric provides the time for consumer group to catch up on a topic/partition at any given time, i.e. number of seconds to consume published but not consumed yet messages by this consumer group.
Use cases
Is my application able to handle the amount of messages that are produced to the topic I am consuming?
It’s crucial to have consumers that can handle the load produced to the topic.
Using this metric you can see if your application has any issues with that. To do so, use basic usage (depending on your case, you can do it with or without aggregator)
If value is constant or negative and increasing, that means that your application is good, and you shouldn’t investigate it.
In other cases, there are possible problems:
-
not enough consumers
-
one of the consumers is stuck (for example, can’t deserialize a message from one of the partitions)
-
not enough partitions - parallelism feature of Kafka to increase consuming speed
Basic usage
Please refer to the example consumer time to catch up
provided in the API docs
This request is asking for the value in seconds of the stream payment-events-stream
on environment dev
for the application consumer_app
for the last 5 minutes with the step-size of 1 minute.
Basic Request
{ "metric": "io.axual.application/consumer_time_to_catch_up", "stepSize": "PT1M", "timeWindow": "PT5M", "filter": { "type": "AND", "filters": [ { "type": "FIELD", "field": "environment", "operation": "EQUALS", "value": "dev" }, { "type": "FIELD", "field": "stream", "operation": "EQUALS", "value": "payment-events-stream" }, { "type": "FIELD", "field": "applicationId", "operation": "EQUALS", "value": "payment_event_emitter" } ] } }
The response could represent positive and negative values which means:
-
positive value : the consumer is lagging behind
-
negative value : the estimate time after which the consumer will be reading the last offset message
-
zero value : the consumer is neither catching up nor lagging behind
Basic Response
{ "type": "UNGROUPED", "dataPoints": [ { "timestamp": "2022-10-24T11:16:00", "value": 0, "labels": { "__name__": "kafka_consumergroup_lag", "axual_cluster": "jupiter", "consumergroup": "axual-qa-dev-payment_event_emitter", "partition": "0", "pod": "jupiter-kafka-exporter-6867f8ccf4-9b8fs", "topic": "axual-qa-dev-payment-events-stream" }, "unit": "Seconds" }, { "timestamp": "2022-10-24T11:17:00", "value": -2502.0, "labels": { "__name__": "kafka_consumergroup_lag", "axual_cluster": "jupiter", "consumergroup": "axual-qa-dev-payment_event_emitter", "partition": "0", "pod": "jupiter-kafka-exporter-6867f8ccf4-9b8fs", "topic": "axual-qa-dev-payment-events-stream" }, "unit": "Seconds" }, { "timestamp": "2022-10-24T11:18:00", "value": -372.0, "labels": { "__name__": "kafka_consumergroup_lag", "axual_cluster": "jupiter", "consumergroup": "axual-qa-dev-payment_event_emitter", "partition": "0", "pod": "jupiter-kafka-exporter-6867f8ccf4-9b8fs", "topic": "axual-qa-dev-payment-events-stream" }, "unit": "Seconds" }, { "timestamp": "2022-10-24T11:19:00", "value": -30.0, "labels": { "__name__": "kafka_consumergroup_lag", "axual_cluster": "jupiter", "consumergroup": "axual-qa-dev-payment_event_emitter", "partition": "0", "pod": "jupiter-kafka-exporter-6867f8ccf4-9b8fs", "topic": "axual-qa-dev-payment-events-stream" }, "unit": "Seconds" }, { "timestamp": "2022-10-24T11:16:00", "value": 0, "labels": { "__name__": "kafka_consumergroup_lag", "axual_cluster": "jupiter", "consumergroup": "axual-qa-dev-payment_event_emitter", "partition": "1", "pod": "jupiter-kafka-exporter-6867f8ccf4-9b8fs", "topic": "axual-qa-dev-payment-events-stream" }, "unit": "Seconds" }, { "timestamp": "2022-10-24T11:17:00", "value": 0, "labels": { "__name__": "kafka_consumergroup_lag", "axual_cluster": "jupiter", "consumergroup": "axual-qa-dev-payment_event_emitter", "partition": "1", "pod": "jupiter-kafka-exporter-6867f8ccf4-9b8fs", "topic": "axual-qa-dev-payment-events-stream" }, "unit": "Seconds" }, { "timestamp": "2022-10-24T11:18:00", "value": -66.0, "labels": { "__name__": "kafka_consumergroup_lag", "axual_cluster": "jupiter", "consumergroup": "axual-qa-dev-payment_event_emitter", "partition": "1", "pod": "jupiter-kafka-exporter-6867f8ccf4-9b8fs", "topic": "axual-qa-dev-payment-events-stream" }, "unit": "Seconds" }, { "timestamp": "2022-10-24T11:19:00", "value": 0.0, "labels": { "__name__": "kafka_consumergroup_lag", "axual_cluster": "jupiter", "consumergroup": "axual-qa-dev-payment_event_emitter", "partition": "1", "pod": "jupiter-kafka-exporter-6867f8ccf4-9b8fs", "topic": "axual-qa-dev-payment-events-stream" }, "unit": "Seconds" }, { "timestamp": "2022-10-24T11:16:00", "value": -780.0, "labels": { "__name__": "kafka_consumergroup_lag", "axual_cluster": "jupiter", "consumergroup": "axual-qa-dev-payment_event_emitter", "partition": "2", "pod": "jupiter-kafka-exporter-6867f8ccf4-9b8fs", "topic": "axual-qa-dev-payment-events-stream" }, "unit": "Seconds" }, { "timestamp": "2022-10-24T11:17:00", "value": -3414.0, "labels": { "__name__": "kafka_consumergroup_lag", "axual_cluster": "jupiter", "consumergroup": "axual-qa-dev-payment_event_emitter", "partition": "2", "pod": "jupiter-kafka-exporter-6867f8ccf4-9b8fs", "topic": "axual-qa-dev-payment-events-stream" }, "unit": "Seconds" }, { "timestamp": "2022-10-24T11:18:00", "value": -30.0, "labels": { "__name__": "kafka_consumergroup_lag", "axual_cluster": "jupiter", "consumergroup": "axual-qa-dev-payment_event_emitter", "partition": "2", "pod": "jupiter-kafka-exporter-6867f8ccf4-9b8fs", "topic": "axual-qa-dev-payment-events-stream" }, "unit": "Seconds" }, { "timestamp": "2022-10-24T11:19:00", "value": -6.0, "labels": { "__name__": "kafka_consumergroup_lag", "axual_cluster": "jupiter", "consumergroup": "axual-qa-dev-payment_event_emitter", "partition": "2", "pod": "jupiter-kafka-exporter-6867f8ccf4-9b8fs", "topic": "axual-qa-dev-payment-events-stream" }, "unit": "Seconds" }, { "timestamp": "2022-10-24T11:16:00", "value": 0, "labels": { "__name__": "kafka_consumergroup_lag", "axual_cluster": "jupiter", "consumergroup": "axual-qa-dev-payment_event_emitter", "partition": "3", "pod": "jupiter-kafka-exporter-6867f8ccf4-9b8fs", "topic": "axual-qa-dev-payment-events-stream" }, "unit": "Seconds" }, { "timestamp": "2022-10-24T11:17:00", "value": 0, "labels": { "__name__": "kafka_consumergroup_lag", "axual_cluster": "jupiter", "consumergroup": "axual-qa-dev-payment_event_emitter", "partition": "3", "pod": "jupiter-kafka-exporter-6867f8ccf4-9b8fs", "topic": "axual-qa-dev-payment-events-stream" }, "unit": "Seconds" }, { "timestamp": "2022-10-24T11:18:00", "value": -54.0, "labels": { "__name__": "kafka_consumergroup_lag", "axual_cluster": "jupiter", "consumergroup": "axual-qa-dev-payment_event_emitter", "partition": "3", "pod": "jupiter-kafka-exporter-6867f8ccf4-9b8fs", "topic": "axual-qa-dev-payment-events-stream" }, "unit": "Seconds" }, { "timestamp": "2022-10-24T11:19:00", "value": -6.0, "labels": { "__name__": "kafka_consumergroup_lag", "axual_cluster": "jupiter", "consumergroup": "axual-qa-dev-payment_event_emitter", "partition": "3", "pod": "jupiter-kafka-exporter-6867f8ccf4-9b8fs", "topic": "axual-qa-dev-payment-events-stream" }, "unit": "Seconds" }, { "timestamp": "2022-10-24T11:16:00", "value": 0, "labels": { "__name__": "kafka_consumergroup_lag", "axual_cluster": "jupiter", "consumergroup": "axual-qa-dev-payment_event_emitter", "partition": "4", "pod": "jupiter-kafka-exporter-6867f8ccf4-9b8fs", "topic": "axual-qa-dev-payment-events-stream" }, "unit": "Seconds" }, { "timestamp": "2022-10-24T11:17:00", "value": 0, "labels": { "__name__": "kafka_consumergroup_lag", "axual_cluster": "jupiter", "consumergroup": "axual-qa-dev-payment_event_emitter", "partition": "4", "pod": "jupiter-kafka-exporter-6867f8ccf4-9b8fs", "topic": "axual-qa-dev-payment-events-stream" }, "unit": "Seconds" }, { "timestamp": "2022-10-24T11:18:00", "value": -78.0, "labels": { "__name__": "kafka_consumergroup_lag", "axual_cluster": "jupiter", "consumergroup": "axual-qa-dev-payment_event_emitter", "partition": "4", "pod": "jupiter-kafka-exporter-6867f8ccf4-9b8fs", "topic": "axual-qa-dev-payment-events-stream" }, "unit": "Seconds" }, { "timestamp": "2022-10-24T11:19:00", "value": -12.0, "labels": { "__name__": "kafka_consumergroup_lag", "axual_cluster": "jupiter", "consumergroup": "axual-qa-dev-payment_event_emitter", "partition": "4", "pod": "jupiter-kafka-exporter-6867f8ccf4-9b8fs", "topic": "axual-qa-dev-payment-events-stream" }, "unit": "Seconds" } ] }
This metric could be used to know how many seconds are left for consumers to catch up, configure alerting for situations when consumer is lagging,
Advanced usage
Using aggregator
By adding aggregator
to the request, the seconds left to consume messages will be aggregated over all partitions.
For instance asking for the avg aggregation function, will lead to get the average estimated time to catch up.
Request using avg aggregator
{ "metric": "io.axual.application/consumer_time_to_catch_up", "stepSize": "PT1M", "timeWindow": "PT5M", "groupBy": [], "aggregator": "avg", "filter": { "type": "AND", "filters": [ { "type": "FIELD", "field": "environment", "operation": "EQUALS", "value": "dev" }, { "type": "FIELD", "field": "stream", "operation": "EQUALS", "value": "payment-events-stream" }, { "type": "FIELD", "field": "applicationId", "operation": "EQUALS", "value": "payment_event_emitter" } ] } }
The below response represents the average value of seconds to consume messages of the stream on a Kafka cluster. As you can see, value decreases over time meaning that consumers are catching up.
Response using avg aggregator
{ "type": "UNGROUPED", "dataPoints": [ { "timestamp": "2022-10-24T11:18:00", "value": -120.0, "labels": {}, "unit": "Seconds" }, { "timestamp": "2022-10-24T11:19:00", "value": -10.8, "labels": {}, "unit": "Seconds" }, { "timestamp": "2022-10-24T11:16:00", "value": -156.0, "labels": {}, "unit": "Seconds" }, { "timestamp": "2022-10-24T11:17:00", "value": -1183.2, "labels": {}, "unit": "Seconds" } ] }
As sum aggregator shows all unconsumed messages, another option is to use max to see the most lagging replica.
Using groupBy
If you want to get response grouped by some label - you can use groupBy
Request using groupBy
{ "metric": "io.axual.application/consumer_time_to_catch_up", "stepSize": "PT1M", "timeWindow": "2022-10-24T11:14:00Z/2022-10-24T11:19:00Z", "groupBy": [ "partition" ], "aggregator": "avg", "filter": { "type": "AND", "filters": [ { "type": "FIELD", "field": "environment", "operation": "EQUALS", "value": "dev" }, { "type": "FIELD", "field": "stream", "operation": "EQUALS", "value": "payment-events-stream" }, { "type": "FIELD", "field": "applicationId", "operation": "EQUALS", "value": "payment_event_emitter" } ] } }
The below response is different from every other metrics. Data is aggregated over timestamps |
Response using groupBy
{ "type": "GROUPED", "groups": [ { "labels": { "partition": "0" }, "dataPoints": [ { "timestamp": null, "value": -726.0, "labels": { "partition": "0" }, "unit": "Seconds" } ] }, { "labels": { "partition": "1" }, "dataPoints": [ { "timestamp": null, "value": -16.5, "labels": { "partition": "1" }, "unit": "Seconds" } ] }, { "labels": { "partition": "2" }, "dataPoints": [ { "timestamp": null, "value": -1057.5, "labels": { "partition": "2" }, "unit": "Seconds" } ] }, { "labels": { "partition": "3" }, "dataPoints": [ { "timestamp": null, "value": -15.0, "labels": { "partition": "3" }, "unit": "Seconds" } ] }, { "labels": { "partition": "4" }, "dataPoints": [ { "timestamp": null, "value": -22.5, "labels": { "partition": "4" }, "unit": "Seconds" } ] } ] }