Acting on Alerts

Process

Assuming a proper Monitoring solution is present, PrometheusRules can be used to define alerts that may trigger AlertManager and call outs to standby engineers to resolve issues.

A common sequence of events is the following:

A Production cluster PrometheusRule alert triggers
AlertManager evaluates active alerts and routes notifications based on configuration
The standby team is notified via telephone or otherwise
The responsible standby engineer investigates issue, reports it and resolves the underlying problem.
Optionally, additional colleagues are called in for incident management
Eventually, alerts are resolved from Prometheus, green alert OKs are sent by AlertManager if so configured.

This page is aimed to assist the responsible standby engineer to investigate and resolve problems.

Alert details

Example Strimzi PrometheusRules

The alerts are based on Strimzi PrometheusRules originally found here, click below to view alert rule details.

Details

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    role: alert-rules
    app: strimzi
  name: prometheus-k8s-rules
spec:
  groups:
    - name: kafka
      rules:
        - alert: KafkaRunningOutOfSpace
          expr: kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"data(-[0-9]+)?-(.+)-kafka-[0-9]+"} * 100 / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"data(-[0-9]+)?-(.+)-kafka-[0-9]+"} < 15
          for: 10s
          labels:
            severity: warning
          annotations:
            summary: 'Kafka is running out of free disk space'
            description: 'There are only {{ $value }} percent available at {{ $labels.persistentvolumeclaim }} PVC'
        - alert: UnderReplicatedPartitions
          expr: kafka_server_replicamanager_underreplicatedpartitions > 0
          for: 10s
          labels:
            severity: warning
          annotations:
            summary: 'Kafka under replicated partitions'
            description: 'There are {{ $value }} under replicated partitions on {{ $labels.kubernetes_pod_name }}'
        - alert: AbnormalControllerState
          expr: sum(kafka_controller_kafkacontroller_activecontrollercount) by (strimzi_io_name) != 1
          for: 10s
          labels:
            severity: warning
          annotations:
            summary: 'Kafka abnormal controller state'
            description: 'There are {{ $value }} active controllers in the cluster'
        - alert: OfflinePartitions
          expr: sum(kafka_controller_kafkacontroller_offlinepartitionscount) > 0
          for: 10s
          labels:
            severity: warning
          annotations:
            summary: 'Kafka offline partitions'
            description: 'One or more partitions have no leader'
        - alert: UnderMinIsrPartitionCount
          expr: kafka_server_replicamanager_underminisrpartitioncount > 0
          for: 10s
          labels:
            severity: warning
          annotations:
            summary: 'Kafka under min ISR partitions'
            description: 'There are {{ $value }} partitions under the min ISR on {{ $labels.kubernetes_pod_name }}'
        - alert: OfflineLogDirectoryCount
          expr: kafka_log_logmanager_offlinelogdirectorycount > 0
          for: 10s
          labels:
            severity: warning
          annotations:
            summary: 'Kafka offline log directories'
            description: 'There are {{ $value }} offline log directories on {{ $labels.kubernetes_pod_name }}'
        - alert: ScrapeProblem
          expr: up{kubernetes_namespace!~"openshift-.+",kubernetes_pod_name=~".+-kafka-[0-9]+"} == 0
          for: 3m
          labels:
            severity: major
          annotations:
            summary: 'Prometheus unable to scrape metrics from {{ $labels.kubernetes_pod_name }}/{{ $labels.instance }}'
            description: 'Prometheus was unable to scrape metrics from {{ $labels.kubernetes_pod_name }}/{{ $labels.instance }} for more than 3 minutes'
        - alert: ClusterOperatorContainerDown
          expr: count((container_last_seen{container="strimzi-cluster-operator"} > (time() - 90))) < 1 or absent(container_last_seen{container="strimzi-cluster-operator"})
          for: 1m
          labels:
            severity: major
          annotations:
            summary: 'Cluster Operator down'
            description: 'The Cluster Operator has been down for longer than 90 seconds'
        - alert: KafkaBrokerContainersDown
          expr: absent(container_last_seen{container="kafka",pod=~".+-kafka-[0-9]+"})
          for: 3m
          labels:
            severity: major
          annotations:
            summary: 'All `kafka` containers down or in CrashLookBackOff status'
            description: 'All `kafka` containers have been down or in CrashLookBackOff status for 3 minutes'
        - alert: KafkaContainerRestartedInTheLast5Minutes
          expr: count(count_over_time(container_last_seen{container="kafka"}[5m])) > 2 * count(container_last_seen{container="kafka",pod=~".+-kafka-[0-9]+"})
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: 'One or more Kafka containers restarted too often'
            description: 'One or more Kafka containers were restarted too often within the last 5 minutes'
    - name: zookeeper
      rules:
        - alert: AvgRequestLatency
          expr: zookeeper_avgrequestlatency > 10
          for: 10s
          labels:
            severity: warning
          annotations:
            summary: 'Zookeeper average request latency'
            description: 'The average request latency is {{ $value }} on {{ $labels.kubernetes_pod_name }}'
        - alert: OutstandingRequests
          expr: zookeeper_outstandingrequests > 10
          for: 10s
          labels:
            severity: warning
          annotations:
            summary: 'Zookeeper outstanding requests'
            description: 'There are {{ $value }} outstanding requests on {{ $labels.kubernetes_pod_name }}'
        - alert: ZookeeperRunningOutOfSpace
          expr: kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"data-(.+)-zookeeper-[0-9]+"} < 5368709120
          for: 10s
          labels:
            severity: warning
          annotations:
            summary: 'Zookeeper is running out of free disk space'
            description: 'There are only {{ $value }} bytes available at {{ $labels.persistentvolumeclaim }} PVC'
        - alert: ZookeeperContainerRestartedInTheLast5Minutes
          expr: count(count_over_time(container_last_seen{container="zookeeper"}[5m])) > 2 * count(container_last_seen{container="zookeeper",pod=~".+-zookeeper-[0-9]+"})
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: 'One or more Zookeeper containers were restarted too often'
            description: 'One or more Zookeeper containers were restarted too often within the last 5 minutes. This alert can be ignored when the Zookeeper cluster is scaling up'
        - alert: ZookeeperContainersDown
          expr: absent(container_last_seen{container="zookeeper",pod=~".+-zookeeper-[0-9]+"})
          for: 3m
          labels:
            severity: major
          annotations:
            summary: 'All `zookeeper` containers in the Zookeeper pods down or in CrashLookBackOff status'
            description: 'All `zookeeper` containers in the Zookeeper pods have been down or in CrashLookBackOff status for 3 minutes'
    - name: entityOperator
      rules:
        - alert: TopicOperatorContainerDown
          expr: absent(container_last_seen{container="topic-operator",pod=~".+-entity-operator-.+"})
          for: 3m
          labels:
            severity: major
          annotations:
            summary: 'Container topic-operator in Entity Operator pod down or in CrashLookBackOff status'
            description: 'Container topic-operator in Entity Operator pod has been or in CrashLookBackOff status for 3 minutes'
        - alert: UserOperatorContainerDown
          expr: absent(container_last_seen{container="user-operator",pod=~".+-entity-operator-.+"})
          for: 3m
          labels:
            severity: major
          annotations:
            summary: 'Container user-operator in Entity Operator pod down or in CrashLookBackOff status'
            description: 'Container user-operator in Entity Operator pod have been down or in CrashLookBackOff status for 3 minutes'
    - name: connect
      rules:
        - alert: ConnectContainersDown
          expr: absent(container_last_seen{container=~".+-connect",pod=~".+-connect-.+"})
          for: 3m
          labels:
            severity: major
          annotations:
            summary: 'All Kafka Connect containers down or in CrashLookBackOff status'
            description: 'All Kafka Connect containers have been down or in CrashLookBackOff status for 3 minutes'
    - name: bridge
      rules:
        - alert: BridgeContainersDown
          expr: absent(container_last_seen{container=~".+-bridge",pod=~".+-bridge-.+"})
          for: 3m
          labels:
            severity: major
          annotations:
            summary: 'All Kafka Bridge containers down or in CrashLookBackOff status'
            description: 'All Kafka Bridge containers have been down or in CrashLookBackOff status for 3 minutes'
        - alert: AvgProducerLatency
          expr: strimzi_bridge_kafka_producer_request_latency_avg > 10
          for: 10s
          labels:
            severity: warning
          annotations:
            summary: 'Kafka Bridge average consumer fetch latency'
            description: 'The average fetch latency is {{ $value }} on {{ $labels.clientId }}'
        - alert: AvgConsumerFetchLatency
          expr: strimzi_bridge_kafka_consumer_fetch_latency_avg > 500
          for: 10s
          labels:
            severity: warning
          annotations:
            summary: 'Kafka Bridge consumer average fetch latency'
            description: 'The average consumer commit latency is {{ $value }} on {{ $labels.clientId }}'
        - alert: AvgConsumerCommitLatency
          expr: strimzi_bridge_kafka_consumer_commit_latency_avg > 200
          for: 10s
          labels:
            severity: warning
          annotations:
            summary: 'Kafka Bridge consumer average commit latency'
            description: 'The average consumer commit latency is {{ $value }} on {{ $labels.clientId }}'
        - alert: Http4xxErrorRate
          expr: strimzi_bridge_http_server_requestCount_total{code=~"^4..$", container=~"^.+-bridge", path !="/favicon.ico"} > 10
          for: 1m
          labels:
            severity: warning
          annotations:
            summary: 'Kafka Bridge returns code 4xx too often'
            description: 'Kafka Bridge returns code 4xx too much ({{ $value }}) for the path {{ $labels.path }}'
        - alert: Http5xxErrorRate
          expr: strimzi_bridge_http_server_requestCount_total{code=~"^5..$", container=~"^.+-bridge"} > 10
          for: 1m
          labels:
            severity: warning
          annotations:
            summary: 'Kafka Bridge returns code 5xx too often'
            description: 'Kafka Bridge returns code 5xx too much ({{ $value }}) for the path {{ $labels.path }}'
    - name: mirrorMaker
      rules:
        - alert: MirrorMakerContainerDown
          expr: absent(container_last_seen{container=~".+-mirror-maker",pod=~".+-mirror-maker-.+"})
          for: 3m
          labels:
            severity: major
          annotations:
            summary: 'All Kafka Mirror Maker containers down or in CrashLookBackOff status'
            description: 'All Kafka Mirror Maker containers have been down or in CrashLookBackOff status for 3 minutes'
    - name: kafkaExporter
      rules:
        - alert: UnderReplicatedPartition
          expr: kafka_topic_partition_under_replicated_partition > 0
          for: 10s
          labels:
            severity: warning
          annotations:
            summary: 'Topic has under-replicated partitions'
            description: 'Topic  {{ $labels.topic }} has {{ $value }} under-replicated partition {{ $labels.partition }}'
        - alert: TooLargeConsumerGroupLag
          expr: kafka_consumergroup_lag > 1000
          for: 10s
          labels:
            severity: warning
          annotations:
            summary: 'Consumer group lag is too big'
            description: 'Consumer group {{ $labels.consumergroup}} lag is too big ({{ $value }}) on topic {{ $labels.topic }}/partition {{ $labels.partition }}'
        - alert: NoMessageForTooLong
          expr: changes(kafka_topic_partition_current_offset[10m]) == 0
          for: 10s
          labels:
            severity: warning
          annotations:
            summary: 'No message for 10 minutes'
            description: 'There is no messages in topic {{ $labels.topic}}/partition {{ $labels.partition }} for 10 minutes'
    - name: certificates
      interval: 1m0s
      rules:
        - alert: CertificateExpiration
          expr: |
            strimzi_certificate_expiration_timestamp_ms/1000 - time() < 30 * 24 * 60 * 60
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: 'Certificate will expire in less than 30 days'
            description: 'Certificate of type {{ $labels.type }} in cluster {{ $labels.cluster }} in namespace {{ $labels.resource_namespace }} will expire in less than 30 days'

Kafka

KafkaRunningOutOfSpace

Severity: major
Summary: 'Kafka is running out of free disk space'
Default Threshold: 85% full
Description: 'There are only {{ $value }} percent available at {{ $labels.persistentvolumeclaim }} PVC'
Comment: This is a potential cluster breaking issue. Brokers will continue writing data until disks are completely full, after which brokers can not function properly. Brokers may have trouble starting and can not run housekeeping to remove data past retention time.
Only dynamically resizing disks will save the brokers once disks are completely filled up.
Action:
- Identify which topics are the largest, if these are not compaction topics, consider reducing retention time.
- Then identify which producers add the most data, attempt to reach the responsible teams and investigate whether to increase disks, lower retention or change producer behaviour.
- In case of emergency, where a producer floods the cluster, revoke ACLs of that producer on the topic to stop new data from coming in without deleting data.

UnderReplicatedPartitions

Severity: warning
Summary: 'Kafka under replicated partitions'
Default Threshold: More than 0
Description: 'There are {{ $value }} under replicated partitions on {{ $labels.kubernetes_pod_name }}'
Comment:
- Means some partitions are online but have fewer In Sync Replicas (ISRs) than expected.
- This is a relatively common occurrence, may happen when a broker is unavailable due to a transient network issue.
- This also occurs when a broker is offline, note that the Kafka cluster is still fully functional but in a strained state.
- This may be an early warning for a larger problem if not resolved automatically.
Action: Wait for the issue to resolve itself, if longer than 5 min, investigate deeper.

AbnormalControllerState

Severity: warning
Summary: 'Kafka abnormal controller state'
Default Threshold: Other than 1
Description: 'There are {{ $value }} active controllers in the cluster'
Comment: It can occur that there are multiple controllers reported during a quick rolling restart of brokers combined with slow metrics scraping, temporarily registering two controllers.
Action: Wait for the issue to resolve itself, if longer than 5 min, investigate deeper.

OfflinePartitions

Severity: warning
Summary: 'Kafka offline partitions'
Default Threshold: More than 0
Description: 'One or more partitions have no leader'
Comment: This should never occur when partitions have sufficient replicas and not more than 1 broker is unavailable.
Action:
- If multiple brokers are offline, this is a major incident and probably a loss of service. Quickly escalate and investigate why brokers are not available.
- If one broker is offline and this alert triggers, identify which topics are affected and ensure the correct number of replicas.

UnderMinIsrPartitionCount

Severity: warning
Summary: 'Kafka under min ISR partitions'
Default Threshold: More than 0
Description: 'There are {{ $value }} partitions under the min ISR on {{ $labels.kubernetes_pod_name }}'
Comment: When below minimum In Sync Replicas, most producers stop writing records into Kafka to ensure no data loss on broker failure.
Action: This can only occur when at least two brokers are down, so this is part of a cluster outage. Quickly escalate and investigate why multiple brokers are unavailable.

OfflineLogDirectoryCount

Severity: warning
Summary: 'Kafka offline log directories'
Default Threshold: More than 0
Description: 'There are {{ $value }} offline log directories on {{ $labels.kubernetes_pod_name }}'
Comment: This may signify an issue with the underlying storage, some log directories may be missing or unreachable.
Action: Involve infra team and investigate root cause of failure.

ScrapeProblem

Severity: major
Summary: 'Prometheus unable to scrape metrics from {{ $labels.kubernetes_pod_name }}/{{ $labels.instance }}'
Default Threshold: More than 0
Description: 'Prometheus was unable to scrape metrics from {{ $labels.kubernetes_pod_name }}/{{ $labels.instance }} for more than 3 minutes'
Comment: If metrics are not scraped, none of the alerts will work, so this is a critical issue.
Action:
- Quickly check the cluster health, broker Pod status etc. Without Prometheus the only monitoring is done with your eyeballs, so keep them on the cluster.
- Investigate where the problem originates, it could be a problem with upstream Prometheus if a used in a 'federated' setup.

KafkaBrokerContainersDown

Severity: major
Summary: 'All kafka containers down or in CrashLookBackOff status'
Default Threshold: All absent
Description: 'All kafka containers have been down or in CrashLookBackOff status for 3 minutes'
Comment: The whole Kafka cluster is broken, this is a critical incident
Action:
- Escalate, the Kafka service is unavailable.
- Look into Kubernetes Events for scheduling issues or any missing volumeMounts like Secrets.
- This alert should never be your first alert and is preceded by UnderReplicatedPartitions and OfflinePartitions.

KafkaContainerRestartedInTheLast5Minutes

Severity: warning
Summary: 'One or more Kafka containers restarted too often'
Default Threshold: > 2 restarts
Description: 'One or more Kafka containers were restarted too often within the last 5 minutes'
Comment: This may occur during broker updates, where it is not a problem.
Action: If this occurs outside of maintenance, check for memory usage (OOMKilled Pods) or crash looping Pods. Check Kubernetes Events.

Zookeeper

AvgRequestLatency

Severity: warning
Summary: 'Zookeeper average request latency'
Default Threshold: > 10 ms
Description: 'The average request latency is {{ $value }} on {{ $labels.kubernetes_pod_name }}'
Comment: Too high latencies signify network or I/O issues
Action: Observe for a while, investigate with networking infra team if issue persists.

OutstandingRequests

Severity: warning
Summary: 'Zookeeper outstanding requests'
Default Threshold: > 10 requests
Description: 'There are {{ $value }} outstanding requests on {{ $labels.kubernetes_pod_name }}'
Comment: It is a rare occurrence to see outstanding requests, this may happen during broker restarts
Action: Observe for a while, improve resource usage if this remains high.

ZookeeperRunningOutOfSpace

Severity: warning
Summary: 'Zookeeper is running out of free disk space'
Default Threshold: 5,37 gigabytes (tip: modify this rule yourself)
Description: 'There are only {{ $value }} bytes available at {{ $labels.persistentvolumeclaim }} PVC'
Comment: Zookeeper data grows very slowly, there should be time to act.
Action: Optionally prevent additional topics or partitions on Kafka while increasing Zookeeper disks.

ZookeeperContainerRestartedInTheLast5Minutes

Severity: warning
Summary: 'One or more Zookeeper containers were restarted too often'
Default Threshold: > 2 restarts
Description: 'One or more Zookeeper containers were restarted too often within the last 5 minutes. This alert can be ignored when the Zookeeper cluster is scaling up'
Comment: This may occur during kafka updates, where it is not a problem.
Action: If this occurs outside of maintenance, check for memory usage (OOMKilled Pods) or crash looping Pods.

ZookeeperContainersDown

Severity: major
Summary: 'All zookeeper containers in the Zookeeper pods down or in CrashLookBackOff status'
Default Threshold: All absent
Description: 'All zookeeper containers in the Zookeeper pods have been down or in CrashLookBackOff status for 3 minutes'
Comment: Without Zookeeper, Kafka is unable to do leader elections, so this can potentially interrupt Kafka service
Action:
- Escalate, the Kafka service is at risk.
- Look into Kubernetes Events for scheduling issues or any missing volumeMounts like Secrets.

Strimzi

ClusterOperatorContainerDown

Severity: major
Summary: 'Cluster Operator down'
Default Threshold: Less than 1
Description: 'The Cluster Operator has been down for longer than 90 seconds'
Comment: If the cluster is stable, the impact of this is low, because Kafka can continue functioning without Strimzi Cluster Operator active.
This becomes a problem if changes to the cluster need to be executed.
Action: The Strimzi Operator Pod should be in a Ready state. Investigate if the Pod is crash looping or can’t be scheduled by Kubernetes.

f// == Certificates