This page describes the alert references for Lenses.
Alert | Alert Identifier | Description | Category | Instance | Severity |
---|---|---|---|---|---|
Kafka Broker is down
1000
Raised when the Kafka broker is not part of the cluster for at least 1 minute. i.e:host-1,host-2
Infrastructure
brokerID
INFO, CRITICAL
Zookeeper Node is down
1001
Raised when the Zookeeper node is not reachable. This is information is based on the Zookeeper JMX. If it responds to JMX queries it is considered to be running.
Infrastructure
service name
INFO, CRITICAL
Connect Worker is down
1002
Raised when the Kafka Connect worker is not responding to the API call for /connectors for more than 1 minute.
Infrastructure
worker URL
MEDIUM
Schema Registry is down
1003
Raised when the Schema Registry node is not responding to the root API call for more than 1 minute.
Infrastructure
service URL
HIGH, INFO
Under replicated partitions
1005
Raised when there are (topic, partitions) not meeting the replication factor set.
Infrastructure
partitions
HIGH, INFO
Partitions offline
1006
Raised when there are partitions which do not have an active leader. These partitions are not writable or readable.
Infrastructure
brokers
HIGH, INFO
Active Controllers
1007
Raised when the number of active controllers is not 1. Each cluster should have exactly one controller.
Infrastructure
brokers
HIGH, INFO
Multiple Broker Versions
1008
Raised when there are brokers in the cluster running on different Kafka version.
Infrastructure
brokers versions
HIGH, INFO
File-open descriptors high capacity on Brokers
1009
A broker has too many open file descriptors
Infrastructure
brokerID
HIGH, INFO, CRITICAL
Average % the request handler is idle
1010
Raised when the average fraction of time the request handler threads are idle. When the valueis smaller than 0.02 the alert level is CRITICAL. When the value is smaller than 0.1 the alert level is HIGH.
Infrastructure
brokerID
HIGH, INFO, CRITICAL
Fetch requests failure
1011
Raised when the Fetch request rate (the value is per second) for requests that failed is greater than a threshold. If the value is greater than 0.1 the alert level is set to CRITICAL otherwise is set to HIGH.
Infrastructure
brokerID
HIGH, INFO, CRITICAL
Produce requests failure
1012
Raised when the Producer request rate (the value is per second) for requests that failed is greater than a threshold. If the value is greater than 0.1 the alert level is set to CRITICAL otherwise is set to HIGH.
Infrastructure
brokerID
HIGH, INFO, CRITICAL
Broker disk usage is greater than the cluster average
1013
Raised when the Kafka Broker disk usage is greater than the cluster average. We provide by default a threshold of 1GB disk usage.
Infrastructure
brokerID
MEDIUM, INFO
Leader Imbalance
1014
Raised when the Kafka Broker has more leader replicas than the cluster average.
Infrastructure
brokerID
INFO
Consumer Lag exceeded
2000
Raises an alert when the consumer lag exceeds the threshold on any partition.
Consumers
topic
HIGH, INFO
Connector deleted
3000
Connector was deleted
Kafka Connect
connector name
INFO
Topic has been created
4000
New topic was added
Topics
topic
INFO
Topic has been deleted
4001
Topic was deleted
Topics
topic
INFO
Topic data has been deleted
4002
Records from topic were deleted
Topics
topic
INFO
Data Produced
5000
Raises an alert when the data produced on a topic doesn’t match expected threshold
Data Produced
topic
LOW, INFO
Connector Failed
6000
Raises an alert when a connector, or any worker in a connector is down
Apps
connector
LOW, INFO
This section describes how to configure alerting in Lenses.
Alerts rules are configurable in Lenses, alerts that are generated can then be sent to specific channels. Several different integration points are available for channels.
These are a set of built-in alerting rules for the core connections, Kafka, Schema Registry, Zookeeper, and Kafka Connect. See infrastructure health.
Data produced are user-defined alerts on the amount of data on a topic over time. Users have a choice to notify if the topic receives either:
more than
or less than
Consumer rules are alerting on consumer group lag. Users can define:
a lag
on a topic
for a consumer group
which channels to send an alert to
Lenses allows operators to configure alerting on Connectors. Operators can:
Set channels to send alerts to
Enable auto restart of connector tasks. Lenses will restart failed tasks with a grace period.
The sequence is:
Lenses watches for task failures.
If a task fails, Lenses will restart it.
If the restart is successful Lenses resets the "restart attempts" back to zero
If the restart is not successful, Lenses increments the restart attempts, waits for the grace period and tries another restart if the task is still in a failed state.
Steps 4 is repeated until restart attempts is reached. Lenses will only rest the restart attempts to zero after the tasks have been brought back to a healthy start by manual intervention.
The number of times Lenses attempts to restart is based on the entry in the alert setting.
The restart attempts can be tracked in the Audits page.
To view events go to Admin -> Alerts -> Events.