Lenses Alerting Capabilities

Being able to monitor the infrastructure is not complete without the option to set rules for raising alerts in order to operate your platform with confidence. One of the most required notification is for consumer lag. Lenses makes it easy to set up conditions for consumer lag alerts; it is as simple as declaring:

lag >= 1000 on group live-stats and topic site_events

For the condition above, an alert is raised when the group live-stats has a lag of at least 1000 for the topic site_events. Since Kafka Consumer offset commit interval is not known, Lenses works out when it should raise the alert and thus avoid the same notification being raised as the value fluctuates.

Lenses provides a few alerts out of the box grouped by their specific area; for example Infrastructure or Consumers. An alert can be, let’s say, static (or condition-less, i.e. Kafka Broker is down) or dynamic (or condition based, i.e. Consumer Lag). For alerts that allow dynamic conditions, the notifications will only be raised when the condition is met.

The list of the alerting options that are provided out of the box will grow with time.

Alerts

With this version, the platform supports 4 different types of alerts:

  • Infrastructure
  • Consumers
  • Kafka Connect
  • Topics

Infrastructure

These alerts are targeting the services which make the Kafka ecosystem. Here is the list of available alerts

Alert Description
Kafka Broker is down
Raised when the Kafka broker is not part of the cluster for at least 1 minute. i.e:host-1,host-2
Zookeeper Node is down
Raised when the Zookeeper node is not reachable. This is information is based on the Zookeeper JMX. If it responds to
JMX queries it is considered to be running.
Schema Registry is down
Raised when the Schema Registry node is not responding to the root API call for more than 1 minute.
Alert Manager is down
Raised when the AlertManager has not been seen online for more than 1 minute.
Connect worker is down
Raised when the Kafka Connect worker is not responding to the API call for /connectors for more than 1 minute.
Under replicated partitions
Raised when there are (topic, partitions) not meeting the replication factor set.
Partitions offline
Raised when there are partitions which do not have an active leader. These partitions are not writable or readable.
Active controllers
Raised when the number of active controllers is not 1. Each cluster should have exactly one controller.
Multiple Broker Versions
Raised when there are brokers in the cluster running on different Kafka version.
File-open descriptors
Raised when a Kafka Broker OS file descriptors count exceeds 90% of the available operating system file descriptors.
Average % the request handler is idle
Raised when the average fraction of time the request handler threads are idle. When the value
is smaller than 0.02 the alert level is CRITICAL. When the value is smaller than 0.1 the alert level is HIGH.
Fetch requests failure
Raised when the Fetch request rate (the value is per second) for requests that failed is greater than a threshold.
If the value is greater than 0.1 the alert level is set to CRITICAL otherwise is set to HIGH.
Produce requests failure
Raised when the Producer request rate (the value is per second) for requests that failed is greater than a threshold.
If the value is greater than 0.1 the alert level is set to CRITICAL otherwise is set to HIGH.
Broker disk usage
Raised when the Kafka Broker disk usage is greater than the cluster average. We provide by default a threshold of 1GB disk usage.
Leader imbalance
Raised when the Kafka Broker has more leader replicas than the cluster average.

Consumers

Currently, this type of alert provides support for monitoring consumer lag only. The user has full control over defining the behavior by adding conditions to the alert. To raise an alert when the live-stats consumer lag is more than 1000, all that is required is to provide the following condition:

lag >= 1000 on group live-stats and topic site_events

Kafka Connect

There are cases when it is required to receive an alert when a Kafka Connect connector has been deleted. Connector deleted alert is providing this.

Topics

In order to receive an alert whenever a Kafka topic is added or removed you will need to use the Topics alert. The alert can be easily disabled via the user interface if there is no such requirement.

Store all the alerts

Lenses stores its alerts into a Kafka topic. To do so the configuration lenses.topics.alerts.storage entry needs to be present. Since the data is stored in a topic, Lenses SQL can be leveraged to get an insight of the alerts being raised at any point in time.

Manage the alerts

The Lenses Web UI interface allows the user to enable or disable specific alerts as well as define the conditions for those alerts supporting them.

../../_images/alert_manager.png

To add a new condition for dynamic alerts, it requires the completion of a friendly syntax. Here is how a consumer lag condition is set up:

../../_images/consumer_lag.png

Note

Similar to any of the capabilities of Lenses, you can control Alerts programmatically via the Alerts REST API

Alert Manager Integration

The platform allows for AlertManager (Prometheus) integration. By providing the lenses.alert.manager.endpoints configuration key comma separated webhooks, you will make all alerts raised by Lenses to be forwarded to AlertManager.

See the alertmanager integration section for details.