1 of 2

Alerting

This section describes how to configure alerting in Lenses.

Alerts rules are configurable in Lenses, alerts that are generated can then be sent to specific channels. Several different integration points are available for channels.

Infrastructure alerts

These are a set of built-in alerting rules for the core connections, Kafka, Schema Registry, Zookeeper, and Kafka Connect. See infrastructure health.

Data Produced Alerts

Data produced are user-defined alerts on the amount of data on a topic over time. Users have a choice to notify if the topic receives either:

more than
or less than

Consumer lag alerts

Consumer rules are alerting on consumer group lag. Users can define:

a lag
on a topic
for a consumer group
which channels to send an alert to

Application alerts

Lenses allows operators to configure alerting on Connectors. Operators can:

Set channels to send alerts to
Enable auto restart of connector tasks. Lenses will restart failed tasks with a grace period.

The sequence is:

Lenses watches for task failures.
If a task fails, Lenses will restart it.
If the restart is successful Lenses resets the "restart attempts" back to zero
If the restart is not successful, Lenses increments the restart attempts, waits for the grace period and tries another restart if the task is still in a failed state.

Steps 4 is repeated until restart attempts is reached. Lenses will only rest the restart attempts to zero after the tasks have been brought back to a healthy start by manual intervention.

The number of times Lenses attempts to restart is based on the entry in the alert setting.

The restart attempts can be tracked in the Audits page.

Viewing alert events

To view events go to Admin -> Alerts -> Events.

Alert Reference

This page describes the alert references for Lenses.

Alert

Alert Identifier

Description

Category

Instance

Severity

Kafka Broker is down

1000

Raised when the Kafka broker is not part of the cluster for at least 1 minute. i.e:host-1,host-2

Infrastructure

brokerID

INFO, CRITICAL

Zookeeper Node is down

1001

Raised when the Zookeeper node is not reachable. This is information is based on the Zookeeper JMX. If it responds to JMX queries it is considered to be running.

Infrastructure

service name

INFO, CRITICAL

Connect Worker is down

1002

Raised when the Kafka Connect worker is not responding to the API call for /connectors for more than 1 minute.

Infrastructure

worker URL

MEDIUM

Schema Registry is down

1003

Raised when the Schema Registry node is not responding to the root API call for more than 1 minute.

Infrastructure

service URL

HIGH, INFO

Under replicated partitions

1005

Raised when there are (topic, partitions) not meeting the replication factor set.

Infrastructure

partitions

HIGH, INFO

Partitions offline

1006

Raised when there are partitions which do not have an active leader. These partitions are not writable or readable.

Infrastructure

brokers

HIGH, INFO

Active Controllers

1007

Raised when the number of active controllers is not 1. Each cluster should have exactly one controller.

Infrastructure

brokers

HIGH, INFO

Multiple Broker Versions

1008

Raised when there are brokers in the cluster running on different Kafka version.

Infrastructure

brokers versions

HIGH, INFO

File-open descriptors high capacity on Brokers

1009

A broker has too many open file descriptors

Infrastructure

brokerID

HIGH, INFO, CRITICAL

Average % the request handler is idle

1010

Raised when the average fraction of time the request handler threads are idle. When the valueis smaller than 0.02 the alert level is CRITICAL. When the value is smaller than 0.1 the alert level is HIGH.

Infrastructure

brokerID

HIGH, INFO, CRITICAL

Fetch requests failure

1011

Raised when the Fetch request rate (the value is per second) for requests that failed is greater than a threshold. If the value is greater than 0.1 the alert level is set to CRITICAL otherwise is set to HIGH.

Infrastructure

brokerID

HIGH, INFO, CRITICAL

Produce requests failure

1012

Raised when the Producer request rate (the value is per second) for requests that failed is greater than a threshold. If the value is greater than 0.1 the alert level is set to CRITICAL otherwise is set to HIGH.

Infrastructure

brokerID

HIGH, INFO, CRITICAL

Broker disk usage is greater than the cluster average

1013

Raised when the Kafka Broker disk usage is greater than the cluster average. We provide by default a threshold of 1GB disk usage.

Infrastructure

brokerID

MEDIUM, INFO

Leader Imbalance

1014

Raised when the Kafka Broker has more leader replicas than the cluster average.

Infrastructure

brokerID

INFO

Consumer Lag exceeded

2000

Raises an alert when the consumer lag exceeds the threshold on any partition.

Consumers

topic

HIGH, INFO

Connector deleted

3000

Connector was deleted

Kafka Connect

connector name

INFO

Topic has been created

4000

New topic was added

Topics

topic

INFO

Topic has been deleted

4001

Topic was deleted

Topics

topic

INFO

Topic data has been deleted

4002

Records from topic were deleted

Topics

topic

INFO

Data Produced

5000

Raises an alert when the data produced on a topic doesn’t match expected threshold

Data Produced

topic

LOW, INFO

Connector Failed

6000

Raises an alert when a connector, or any worker in a connector is down

Apps

connector

LOW, INFO