Query performance on Apache Kafka with SQL


There are many reasons why one would want to explore or query the data using SQL on Apache Kafka. They can range from democratizing and socializing your data (and deliver better data experiences) to debugging a specific transaction or device identifier.

With the native SQL queries that comes with Lenses, that capability is available over any data format (AVRO, XML, JSon, ProtoBuf), and is governed via a data-centric and field-centric security model.

Testing environment 

As a testing environment, we are using 5 x Apache Kafka brokers within a DMZ zone, with 1 x instance of Lenses serving traffic over a secure (HTTPS) connection.

Apache Kafka SQL Query performance - test environment

SQL queries on Kafka topics with Avro messages 

For the first exploration, we will be running a SQL query over a Kafka topic that contains AVRO serialized data in both the key and the value of the messages. The total payload of the data is 500 bytes when un-serialized.

We will be fetching 100K messages, thus downloading a chunk of 50MB of data over the internet by using a simple SQL query over a Kafka topic:

Querying 100K AVRO messages from Apache Kafka

Executing this query takes 18 seconds

Lenses Configuration

Let’s check our Prometheus monitoring to review the performance further. We notice network traffic increased with a peak at ~6MB/sec. CPU usage increased by 30% during the execution of the query. This is because of the deserializing of more than 5K Avro messages/sec, the JVM memory usage, and GC time slightly increased due to the intermediate object creation.

Prometheus while querying 100K AVRO messages with SQL

SQL queries on Kafka topics with JSON messages 

For the second case, we will run a SQL query over a Kafka topic that contains JSON serialized data. The total payload of the data is 1000+ bytes when un-serialized.

We will be fetching 50K messages, thus downloading a chunk of ~50MB of data over the internet by using simple SQL query over a Kafka topic:

Querying 50K JSON messages from Apache Kafka with SQL

Executing this query takes 15 seconds

Querying 50K JSON messages from Apache Kafka with SQL

Let’s check our Prometheus monitoring to review the performance further. We notice network traffic increased with a peak at ~2MB/sec. CPU usage increased by ~50% during the execution of the query. This is because of the deserializing of more than ~3.3K JSON messages/sec, the JVM memory usage, and GC time slightly increased due to the intermediate object creation.

Prometheus while querying 50K AVRO messages with SQL

Summary 

While working remotely, in this performance article, we analyzed how to query data on Apache Kafka using SQL queries. It highlights how to deliver secure data access to a real-time environment, and how one can execute queries that run up to 6MB / sec ( about 400% faster than a Netflix stream ) over AVRO data.

Note that the analysis was done over the internet and on a live Apache Kafka cluster. Performance characteristics will also depend on cluster configuration and whether the data is cached, so it might vary.

Technical product notes 

For the above article we used the April 2020 release of Lenses 3.0.9 and 3.1.1

--
Last modified: March 20, 2024