# Querying data

When querying Kafka topic data with SQL such as

```sql
SELECT *
FROM topicA
WHERE transaction_id=123
```

a **full scan** will be executed, and the query processes the **entire data** on that topic to identify all records that match the transaction ID.

If the Kafka topic contains a billion 50KB messages - that would require querying 50 GB of data. Depending on your network capabilities, brokers’ performance, any quotas on your account, and other parameters, fetching **50 GB** of data could take some time! Even more, if the data is compressed. In the last case, the client has to decompress it before parsing the raw bytes to translate into a structure to which the query can be applied.

## Does Apache Kafka have indexing capabilities? <a href="#does-apache-kafka-have-indexing-capabilities" id="does-apache-kafka-have-indexing-capabilities"></a>

No. Apache Kafka does not have the full indexing capabilities in the payload (indexes typically come at a high cost even on an RDBMS / DB or a system like Elastic Search), however, Kafka indexes the ***metadata***.

## Can we push down predicates in Apache Kafka? <a href="#can-we-push-down-predicates-in-apache-kafka" id="can-we-push-down-predicates-in-apache-kafka"></a>

The only filters Kafka is supporting are ***topic***, ***partition*** and ***offsets*** or ***timestamp***.

### Partitions <a href="#id-1-partitions" id="id-1-partitions"></a>

```sql
SELECT *
FROM topicA
WHERE transaction_id=123
   AND _meta.partition = 1
```

If we specify in our query that we are only interested in partition 1, and for the sake of example the above Kafka topic has 50 x partitions. Then Lenses will automatically push this predicate down, meaning that we will only need to scan **1GB** instead of 50GB of data.

```sql
SELECT *
FROM topicA
WHERE transaction_id=123
  AND _meta.offset > 100
  AND _meta.offset < 100100
  AND _meta.partition = 1
```

If we specify the offset range and the partition, we would only need to scan the specific range of 100K messages resulting in scanning **5MB**.

### Timestamp <a href="#id-3-timestamp" id="id-3-timestamp"></a>

```sql
SELECT *
FROM topicA
WHERE transaction_id=123
  AND _meta.timestamp > NOW() - "1H"
```

The above will query only the data added to the topic up to 1 hour ago. Thus we would query just **10MB**.

### Time-traveling <a href="#id-4-time-traveling" id="id-4-time-traveling"></a>

```sql
SELECT *
FROM position_reports
WHERE
   _meta.timestamp > "2024-04-01" AND
   _meta.timestamp < "2026-04-02"
```

The above will query only the data that have been added to the Kafka topic on a specific day. If we are storing 1,000 days of data, we would query just **50MB**.

you could also filter on time with :

```sql
SELECT * from mytopic
WHERE 
    _meta.timestamp > '2024-10-16 14:30:00' AND 
    _meta.timestamp < "2025-04-02 14:30:00"
```

## How can I have 100s of queries without impacting my cluster? <a href="#how-can-i-have-100s-of-queries-without-impacting-my-cluster" id="how-can-i-have-100s-of-queries-without-impacting-my-cluster"></a>

Lenses provides a set of rules for

* termination control
* resource protection
* query management

### Termination Control <a href="#termination-control" id="termination-control"></a>

```sql
SELECT * FROM topicA WHERE _key.deviceId=123 LIMIT 10
```

Adding a **LIMIT 10** in the SQL query will result in the SQL terminating early, as soon as 10 x messages have been discovered. It’s not a perfect solution as we might never find 10 x messages, and thus perform a full scan.

{% hint style="info" %}
Add LIMIT to your query to have quick and efficient query completion
{% endhint %}

```sql
SET max.query.time = 30s;
```

One can control the maximum time a SQL query will run for. The admin can set up a default value, and a user can override it.

```sql
SET max.size = 1M;
```

{% hint style="info" %}
Complete doc on [execution can be found here](https://github.com/lensesio-dev/user-guide-docs/blob/6.2/using/sql-studio/best-practices.md#control-execution)
{% endhint %}

One can control the maximum bytes the SQL query will fetch. The admin can set up a **default** value, but a more advanced user can override it.

```sql
SET max.idle.time = 5s;
```

The above will make sure the query terminates after 5 seconds of reaching the end of the topic. The admin can set up a **default** value. The idea is that there is no reason to keep polling if we have exhausted the entire topic.

### Resource Protection <a href="#resource-protection" id="resource-protection"></a>

The complete set of SQL Queries on Apache Kafka are currently being executed under a specific client-id `lenses.sql.engine` and an admin can apply a global Kafka quota to restrict the maximum total network I/O.

By adding a Quota on your Kafka cluster under the `lenses.sql.engine` CLIENT name, you can also control the global network I/O that is allocated to all users querying Kafka data with SQL.

### Query management <a href="#query-management" id="query-management"></a>

An admin can view all active queries with:

```sql
SHOW QUERIES
```

and control them, i.e., stop a running query with the following statement

```sql
KILL QUERY <id>
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.lenses.io/latest/user-guide/using/tutorials/sql-studio/querying-data.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
