Filtering
This page describes common filtering of data in Kafka with Lenses SQL Studio.
WHERE clause
WHERE
clause allows you to define a set of logical predicates the data needs to match in order to be returned. Standard comparison operators are supported (>
, >=
, <
, <=
, =
, and !=
) as well as calling functions.
We are going to use the groceries
table created earlier. Select all items purchased where the prices are greater or equal to 2.00:
Select all customers whose last name length equals to 5:
Search all customers containing Ana
in their first name:
Keep in mind that text search is case-sensitive. To use case insensitive text search, you can write:
Missing values
Sometimes data can contain explicit NULL values, or it can omit fields entirely. Using IS [ NOT ] NULL, or EXISTS functions allows you to check for these situations.
Exists
is a keyword in Lenses SQL grammar so it needs to be escaped, the escape character is `````.
Lenses supports JSON. JSON does not enforce a schema allowing you to insert null values.
Create the following table named customers_json
:
Query this table for all its entries:
The middle_name
is only present on the mikejones
record.
Write a query which filters out records where middle_name
is not present:
This can also be written as:
When a field is actually NULL
or is missing, checking like in the above query has the same outcome.
Multiple WHERE conditions
You can use AND/OR to specify complex conditions for filtering your data.
To filter the purchased items where more than one item has been bought for a given product, and the unit price is greater than 2:
Now try changing the AND logical operand to OR and see the differences in output.
HAVING clause
To filter the entries returned from a grouping query. As with the WHERE
statement, you can use HAVING
syntax to achieve the same result when it comes to grouped queries.
Read a table partition only
To select data from a specific partition access the metadata of the topic.
In the following example, a table is created with three partitions and the message key is hashed and then the remainder HashValue % partitions
will be the table partition the record is sent to.
Next, run the following query:
As you can see from the results (your timestamps will be different) the records span over the three partitions. Now query specific partitions:
Query by partitions
Kafka reads are non-deterministic over multiple partitions. The Snapshot engine may reach its max.size
before it finds your record in one run, next time it might.
If we specify in our query that we are only interested in partition 1, and for the sake of example the above Kafka topic has 50 x partitions. Then Lenses will automatically push this predicate down, meaning that we will only need to scan 1GB instead of 50GB of data.
Query by offsets
If we specify the offset range and the partition, we would only need to scan the specific range of 100K messages resulting in scanning 5MB.
Query by timestamp
The above will query only the data added to the topic up to 1 hour ago. Thus we would query just 10MB.Time-traveling
The above will query only the data that have been added to the Kafka topic on a specific day. If we are storing 1,000 days of data, we would query just 50MB.
Last updated