1 of 13

SQL Studio

A declarative SQL interface, for querying, transforming and manipulating data at rest and data in motion. It works with Apache Kafka topics and other data sources. It helps developers and Kafka users

The Lenses SQL Snapshot engine accesses the data at the point in time the query is executed. This means, that for Apache Kafka, data added just after the query was initiated will not be processed.

Typical use cases are but are not limited to:

Identifying a specific message.
Identifying a particular transaction of payment that your system has processed
Identifying all thermostats readings for a specific customer if you are working for an energy provider
Counting transactions processed within a given time window.

The Snapshot engine presents a familiar SQL interface, but remember that it queries Kafka with no indexes. Use Kafka's metadata (partition, offset, timestamp) to improve query performance.

Writing queries

Go to Workspace->Sql Studio, enter your query, and click run.

Concepts

This page describes the concepts of the Lenses SQL snapshot engine that drives the SQL Studio allowing you to query data in Kafka.

Escape topic names with backticks if they contain non-alpha numeric characters

Snapshot queries on streaming data provide answers to a direct question, e.g. The current balance is $10. The query is active, the data is passive.

What is a message?

A single entry in a Kafka topic is called a message.

The engine considers a message to have four distinct components key, value, headers and metadata.

Facets

Currently, the Snapshot Engine supports four different facets _key, _value, _headers and _metadata; These strings can be used to reference properties of each of the aforementioned message components and build a query that way.

By default, unqualified properties are assumed to belong to the _value facet:

SELECT 
  property
FROM source_topic;

In order to reference a different facet, a facet qualifier can be added:

SELECT 
  _value.valueField,
  _key.keyField,
  _meta.metaField,
  _headers.headerField
FROM source_topic;

When more than one sources/topics are specified in a query (like it happens when two topics are joined) a table reference can be added to the selection to fix the ambiguity:

SELECT 
  users._value.field
FROM users JOIN purchases

the same can be done for any of the other facets (_key,_meta,_headers).

Note Using a wildcard selection statement SELECT * provides only the value component of a message.

Headers are interpreted as a simple mapping of strings to strings. This means that if a header is a JSON, XML or any other structured type, the snapshot engine will still read it as a string value.

Selecting nested fields

Messages can contain nested elements and embedded arrays. The . operator is used to refer to children, and the [] operator is used for referring to an element in an array.

You can use a combination of these two operators to access data of any depth.

SELECT 
    dependencies[0].first_name AS childName
FROM policy_holder
WHERE policyId='100001'

You explicitly reference the key, value and metadata.

For the key use _key, for the value use _value, and for metadata use _meta. When there is no prefix, the engine will resolve the field(s) as being part of the message value. For example, the following two queries are identical:

SELECT 
    amount
FROM payments;

SELECT 
    _value.amount
FROM payments;

Primitive types

When the key or a value content is a primitive data type use the prefix only to address them.

For example, if messages contain a device identifier as the key and the temperature as the value, SQL code would be:

SELECT 
    _key AS deviceId
  , _value AS temperature
FROM iot_data

Accessing metadata

Use the _meta keyword to address the metadata. For example:

SELECT 
    _meta.timestamp AS timestamp
    , _meta.offset AS index
FROM iot_data

Projections and nested aliases

When projecting a field into a target record, Lenses allows complex structures to be built. This can be done by using a nested alias like below:

SELECT 
    amount as user.amount
    userId as user.id
FROM payments;

The result would be a struct with the following shape:

{
  "user": {
    "amount" : 10.19,
    "id": 10
  }  
}

Alias clashes (repeated fields)

When two alias names clash, the snapshot engine does not “override” that field. Lenses will instead generate a new name by appending a unique integer. This means that a query like the following:

SELECT 
    amount as result.amount,
    amount + 5 as result.amount
FROM payments;

will generate a structure like the following:

{
  "result": {
    "amount" : 10, 
    "amount0": 15
  }  
}

Nested queries

The tabled query allows you to nest queries. Let us take the query in the previous section and say we are only interested in those entries where there exist more than 1 customer per country.

SELECT *
FROM (
    SELECT 
        COUNT(*) AS count
        , country
    FROM customer
    GROUP BY country
    )
WHERE count > 1

Run the query, and you will only see those entries for which there is more than one person registered per country.

Functions

Functions can be used directly.

For example, the ROUND function allows you to round numeric functions:


SELECT
    name 
    , ROUND(quantity * price) AS rounded_total
FROM groceries

/*The output:
Fairtrade Bananas                                   2
Meridian Crunchy Peanut Butter                      3
Green & Black's organic 85% dark chocolate bar      4
Activia fat free cherry yogurts                     6
Green & Blacks Organic Chocolate Ice Cream          8
*/

For a full list of functions see SQL Reference.

Best practices

This page describes the best practices when using Lenses SQL Studio to query data in Kafka.

Does Apache Kafka have indexing?

No. Apache Kafka does not have the full indexing capabilities in the payload (indexes typically come at a high cost even on an RDBMS / DB or a system like Elastic Search), however, Kafka indexes the metadata.

The only filters Kafka supports are topic, partition and offsets or timestamps.

Push down queries using the Apache Kafka metadata

When querying Kafka topic data with SQL such as

SELECT *
FROM topicA
WHERE transaction_id=123

a full scan will be executed, and the query processes the entire data on that topic to identify all records that match the transaction id.

If the Kafka topic contains a billion 50KB messages - that would require querying 50 GB of data. Depending on your network capabilities, brokers’ performance, any quotas on your account, and other parameters, fetching 50 GB of data could take some time! Even more, if the data is compressed. In the last case, the client has to decompress it before parsing the raw bytes to translate into a structure to which the query can be applied.

Query by partition, offset or timestamp to avoid full scans.

Understanding bad records

When Lenses can’t read (deserialize) your topic’s messages, it classifies them as “bad records”. This happens for one of the following reasons:

Kafka records are corrupted. On an AVRO topic, a rogue producer might have published a different format
Lenses topic settings do not match the payload data. Maybe a topic was incorrectly given AVRO format when it’s JSON or vice versa
If AVRO payload is involved, maybe the Schema Registry is down or not accessible from the machine running Lenses

By default, Lenses skips them and displays the records’ metadata in the Bad Records tab. If you want to force stop the query in such case use:

SET skip.bad.records=false;
SELECT * FROM topicA LIMIT 100

Control execution

Querying a table can take a long time if it contains a lot of records. The underlying Kafka topic has to be read, the filter conditions applied, and the projections made.

Additionally, the SELECT statement could end up bringing a large amount of data to the client. To be able to constrain the resources involved, Lenses allows for context customization, which ends up driving the execution, thus giving control to the user. Here is the list of context parameters to overwrite:

Name

Description

Example

max.size

The maximum amount of Kafka data to scan. This is to avoid full topic scan over large topics. It can be expressed as bytes (1024), as kilo-bytes (1024k), as mega-bytes (10m) or as giga-bytes (5g). Default is 20MB.

SET max.size = '1g';

max.query.time

The maximum amount of time the query is allowed to run. It can be specified as milliseconds (2000ms), as hours (2h), minutes (10m) or seconds (60s). Default is 1 hour.

SET max.query.time = '60000ms';

max.idle.time

The amount of time to wait when no more records are read from the source before the query is completed. Default is 5 seconds

SET max.idle.time = '5s';

LIMIT N

The maximum of records to return. Default is 10000

SELECT * FROM payments LIMIT 100;

show.bad.records

Flag to drive the behavior of handling topic records when their payload does not correspond with the table storage format. Default is true. This means bad records are processed, and displayed seperately in the Bad Records section. Set it to false to fail to skip them completely.

SET show.bad.records=false;

format.timestamp

Flag to control the values for Avro date time. Avro encodes date time via Long values. Set the value to true if you want the values to be returned as text and in a human readable format.

SET format.timestamp=true;

format.decimal

Flag to control the formatting of decimal types. Use to specify how many decimal places are shown.

SET format.decimal= 2;

format.uppercase

Flag to control the formatting of string types. Use to specify if strings should all be made uppercase. Default is false.

SET format.decimal= 2;

live.aggs

Flag to control if aggregation queries should be allowed to run. Since they accumulate data they require more memory to retain the state.

SET live.aggs=true;

max.group.records

When an aggregation is calculated, this config is used to define the maximum number of records over which the engine is computed. Default is 10 000 000

SET max.group.records=10000000

optimize.kafka.partition

When enabled, it will use the primitive used for the _key filter to determine the partition the same way the default Kafka partitioner logic does. Therefore, queries like SELECT * FROM trips WHERE _key='customer_id_value'; on multiple partition topics will only read one partition as opposed to the entire topic. To disable it, set the flag to false.

SET optimize.kafka.partition=false;

query.parallel

When used, it will parallelize the query. The number provided will be capped by the target topic partitions count.

SET query.parallel=2;

query.buffer

Internal buffer when processing messages. Higher number might yield better performance when coupled with max.poll.records.

SET query.buffer=50000;

kafka.offset.timeout

Timeout for retrieving target topic start/end offsets.

SET kafka.offset.timeout=20000;

All the above values can be given a default value via the configuration file. Using lenses.sql.settings as prefix the format.timestamp can be set like this:

lenses.sql.settings.format.timestamp=true

Query tuning

Lenses SQL uses Kafka Consumer to read the data. This means that an advanced user with knowledge of Kafka could tweak the consumer properties to achieve better throughput. This would occur on very rare occasions. The query context can receive Kafka consumer settings. For example, the max.poll.records consumer can be set as:

SET max.poll.records= 100000;

SELECT *
FROM payments
LIMIT 1000000

Example

The fact is that streaming SQL is operating on unbounded streams of events: a query would normally be a never-ending query. In order to bring query termination semantics into Apache Kafka we introduced 4 controls:

LIMIT = 10000 - Force the query to terminate when 10,000 records are matched.
max.bytes = 20000000 - Force the query to terminate once 20 MBytes have been retrieved.
max.time = 60000 - Force the query to terminate after 60 seconds.
max.zero.polls = 8 - Force the query to terminate after 8 consecutive polls are empty, indicating we have exhausted a topic.

Thus, when retrieving data, you can set a limit of 1GB to the maximum number of bytes retrieved and a maximum query time of one hour like this:

SET max.bytes = 1000000000;
SET max.time = 60000000;

SELECT * 
FROM topicA 
WHERE customer.id = "XXX";

Creating & deleting Kafka topics

This page describes how to create and delete topics in the Lenses SQL Studio.

Lenses supports the typical SQL commands supported by a relational database:

CREATE
DROP
TRUNCATE
DELETE
SHOW TABLES
DESCRIBE TABLE
DESCRIBE FORMATTED

CREATE TABLE

CREATE TABLE 
    table_name(
        $field $fieldType [, $field $fieldType,...]
    )
FORMAT ($keyStorageFormat, $valueStorageFormat)
[PROPERTIES(
        partitions= *, 
        replication=$replication, 
        compacted=true/false)
];

The CREATE statement has the following parts:

CREATE TABLE - Instructs the construction of a table
$Table - The actual name given to the table.
Schema - Constructed as a list of (field, type) tuple, it describes the data each record in the table contains
FORMAT - Defines the storage format. Since it is an Apache Kafka topic, both the Key and the Value formats are required. Valid values are STRING, INT, LONG, JSON, AVRO.
PROPERTIES - Specifies the number of partitions the final Kafka topic should have, the replication factor in order to ensure high availability (it cannot be a number higher than the current Kafka Brokers number) and if the topic should be compacted.

A Kafka topic which is compacted is a special type of topic with a finer-grained retention mechanism that retains the last update record for each key.

A compacted topic (once the compaction has been completed) contains a full snapshot of the final record values for every record key and not just the recently changed keys. They are useful for in-memory services, persistent data stores, reloading caches, etc.

For more details on the subject, you should look at Kafka Documentation.

Example:

CREATE TABLE customer (
        id STRING 
        , address.line STRING 
        , address.city STRING, 
        , address.postcode INT 
        , email STRING
    )
FORMAT (string, json)
PROPERTIES (
    partitions=1, 
    compacted=true
);

Best practices dictate to use Avro as a storage format over other formats. In this case, the key can still be stored as STRING but the value can be Avro.

CREATE TABLE customer_avro (
        id STRING
        , address.line STRING
        , address.city STRING 
        , address.postcode int 
        , email string
    )
FORMAT (string, avro)
PROPERTIES (
    partitions=1, 
    compacted=true
);

SHOW TABLES

To list all tables:

SHOW TABLES;

DESCRIBE TABLE

To examine the schema an metadata for a topic:

DESCRIBE TABLE $tableName

The $tableName should contain the name of the table to describe.

Given the two tables created earlier, a user can run the following SQL to get the information on each table:

DESCRIBE TABLE customer_avro

the following information will be displayed:

/* Output:
# Column Name                           # Data Type
_key                                    String
_value.address.postcode                 Int
_value.address.city                     String
_value.address.line                     String
_value.email                            String
_value.id                               String

# Config Key                            # Config Value
cleanup.policy                          compact
compression.type                        producer
delete.retention.ms                     86400000
file.delete.delay.ms                    60000
flush.messages                          9223372036854775807
flush.ms                                9223372036854775807
index.interval.bytes                    4096
max.message.bytes                       1000012
message.format.version                  1.1-IV0
message.timestamp.difference.max.ms     9223372036854775807
message.timestamp.type                  CreateTime
min.cleanable.dirty.ratio               0.5
min.compaction.lag.ms                   0
min.insync.replicas                     1
preallocate                             false
retention.bytes                         2147483648
retention.ms                            604800000
segment.bytes                           1073741824
segment.index.bytes                     10485760
segment.jitter.ms                       0
segment.ms                              604800000
unclean.leader.election.enable          false
*/

DROP TABLE

To drop a table:

DROP TABLE $Table;

Dropping a table results in the underlying Kafka topics being removed.

System virtual tables (tables, fields)

Lenses provides a set of virtual tables that contain information about all the fields in all the tables.

Using the virtual table, you can quickly search for a table name but also see the table type.

The __table has a table_name column containing the table name, and a table_type column describing the table type (system, user, etc).

SELECT * FROM __tables;

SELECT *
FROM __tables
WHERE table_name LIKE '%customer%';

To see all the tables fields select from the _fields virtual table.

SELECT * 
FROM __fields;

SELECT * 
FROM __fields
WHERE table_name LIKE '%customer%'

Record metadata

Each Kafka message contains information related to partition, offset, timestamp, and topic. Additionally, the engine adds the key and value raw byte size.

Create a topic and insert a few entries.

CREATE TABLE tutorial(
    _key STRING
  , name STRING
  , difficulty INT
) 
FORMAT (Avro, Avro);

INSERT INTO tutorial(_key, name, difficulty)
VALUES
("1", "Learn Lenses SQL", 3),
("2", "Learn Quantum Physics", 10),
("3", "Learn French", 5);

Now we can query for specific metadata related to the records.

To query for metadata such as the underlying Kafka topic offset, partition and timestamp prefix your desired fields with _meta.

Run the following query to see each tutorial name along with its metadata information:

SELECT name
    , _meta.offset
    , _meta.timestamp
    , _meta.partition
    , _meta.__keysize
    , _meta.__valsize
FROM tutorial

/* The output is (timestamp will be different)
Learn Lenses SQL         0    1540575169198    0    7   23
Learn Quantum Physics    1    1540575169226    0    7   28
Learn French             2    1540575169247    0    7   19
*/

Filtering

This page describes common filtering of data in Kafka with Lenses SQL Studio.

WHERE clause

WHERE clause allows you to define a set of logical predicates the data needs to match in order to be returned. Standard comparison operators are supported (>, >=, <, <=, =, and !=) as well as calling functions.

We are going to use the groceries table created earlier. Select all items purchased where the prices are greater or equal to 2.00:

SELECT
     name
     , price
FROM groceries
WHERE price >= 2.0

/* Output
Meridian Crunchy Peanut Butter              2.5
Activia fat free cherry yogurts             2
Green & Blacks Organic Chocolate Ice Cream  4.2
*/

Select all customers whose last name length equals to 5:

SELECT *
FROM customers
WHERE LEN(last_name) = 5

/* Output
key         value.first_name    value.last_name
mikejones       Mike                Jones
anasmith        Ana                 Smith
*/

Search all customers containing Ana in their first name:

SELECT *
FROM customers
WHERE first_name LIKE '%Ana%'

Keep in mind that text search is case-sensitive. To use case insensitive text search, you can write:

SELECT *
FROM customers
WHERE LOWERCASE(first_name) LIKE '%ana%';

-- And here is the negated version
SELECT *
FROM customers
WHERE LOWERCASE(first_name) NOT LIKE '%ana%';

Missing values

Sometimes data can contain explicit NULL values, or it can omit fields entirely. Using IS [ NOT ] NULL, or EXISTS functions allows you to check for these situations.

Exists is a keyword in Lenses SQL grammar so it needs to be escaped, the escape character is `````.

Lenses supports JSON. JSON does not enforce a schema allowing you to insert null values.

Create the following table named customers_json:

CREATE TABLE customers_json (
    _key STRING
    , first_name STRING
    , last_name STRING
    , middle_name STRING
) FORMAT(string, json);


INSERT INTO customers_json(_key, first_name, last_name, middle_name) VALUES("mikejones", "Mike", "Jones", "Albert");
INSERT INTO customers_json(_key, first_name, last_name) VALUES("anasmith", "Ana", "Smith");
INSERT INTO customers_json(_key, first_name, last_name) VALUES("shannonelliott", "Shannon","Elliott");

Query this table for all its entries:

SELECT * 
FROM customers_json

/* The output
key             value.first_name   value.middle_name   value.last_name
mikejones           Mike                Albert          Jones
anasmith            Ana                                 Smith
shannonelliott      Shannon                             Elliott
*/

The middle_name is only present on the mikejones record.

Write a query which filters out records where middle_name is not present:


SELECT *
FROM customers_json
WHERE `EXISTS`(middle_name)

/* The output
 key             value.first_name   value.middle_name   value.last_name
mikejones            Mike            Albert                Jones
*/

This can also be written as:

SELECT *
FROM customers_json
WHERE middle_name IS NULL

When a field is actually NULL or is missing, checking like in the above query has the same outcome.

Multiple WHERE conditions

You can use AND/OR to specify complex conditions for filtering your data.

To filter the purchased items where more than one item has been bought for a given product, and the unit price is greater than 2:

SELECT *
FROM groceries
WHERE quantity > 1 
    AND price > 2

Now try changing the AND logical operand to OR and see the differences in output.

HAVING clause

To filter the entries returned from a grouping query. As with the WHERE statement, you can use HAVING syntax to achieve the same result when it comes to grouped queries.

SELECT
    COUNT(*) AS count
    , country
FROM customer
GROUP BY country
HAVING count > 1

Read a table partition only

To select data from a specific partition access the metadata of the topic.

In the following example, a table is created with three partitions and the message key is hashed and then the remainder HashValue % partitions will be the table partition the record is sent to.

-- Run
CREATE TABLE customers_partitioned (
    _key STRING
    , first_name STRING
    , last_name STRING
) 
FORMAT(string, Avro)
properties(partitions = 3);

INSERT INTO customers_partitioned(
    _key
    , first_name
    , last_name)
VALUES
("mikejones", "Mike", "Jones"),
("anasmith", "Ana", "Smith"),
("shannonelliott", "Shannon","Elliott"),
("tomwood", "Tom","Wood"),
("adelewatson", "Adele","Watson"),
("mariasanchez", "Maria", "Sanchez");

Next, run the following query:

SELECT *
FROM customers_partitioned

/* The output
offset  partition   timestamp       key         value.first_name    value.last_name
0       0           1540830780401   mikejones       Mike                Jones
1       0           1540830780441   anasmith        Ana                 Smith
2       0           1540830780464   shannonelliott  Shannon             Elliott
0       2           1540831270170   mariasanchez    Maria               Sanchez
0       1           1540830984698   tomwood         Tom                 Wood
1       1           1540831183308   adelewatson     Adele               Watson
*/

As you can see from the results (your timestamps will be different) the records span over the three partitions. Now query specific partitions:

-- selects only records from partition = 0
SELECT *
FROM customers_partitioned
WHERE _meta.partition = 0;

-- selects only records from partition  0 and 2
SELECT *
FROM customers_partitioned
WHERE _meta.partition = 0
   OR _meta.partition = 2;

Query by partitions

Kafka reads are non-deterministic over multiple partitions. The Snapshot engine may reach its max.size before it finds your record in one run, next time it might.

SELECT *
FROM topicA
WHERE transaction_id=123
   AND _meta.partition = 1

If we specify in our query that we are only interested in partition 1, and for the sake of example the above Kafka topic has 50 x partitions. Then Lenses will automatically push this predicate down, meaning that we will only need to scan 1GB instead of 50GB of data.

Query by offsets

SELECT *
FROM topicA
WHERE transaction_id=123
  AND _meta.offset > 100
  AND _meta.offset < 100100
  AND _meta.partition = 1

If we specify the offset range and the partition, we would only need to scan the specific range of 100K messages resulting in scanning 5MB.

Query by timestamp

SELECT *
FROM topicA
WHERE transaction_id=123
  AND _meta.timestamp > NOW() - "1H"

The above will query only the data added to the topic up to 1 hour ago. Thus we would query just 10MB.Time-traveling

SELECT *
FROM position_reports
WHERE
   _meta.timestamp > "2020-04-01" AND
   _meta.timestamp < "2020-04-02"

The above will query only the data that have been added to the Kafka topic on a specific day. If we are storing 1,000 days of data, we would query just 50MB.

Limit & Sampling

This page describes how to limit return and sample data in Kafka with Lenses SQL Studio.

Limit the output

To limit the output of the query you can use two approaches:

use the LIMIT clause
set the max size of the data to be returned

-- limit to 1 record
SELECT *
FROM groceries
LIMIT 1

-- set the data size returned to be 1 megabyte.
SET max.size="1m";

-- on the small dataset we have here 1 MB will accommodate all records added and more
SELECT *
FROM groceries

Set a time limit for a query

To restrict the time to run the query, use SET max.query.time:

SET  max.query.time = '1h';

SELECT ...
    FROM table
    LIMIT 1000

Sampling data

To sample data and discard the first rows:

SELECT *
FROM groceries
LIMIT 1,2

This statement instructs Lenses to skip the first record matched and then sample the next two.

Joins

This page describes joining data in Kafka with Lenses SQL Studio.

Joins

Lenses allows you to combine records from two tables. A query can contain zero, one or multiple JOIN operations.

Create an orders table and insert some data into it:

CREATE TABLE orders(
    _key INT
    , orderDate STRING
    , customerId STRING
    , amount DOUBLE
) 
FORMAT(int, avro);

INSERT INTO orders (
    _key
    , orderDate
    , customerId
    , amount
)
VALUES
(1, '2018-10-01', '1', 200.50),
(2, '2018-10-11', '1', 813.00),
(3, '2018-10-11', '3', 625.20),
(4, '2018-10-11', '14', 730.00),
(5, '2018-10-11', '10', 440.00),
(6, '2018-10-11', '9', 444.80);

With these tables in place, join them to get more information about an order by combining it with the customer information found in the customer table:

SELECT 
    o._key AS orderNumber
    , o.amount AS totalAmount
    , c.firstName
    , c.lastName
    , c.city
    , c.country
FROM orders o INNER JOIN customer c
    ON o.customerId = c._key

/*
city        orderNumber     country     totalAmount     lastName    firstName
New York        1               USA         200.5       Smith       Craig
New York        2               USA         813         Smith       Craig
Leeds           3               UK          625.2       Anthony     William
Rio De Janeiro  6               Brazil      444.8       de Ellis    Marquis
Houston         5               USA         440         Milton      Joseph
London          4               UK          730         Wilde       C. J.
*/

Lateral Joins

With lateral joins, Lenses allows you to combine records from a table with the elements of an array expression.

We are going to see in more detail what lateral joins are with an example.

Create a batched_readings table and insert some data into it:

CREATE TABLE batched_readings(
  meter_id int
  , readings int[]
)
FORMAT(int, AVRO);

INSERT INTO batched_readings(
    meter_id
    , readings
) VALUES
(1, [100, 80, 95, 91]),
(2, [87, 93, 100]),
(1, [88, 89, 92, 94]),
(2, [81])

You now can use a LATERAL join to inspect, extract and filter the single elements of the readings array, as if they were a normal field:

SELECT
    meter_id
    , reading
 FROM
    batched_readings
    LATERAL readings AS reading
WHERE 
    reading > 90

Running that query we will get the values:

meter_id

reading

100

You can use multiple LATERAL joins, one inside the other, if you want to extract elements from a nested array:

CREATE TABLE batched_readings_nested(
  sensor_id int
  , nested_readings int[][]
)
FORMAT(int, AVRO);

INSERT INTO batched_readings_nested(
    sensor_id
    , nested_readings
) VALUES
(1, [[100, 101], [103]]),
(2, [[80, 81], [82, 83, 82]]),
(1, [[100], [103, 102], [104]])

Running the following query we will obtain the same records of the previous example:

SELECT
    meter_id
    , reading
 FROM
    batched_readings
    LATERAL nested_readings AS readings
    LATERAL readings as reading
WHERE 
    reading > 90

Inserting & deleting data

This page describes how to insert and delete data into Kafka with Lenses SQL Studio.

Lenses SQL allows you to utilize the ANSI SQL command to store new records into a table.

Single or multi-record inserts are supported:

INSERT INTO $Table(column1[, column2, column3])
VALUES(value1[,value2, value3])

INSERT INTO $Table(column1[, column2, column3])
VALUES
(value1[,value2, value3]),
(value4[,value5, value6])

$Table - The name of the table to insert the data into
Columns - The target columns to populate with data. Adding a record does not require you to fill all the available columns. In the case of Avro stored Key, Value pairs, the user needs to make sure that a value is specified for all the required Avro fields.
VALUES - The set of value to insert. It has to match the list of columns provided, including their data types. You can use simple constants or more complex expressions as values, like 1 + 1 or NOW().

Example:

INSERT INTO customer (
    _key, id
    , address.line
    , address.city
    , address.postcode
    , email)
VALUES
('maria.wood','maria.wood', '698 E. Bedford Lane','Los Angeles', 90044, 'maria.wood@lenses.io'),
('david.green', 'david.green', '4309 S Morgan St', 'Chicago', 60609, 'david.green@lenses.io');

Inserting data from a SELECT

Records can be inserted from the result of SELECT statement.

The syntax is:

INSERT INTO $TABLE1
SELECT */column1[,column2, ...]
FROM $Table2
[WHERE $condition]
[LIMIT N]

For example, to copy all the records from the customer table into customer_avro one:

INSERT INTO customer_avro
SELECT *
FROM customer

Insert complex key

There are scenarios where a record key is a complex type. Regardless of the storage format, JSON or Avro, the SQL engine allows the insertion of such entries:

-- creates a smart_devices table where the key is a type with one field {deviceId:100}
CREATE TABLE smart_devices(
                              _key.deviceId INT
    , room INT
    , temperature double)
    FORMAT(avro, avro);

INSERT INTO smart_devices(
                           _key.deviceId
                         , room
                         , temperature)
VALUES(11223, 99, 22.1);

SELECT *
FROM smart_devices;

Deleting data in Kafka

There are two ways to delete data:

If the topic is not compacted, then DELETE expects an offset to delete records up to.

-- Delete records across all partitions up to the offset 10
DELETE FROM customer 
WHERE _meta.offset <= 10;

-- Delete records from a specific partition 
DELETE FROM customer 
WHERE _meta.offset <= 10 AND _meta.partition = 2

If the topic is compacted, then DELETE expects the record Key to be provided. For a compacted topic a delete translates to inserting a record with the existing Key, but the Value is null. For the customer_avro topic (which has the compacted flag on), a delete operation for a specific customer identifier would look like this:

DELETE FROM customer_avro
WHERE _key = 'andy.perez'

Deleting is an insert operation. Until the compaction takes place, there will be at least one record with the Key used earlier. The latest (or last) record will have the Value set to null.

Truncating a table

To remove all records from a table:

TRUNCATE TABLE $Table

where the $Table is the table name to delete all records from. This operation is only supported on non-compacted topics, which is a Kafka design restriction. To remove the data from a compacted topic, you have two options: either dropping and recreating the topic or inserting null Value records for each unique Key on the topic.

After rebuilding the customer table to be non-compacted, perform the truncate:


TRUNCATE TABLE customer;
/* SELECT count(*) FROM customer returns 0 after the previous statement */

Truncating a compacted Kafka topic is not supported. This is an Apache Kafka restriction. You can drop and recreate the table, or insert a record with a null Value for each unique key in the topic.

Aggregations

This page describes how to aggregate Kafka data in Lenses SQL Studio.

For a full list of aggregation functions see the SQL Reference.

Count the records

Using the COUNT aggregate function you can count the records in a table. Run the following SQL to see how many records we have on the customers_partitioned:

SELECT 
    COUNT(*) AS total
FROM customers_partitioned

Use SUM to aggregate your amounts

Using the SUM function you can sum records in a table.

SELECT 
    SUM(quantity * price) AS amount
FROM groceries

Group data with GROUP BY

To group data use the GROUP BY clause:

CREATE TABLE customer (
    firstName STRING
    , lastName STRING
    , city STRING
    , country STRING
    , phone STRING) 
FORMAT(string, avro) 
properties(compacted=true);

INSERT INTO customer (
    _key
    ,firstName
    , lastName
    , city
    , country
    , phone)
VALUES
('1','Craig', 'Smith', 'New York', 'USA', '1-01-993 2800'),
('2','William', 'Maugham','Toronto','Canada','+1-613-555-0110'),
('3','William', 'Anthony','Leeds','UK', '+44 1632 960427'),
('4','S.B.','Swaminathan','Bangalore','India','+91 7513200000'),
('5','Thomas','Morgan','Arnsberg','Germany','+49-89-636-48018'),
('6','Thomas', 'Merton','New York','USA', '+1-617-555-0147'),
('7','Piers','Gibson','London','UK', '+44 1632 960269'),
('8','Nikolai','Dewey','Atlanta','USA','+1-404-555-0178'),
('9','Marquis', 'de Ellis','Rio De Janeiro','Brazil','+55 21 5555 5555'),
('10','Joseph', 'Milton','Houston','USA','+1-202-555-0153'),
('11','John','Betjeman Hunter','Sydney','Australia','+61 1900 654 321'),
('12','Evan', 'Hayek','Vancouver','Canada','+1-613-555-0130'),
('13','E.','Howard','Adelaide','Australia','+61 491 570 157'),
('14','C. J.', 'Wilde','London','UK','+44 1632 960111'),
('15','Butler', 'Andre','Florida','USA','+1-202-555-0107');

Let’s see how many customers there are from each country. Here is the code which computes that:

SELECT 
    COUNT(*) AS count
    , country
FROM customer
GROUP BY country

/* Output
2    Canada
3    UK
2    Australia
1    India
1    Brazil
5    USA
1    Germany
*/

Metadata fields

This page describes access Kafka message metadata in Lenses SQL Studio.

When running queries against Kafka, Snapshot Engine enables you to access the record metadata through the special _meta facet.

These are the available meta fields:

Field

Description

_meta.offset

The offset of the record in its Kafka topic partition

_meta.partition

The Kafka topic partition of the record

_meta.timestamp

The Kafka record timestamp

_meta.__keysize

The length in bytes of the raw key stored in Kafka

_meta.__valuesize

The length in bytes of the raw value stored in Kafka

Select all the meta fields

The following query will select all the meta fields listed above:

SELECT _meta.* FROM topic

View headers

To view the value of a specific header you can run:

SELECT HEADERASSTRING("User") as user
FROM trips
LIMIT 100

Filter on record timestamp

SELECT ...
    FROM topic
WHERE _meta.timestamp > YESTERDAY()

Filter on table partition

To read records from a specific partition, the following query can be used:

SELECT ...
    FROM topic
WHERE _meta.partition = 1
    OR _meta.partition = 8

Search for a record on a specific offset

Here is the query to use when the record offset and partition are known:

SELECT ...
    FROM topic
WHERE _meta.partition = 2
    AND _meta.offset = 8
LIMIT 1

Get the latest N records per partition

This query will get the latest 100 records per partition (assuming the topic is not compacted):

SELECT ...
    FROM topic
WHERE _meta.offset >= LAST_OFFSET() - 100

This instead will get the latest 100 records for a given partition (again assuming the topic is not compacted):

SELECT ...
    FROM topic
WHERE _meta.offset >= LAST_OFFSET() - 100 
    AND _meta.partition = 2

Views & synonyms

This page describes how to use views and synonyms in Lenses SQL Studio to query Kafka.

Lenses supports the typical SQL commands supported by a relational database:

CREATE
DROP
TRUNCATE
DELETE
SHOW VIEWS

A view is a virtual table, generated dynamically based on the results of a SELECT statement.

A view looks and acts just like a real table, but is always created on the fly as required, so it is always up to date.

A synonym is an alias for a table. This is useful if you have a topic with a long, unwieldy name like customer_reports_emea_april_2018_to_march_2018 and you want to access this as customer_reports.

CREATE VIEW

To create a view:

CREATE VIEW <viewname> AS <query>

Where viewname is the name of the virtual table that is used to access the records in the view, and the query is a standard SELECT statement.

Then we can query the view:

SELECT *
FROM customer_emails

A view acts as a virtual table. This means that a view can be filtered even more or that a projection can be applied to a view:

SELECT *
FROM customer_emails
WHERE name LIKE 'sam%'

DROP VIEW

To delete a view:

DROP VIEW <viewname>

If you wish to modify an existing view, use the syntax above to delete it, and then create a new view with the same name.

SHOW VIEW

To see a definition of a view. You can use the following syntax:

SHOW VIEW <viewname>

CREATE SYNONYM

To create a synonym:

CREATE SYNONYM <name> FOR <table>

DROP SYNONYM

To delete a synonym:

DROP SYNONYM <name>

If you wish to modify an existing synonym, use the syntax above to delete it, and then create a new synonym with the same name.

Common use cases

Three common reasons for using a view are:

creating a projection from a table with a large number of fields
representing joins as a single table
and creating a preset filter

We will cover each scenario with an example.

Projection subset

If we have a table called customers which contains full customer records - name, email, age, registration date, country, password, and many others – and we find ourselves repeatedly querying it for just name and email.

A view could be created that returns just the name and email as a projection.

CREATE VIEW customer_emails AS
SELECT 
    name
    , email
FROM customers

There is no reason to specify the projection each time.

The benefit is more significant when we want to select a higher number of fields - say a topic with 50 fields, and we want to select only 15.

Representing joins

The statement that is used to generate the view can consist of one or more tables. One use case of views is to represent joined tables as if they were a single table. This avoids the need for writing a complex join query each time.

CREATE VIEW orders AS
    SELECT 
        c.customer_id
        , c.customer_name
        , c.customer_email
        , o.order_id
        , o.order_date
        , a.address
        , a.postcode
    FROM customers c JOIN orders o ON c.customer_id = o.customer_id
        JOIN addresses a ON o.delivery_address_id = a.address_id

Then we can select from this join like this:


SELECT *
FROM orders

Preset filter

Finally, another use case is to define a filter that is commonly used. If a topic contains transactions, and we often found ourselves searching for transactions from the UK. We could run this query each time:

SELECT *
FROM transactions
WHERE country = "UK"

Alternatively, we can set up a view with this filter pre-applied:

CREATE VIEW transactions_uk AS
SELECT *
FROM transactions
WHERE country = "UK"

Then use a SELECT query:

SELECT *
FROM transactions_uk

Arrays

This page describes examples of using arrays in Lenses SQL Studio to query Kafka.

For a full list of array functions see the SQL Reference.

Create an array field

You can create array fields using the ..[] syntax:

CREATE TABLE table(
                      _key INT,
    , fialdA INT[]            -- Simple array
    , fieldB.fieldC STRING[]) -- Array in a subfield
    , fieldD INT[][]          -- Nested array
FORMAT(avro, avro);

Select an array field

Tables can store data containing arrays. Here is a SQL statement for querying an array item:

SELECT
    fieldA[1]
     , fieldB.fieldC[2].x
FROM table
WHERE fieldA[1] LIKE '%Lenses%'

When working with arrays is good to check the array bounds. See the SIZEOF function in the list of supported functions.

Query for array size

Sometimes you want to find out how many items are in your array. To do so you can run:

SELECT
    SIZEOF(arrayFieldA)
FROM table

Managing queries

This page describes how to manage and control queries against Kafka in Lenses SQL Studio.

Termination Control

SELECT * FROM topicA WHERE _key.deviceId=123 LIMIT 10

Adding a LIMIT 10 in the SQL query will result in the SQL terminating early, as soon as 10 x messages have been discovered. It’s not a perfect solution as we might never find 10 x messages, and thus perform a full scan.

You can also set a maximum query or idle time:

SET max.query.time = 30s;

or max idle time, the idea is that there is no reason to keep polling if we have exhausted the entire topic:

SET max.idle.time = 5s;

or a maximum amount of data to read from Kafka. This controls how much data to read from Kafka NOT the required memory.

SET max.size = 1M;

Recent queries

Recent queries are displayed, but only for the current session, they are not currently retained.

Click on the play button to run a previous query. If a query is already running, you will be asked if you want to stop it first.

View All queries

SHOW ALL QUERIES

View Running queries

You can see all running queries by Lenses users using SQL:

SHOW QUERIES

Kill Running queries

You can force stop a query by another user using SQL:

KILL QUERY <id>

SQL Studio

Writing queries

Concepts

What is a message?

Facets

Selecting nested fields

Primitive types

Accessing metadata

Projections and nested aliases

Alias clashes (repeated fields)

Nested queries

Functions

Best practices

Understanding bad records

Control execution

Query tuning

Example

Creating & deleting Kafka topics

CREATE TABLE

SHOW TABLES

DESCRIBE TABLE

DROP TABLE

System virtual tables (__tables, __fields)

Record metadata

Filtering

WHERE clause

Missing values

Multiple WHERE conditions

HAVING clause

Read a table partition only

Query by partitions

Query by offsets

Query by timestamp

Limit & Sampling

Limit the output

Set a time limit for a query

Sampling data

Joins

Joins

Lateral Joins

Inserting & deleting data

Inserting data from a SELECT

Insert complex key

Deleting data in Kafka

Truncating a table

Aggregations

Count the records

Use SUM to aggregate your amounts

Group data with GROUP BY

Metadata fields

Select all the meta fields

View headers

Filter on record timestamp

Filter on table partition

Search for a record on a specific offset

Get the latest N records per partition

Views & synonyms

CREATE VIEW

DROP VIEW

SHOW VIEW

CREATE SYNONYM

DROP SYNONYM

Common use cases

Projection subset

Representing joins

Preset filter

Arrays

Create an array field

Select an array field

Query for array size

Managing queries

Termination Control

Recent queries

Kill Running queries

System virtual tables (tables, fields)