1 of 100

Lenses SQL

This page describes how to explore and process data in real time with Lenses SQL engines, and use Kafka Connectors to source and sink data, with monitoring and alerting.

For automation use the CLI.

Lenses has two SQL engines to allow users to explore and process streaming data:

SQL Snapshot - This is point in time SQL engine powering the Explore and SQL Studio for debugging and exploration of data.
SQL Processors (Streaming) - These are long-running applications, configured, deployed and managed by Lenses to perform joins, aggregations, data conversion and more.

Exploring

Learn how to use Lenses to explore topics, their data, configurations and consumers.

SQL Studio

Learn how to use the SQL Studio to interactively query data in your Kafka topics and more.

Processing

Learn how to process data, join, filter aggregate and more with Lenses SQL Processors.

SQL Studio

A declarative SQL interface, for querying, transforming and manipulating data at rest and data in motion. It works with Apache Kafka topics and other data sources. It helps developers and Kafka users

The Lenses SQL Snapshot engine accesses the data at the point in time the query is executed. This means, that for Apache Kafka, data added just after the query was initiated will not be processed.

Typical use cases are but are not limited to:

Identifying a specific message.
Identifying a particular transaction of payment that your system has processed
Identifying all thermostats readings for a specific customer if you are working for an energy provider
Counting transactions processed within a given time window.

The Snapshot engine presents a familiar SQL interface, but remember that it queries Kafka with no indexes. Use Kafka's metadata (partition, offset, timestamp) to improve query performance.

Writing queries

Go to Workspace->Sql Studio, enter your query, and click run.

Concepts

This page describes the concepts of the Lenses SQL snapshot engine that drives the SQL Studio allowing you to query data in Kafka.

Escape topic names with backticks if they contain non-alpha numeric characters

Snapshot queries on streaming data provide answers to a direct question, e.g. The current balance is $10. The query is active, the data is passive.

What is a message?

A single entry in a Kafka topic is called a message.

The engine considers a message to have four distinct components key, value, headers and metadata.

Facets

Currently, the Snapshot Engine supports four different facets _key, _value, _headers and _metadata; These strings can be used to reference properties of each of the aforementioned message components and build a query that way.

By default, unqualified properties are assumed to belong to the _value facet:

SELECT 
  property
FROM source_topic;

In order to reference a different facet, a facet qualifier can be added:

SELECT 
  _value.valueField,
  _key.keyField,
  _meta.metaField,
  _headers.headerField
FROM source_topic;

When more than one sources/topics are specified in a query (like it happens when two topics are joined) a table reference can be added to the selection to fix the ambiguity:

SELECT 
  users._value.field
FROM users JOIN purchases

the same can be done for any of the other facets (_key,_meta,_headers).

Note Using a wildcard selection statement SELECT * provides only the value component of a message.

Headers are interpreted as a simple mapping of strings to strings. This means that if a header is a JSON, XML or any other structured type, the snapshot engine will still read it as a string value.

Selecting nested fields

Messages can contain nested elements and embedded arrays. The . operator is used to refer to children, and the [] operator is used for referring to an element in an array.

You can use a combination of these two operators to access data of any depth.

SELECT 
    dependencies[0].first_name AS childName
FROM policy_holder
WHERE policyId='100001'

You explicitly reference the key, value and metadata.

For the key use _key, for the value use _value, and for metadata use _meta. When there is no prefix, the engine will resolve the field(s) as being part of the message value. For example, the following two queries are identical:

SELECT 
    amount
FROM payments;

SELECT 
    _value.amount
FROM payments;

Primitive types

When the key or a value content is a primitive data type use the prefix only to address them.

For example, if messages contain a device identifier as the key and the temperature as the value, SQL code would be:

SELECT 
    _key AS deviceId
  , _value AS temperature
FROM iot_data

Accessing metadata

Use the _meta keyword to address the metadata. For example:

SELECT 
    _meta.timestamp AS timestamp
    , _meta.offset AS index
FROM iot_data

Projections and nested aliases

When projecting a field into a target record, Lenses allows complex structures to be built. This can be done by using a nested alias like below:

SELECT 
    amount as user.amount
    userId as user.id
FROM payments;

The result would be a struct with the following shape:

{
  "user": {
    "amount" : 10.19,
    "id": 10
  }  
}

Alias clashes (repeated fields)

When two alias names clash, the snapshot engine does not “override” that field. Lenses will instead generate a new name by appending a unique integer. This means that a query like the following:

SELECT 
    amount as result.amount,
    amount + 5 as result.amount
FROM payments;

will generate a structure like the following:

{
  "result": {
    "amount" : 10, 
    "amount0": 15
  }  
}

Nested queries

The tabled query allows you to nest queries. Let us take the query in the previous section and say we are only interested in those entries where there exist more than 1 customer per country.

SELECT *
FROM (
    SELECT 
        COUNT(*) AS count
        , country
    FROM customer
    GROUP BY country
    )
WHERE count > 1

Run the query, and you will only see those entries for which there is more than one person registered per country.

Functions

Functions can be used directly.

For example, the ROUND function allows you to round numeric functions:


SELECT
    name 
    , ROUND(quantity * price) AS rounded_total
FROM groceries

/*The output:
Fairtrade Bananas                                   2
Meridian Crunchy Peanut Butter                      3
Green & Black's organic 85% dark chocolate bar      4
Activia fat free cherry yogurts                     6
Green & Blacks Organic Chocolate Ice Cream          8
*/

For a full list of functions see SQL Reference.

Best practices

This page describes the best practices when using Lenses SQL Studio to query data in Kafka.

Does Apache Kafka have indexing?

No. Apache Kafka does not have the full indexing capabilities in the payload (indexes typically come at a high cost even on an RDBMS / DB or a system like Elastic Search), however, Kafka indexes the metadata.

The only filters Kafka supports are topic, partition and offsets or timestamps.

Push down queries using the Apache Kafka metadata

When querying Kafka topic data with SQL such as

SELECT *
FROM topicA
WHERE transaction_id=123

a full scan will be executed, and the query processes the entire data on that topic to identify all records that match the transaction id.

If the Kafka topic contains a billion 50KB messages - that would require querying 50 GB of data. Depending on your network capabilities, brokers’ performance, any quotas on your account, and other parameters, fetching 50 GB of data could take some time! Even more, if the data is compressed. In the last case, the client has to decompress it before parsing the raw bytes to translate into a structure to which the query can be applied.

Query by partition, offset or timestamp to avoid full scans.

Understanding bad records

When Lenses can’t read (deserialize) your topic’s messages, it classifies them as “bad records”. This happens for one of the following reasons:

Kafka records are corrupted. On an AVRO topic, a rogue producer might have published a different format
Lenses topic settings do not match the payload data. Maybe a topic was incorrectly given AVRO format when it’s JSON or vice versa
If AVRO payload is involved, maybe the Schema Registry is down or not accessible from the machine running Lenses

By default, Lenses skips them and displays the records’ metadata in the Bad Records tab. If you want to force stop the query in such case use:

SET skip.bad.records=false;
SELECT * FROM topicA LIMIT 100

Control execution

Querying a table can take a long time if it contains a lot of records. The underlying Kafka topic has to be read, the filter conditions applied, and the projections made.

Additionally, the SELECT statement could end up bringing a large amount of data to the client. To be able to constrain the resources involved, Lenses allows for context customization, which ends up driving the execution, thus giving control to the user. Here is the list of context parameters to overwrite:

Name

Description

Example

max.size

The maximum amount of Kafka data to scan. This is to avoid full topic scan over large topics. It can be expressed as bytes (1024), as kilo-bytes (1024k), as mega-bytes (10m) or as giga-bytes (5g). Default is 20MB.

SET max.size = '1g';

max.query.time

The maximum amount of time the query is allowed to run. It can be specified as milliseconds (2000ms), as hours (2h), minutes (10m) or seconds (60s). Default is 1 hour.

SET max.query.time = '60000ms';

max.idle.time

The amount of time to wait when no more records are read from the source before the query is completed. Default is 5 seconds

SET max.idle.time = '5s';

LIMIT N

The maximum of records to return. Default is 10000

SELECT * FROM payments LIMIT 100;

show.bad.records

Flag to drive the behavior of handling topic records when their payload does not correspond with the table storage format. Default is true. This means bad records are processed, and displayed seperately in the Bad Records section. Set it to false to fail to skip them completely.

SET show.bad.records=false;

format.timestamp

Flag to control the values for Avro date time. Avro encodes date time via Long values. Set the value to true if you want the values to be returned as text and in a human readable format.

SET format.timestamp=true;

format.decimal

Flag to control the formatting of decimal types. Use to specify how many decimal places are shown.

SET format.decimal= 2;

format.uppercase

Flag to control the formatting of string types. Use to specify if strings should all be made uppercase. Default is false.

SET format.decimal= 2;

live.aggs

Flag to control if aggregation queries should be allowed to run. Since they accumulate data they require more memory to retain the state.

SET live.aggs=true;

max.group.records

When an aggregation is calculated, this config is used to define the maximum number of records over which the engine is computed. Default is 10 000 000

SET max.group.records=10000000

optimize.kafka.partition

When enabled, it will use the primitive used for the _key filter to determine the partition the same way the default Kafka partitioner logic does. Therefore, queries like SELECT * FROM trips WHERE _key='customer_id_value'; on multiple partition topics will only read one partition as opposed to the entire topic. To disable it, set the flag to false.

SET optimize.kafka.partition=false;

query.parallel

When used, it will parallelize the query. The number provided will be capped by the target topic partitions count.

SET query.parallel=2;

query.buffer

Internal buffer when processing messages. Higher number might yield better performance when coupled with max.poll.records.

SET query.buffer=50000;

kafka.offset.timeout

Timeout for retrieving target topic start/end offsets.

SET kafka.offset.timeout=20000;

All the above values can be given a default value via the configuration file. Using lenses.sql.settings as prefix the format.timestamp can be set like this:

lenses.sql.settings.format.timestamp=true

Query tuning

Lenses SQL uses Kafka Consumer to read the data. This means that an advanced user with knowledge of Kafka could tweak the consumer properties to achieve better throughput. This would occur on very rare occasions. The query context can receive Kafka consumer settings. For example, the max.poll.records consumer can be set as:

SET max.poll.records= 100000;

SELECT *
FROM payments
LIMIT 1000000

Example

The fact is that streaming SQL is operating on unbounded streams of events: a query would normally be a never-ending query. In order to bring query termination semantics into Apache Kafka we introduced 4 controls:

LIMIT = 10000 - Force the query to terminate when 10,000 records are matched.
max.bytes = 20000000 - Force the query to terminate once 20 MBytes have been retrieved.
max.time = 60000 - Force the query to terminate after 60 seconds.
max.zero.polls = 8 - Force the query to terminate after 8 consecutive polls are empty, indicating we have exhausted a topic.

Thus, when retrieving data, you can set a limit of 1GB to the maximum number of bytes retrieved and a maximum query time of one hour like this:

SET max.bytes = 1000000000;
SET max.time = 60000000;

SELECT * 
FROM topicA 
WHERE customer.id = "XXX";

Creating & deleting Kafka topics

This page describes how to create and delete topics in the Lenses SQL Studio.

Lenses supports the typical SQL commands supported by a relational database:

CREATE
DROP
TRUNCATE
DELETE
SHOW TABLES
DESCRIBE TABLE
DESCRIBE FORMATTED

CREATE TABLE

CREATE TABLE 
    table_name(
        $field $fieldType [, $field $fieldType,...]
    )
FORMAT ($keyStorageFormat, $valueStorageFormat)
[PROPERTIES(
        partitions= *, 
        replication=$replication, 
        compacted=true/false)
];

The CREATE statement has the following parts:

CREATE TABLE - Instructs the construction of a table
$Table - The actual name given to the table.
Schema - Constructed as a list of (field, type) tuple, it describes the data each record in the table contains
FORMAT - Defines the storage format. Since it is an Apache Kafka topic, both the Key and the Value formats are required. Valid values are STRING, INT, LONG, JSON, AVRO.
PROPERTIES - Specifies the number of partitions the final Kafka topic should have, the replication factor in order to ensure high availability (it cannot be a number higher than the current Kafka Brokers number) and if the topic should be compacted.

A Kafka topic which is compacted is a special type of topic with a finer-grained retention mechanism that retains the last update record for each key.

A compacted topic (once the compaction has been completed) contains a full snapshot of the final record values for every record key and not just the recently changed keys. They are useful for in-memory services, persistent data stores, reloading caches, etc.

For more details on the subject, you should look at Kafka Documentation.

Example:

CREATE TABLE customer (
        id STRING 
        , address.line STRING 
        , address.city STRING, 
        , address.postcode INT 
        , email STRING
    )
FORMAT (string, json)
PROPERTIES (
    partitions=1, 
    compacted=true
);

Best practices dictate to use Avro as a storage format over other formats. In this case, the key can still be stored as STRING but the value can be Avro.

CREATE TABLE customer_avro (
        id STRING
        , address.line STRING
        , address.city STRING 
        , address.postcode int 
        , email string
    )
FORMAT (string, avro)
PROPERTIES (
    partitions=1, 
    compacted=true
);

SHOW TABLES

To list all tables:

SHOW TABLES;

DESCRIBE TABLE

To examine the schema an metadata for a topic:

DESCRIBE TABLE $tableName

The $tableName should contain the name of the table to describe.

Given the two tables created earlier, a user can run the following SQL to get the information on each table:

DESCRIBE TABLE customer_avro

the following information will be displayed:

/* Output:
# Column Name                           # Data Type
_key                                    String
_value.address.postcode                 Int
_value.address.city                     String
_value.address.line                     String
_value.email                            String
_value.id                               String

# Config Key                            # Config Value
cleanup.policy                          compact
compression.type                        producer
delete.retention.ms                     86400000
file.delete.delay.ms                    60000
flush.messages                          9223372036854775807
flush.ms                                9223372036854775807
index.interval.bytes                    4096
max.message.bytes                       1000012
message.format.version                  1.1-IV0
message.timestamp.difference.max.ms     9223372036854775807
message.timestamp.type                  CreateTime
min.cleanable.dirty.ratio               0.5
min.compaction.lag.ms                   0
min.insync.replicas                     1
preallocate                             false
retention.bytes                         2147483648
retention.ms                            604800000
segment.bytes                           1073741824
segment.index.bytes                     10485760
segment.jitter.ms                       0
segment.ms                              604800000
unclean.leader.election.enable          false
*/

DROP TABLE

To drop a table:

DROP TABLE $Table;

Dropping a table results in the underlying Kafka topics being removed.

System virtual tables (tables, fields)

Lenses provides a set of virtual tables that contain information about all the fields in all the tables.

Using the virtual table, you can quickly search for a table name but also see the table type.

The __table has a table_name column containing the table name, and a table_type column describing the table type (system, user, etc).

SELECT * FROM __tables;

SELECT *
FROM __tables
WHERE table_name LIKE '%customer%';

To see all the tables fields select from the _fields virtual table.

SELECT * 
FROM __fields;

SELECT * 
FROM __fields
WHERE table_name LIKE '%customer%'

Record metadata

Each Kafka message contains information related to partition, offset, timestamp, and topic. Additionally, the engine adds the key and value raw byte size.

Create a topic and insert a few entries.

CREATE TABLE tutorial(
    _key STRING
  , name STRING
  , difficulty INT
) 
FORMAT (Avro, Avro);

INSERT INTO tutorial(_key, name, difficulty)
VALUES
("1", "Learn Lenses SQL", 3),
("2", "Learn Quantum Physics", 10),
("3", "Learn French", 5);

Now we can query for specific metadata related to the records.

To query for metadata such as the underlying Kafka topic offset, partition and timestamp prefix your desired fields with _meta.

Run the following query to see each tutorial name along with its metadata information:

SELECT name
    , _meta.offset
    , _meta.timestamp
    , _meta.partition
    , _meta.__keysize
    , _meta.__valsize
FROM tutorial

/* The output is (timestamp will be different)
Learn Lenses SQL         0    1540575169198    0    7   23
Learn Quantum Physics    1    1540575169226    0    7   28
Learn French             2    1540575169247    0    7   19
*/

Filtering

This page describes common filtering of data in Kafka with Lenses SQL Studio.

WHERE clause

WHERE clause allows you to define a set of logical predicates the data needs to match in order to be returned. Standard comparison operators are supported (>, >=, <, <=, =, and !=) as well as calling functions.

We are going to use the groceries table created earlier. Select all items purchased where the prices are greater or equal to 2.00:

SELECT
     name
     , price
FROM groceries
WHERE price >= 2.0

/* Output
Meridian Crunchy Peanut Butter              2.5
Activia fat free cherry yogurts             2
Green & Blacks Organic Chocolate Ice Cream  4.2
*/

Select all customers whose last name length equals to 5:

SELECT *
FROM customers
WHERE LEN(last_name) = 5

/* Output
key         value.first_name    value.last_name
mikejones       Mike                Jones
anasmith        Ana                 Smith
*/

Search all customers containing Ana in their first name:

SELECT *
FROM customers
WHERE first_name LIKE '%Ana%'

Keep in mind that text search is case-sensitive. To use case insensitive text search, you can write:

SELECT *
FROM customers
WHERE LOWERCASE(first_name) LIKE '%ana%';

-- And here is the negated version
SELECT *
FROM customers
WHERE LOWERCASE(first_name) NOT LIKE '%ana%';

Missing values

Sometimes data can contain explicit NULL values, or it can omit fields entirely. Using IS [ NOT ] NULL, or EXISTS functions allows you to check for these situations.

Exists is a keyword in Lenses SQL grammar so it needs to be escaped, the escape character is `````.

Lenses supports JSON. JSON does not enforce a schema allowing you to insert null values.

Create the following table named customers_json:

CREATE TABLE customers_json (
    _key STRING
    , first_name STRING
    , last_name STRING
    , middle_name STRING
) FORMAT(string, json);


INSERT INTO customers_json(_key, first_name, last_name, middle_name) VALUES("mikejones", "Mike", "Jones", "Albert");
INSERT INTO customers_json(_key, first_name, last_name) VALUES("anasmith", "Ana", "Smith");
INSERT INTO customers_json(_key, first_name, last_name) VALUES("shannonelliott", "Shannon","Elliott");

Query this table for all its entries:

SELECT * 
FROM customers_json

/* The output
key             value.first_name   value.middle_name   value.last_name
mikejones           Mike                Albert          Jones
anasmith            Ana                                 Smith
shannonelliott      Shannon                             Elliott
*/

The middle_name is only present on the mikejones record.

Write a query which filters out records where middle_name is not present:


SELECT *
FROM customers_json
WHERE `EXISTS`(middle_name)

/* The output
 key             value.first_name   value.middle_name   value.last_name
mikejones            Mike            Albert                Jones
*/

This can also be written as:

SELECT *
FROM customers_json
WHERE middle_name IS NULL

When a field is actually NULL or is missing, checking like in the above query has the same outcome.

Multiple WHERE conditions

You can use AND/OR to specify complex conditions for filtering your data.

To filter the purchased items where more than one item has been bought for a given product, and the unit price is greater than 2:

SELECT *
FROM groceries
WHERE quantity > 1 
    AND price > 2

Now try changing the AND logical operand to OR and see the differences in output.

HAVING clause

To filter the entries returned from a grouping query. As with the WHERE statement, you can use HAVING syntax to achieve the same result when it comes to grouped queries.

SELECT
    COUNT(*) AS count
    , country
FROM customer
GROUP BY country
HAVING count > 1

Read a table partition only

To select data from a specific partition access the metadata of the topic.

In the following example, a table is created with three partitions and the message key is hashed and then the remainder HashValue % partitions will be the table partition the record is sent to.

-- Run
CREATE TABLE customers_partitioned (
    _key STRING
    , first_name STRING
    , last_name STRING
) 
FORMAT(string, Avro)
properties(partitions = 3);

INSERT INTO customers_partitioned(
    _key
    , first_name
    , last_name)
VALUES
("mikejones", "Mike", "Jones"),
("anasmith", "Ana", "Smith"),
("shannonelliott", "Shannon","Elliott"),
("tomwood", "Tom","Wood"),
("adelewatson", "Adele","Watson"),
("mariasanchez", "Maria", "Sanchez");

Next, run the following query:

SELECT *
FROM customers_partitioned

/* The output
offset  partition   timestamp       key         value.first_name    value.last_name
0       0           1540830780401   mikejones       Mike                Jones
1       0           1540830780441   anasmith        Ana                 Smith
2       0           1540830780464   shannonelliott  Shannon             Elliott
0       2           1540831270170   mariasanchez    Maria               Sanchez
0       1           1540830984698   tomwood         Tom                 Wood
1       1           1540831183308   adelewatson     Adele               Watson
*/

As you can see from the results (your timestamps will be different) the records span over the three partitions. Now query specific partitions:

-- selects only records from partition = 0
SELECT *
FROM customers_partitioned
WHERE _meta.partition = 0;

-- selects only records from partition  0 and 2
SELECT *
FROM customers_partitioned
WHERE _meta.partition = 0
   OR _meta.partition = 2;

Query by partitions

Kafka reads are non-deterministic over multiple partitions. The Snapshot engine may reach its max.size before it finds your record in one run, next time it might.

SELECT *
FROM topicA
WHERE transaction_id=123
   AND _meta.partition = 1

If we specify in our query that we are only interested in partition 1, and for the sake of example the above Kafka topic has 50 x partitions. Then Lenses will automatically push this predicate down, meaning that we will only need to scan 1GB instead of 50GB of data.

Query by offsets

SELECT *
FROM topicA
WHERE transaction_id=123
  AND _meta.offset > 100
  AND _meta.offset < 100100
  AND _meta.partition = 1

If we specify the offset range and the partition, we would only need to scan the specific range of 100K messages resulting in scanning 5MB.

Query by timestamp

SELECT *
FROM topicA
WHERE transaction_id=123
  AND _meta.timestamp > NOW() - "1H"

The above will query only the data added to the topic up to 1 hour ago. Thus we would query just 10MB.Time-traveling

SELECT *
FROM position_reports
WHERE
   _meta.timestamp > "2020-04-01" AND
   _meta.timestamp < "2020-04-02"

The above will query only the data that have been added to the Kafka topic on a specific day. If we are storing 1,000 days of data, we would query just 50MB.

Limit & Sampling

This page describes how to limit return and sample data in Kafka with Lenses SQL Studio.

Limit the output

To limit the output of the query you can use two approaches:

use the LIMIT clause
set the max size of the data to be returned

-- limit to 1 record
SELECT *
FROM groceries
LIMIT 1

-- set the data size returned to be 1 megabyte.
SET max.size="1m";

-- on the small dataset we have here 1 MB will accommodate all records added and more
SELECT *
FROM groceries

Set a time limit for a query

To restrict the time to run the query, use SET max.query.time:

SET  max.query.time = '1h';

SELECT ...
    FROM table
    LIMIT 1000

Sampling data

To sample data and discard the first rows:

SELECT *
FROM groceries
LIMIT 1,2

This statement instructs Lenses to skip the first record matched and then sample the next two.

Joins

This page describes joining data in Kafka with Lenses SQL Studio.

Joins

Lenses allows you to combine records from two tables. A query can contain zero, one or multiple JOIN operations.

Create an orders table and insert some data into it:

CREATE TABLE orders(
    _key INT
    , orderDate STRING
    , customerId STRING
    , amount DOUBLE
) 
FORMAT(int, avro);

INSERT INTO orders (
    _key
    , orderDate
    , customerId
    , amount
)
VALUES
(1, '2018-10-01', '1', 200.50),
(2, '2018-10-11', '1', 813.00),
(3, '2018-10-11', '3', 625.20),
(4, '2018-10-11', '14', 730.00),
(5, '2018-10-11', '10', 440.00),
(6, '2018-10-11', '9', 444.80);

With these tables in place, join them to get more information about an order by combining it with the customer information found in the customer table:

SELECT 
    o._key AS orderNumber
    , o.amount AS totalAmount
    , c.firstName
    , c.lastName
    , c.city
    , c.country
FROM orders o INNER JOIN customer c
    ON o.customerId = c._key

/*
city        orderNumber     country     totalAmount     lastName    firstName
New York        1               USA         200.5       Smith       Craig
New York        2               USA         813         Smith       Craig
Leeds           3               UK          625.2       Anthony     William
Rio De Janeiro  6               Brazil      444.8       de Ellis    Marquis
Houston         5               USA         440         Milton      Joseph
London          4               UK          730         Wilde       C. J.
*/

Lateral Joins

With lateral joins, Lenses allows you to combine records from a table with the elements of an array expression.

We are going to see in more detail what lateral joins are with an example.

Create a batched_readings table and insert some data into it:

CREATE TABLE batched_readings(
  meter_id int
  , readings int[]
)
FORMAT(int, AVRO);

INSERT INTO batched_readings(
    meter_id
    , readings
) VALUES
(1, [100, 80, 95, 91]),
(2, [87, 93, 100]),
(1, [88, 89, 92, 94]),
(2, [81])

You now can use a LATERAL join to inspect, extract and filter the single elements of the readings array, as if they were a normal field:

SELECT
    meter_id
    , reading
 FROM
    batched_readings
    LATERAL readings AS reading
WHERE 
    reading > 90

Running that query we will get the values:

meter_id

reading

100

You can use multiple LATERAL joins, one inside the other, if you want to extract elements from a nested array:

CREATE TABLE batched_readings_nested(
  sensor_id int
  , nested_readings int[][]
)
FORMAT(int, AVRO);

INSERT INTO batched_readings_nested(
    sensor_id
    , nested_readings
) VALUES
(1, [[100, 101], [103]]),
(2, [[80, 81], [82, 83, 82]]),
(1, [[100], [103, 102], [104]])

Running the following query we will obtain the same records of the previous example:

SELECT
    meter_id
    , reading
 FROM
    batched_readings
    LATERAL nested_readings AS readings
    LATERAL readings as reading
WHERE 
    reading > 90

Inserting & deleting data

This page describes how to insert and delete data into Kafka with Lenses SQL Studio.

Lenses SQL allows you to utilize the ANSI SQL command to store new records into a table.

Single or multi-record inserts are supported:

INSERT INTO $Table(column1[, column2, column3])
VALUES(value1[,value2, value3])

INSERT INTO $Table(column1[, column2, column3])
VALUES
(value1[,value2, value3]),
(value4[,value5, value6])

$Table - The name of the table to insert the data into
Columns - The target columns to populate with data. Adding a record does not require you to fill all the available columns. In the case of Avro stored Key, Value pairs, the user needs to make sure that a value is specified for all the required Avro fields.
VALUES - The set of value to insert. It has to match the list of columns provided, including their data types. You can use simple constants or more complex expressions as values, like 1 + 1 or NOW().

Example:

INSERT INTO customer (
    _key, id
    , address.line
    , address.city
    , address.postcode
    , email)
VALUES
('maria.wood','maria.wood', '698 E. Bedford Lane','Los Angeles', 90044, 'maria.wood@lenses.io'),
('david.green', 'david.green', '4309 S Morgan St', 'Chicago', 60609, 'david.green@lenses.io');

Inserting data from a SELECT

Records can be inserted from the result of SELECT statement.

The syntax is:

INSERT INTO $TABLE1
SELECT */column1[,column2, ...]
FROM $Table2
[WHERE $condition]
[LIMIT N]

For example, to copy all the records from the customer table into customer_avro one:

INSERT INTO customer_avro
SELECT *
FROM customer

Insert complex key

There are scenarios where a record key is a complex type. Regardless of the storage format, JSON or Avro, the SQL engine allows the insertion of such entries:

-- creates a smart_devices table where the key is a type with one field {deviceId:100}
CREATE TABLE smart_devices(
                              _key.deviceId INT
    , room INT
    , temperature double)
    FORMAT(avro, avro);

INSERT INTO smart_devices(
                           _key.deviceId
                         , room
                         , temperature)
VALUES(11223, 99, 22.1);

SELECT *
FROM smart_devices;

Deleting data in Kafka

There are two ways to delete data:

If the topic is not compacted, then DELETE expects an offset to delete records up to.

-- Delete records across all partitions up to the offset 10
DELETE FROM customer 
WHERE _meta.offset <= 10;

-- Delete records from a specific partition 
DELETE FROM customer 
WHERE _meta.offset <= 10 AND _meta.partition = 2

If the topic is compacted, then DELETE expects the record Key to be provided. For a compacted topic a delete translates to inserting a record with the existing Key, but the Value is null. For the customer_avro topic (which has the compacted flag on), a delete operation for a specific customer identifier would look like this:

DELETE FROM customer_avro
WHERE _key = 'andy.perez'

Deleting is an insert operation. Until the compaction takes place, there will be at least one record with the Key used earlier. The latest (or last) record will have the Value set to null.

Truncating a table

To remove all records from a table:

TRUNCATE TABLE $Table

where the $Table is the table name to delete all records from. This operation is only supported on non-compacted topics, which is a Kafka design restriction. To remove the data from a compacted topic, you have two options: either dropping and recreating the topic or inserting null Value records for each unique Key on the topic.

After rebuilding the customer table to be non-compacted, perform the truncate:


TRUNCATE TABLE customer;
/* SELECT count(*) FROM customer returns 0 after the previous statement */

Truncating a compacted Kafka topic is not supported. This is an Apache Kafka restriction. You can drop and recreate the table, or insert a record with a null Value for each unique key in the topic.

Aggregations

This page describes how to aggregate Kafka data in Lenses SQL Studio.

For a full list of aggregation functions see the SQL Reference.

Count the records

Using the COUNT aggregate function you can count the records in a table. Run the following SQL to see how many records we have on the customers_partitioned:

SELECT 
    COUNT(*) AS total
FROM customers_partitioned

Use SUM to aggregate your amounts

Using the SUM function you can sum records in a table.

SELECT 
    SUM(quantity * price) AS amount
FROM groceries

Group data with GROUP BY

To group data use the GROUP BY clause:

CREATE TABLE customer (
    firstName STRING
    , lastName STRING
    , city STRING
    , country STRING
    , phone STRING) 
FORMAT(string, avro) 
properties(compacted=true);

INSERT INTO customer (
    _key
    ,firstName
    , lastName
    , city
    , country
    , phone)
VALUES
('1','Craig', 'Smith', 'New York', 'USA', '1-01-993 2800'),
('2','William', 'Maugham','Toronto','Canada','+1-613-555-0110'),
('3','William', 'Anthony','Leeds','UK', '+44 1632 960427'),
('4','S.B.','Swaminathan','Bangalore','India','+91 7513200000'),
('5','Thomas','Morgan','Arnsberg','Germany','+49-89-636-48018'),
('6','Thomas', 'Merton','New York','USA', '+1-617-555-0147'),
('7','Piers','Gibson','London','UK', '+44 1632 960269'),
('8','Nikolai','Dewey','Atlanta','USA','+1-404-555-0178'),
('9','Marquis', 'de Ellis','Rio De Janeiro','Brazil','+55 21 5555 5555'),
('10','Joseph', 'Milton','Houston','USA','+1-202-555-0153'),
('11','John','Betjeman Hunter','Sydney','Australia','+61 1900 654 321'),
('12','Evan', 'Hayek','Vancouver','Canada','+1-613-555-0130'),
('13','E.','Howard','Adelaide','Australia','+61 491 570 157'),
('14','C. J.', 'Wilde','London','UK','+44 1632 960111'),
('15','Butler', 'Andre','Florida','USA','+1-202-555-0107');

Let’s see how many customers there are from each country. Here is the code which computes that:

SELECT 
    COUNT(*) AS count
    , country
FROM customer
GROUP BY country

/* Output
2    Canada
3    UK
2    Australia
1    India
1    Brazil
5    USA
1    Germany
*/

Metadata fields

This page describes access Kafka message metadata in Lenses SQL Studio.

When running queries against Kafka, Snapshot Engine enables you to access the record metadata through the special _meta facet.

These are the available meta fields:

Field

Description

_meta.offset

The offset of the record in its Kafka topic partition

_meta.partition

The Kafka topic partition of the record

_meta.timestamp

The Kafka record timestamp

_meta.__keysize

The length in bytes of the raw key stored in Kafka

_meta.__valuesize

The length in bytes of the raw value stored in Kafka

Select all the meta fields

The following query will select all the meta fields listed above:

SELECT _meta.* FROM topic

View headers

To view the value of a specific header you can run:

SELECT HEADERASSTRING("User") as user
FROM trips
LIMIT 100

Filter on record timestamp

SELECT ...
    FROM topic
WHERE _meta.timestamp > YESTERDAY()

Filter on table partition

To read records from a specific partition, the following query can be used:

SELECT ...
    FROM topic
WHERE _meta.partition = 1
    OR _meta.partition = 8

Search for a record on a specific offset

Here is the query to use when the record offset and partition are known:

SELECT ...
    FROM topic
WHERE _meta.partition = 2
    AND _meta.offset = 8
LIMIT 1

Get the latest N records per partition

This query will get the latest 100 records per partition (assuming the topic is not compacted):

SELECT ...
    FROM topic
WHERE _meta.offset >= LAST_OFFSET() - 100

This instead will get the latest 100 records for a given partition (again assuming the topic is not compacted):

SELECT ...
    FROM topic
WHERE _meta.offset >= LAST_OFFSET() - 100 
    AND _meta.partition = 2

Views & synonyms

This page describes how to use views and synonyms in Lenses SQL Studio to query Kafka.

Lenses supports the typical SQL commands supported by a relational database:

CREATE
DROP
TRUNCATE
DELETE
SHOW VIEWS

A view is a virtual table, generated dynamically based on the results of a SELECT statement.

A view looks and acts just like a real table, but is always created on the fly as required, so it is always up to date.

A synonym is an alias for a table. This is useful if you have a topic with a long, unwieldy name like customer_reports_emea_april_2018_to_march_2018 and you want to access this as customer_reports.

CREATE VIEW

To create a view:

CREATE VIEW <viewname> AS <query>

Where viewname is the name of the virtual table that is used to access the records in the view, and the query is a standard SELECT statement.

Then we can query the view:

SELECT *
FROM customer_emails

A view acts as a virtual table. This means that a view can be filtered even more or that a projection can be applied to a view:

SELECT *
FROM customer_emails
WHERE name LIKE 'sam%'

DROP VIEW

To delete a view:

DROP VIEW <viewname>

If you wish to modify an existing view, use the syntax above to delete it, and then create a new view with the same name.

SHOW VIEW

To see a definition of a view. You can use the following syntax:

SHOW VIEW <viewname>

CREATE SYNONYM

To create a synonym:

CREATE SYNONYM <name> FOR <table>

DROP SYNONYM

To delete a synonym:

DROP SYNONYM <name>

If you wish to modify an existing synonym, use the syntax above to delete it, and then create a new synonym with the same name.

Common use cases

Three common reasons for using a view are:

creating a projection from a table with a large number of fields
representing joins as a single table
and creating a preset filter

We will cover each scenario with an example.

Projection subset

If we have a table called customers which contains full customer records - name, email, age, registration date, country, password, and many others – and we find ourselves repeatedly querying it for just name and email.

A view could be created that returns just the name and email as a projection.

CREATE VIEW customer_emails AS
SELECT 
    name
    , email
FROM customers

There is no reason to specify the projection each time.

The benefit is more significant when we want to select a higher number of fields - say a topic with 50 fields, and we want to select only 15.

Representing joins

The statement that is used to generate the view can consist of one or more tables. One use case of views is to represent joined tables as if they were a single table. This avoids the need for writing a complex join query each time.

CREATE VIEW orders AS
    SELECT 
        c.customer_id
        , c.customer_name
        , c.customer_email
        , o.order_id
        , o.order_date
        , a.address
        , a.postcode
    FROM customers c JOIN orders o ON c.customer_id = o.customer_id
        JOIN addresses a ON o.delivery_address_id = a.address_id

Then we can select from this join like this:


SELECT *
FROM orders

Preset filter

Finally, another use case is to define a filter that is commonly used. If a topic contains transactions, and we often found ourselves searching for transactions from the UK. We could run this query each time:

SELECT *
FROM transactions
WHERE country = "UK"

Alternatively, we can set up a view with this filter pre-applied:

CREATE VIEW transactions_uk AS
SELECT *
FROM transactions
WHERE country = "UK"

Then use a SELECT query:

SELECT *
FROM transactions_uk

Arrays

This page describes examples of using arrays in Lenses SQL Studio to query Kafka.

For a full list of array functions see the SQL Reference.

Create an array field

You can create array fields using the ..[] syntax:

CREATE TABLE table(
                      _key INT,
    , fialdA INT[]            -- Simple array
    , fieldB.fieldC STRING[]) -- Array in a subfield
    , fieldD INT[][]          -- Nested array
FORMAT(avro, avro);

Select an array field

Tables can store data containing arrays. Here is a SQL statement for querying an array item:

SELECT
    fieldA[1]
     , fieldB.fieldC[2].x
FROM table
WHERE fieldA[1] LIKE '%Lenses%'

When working with arrays is good to check the array bounds. See the SIZEOF function in the list of supported functions.

Query for array size

Sometimes you want to find out how many items are in your array. To do so you can run:

SELECT
    SIZEOF(arrayFieldA)
FROM table

Managing queries

This page describes how to manage and control queries against Kafka in Lenses SQL Studio.

Termination Control

SELECT * FROM topicA WHERE _key.deviceId=123 LIMIT 10

Adding a LIMIT 10 in the SQL query will result in the SQL terminating early, as soon as 10 x messages have been discovered. It’s not a perfect solution as we might never find 10 x messages, and thus perform a full scan.

You can also set a maximum query or idle time:

SET max.query.time = 30s;

or max idle time, the idea is that there is no reason to keep polling if we have exhausted the entire topic:

SET max.idle.time = 5s;

or a maximum amount of data to read from Kafka. This controls how much data to read from Kafka NOT the required memory.

SET max.size = 1M;

Recent queries

Recent queries are displayed, but only for the current session, they are not currently retained.

Click on the play button to run a previous query. If a query is already running, you will be asked if you want to stop it first.

View All queries

SHOW ALL QUERIES

View Running queries

You can see all running queries by Lenses users using SQL:

SHOW QUERIES

Kill Running queries

You can force stop a query by another user using SQL:

KILL QUERY <id>

SQL Processors

In this guide you will learn about SQL Processor apps in Lenses, how to use them to quickly create streaming flows using SQL streaming, deploy and scale them.

SQL Processors continuously process data in Kafka topics. Under the hood, the Kafka Streams APIs are combined with the internal Lenses application deployment framework. A user gets a seamless experience for creating scalable processing components to filter, aggregate, join, and transform Kafka topics, which can scale natively to Kubernetes.

SQL Processors are continuous queries, they process data as it arrives and react to events, e.g. payment made, $5 to Coffeshop. The data is active, the query is passive.

SQL Processors offer:

A no-code, stand-alone application executing a given Lenses SQL query on current and future data
Query graph visualisation
Fully integrated experience within Lenses
ACLs and Monitoring functionality out-of-the-box
Ability to scale up and down workflows via Kubernetes

SQL Processors are long-lived applications that continuously process data. In use with Kubernetes mode (recommended), they are deployed via Lenses, as Kubernetes deployments separate from the Lenses instance.

Creating a SQL Processor

To create SQL Processors go to Workspace->Apps->New App->SQL Processor.

Enter a name
Enter your SQL statement, the editor will help you with IntelliSense
Optionally specify a description, tag and ProcessorID (consumer group ID)
Select the deployment target, Kubernetes this will be a Kubernetes cluster and namespace.

If the configuration is valid the SQL Processor will be created. You can then click Start to deploy. Processors are not started automatically, click Start in the Actions menu to start the Processor.

Starting a SQL Processor

To start a Processor, select Start from the Action menu.

Viewing SQL Processors

Select the SQL processor. Lenses will show an overview of the health of the processor.

Selecting a Runner will display further information about each individual runner.

The metadata in the Summary tab also contains the consumer group ID (Processor ID).

The SQL tab shows the SQL Streaming statement the processor is running. A visual representation is shown in the Data Flow tab.

The Configuration tab shows the low-level settings.

Scaling a SQL Processor

To scale a processor, Select Scale from the Actions menu, and enter the desired amount of runners. In Kubernetes mode, the runners are pods.

Stopping a SQL Processor

Select Stop in the Actions Menu.

Snippets

Lenses provides helpful snippets for common scenarios. Select the snippet from the Help section.

Processor ID vs Application ID

ProcessorID is the public unique identifier for an SQL processor. It is customizable, meaning that you, as a user, have control over it and can set this identifier to any arbitrary string.

Restrictions on custom ProcessorIDs:

They have to be unique across all processors
Match the following regex: ^[a-zA-Z0-9\-\._]+:
- Only letters, capital letters, numbers and -, _ & - are allowed.
- It has to start with a letter or a number.
- It cannot be empty.

One important aspect of the ProcessorID is that it is used as the Kafka consumer group identifier. That means that, in practice, this is the value that allows an SQL processor to build its consumer group and coordinate record ingestion from Kafka between all Processor replicas. Consequently, if the ProcessorID of a given SQL processor is changed, that processor will restart consuming messages from the beginning of the existing records in the topic.

The ApplicationID is the Lenses unique identifier, is automatically created by Lenses, and cannot be customized.

This is unique among all applications types; it does not matter if it’s an SQL processor or a different (new) sort of future application.

Lenses uses the ApplicationID to manage applications. This means that, when Starting, Stopping or Scaling an application, Lenses will use this attribute to pick the right instance.

Concepts

This page describes the concepts of Lenses SQL Processors from joining, aggregating and filtering data in Kafka.

SQL Processors see data as an independent sequence of infinite events. An Event in this context is a datum of information; the smallest element of information that the underlying system uses to communicate. In Kafka’s case, this is a Kafka record/message.

Two parts of the Kafka record are relevant:

Key
Value

These are referred to as facets by the engine. These two components can hold any type of data and Kafka itself is agnostic on the actual storage format for either of these two fields. SQL Processors interpret records as (key, value) pairs, and it exposes ways to manipulate these pairs in several ways.

Queries are applications: SQL Processors

As mentioned above, queries that are meant to be run on streaming data are treated as stand-alone applications. These applications, in the context of the Lenses platform, are referred to as SQL Processors.

A SQL Processor encapsulates a specific Lenses SQL query, its details and everything else Lenses needs to be able to run the query continuously.

Schemas must be available for Structured Data

To support features like:

Inference of output schemas
Creation-time validation of input query
Selections
Expressions

Lenses SQL Engine Streaming mode needs to have up-to-date schema information for all structured topics that are used as input in a given query. In this context, structured means topics that are using complex storage formats like AVRO or JSON.

INSERT INTO daily-item-purchases-stats
SELECT STREAM
    itemId
    , COUNT(*) AS dailyPurchases
    , AVG(price / quantity) AS average_per_unit
FROM purchases
WINDOW BY TUMBLE 1d
GROUP BY itemId;

For the above query, for example, the purchases topic will need to have a value set to a structured format and a valid schema will need to already have been configured in Lenses. In such schema, fields itemId, price and quantity must be defined, the latter two being of a numerical type.

These requirements ensure the Engine will always be in a position to know what kind of data it will be working with, guaranteeing at the same time that all obvious errors are caught before a query is submitted.

M-N topologies

The UI allows us to visualise any SQL Processor out of the box. For the example:

This visualisation helps to highlight that the Lenses SQL fully supports M-N topologies.

What this means is that multiple input topics can be read at the same time, their data manipulated in different ways and then the corresponding results stored in several output topics, all as part of the same Processor’s topology.

This means that all processing can be done in one go, without having to split parts of a topology into different Processors (which could result in more data being stored and shuffled by Kafka).

Expressions in Lenses SQL

An expression is any part of a Lenses SQL query that can be evaluated to a concrete value (not to be confused with a record value).

In a query like the following:

INSERT INTO target-topic
SELECT STREAM
    CONCAT('a', 'b') AS result1
    , (1 + field1) AS result2
    , field2 AS result3
    , CASE
        WHEN field3 = 'Robert' THEN 'It's bobby'
        WHEN field3 = 'William' THEN 'It's willy'
        ELSE 'Unknown'
      END AS who_is_it
FROM input-topic
WHERE LENGTH(field2) > 5;

CONCAT('a', 'b'), (1 + field1) and field2 are all expressions whose values will be _projected_ onto the output topic, whereas LENGTH(field2) > 5 is an expression whose values will be used to filter out input records.

SQL Processors and Kafka Streams

SQL Processors are built on top of Kafka Streams, and it enriches this tool with an implementation of Lenses SQL that fits well with the architecture and design of Kafka Streams. When executed, they run a Kafka Streams instance.

Consumer groups

Each SQL Processor has an application ID which uniquely identifies it within Lenses. The application ID is used as the Kafka Streams application ID which in turn becomes the underlying Kafka Consumer(s) group identifier.

Scaling

Scaling up or down the number of runners automatically adapts and rebalances the underlying Kafka Streams application in line with the Kafka group semantics.

The advantages of using Kafka Streams as the underlying technology for SQL Processors are several:

Kafka Streams is an enterprise-ready, widely adopted and understood technology that integrates natively with Kafka
Using consumer group semantics allows leveraging Kafka’s distribution of workload, fault tolerance and replication out of the box

Data as a flow of events: Streams

A stream is probably the most fundamental abstraction that SQL Processors provide, and it represents an unbounded sequence of independent events over a continuously changing dataset.

Let’s clarify the key terms in the above definition:

event: an event, as explained earlier, is a datum, that is a (key, value) pair. In Kafka, it is a record.
continuously changing dataset: the dataset is the totality of all data described by every event received so far. As such, it is changed every time a new event is received.
unbounded: this means that the number of events changing the dataset is unknown and it could even be infinite
independent: events don’t relate to each other and, in a stream, they are to be considered in isolation

The main implication of this is that stream transformations (e.g. operations that preserve the stream semantics) are stateless because the only thing they need to take into account is the single event being transformed. Most Projections fall within this category.

Stream Example

To illustrate the meaning of the above definition, imagine that the following two events are received by a stream:

("key1", 10)
("key1", 20)

Now, if the desired operation on this stream was to sum the values of all events with the same key (this is called an Aggregation), the result for "key1" would be 30, because each event is taken in isolation.

Finally, compare this behaviour with that of tables, as explained below, to get an intuition of how these two abstractions are related but different.

Stream syntax

Lenses SQL streaming supports reading a data source (e.g. a Kafka topic) into a stream by using SELECT STREAM.

SELECT STREAM *
FROM input-topic;

The above example will create a stream that will emit an event for each record, including future ones.

Data as a snapshot of the state of the world: Tables

While a stream is useful to have visibility to every change in a dataset, sometimes it is necessary to hold a snapshot of the most current state of the dataset at any given time.

This is a familiar use-case for a database and the Streaming abstraction for this is aptly called table.

For each key, a table holds the latest version received of its value, which means that upon receiving events for keys that already have an associated value, such values will be overridden.

A table is sometimes referred to as a changelog stream, to highlight the fact that each event in the stream is interpreted as an update.

Given its nature, a table is intrinsically a stateful construct, because it needs to keep track of what it has already been seen. The main implication of this is that table transformations will consequently also be stateful, which in this context means that they will require local storage and data being copied.

Additionally, tables support delete semantics. An input event with a given key and a null value will be interpreted as a signal to delete the (key, value) pair from the table.

Finally, a table needs the key for all the input events to not be null. To avoid issues, tables will ignore and discard input events that have a null key.

Table example

To illustrate the above definition, imagine that the following two events are received by a table:

("key1", 10)
("key1", 20)

Now, if the desired operation on this table was to sum the values of all events with the same key (this is called an Aggregation), the result for key1 would be 20, because (key1, 20) is interpreted as an update.

Finally, compare this behaviour with that of streams, as explained above, to get an intuition of how these two abstractions are related but different.

Table syntax

Lenses SQL Streaming supports reading a data source (e.g. a Kafka topic) into a table by using SELECT TABLE.

SELECT TABLE *
FROM input-topic;

The above example will create a table that will treat each event on input-topic, including future ones, as updates.

Tables and compacted topics

Given the semantics of tables, and the mechanics of how Kafka stores data, the Lenses SQL Streaming will set the cleanup.policy setting of every new topic that is created from a table to compact, unless explicitly specified otherwise.

What this means is that the data on the topic will be stored with a semantic more closely aligned to that of a table (in fact, tables in Kafka Streams use compacted topics internally). For further information regarding the implications of this, it is advisable to read the official Kafka Documentation about cleanup.policy.

The duality between streams and tables

Streams and tables have significantly different semantics and use cases, but one interesting observation is that are strongly related nonetheless.

This relationship is known as stream-table duality. It is described by the fact that every stream can be interpreted as a table, and similarly, a table can be interpreted as a stream.

Stream as Table: A stream can be seen as the changelog of a table. Each event in the stream represents a state change in the table. As such, a table can always be reconstructed by replaying all events of a stream, in order.
Table as Stream: A table can be seen as a snapshot, at a point in time, of the latest value received for each key in a stream. As such, a stream can always be reconstructed by iterating over each (Key, Value) pair and emitting it as an event.

To clarify the above duality, let’s use a chess game as an example.

On the left side of the above image, a chessboard at a specific point in time during a game is shown. This can be seen as a table where the key is a given piece and the value is its position. Also, on the right-hand side, there is the list of moves that culminated in the positioning described on the left; it should be obvious that this can be seen as a stream of events.

The idea formalised by the stream-table duality is that, as it should be clear from the above picture, we can always build a table from a stream (by applying all moves in order).

It is also always possible to build a stream from a table. In the case of the chess example, a stream could be made where each element represents the current state of a single piece (e.g. w: Q h3).

This duality is very important because it is actively used by Kafka (as well as several other storage technologies), for example, to replicate data and data stores and to guarantee fault tolerance. It is also used to translate table and stream nodes within different parts of a query.

SQL Processors and schemas: a proactive approach

One of the main goals of SQL Processors is to ensure that it uses all the information available to it when a SQL Processor is created to catch problems, suggest improvements and prevent errors. It’s more efficient and less frustrating to have an issue coming up during registration rather than at some unpredictable moment in the future, at runtime, possibly generating corrupted data.

SQL engine will actively check the following during the registration of a processor:

Validation of all user inputs
Query lexical correctness
Query semantics correctness
Existence of the input topics used within the query
User permissions to all input and output topics
Schema alignment between fields and topics used within the query
Format alignment between data written and output topics, if the latter already exist

When all the above checks pass, the Engine will:

Generate a SQL Processor able to execute the user’s query
Generate and save valid schemas for all output topics to be created
Monitor the processor and make such metrics available to Lenses

The Engine takes a principled and opinionated approach to schemas and typing information; what this means is that, for example, where there is no schema information for a given topic, that topic’s fields will not be available to the Engine, even if they are present in the data; also, if a field in a topic is a string, it will not be possible to use it as a number for example, without explicitly CASTing it.

The Engine’s approach allows it to support naming and reusing parts of a query multiple times. This can be achieved using the dedicated statement WITH.

SET defaults.topic.autocreate=true;
SET commit.interval.ms='1000';
SET enable.auto.commit=false;
SET auto.offset.reset='earliest';

WITH countriesStream AS (
  SELECT STREAM *
  FROM countries
);

WITH merchantsStream AS (
  SELECT STREAM *
  FROM merchants
);


WITH merchantsWithCountryInfoStream AS (
  SELECT STREAM
    m._key AS l_key
    , CONCAT(surname, ', ', name) AS fullname
    , address.country
    , language
    , platform
  FROM merchantsStream AS m
        JOIN countriesStream AS c
            ON m.address.country = c._key  
  WITHIN 1h
);

WITH merchantsCorrectKey AS(
  SELECT STREAM
    l_key AS _key
    , fullname
    , country
    , language
    , platform
  FROM merchantsWithCountryInfoStream
);

INSERT INTO currentMerchants
SELECT STREAM *
FROM merchantsCorrectKey;

INSERT INTO merchantsPerPlatform
SELECT TABLE
  COUNT(*) AS merchants
FROM merchantsCorrectKey
GROUP BY platform;

The WITHs allow for whole sections of the query to be reused and manipulated independently by successive statements, and all this is done by maintaining schema and format alignment and correctness. The reason why this is useful is that it allows to specify queries that split their processing flow without having to redefine parts of the topology. This, in turn, means that less data needs to be read and written to Kafka, improving performance.

This is just an example of what SQL Processors can offer because of the design choices taken and the strict rules implemented at query registration.

Projections

This page describes using projections for data in Kafka with Lenses SQL Processors.

A projection represents the ability to project a given input value onto a target location in an output record. Projections are the main building block of SELECT statements.

A projection is composed of several parts. Projections have a source section that allows one to select a specific value and to ensure that it will be present in the output, in the desired structure and with the desired name, as described by the target section.

In the below query:

INSERT INTO target-topic
SELECT STREAM
    CONCAT('a', 'b') AS result1
    , field4
    , (1 + field1) AS _key.a
    , _key.field2 AS result3
    ,  5 + 7 AS constantField
    ,  CASE
        WHEN field3 = 'Robert' THEN 'It's bobby'
        WHEN field3 = 'William' THEN 'It's willy'
        ELSE 'Unknown'
       END AS who_is_it
FROM input-topic;

These are projections:

CONCAT(‘a’, ‘b’) as result1
field4
(1 + field1) as _key.a
_key.field2 as result3
5 + 7 as constantField
CASE … as who_is_it

It is worth highlighting that projections themselves are stateless. While the calculation of the source value could be stateful (depending on the type of expression being evaluated, the act of making the value available in the output is not a stateful operation.

Syntax

The precise syntax of a projection is:

 <expression> [as [<facet>.]<alias>]|[<facet>]]
|------------||-------------------------------|
    source                  target

In the above, [] indicates optional sections and | is to be read as OR.

Source:
- <expression>: the source section must consist of a valid SQL expression, which will be evaluated to the value to be projected.
Target: this section is optional. When missing, the output target facet will be _value and the <alias> will be defaulted to the string representation of the source expression.
- as: this is the Lenses SQL keyword that makes specifying the target of the projection explicit.
- [<facet>.]<alias>]|[<facet>]: this nested section, which must be specified whenever as is used, is composed of two mutually exclusive sub-sections.
  - [<facet>.]<alias>]: this optional section is a string that will be used as the field name in the output record. Optionally, the output _facet_ where the field will be projected (e.g. _key or _value) can be also specified.
  - [<facet>]: this optional section specifies that the result of the projection will make up the whole of the indicated facet of the output record (e.g. the whole key or the whole value).

The above syntax highlights something important about the relationship between projections and facets: a projection can have a source facet and will always have a target facet and while they are related, they might not always be the same.

In the above example query:

field4 is a projection from value facet to value facet
(1 + field1) as _key.a is a projection from the value facet to the key facet
_key.field2 as result3 is a projection from the key facet to the value facet
5 + 7 as constantField is a projection to value facet but with no source facet, as the expression does not depend on the input record

This makes SQL projections very expressive, but it also means that attention needs to be paid when facets are manipulated explicitly. Details, edge cases and implications of this will be discussed below.

Source of a Projection

Projection from a `_value` field

This is the default type of projection when a field is selected within the source expression. Unless otherwise specified, the source facet of a projection will always be _value.

In light of the above, the following query contains only projections that read from the _value facet of the source:

INSERT INTO target-topic
SELECT STREAM
    CONCAT('--> ', _value.field2, ' <--')
    , field1
    , LENGTH(field4)
    , _value.field1 AS aliased
    , 5 + field3
FROM input-topic;

Also, notice that field1 and _value.field1 as aliased are reading the exact same input field and in the former case the _value facet is implicit.

Projection from a key field

A projection can also access a selected field on the key facet.

INSERT INTO target-topic
SELECT STREAM
    CONCAT('--> ', _key.field2, ' <--')
    , _key.field1
    , LENGTH(_key.field4)
    , 5 + _key.field3
FROM input-topic;

The above query contains only projections that read from the _key facet of the source.

This kind of projection behaves exactly the same as any other projection, but because of specific interactions with other mechanics in Lenses SQL Engine Streaming mode, they can’t be used in the same query with Aggregations or Joins.

Projection from whole facets

All examples of projections described until now focused on selecting a field from either the key or the value of an input record. However, a projection can also read a whole facet and project it to the output.

INSERT INTO target-topic
SELECT STREAM
    _key AS old-key
    , _value AS old-value
FROM input-topic;

In the above query, there are two projections:

_key as old-key: This is projecting the whole key of input-topic onto a field old-key on target-topic
_value as old-value: This is projecting the whole value of input-topic onto a field old-value on target-topic

For more details about the rules around using aliases (as done in the above example).

This can be useful when the input source uses a primitive storage format for either one or both facets, but it is desirable to map such facets to named fields within a more complex output structure, as could be the case in the above query. This said projections from whole facets are supported for all storage formats, not only the primitive ones.

Mix and match

As it should be clear from all the examples on this page so far, projections can be freely mixed within a single SELECT statement; the same query can have many projections, some of which could be reading from the key of the input record, some others from the value and yet others returning literal constants.

Lenses SQL is designed to support this mixed usage and to calculate the appropriate resulting structure given the schemas of all the projections’ inputs.

Wildcard projections

Lenses SQL assigns a special meaning to * when used as a projection.

Unqualified Wildcard projection

When * is used without any further qualification, it is interpreted as an instruction to project all fields from _key_ and _value_ to the output.

INSERT INTO target-topic
SELECT STREAM *
FROM input-topic;

The result of this query is that target-topic will have exactly the same fields, schema and data than input-topic.

Qualified wildcard projection

When * is explicitly qualified, the meaning becomes more precise and it will limit the fields to be selected to only the ones belonging to the qualified source (and optionally facet).

INSERT INTO target-topic
SELECT STREAM
    i1.*
    , i2.field1
FROM input-topic1 AS i1 JOIN input-topic2 AS i2
WITHIN 1h;

The above shows how a qualified wildcard projection can be used to target all the fields of a specific source. Additionally, a qualified wildcard can be used in addition to other normal projections (e.g. i2.field1).

Target of a Projection

The target of a projection is the location within the output record where the result of the projection’s expression is going to be mapped. As previously mentioned, Lenses SQL uses the keyword as to explicitly control this.

Using as, it is possible to:

Assign an alias to the projected field in the result. For example, field1 as aliased-field1 is reading field1 and projecting its value to a field called aliased-field1. Notice that, as no facet information is specified, this will be targeting the value of the output record.
Project directly to nested fields within structures

INSERT INTO target-topic
SELECT STREAM
    field1 AS x.a
    , field2 AS x.b,
FROM input-topic;

The above query will result in a field x that is a structure that contains two fields a and b.

Control the facet to which the field is projected. For example, field1 as _key.field1 is reading field1 (from the value facet) and projecting to a field with the same name on the key of the output. Depending on the source being projected, doing this might have important implications.
Project over the whole target value or key For example, field1 as _value is reading field1 and projecting it over the whole value of the output. One important thing about this is that Lenses SQL allows only one projection of this kind per facet per query. What this means is that the following query would be invalid, because only one projection can target _value within the same query:

INSERT INTO target-topic
SELECT STREAM  
    field1 AS _value
    , field2 AS _value,
FROM input-topic;

Rules for nested aliases

In order to avoid potential errors, Lenses defines the following rules for defining aliases:

Alias can’t add new fields to non-struct properties.
- e.g: an_int AS foo.bar, field1 AS foo.bar.field1 (since bar will be an INT we can’t define a property field1 under it)
Aliases cannot override previously defined properties.
- e.g: field1 AS foo.bar.field1, field2 as foo.bar (setting foo.bar with the contents of field2 would override the value of field1)
Fields cannot have duplicated names.
- e.g: field1 AS foo.bar, field2 as foo.bar (setting foo.bar with the contents of field2 would override it’s previous content)

Implications of projecting on key

Projecting on Key is a feature that can be useful in situations where it is desirable to quickly change the key of a Table or a Stream, maybe in preparation for further operations (e.g. joins etc…). This feature is sometimes referred to as re-keying within the industry.

However, one important implication of using this feature is that Kafka uses the key to determine in what partition a record must be stored; by changing the key of the record, the resulting partitioning of the output topic might differ from one of the input topics. While there is nothing wrong with this, it is something that must be understood clearly when using this feature.

Filtering

Sometimes it is desirable to limit the input records to be projected based on some predicate.

For example, we might want to project field1 and field2 of input-topic onto output-topic, but only if field3 contains a specific value.

INSERT INTO target-topic
SELECT STREAM  
    field1
    , field2,
FROM input-topic
WHERE field3 = 'select_me';

This is the WHERE clause is used: to filter the input dataset by some predicate, applying the rest of the query only to records that match the predicate.

The syntax for this clause is simply WHERE <expression>, where <expression> is a valid arbitrarily nested Lenses SQL boolean expression.

INSERT INTO target-topic
SELECT STREAM  
    field1
    , field2,
FROM input-topic
WHERE (field3 = 'select_me' AND LENGTH(CONCAT(field1, field2)) >= 5)
    OR field4 = (field5 + 5);

Projections and Storage Formats

Projections have a close relationship with the storage format of their target.

By default, if a query contains more than one projection for the same facet, then that facet’s storage format will be a structure (which type of structure exactly depends on other factors that are not relevant here).

INSERT INTO target-topic
SELECT STREAM  
    field1 as result1
    , field2
FROM input-topic;

The above query will make target-topic’s value storage format a structure (e.g. AVRO or JSON) with two fields named: result1 and field2. The storage format for target-topic’s key will be the same as the input-topic’s, as there are no projections targeting that facet.

The storage format of the output can be explicitly changed by a projection, however. This will often be the case when a projection on a whole facet is used. Consider the following query:

INSERT INTO target-topic
SELECT STREAM  
    field1 AS result1
    , field2 AS _key,
FROM input-topic;

In this case, target-topic’s value storage format will still be a structure, but its key will depend on field2’s schema. For example, if field2 is a string, then target-topic’s key will be changed to STRING (assuming it was not STRING already). The same behavior applies to the _value_ facet.

One example where this can be relevant is when a projection is used to map the result of a single field in the target topic. Consider the following query:

INSERT INTO target-topic
SELECT STREAM field1 AS _value
FROM input-topic;

This query will project field1 on the whole value facet, and this will result in a change of storage format as well. This behaviour is quite common when a single projection is used, because more often than not the desired output in such a scenario will be the content of the field rather than a structure with a single field.

Joins

This page describes how to join data in Kafka with in Lenses SQL Processors.

Joins allow rows from different sources to be combined.

Lenses allows two sources of data to be combined based either on the equality of their _key facet or using a user-provided expression.

A query using joins looks like a regular query apart from the definition of its source and in some cases the need to specify a window expression:

SELECT (STREAM|TABLE)
  <projection>
FROM
  <sourceA> [LEFT|RIGHT|INNER|OUTER] JOIN
  <sourceB> [ON <joinExpression>]
  [WITHIN <duration>]
WHERE
  <filterExpression>;

projection: the projection of a join expression differs very little from a regular projection. The only important consideration is that since data is selected from two sources, some fields may be common to both. The syntax table.field is recommended to avoid this type of problem.
sourceA/sourceB : the two sources of data to combine.
window: only used if two streams are joined. Specifies the interval of time to search matching results.
joinExpression: a boolean expression that specifies how the combination of the two sources is calculated.
filterExpression: a filter expression specifying which records should be filtered.

Join types

When two sources of data are combined it is possible to control which records to keep when a match is not found:

Disclaimer: The following examples do not take into consideration windowing and/or table materialization concerns.

Customers

_key

name

John

Frank

Orders

_key

customer_id

item

Computer

Mouse

Keyboard

Inner join

WITH customersTable AS (SELECT TABLE * FROM customers);
WITH ordersStream AS (SELECT STREAM * FROM orders);

INSERT INTO result
SELECT STREAM
    customersTable.name
   , ordersStream.item
 FROM
    ordersStream JOIN customersTable
        ON customersTable.id = ordersStream.customer_id;

This join type will only emit records where a match has occurred.

name

item

John

Computer

John

Mouse

(Notice there’s no item with customer.id = 2 nor is there a customer with id = 3 so these two rows are not present in the result).

Left join

WITH customersTable AS (SELECT TABLE * FROM customers);
WITH ordersStream AS (SELECT STREAM * FROM orders);

INSERT INTO result
SELECT STREAM
    customersTable.name
    , ordersStream.item
 FROM
    ordersStream LEFT JOIN customersTable
        ON customersTable.id = ordersStream.customer_id;

This join type selects all the records from the left side of the join regardless of a match:

name

item

John

Computer

John

Mouse

null

Keyboard

(Notice all the rows from orders are present but since no customer.id = 3 no name can be set.)

Right join

WITH customersTable AS (SELECT TABLE * FROM customers);
WITH ordersStream AS (SELECT STREAM * FROM orders);

INSERT INTO result
SELECT TABLE
    customersTable.name
    , ordersStream.item
 FROM
    customersTable RIGHT JOIN ordersStream
        ON customersTable.id = ordersStream.customer_id;

A right join can be seen as a mirror of a LEFT JOIN. It selects all the records from the right side of the join regardless of a match:

name

item

John

Computer

John

Mouse

null

Keyboard

Outer Join

WITH customersStream AS (SELECT STREAM * FROM customers);
WITH ordersStream AS (SELECT STREAM * FROM orders);

INSERT INTO result
SELECT TABLE
    customersStream.name
    , ordersStream.item
 FROM
    ordersStream OUTER JOIN customersStream
        ON customersTable.id = ordersStream.customer_id;

An outer join can be seen as the union of left and right joins. It selects all records from the left and right side of the join regardless of a match happening:

name

item

John

Computer

John

Mouse

null

Keyboard

Frank

null

Matching expression (ON)

By default, if no ON expression is provided, the join will be evaluated based on the equality of the _key facet. This means that the following queries are equivalent:

SELECT TABLE *
FROM customers JOIN orders;

SELECT TABLE *
FROM customers JOIN orders
    ON customers._key = orders._key;

When an expression is provided, however, there are limitations regarding what kind of expressions can be evaluated.

Currently, the following expression types are supported:

Equality expressions using equality (=) with one table on each side:
- customers.id = order.user_id
- customers.id - 1 = order.user_id - 1
- substr(customers.name, 5) = order.item
Any boolean expression which references only one table:
- len(customers.name) > 10
- substr(customer.name,1) = "J"
- len(customer.name) > 10 OR customer_key > 1
Allowed expressions mixed together using an AND operator:
- customers._key = order.user_id AND len(customers.name) > 10
- len(customers.name) > 10 AND substr(customer.name,1) = "J"
- substr(customers.name, 5) = order.item AND len(customer.name) > 10 OR customer_key > 1

Any expressions not following the rules above will be rejected:

More than one table is referenced on each side of the equality operator
- concat(customer.name, item.name) = "John"
- customer._key - order.customer_id = 0
a boolean expression not separated by an AND references more than one table:
- customer._key = 1 OR customer._key = order.customer_id

Windowing: stream to stream joins (WITHIN )

When two streams are joined Lenses needs to know how far away in the past and in the future to look for a matching record.

This approach is called a “Sliding Window” and works like this:

Customers

arrival

name

t = 6s

Frank

t = 20s

John

Purchases

arrival

item

customer_id

t = 10s

Computer

t = 11s

Keyboard

SELECT STREAM
     customers.name
    , orders.item
FROM
    customers LEFT JOIN orders WITHIN 5s
        ON customers.id = orders.customer_id
WITHIN 5s;

At t=10, when both the Computer and the Keyboard records arrive, only one customer can be found within the given time window (the specified window is 5s thus the window will be [10-5,10+5]s ).

This means that the following would be the result of running the query:

name

item

Frank

Computer

null

Keyboard

Note: John will not match the Keyboard purchase since t=20s is not within the window interval [10-5,10+5]s.

Non-windowed joins (stream to table and table to stream)

When streaming data, records can be produced at different rates and even out of order. This means that often a match may not be found because a record hasn’t arrived yet.

The following example shows an example of a join between a stream and a table where the arrival of the purchase information is made available before the customers’ information is.

(Notice that the purchase of a “Keyboard” by customer_id = 2 is produced before the record with the customer details is.)

Customers

arrival

name

t = 0s

Frank

t = 10s

John

Purchases

arrival

item

customer_id

t = 0s

Computer

t = 1s

Keyboard

Running the following query:

WITH customersTable AS (SELECT TABLE * FROM customers);

SELECT STREAM
    customers.name
   , item.item
FROM
    orders LEFT JOIN customersTable  
        ON customers.id = orders.id

would result in the following:

name

item

Frank

Computer

null

Keyboard

If later, the record for customer_id = 2 is available:

arrival

name

t = 10s

John

a record would be emitted with the result now looking like the following:

name

item

Frank

Computer

null

Keyboard

John

Keyboard

Notice that “Keyboard” appears twice, once for the situation where the data is missing and another for when the data is made available.

This scenario will happen whenever a Stream is joined with a Table using a non-inner join.

Table/Stream joins compatibility table

The following table shows which combinations of table/stream joins are available:

Left

Right

Allowed types

Window

Result

Stream

All

Required

Stream

Table

All

Table

Stream

RIGHT JOIN

Stream

Table

INNER, LEFT JOIN

Stream

Key decoder types

In order to evaluate a join between two sources, the key facet for both sources has to share the same initial format.

If formats are not the same the join can’t be evaluated. To address this issue, an intermediate topic can be created with the correct format using a STORE AS statement. This newly created topic can then be created as the new source.

Co-partitioning

In addition to the constraint aforementioned, when joining, it’s required that the partition number of both sources be the same.

When a mismatch is found, an additional step will be added to the join evaluation in order to guarantee an equal number of partitions between the two sources. This step will write the data from the source topic with a smaller count of partitions into an intermediate one.

This newly created topic will match the partition count of the source with the highest partition count.

In the topology view, this step will show up as a Repartition Node.

ON expressions and key change

Joining two topics is only possible if the two sources used in the join share the same key shape and decoder.

When an ON statement is specified, the original key facet will have to change so that it matches the expression provided in the ON statement. Lenses will do this calculation automatically. As a result, the key schema of the result will not be the same as either one of the sources. It will be a Lenses calculated object equivalent to the join expression specified in the query.

Nullability

As discussed when addressing join types, some values may have null values when non-inner joins are used.

Due to this fact, fields that may have null values will be typed as the union of null and their original type.

Joining more than 2 sources

Within the same query, joins may only be evaluated between two sources.

When a join between more than two sources is required, multiple queries can be combined using a WITH statement:

WITH customerOrders AS (
 SELECT TABLE
    customer.name
   , order.item,
   , order._key AS order_id
 FROM
    customers INNER JOIN orders
        ON orders.customer_id = customers.id
);

INSERT INTO target
SELECT TABLE *
FROM customerOrders INNER JOIN deliveryAddress
    ON customerOders.order_id = deliveryAddress.order_id;

Joining and Grouping

In order to group the results of a join, one just has to provide a GROUP BY expressions after a join expression is specified.

SET defaults.topic.autocreate=true;
WITH customersTable AS (SELECT TABLE * FROM customers);
WITH ordersStream AS (SELECT STREAM * FROM orders);

WITH joined AS (
    SELECT STREAM
        customersTable.name
        , ordersStream.item
     FROM
        ordersStream JOIN customersTable
            ON customersTable.id = ordersStream.customer_id
    GROUP BY customersTable.name;
);

Stream-Table/Table-Stream joins: table materialization

emmited

processed

name

t = 0s

t = 20s

Frank

Purchases

arrival

processed

item

customer_id

t = 0s

t = 10s

Computer

WITH customersStream AS (SELECT TABLE * FROM customers);
WITH ordersStream AS (SELECT STREAM * FROM orders);

INSERT INTO result
SELECT TABLE
    customersStream.name
   , ordersStream.item
 FROM
    ordersStream OUTER JOIN customersStream
        ON customersTable.id = ordersStream.customer_id;

When a join between a table and a join is processed, lenses will, for each stream input (orders in the example above), look for a matching record on the specified table (customers).

Notice that the record with Frank’s purchase information is processed at t = 10s at which point Frank’s Customer information hasn’t yet been processed. This means that no match will be found for this record.

At t=20s however, the record with Frank’s customer information is processed; this will only trigger the emission of a new record if an Outer Join is used.

Filter optimizations

There are some cases where filter expressions can help optimize a query. A filter can be broken down into multiple steps so that some can be applied before the join node is evaluated. This type of optimization will reduce the number of records going into the join node and consequentially increase its speed.

For this reason, in some cases, filters will show up before the join in the topology node.

Lateral Joins

This page describes how to use lateral joins for data in Kafka with Lenses SQL Processors.

With Lateral Joins you can combine a data source with any array expression. As a result, you will get a new data source, where every record of the original one will be joined with the values of the lateral array expression.

Assume you have a source where elements is an array field:

field1

field2

elements

[1, 2]

[3, 4, 5]

[6]

Then a Lateral Join of source with elements is a new table, where every record of source will be joined with all the single items of the value of elements for that record:

field1

field2

elements

element

[1, 2]

[3, 4, 5]

[6]

In this way, the single elements of the array become available and can be used as a normal field in the query.

Syntax

A query using lateral joins looks like a regular query apart from the definition of its source:

SELECT (STREAM|TABLE)
  <projection>
FROM
  <source> LATERAL
  <lateralArrayExpression> AS <lateralAlias>
WHERE
  <filterExpression>;

projection: as in a single-table select, all the fields from <source> will be available in the projection. In addition to that, the special field <lateralAlias> will be available.
source: the source of data. Note: it is not possible to specify a normal join as a source of a lateral join. This limitation will be removed in the future.
lateralArrayExpression: any expression that evaluates to an array. Fields <source> are available for defining this expression.
filterExpression: a filter expression specifying which records should be filtered.

Single Lateral Joins

Assume you have a topic batched_readings populated with the following records:

batched_readings

_key

meter_id

readings

[100, 80, 95, 91]

[87, 93, 100]

[88, 89, 92, 94]

[81]

As you can see, readings is a field containing arrays of integers.

We define a processor like this:

INSERT INTO readings
SELECT STREAM
    meter_id,
    reading
 FROM
    batched_readings
    LATERAL readings AS reading
WHERE 
    reading > 90

The processor will emil the following records:

_key

meter_id

reading

100

Things to notice:

We used the aliased lateral expression reading both in the projection and in the WHERE.
The _key for each emitted record is the one of the original record. As usual you can change this behavior projecting on the key with a projection like expression AS _key.
batched_readings records with keys a and b have been split into multiple records. That’s because they contain multiple readings greater than 90.
Record d disappeared, because it has no readings greater than 90

Multiple Lateral Joins

It is possible to use multiple LATERAL joins in the same FROM clause.

Assume you have a topic batched_nested_readings populated with the following records:

batched_readings

_key

meter_id

nested_readings

[[100, 80], [95, 91]]

[[87], [93, 100]]

[[88, 89], [92, 94]]

[[81]]

Notice how nested_readings contains arrays of arrays of integers.

To get the same results of the previous example, we use a first lateral join to unpack the first level of nested_readings into an array that we call readings. We then define a second lateral join on readings to extract the single values:

INSERT INTO readings
SELECT STREAM
    meter_id,
    reading
 FROM
    batched_readings
    LATERAL nested_readings AS readings
    LATERAL readings as reading
WHERE 
    reading > 90

Complex Lateral expressions

In the previous example we used a simple field as the <lateralArrayExpression>. In the section we will see how any array expression can be used for it.

Assume you have a topic day_night_readings populated with the following records:

day_night_readings

_key

meter_id

readings_day

readings_night

[100, 80]

[95, 91]

[87, 93]

[100]

[88]

[89, 92, 94]

[81]

[]

We can make use of Array Functions to lateral join day_night_readings on the concatenation of the two readings fields:

INSERT INTO readings
SELECT STREAM
    meter_id,
    reading
 FROM
    batched_readings
    LATERAL flatten([readings_day, readings_night]) AS reading
WHERE 
    reading > 90The processor such defined will emit the records

_key

meter_id

reading

100

Aggregations

This page describes how to aggregate data in Kafka with Lenses SQL Processors.

Aggregations are stateful transformations that allow to grouping of an unbounded set of inputs into sub-sets and then aggregate each of these sub-sets into a single output; the reason why they are stateful is because they need to maintain the current state of computation between the application of each input.

To group a given input dataset into sub-sets, a key function needs to be specified; the result of applying this key function to an input record will be used as a discriminator (sometimes called a pivot) to determine in what sub-set each input record is to be bucketed.

The specific transformation that each aggregation performs is described by the Aggregated Functions used in the input query.

Aggregations match table semantics

Notice that the behaviour described above is precisely what a Table does. For any given key, there is the state will continuously be updated as new events with the given key are received. In the case of Aggregations, new events are represented by input records in the original dataset that will map to a given key, therefore ending up in a bucket or another.

Whenever Aggregations are used, the result will be a Table. Each entry will have the key set to the grouping discriminator, and the value set to the current state of computation for all input records matching the key.

Syntax

The complete syntax for aggregations is:

SELECT (STREAM | TABLE)
  <aggregated projection1>
    [, aggregated projection2] ... [, aggregated projectionN]
    [, projection1] ... [, projectionM]
FROM
  <source>
[WINDOW BY <window description>]
GROUP BY <expression>
;

The specific syntactical elements of the above are:

(STREAM | TABLE): specifies if the <source> is to be interpreted as a stream or a table.
<aggregated projection1>: a projection is aggregated when its source contains an Aggregated Function (e.g. COUNT(*) as x, CAST(COUNT(*) as STRING) as stringed).
[, aggregated projection2] ... [, aggregated projectionN]: a query can contain any number of additional aggregated projections after the first mandatory one.
[, projection1] ... [, projectionM]: a query can contain any number of common, non-aggregated, projections. Streaming only supports full GROUP BY mode. This means that fields that are not part of the GROUP BY clause cannot be referenced by non-aggregated projections.
<source>: a normal source, like a topic or the result of a WITH statement.
[WINDOW BY <window description>]: this optional section can only be specified if STREAM is used. It allows us to describe the windowing that will be applied to the aggregation.
GROUP BY <expression>: the result of evaluating <expression> will be used to divide the input values into different groups. These groups will be the input for each aggregated projection specified. The <expression>’s result will become the key for the table resulting from the query.

Specific rules for aggregated projections

Most of the rules and syntax described for Projections apply to aggregated projections as well, but there are some additional syntactical rules due to the specific nature of Aggregations.

Aliasing rules mostly work the same, but it is not possible to project on the key facet; COUNT(*) as _key.a or SUM(x) as _key are therefore not allowed.
At least one aggregated projection must be specified in the query.
Projections using an unqualified key facet as the source are not allowed. _key.a or COUNT(_key.b) are forbidden because _key is unqualified, but <source>._key.a and COUNT(<source>._key.b) are supported.

Grouping and storage format for key

INSERT INTO target-topic
SELECT STREAM  
    COUNT(*) AS records
FROM input-topic
GROUP BY field1;

As previously mentioned, the GROUP BY is used to determine the key of the query’s result; the above query will group all records in input-topic by the value of field1 in each record, and target-topic’s key will be the schema of field1.

Just like in the case of the Projections, the Streaming mode takes an opinionated approach here and will simplify the result schema and Storage Format in the case of single field structures.

In the case above, assuming for example that field1 is an integer, target-topic’s key will not be a structure with a single integer field1 field, but rather just the value field1; the resulting storage format is going to be INT, and the label field1 will be just dropped.

In case the above behaviour is not desirable, specifying an explicit alias will allow us to override it.

INSERT INTO target-topic
SELECT STREAM  
    COUNT(*) AS records,
FROM input-topic
GROUP BY field1 AS keep_me;

This will result in target-topic’s key is a structure with a field keep_me, with the same schema as field1. The corresponding Storage Format will match the input format for input-topic, AVRO or JSON.

Semantics

An example will help clarify how aggregations work, as well as how they behave depending on the semantics of the input dataset they are being applied to.

Example scenario

Assume that we have a Kafka topic (gaming-sessions) containing these records:

Offset

Key

Value

billy

{points: 50, country: “uk”}

billy

{points: 90, country: “uk”}

willy

{points: 70, country: “uk”}

noel

{points: 82, country: “uk”}

john

{points: 50, country: “usa”}

dave

{points: 30, country: “usa”}

billy

{points: 90, country: “spain”}

What this data describes is a series of gaming sessions, performed by a player. For each gaming session, the player (used as Key), the points achieved, and the country where the game took place.

Aggregating a Stream

Let’s now assume that what we want to calculate is the total points achieved by players in a given country, as well as the average points per game. One way to achieve the desired behaviour is to build a Stream from the input topic. Remember that this means that each event will be considered in isolation.

INSERT INTO target-topic
SELECT STREAM
    SUM(points) AS total_points
    , AVG(points) AS average_points
FROM gaming-sessions
GROUP BY country

Explanations for each element of this syntax can be found below, but very briefly, this builds a Stream from gaming-sessions, grouping all events by country (e.g. all records with the same country will be aggregated together) and finally calculating the total (total_points) and the average (average_points) of all points for a given group.

The final result in target-topic will be (disregarding intermediate events):

Offset

Key

Value

{total_points: 292, average_points: 73}

usa

{total_points: 80, average_points: 40}

spain

{total_points: 90, average_points: 90}

The results are calculated from the totality of the input results because in a Stream, each event is independent and unrelated to any other.

Aggregating a Table

We now want to calculate something similar to what we obtained before, but we want to keep track only of the last session played by a player, as it might give us a better snapshot of both the performances and locations of players worldwide. The statistics we want to gather are the same as before: total and average of points per country.

The way to achieve the above requirement is simply by reading gaming-sessions into a Table, rather than a Stream, and aggregating it.

INSERT INTO target-topic
SELECT TABLE
    SUM(points) AS total_points
    , AVG(points) AS average_points
FROM gaming-sessions
GROUP BY country

The final result in target-topic will be (disregarding intermediate events):

Key

Value

{total_points: 152, average_points: 76}

usa

{total_points: 80, average_points: 40}

spain

{total_points: 90, average_points: 90}

Compare this with the behaviour from the previous scenario; the key difference is that the value for uk includes only willy and noel, and that’s because the last event moved billy to the spain bucket, removing all data regarding him from his original group.

Aggregation functions for Tables

The previous section described the behaviour of aggregations when applied to Tables, and highlighted how aggregations not only need to be able to sum the latest values received to the current state of a group but also need to be able to subtract an obsolete value that might have just been assigned to a new group. As we saw above, it is easy to do this in the case of SUM and AVG.

However, consider what would happen if we wanted to add new statistics to the ones calculated above: the maximum points achieved by a player in a given country.

In the Stream scenario, this can be achieved by simply adding MAXK(points,1) as max_points to the query.

INSERT INTO target-topic
SELECT STREAM
    SUM(points) AS total_points
    , AVG(points) AS average_points
    , MAXK(points,1) AS max_points
FROM gaming-sessions
GROUP BY country

Offset

Key

Value

{total_points: 292, average_points: 73, max_points: 90}

usa

{total_points: 80, average_points: 40, max_points: 50}

spain

{total_points: 90, average_points: 90, max_points: 90}

In the Table scenario, however, things are different. We know that the final event moves billy from uk to spain, so we need to subtract from uk all information related to billy. In case of SUM and AVG that’s possible because subtracting billy’s points to the current value of the aggregation will return the correct result.

But that’s not possible for MAXK. MAXK(points, 1) only keeps track of 1 value, the highest seen so far, and if that’s removed, what value should take its place? The aggregation function cannot inspect the entire topic data to search for the correct answer. The state the aggregation function has access to is that single number, which now is invalid.

This problem explains why some aggregated functions can be used on Streams and Tables (e.g. SUM), while others can be used only on Streams (e.g. MAXK).

The key factor is usually whether a hypothetical subtraction operation would need access to all previous inputs to calculate its new value (like MAXK) or just the aggregated state (like SUM).

Windowed aggregations

A common scenario that arises in the context of aggregations is the idea of adding a time dimension to the grouping logic expressed in the query. For example, one might want to group all input records by a given field that were received within 1 hour of each other.

To express the above Lenses SQL Streaming supports windowed aggregations, by adding a WINDOW BY clause to the query. Given their semantics, tables cannot be aggregated using a window, because it would not make sense. A table represents the latest_ state of a set of (Key, Value) pairs, not a series of events interspersed over a time continuum. Thus trying to window them is not a sensible operation.

Filtering aggregated queries

Filtering the input into aggregated queries is similar to filtering non-aggregated ones. When using a WHERE <expression> statement, where <expression> is a valid SQL boolean expression, all records that do not match the predicate will be left out.

However, aggregated functions add a further dimension to what it might be desirable to filter.

We might be interested in filtering based on some conditions of the groups themselves; for example, we might want to count all input records that have a given value offield1, but only if the total is greater than 3. In this case, WHERE would not help, because it has no access to the groups nor to the results of the aggregated projections. The below query is what is needed.

INSERT INTO target-topic
SELECT STREAM
    COUNT(*) as sessions
FROM gaming-sessions
GROUP BY country
HAVING sessions > 3;

The above query uses the HAVING clause to express a filter at a grouping level. Using this feature it is possible to express a predicate on the result of aggregated projections and filter out the output records that do not satisfy it.

Only aggregated projections specified in the SELECT clause can be used within the HAVING clause.

Time & Windows

This page describes the time and windowing of data in Kafka with Lenses SQL Processors.

A data stream is a sequence of events ordered by time. Each entry contains a timestamp component, which aligns it on the time axis.

Kafka provides the source for the data streams, and the Kafka message comes with the timestamp built in. This is used by the Push Engines by default. One thing to consider is that from a time perspective, the stream records can be out of order. Two Kafka records, R1 and R2 do not necessarily respect the rule: R1 timestamp is smaller than R2 timestamp.

Timestamps are required to perform time-dependent operations for streams - like aggregations and joins.

Timestamp semantics

A record timestamp value can have three distinct meanings. Kafka allows to configure a topic timestamp meaning via this log.message.timestamp.type setting. The two supported values are CreateTime and LogAppendTime.

Event time

When a record is created at source the producer is responsible for setting the timestamp for it. Kafka producer provides this automatically and this is aligned with the CreateTime the configuration mentioned earlier.

Ingestion time

At times, the data source timestamp is not available. When setting the topic timestamp type to LogAppendTime, the Kafka broker will attach the timestamp at the moment it writes it to the topic.

Processing time

The timestamp will be set to the time the record was read by the engine, ignoring any previously set timestamps.

Control the timestamp

Sometimes, when the data source is not under direct control, it might be that the record’s timestamp is actually embedded in the payload, either in the key or the value.

Lenses SQL Streaming allows to specify where to extract the timestamp from the record by using EVENTTIME BY.

...
SELECT STREAM ...
FROM input-topic
EVENTTIME BY <selection>
...

where <selection> is a valid selection.

Here are a few examples on how to use the syntax to use the timestamp from the record value facet:

SELECT STREAM *
FROM <source>
EVENTTIME BY startedAt;
...

// this is identical with the above; _value qualifies to the record Value component
SELECT STREAM *
FROM <source>
EVENTTIME BY _value.startedAt;
...

// `details` here is a structure and `startedAt` a nested field
SELECT STREAM *
FROM <source>
EVENTTIME BY details.startedAt;

...

For those scenarios when the timestamp value lives within the record key, the syntax is similar:

SELECT STREAM *
FROM <source>
EVENTTIME BY _key.startedAt;
...

// `details` here is a structure and `startedAt` a nested field
SELECT STREAM *
FROM <source>
EVENTTIME BY _key.details.startedAt;
...

Output timestamp

All records produced by the Lenses SQL Streaming will have a timestamp set and its value will be one of the following:

For direct transformations, where the output record is a straightforward transformation of the input, the input record timestamp will be used.
For aggregations, the timestamp of the latest input record being aggregated will be used.
In all other scenarios, the timestamp at which the output record is generated will be used.

Time windows

Some stream processing operations, like joins or aggregations, require distinct time boundaries which are called windows. For each time window, there is a start and an end, and as a result a duration. Performing aggregations over a time window, means only the records which fall within the time window boundaries are aggregated together. It might happen for the records to be out-of-order and arrive after the window end has passed, but they will be associated with the correct window.

Types

There are three-time windows to be used at the moment: hopping, tumbling and session.

Duration types

When defining a time window size, the following types are available:

Duration

Description

Example

time in milliseconds.

100ms

time in seconds.

10s

time in minutes.

10m

time in hours.

10h

Hopping window

These are fixed-size and overlapping windows. They are characterised by duration and the hop interval. The hop interval specifies how far a window moves forward in time relative to the previous window.

Since the windows can overlap, a record can be associated with more than one window.

Use this syntax to define a hopping window:

WINDOW BY HOP <duration_time>,<hop_interval>

INSERT INTO <target>
SELECT STREAM
    country
  , COUNT(*) AS occurrences
  , MAXK_UNIQUE(points,3) AS maxpoints
  , AVG(points) AS avgpoints
FROM <source>
EVENTTIME BY startedAt
WINDOW BY HOP 5m,1m
GROUP BY country
;

Tumbling window

They are a particularisation of hopping windows, where the duration and hop interval are equal. This means that two windows can never overlap, therefore a record can only be associated with one window.

WINDOW BY TUMBLE <duration_time>

Duration time takes the same unit types as described earlier for hopping windows.

INSERT INTO <target>
SELECT STREAM
    country
  , COUNT(*) AS occurrences
  , MAXK_UNIQUE(points,3) AS maxpoints
  , AVG(points) AS avgpoints
FROM <source>
EVENTTIME BY startedAt
WINDOW BY TUMBLE 5m
GROUP BY country
;

Session window

Unlike the other two window types, this window size is dynamic and driven by the data. Similar to tumbling windows, these are non-overlapping windows.

A session window is defined as a period of activity separated by a specified gap of inactivity. Any records with timestamps that occur within the boundaries of the inactivity interval are considered part of the existing sessions. When a record arrives and its timestamp is outside of the session gap, a new session window is created and the record will belong to that.

A new session window starts if the last record that arrived is further back in time than the specified inactivity gap. Additionally, different session windows might be merged into a single one if an event is received that falls in between two existing windows, and the resulting windows would then overlap.

To define a session window the following syntax should be used:

WINDOW BY SESSION <inactivity_interval>

The inactivity interval can take the time unit type seen earlier for the hopping window.

INSERT INTO <target>
SELECT STREAM
    country
  , COUNT(*) AS occurrences
FROM $source
WINDOW BY SESSION 1m
GROUP BY country

Session windows are tracked on a per-key basis. This means windows for different keys will likely have different durations. Even for the same key, the window duration can vary.

User behaviour analysis is an example of when to use session windows. They allow metrics like counting user visits, customer conversion funnel or event flows.

Late arrival

It is quite common to see records belonging to one window arriving late, that is after the window end time has passed. To accept these records the notion of a grace period is supported. This means that if a record timestamp falls within a window W and it arrives within W + G (where G is the grace interval) then the record will be processed and the aggregations or joins will update. If, however, the record comes after the grace period then it is discarded.

To control the grace interval use this syntax:

...
WINDOW BY HOP 1m,5m GRACE BY 2h

...
WINDOW BY TUMBLE 5m GRACE BY 2h

...
WINDOW BY SESSION 1m, GRACE BY 2h

The default grace period is 24 hours. Until the grace period elapses, the window is not actually closed.

Storage format

This page describes the storage formats of data in Kaka supported by Lenses SQL Processors.

The output storage format depends on the sources. For example, if the incoming data is stored as JSON, then the output will be JSON as well. The same applies when Avro is involved.

When using a custom storage format, the output will be JSON.

At times, it is required to control the resulting Key and/or Value storage. If the input is JSON, for example, the output for the streaming computation can be set to Avro.

Another scenario involves Avro source(-s), and a result which projects the Key as a primitive type. Rather than using the Avro storage format to store the primitive, it might be required to use the actual primitive format.

Syntax

Controlling the storage format can be done using the following syntax:

INSERT INTO <target>
STORE
  KEY AS <format>
  VALUE AS <format>
  ...

There is no requirement to always set both the Key and the Value. Maybe only the Key or maybe only the Value needs to be changed. For example:

 INSERT INTO <target>
 STORE KEY AS <format>
 ...

 //or

 INSERT INTO <target>
 STORE VALUE AS <format>
 ...

Considering a scenario where the input data is stored as Avro, and there is an aggregation on a field which yields an INT, using the primitive INT storage and not the Avro INT storage set the Key format to INT:

INSERT INTO <target>
STORE KEY AS INT
SELECT TABLE
    SUM(amount) AS total
FROM <source>
GROUP BY CAST(merchantId AS int)

Here is an example of the scenario of having Json input source(-s), but an Avro stored output:

INSERT INTO <target>
STORE
  KEY AS AVRO
  VALUE AS AVRO
SELECT STREAM
    _key.cId AS _key.cId
    , CONCAT(_key.name, "!") AS _key.name
    , pId
    , CONCAT("!", name) AS name
    , surname
    , age  
FROM <source>

Validation

Changing the storage format is guarded by a set of rules. The following table describes how storage formats can be converted for the output.

From \ To

INT

LONG

STRING

JSON

AVRO

XML

Custom/Protobuf

INT

yes

LONG

yes

STRING

yes

JSON

If the Json storage contains integer only

If the Json storage contains integer or long only

yes

AVRO

If Avro storage contains integer only

If the Avro storage contains integer or long only

yes

XML

yes

Custom (includes Protobuf)

yes

Time/Session window validations

Time windowed formats follow similar rules to the ones described above with the additional constraint that Session Windows(SW) cannot be converted into Time Windows (TW) nor vice-versa.

From \ To

SW[B]

TW[B]

SW[A]

yes if format A is compatible with format B

TW[A]

yes if format A is compatible with format B

Example: Changing the storage format from TWAvro to TWJson is possible since they’re both TW formats and Avro can be converted to JSON.

Example: Changing the storage format from TWString to TWJson is not possible since, even though they’re both TW formats, String formats can’t be written as JSON.

XML as well as any custom formats are only supported as an input format. Lenses will, by default translate and process these formats by translating them to JSON and writing them as such (AVRO is also supported if a store is explicitly set).

Nullibility

This page describes handling nulls values in Kafka data with Lenses SQL Processors.

Null values are used as a way to express a value that isn’t yet known.

Null values can be found in the data present in existing sources, or they can be the product of joining data using non-inner joins.

The schema of nullable types is represented as a union of the field type and a null value:

{
  "type": "record",
  "name": "record",
  "namespace": "example",
  "doc": "A schema representing a nullable type.",
  "fields": [
    {
      "name": "property",
      "type": [
        "null",
        "double"
      ],
      "doc": "A property that can be null or a double."
    }
  ]
}

Using nullable types

Working with null values can create situations where it’s not clear what the outcome of an operation is.

One example of this would be the following:

null * 1 = ?
null + 1 = ?

Looking at the first two expressions, one may be tempted to solve the problem above by saying “Null is 1 when multiplying and 0 when summing” meaning the following would be the evaluation result:

null * 1 = 1
null + 1 = 1

Rewriting the third expression applying the distributive property of multiplication however, shows that the rule creates inconsistencies:

(null + 1) * null = (null + 1) * null <=>
null * null + 1 * null = (null + 1) * null <=>
1 * 1 + 1*1 = (0 + 1) * 1 <=>
1 + 1 = 1 * 2 <=>
1 = 2  //not valid

With the intent of avoiding scenarios like the above where a computation may have different results based on the evaluation approach taken, most operations in lenses do not allow operations to use nullable types.

Address nullability

Lenses provides the following tools to address nullability in a flexible way:

COALESCE

Coalesce: A function that allows specifying a list of fields to be tried until the first non-null value is found.
- Note: the coalesce function won’t verify if a non-nullable field is provided so an error may still be thrown if all the provided fields are null
- e.g:COALESCE(nullable_fieldA, nullable_fieldB, 0)

AS_NON_NULLABLE

AS_NON_NULLABLE: a function that changes the type of a property from nullable to non-nullable.
- Note: This function is unsafe and will throw an error if a null value is passed. It should only be used if there’s a guarantee that the value won’t ever be null (for instance if used in a CASE branch where the null case has been previously handled or if the data has previously been filtered and the null values removed).
- e.g: AS_NON_NULLABLE(nullable_field)

AS_NON_NULLABLE and CASE

AS_NON_NULLABLE with CASE: A type-checked construct equivalent to using coalesce:
- e.g:

CASE
    WHEN a_nullable_field IS NULL THEN 0
    ELSE AS_NON_NULLABLE(a_nullable_field)
END

AS_NULLABLE

The AS_NULLABLE function is the inverse transformation of the AS_NON_NULLABLE version. This function allows a non-nullable field type to be transformed into a nullable type. It can be used to insert data into existing topics where the schema of the target field is nullable.

Settings

This page describes using settings in Lenses SQL Processors to process data in Kafka.

The SET syntax allows customizing the behaviour for the underlying Kafka Consumer/Producer, Kafka Streams (including RocksDB parameters), topic creation and error handling.

The general syntax is:

SET <setting_name>=<setting_value>;

Kafka topics

SQL processors can create topics that are not present. There are two levels of settings, generic (or default) applying to all target topics and specific (or topic-related) to allow distinct setups for a given topic. Maybe one of the output topics requires a different partition count or replication factor than the defaults.

To set the defaults follow this syntax:

SET defaults.topic.<topic_setting_key> = <value>;

Key

Type

Description

autocreate

BOOLEAN

Creates the topic if it does not exist already.

partitions

INT

Controls the target topic partitions count. If the topic already exists, this will not be applied.

replication

INT

Controls the topic replication factor. If the topic already exists, this will not be applied.

Each Kafka topics allows a set of parameters to be set. For example cleanup.policy can be set like this SET defaults.topic.cleanup.policy='compact,delete';

key.avro.record

STRING

Controls the output record Key schema name.

key.avro.namespace

STRING

Controls the output record Key schema namespace.

value.avro.record

STRING

Controls the output record Key schema name.

value.avro.namespace

STRING

Controls the output record Key schema namespace.

All the keys applicable for defaults are valid for controlling the settings for a given topic. Controlling the settings for a specific topic can be done via:

SET topic.<topic_name>.<topic_setting_key>=<value>;

SET topic.market_risk.cleanup.policy='compact,delete';

--escaping the topic name if it contains . or - or other non-alpha numeric
SET topic.`market.risk`.cleanup.policy='compact,delete';
SET topic.`market-risk`.cleanup.policy='compact,delete';

Error handling

The streaming engine allows users to define how errors are handled when writing to or reading from a topic.

Both sides can be set at once by doing:

SET error.policy= '<error_policy>';

or individually as described in the sections below.

Reading Errors

Data being processed might be corrupted or not aligned with the topic format (maybe you expect an Avro payload but the raw bytes represent a JSON document). Setting what happens in these scenarios can be done like this:

SET error.policy.read= '<error_policy>';

Writing Errors

While data is being written multiple errors can occur (maybe there were some network issues). Setting what happens in these scenarios can be done like this:

SET error.policy.write= '<error_policy>';

There are three possible values to control the behaviour.

Value

Description

continue

Allows the application to carry on. The problem will be logged.

fail

Stops the application. The application will be in a failed (error) state.

dlq

Allows the application to continue but it will send the payload to a dead-letter-topic. It requires dead.letter.queue to be set. The default value for dead.letter.queue is lenses.sql.dlq.

When dlq is used this setting is required. The value is the target topic where the problematic records will be sent to.

SET dead.letter.queue = '<dead_letter_topic>';

Kafka Streams Consumer and Producer settings

Using the SET syntax, the underlying Kafka Streams and Kafka Producer and Consumer settings can be adjusted.

SET <setting_key>=<value>;

Key

Type

Description

processing.guarantee

STRING

The processing guarantee that should be used. Possible values are AT_LEAST_ONCE (default) and EXACTLY_ONCE. Exactly-once processing requires a cluster of at least three brokers by default what is the recommended setting for production.

commit.interval.ms

LONG

The frequency with which to save the position of the processor. If processing.guarantee is set to EXACTLY_ONCE, the default value is 100, otherwise the default value is 30000. This setting directly impacts the behavior of Tables, as it controls how often they will emit events downstream. An event will be emitted only every commit.interval.ms, so every intermediate event that is received by the table will not be visible downstream directly.

poll.ms

LONG

The amount of time in milliseconds to block waiting for input.

cache.max.bytes.buffering

LONG

Maximum number of memory bytes to be used for buffering across all threads. It has to be at least 0. Default value is: 10 * 1024 * 1024.

client.id

STRING

An ID prefix string used for the client IDs of internal consumer, producer and restore-consumer, with pattern ‘<client.d>-StreamThread--<consumer

num.standby.replicas

INT

The number of standby replicas for each task. Default value is 0.

num.stream.threads

INT

The number of threads to execute stream processing. Default values is 1.

max.task.idle.ms

LONG

Maximum amount of time a stream task will stay idle when not all of its partition buffers contain records, to avoid potential out-of-order record processing across multiple input streams.

buffered.records.per.partition

INT

Maximum number of records to buffer per partition. Default is 1000.

buffered.records.per.partition

INT

Maximum number of records to buffer per partition. Default is 1000.

connections.max.idle.ms

LONG

Close idle connections after the number of milliseconds specified by this config.

receive.buffer.bytes

LONG

The size of the TCP receive buffer (SO_RCVBUF) to use when reading data. If the value is -1, the OS default will be used.

reconnect.backoff.ms

LONG

The base amount of time to wait before attempting to reconnect to a given host. This avoids repeatedly connecting to a host in a tight loop. This backoff applies to all connection attempts by the client to a broker.

reconnect.backoff.max.ms

LONG

The maximum amount of time in milliseconds to wait when reconnecting to a broker that has repeatedly failed to connect. If provided, the backoff per host will increase exponentially for each consecutive connection failure, up to this maximum. After calculating the backoff increase, 20% random jitter is added to avoid connection storms. Default is 1000.

retries

INT

Setting a value greater than zero will cause the client to resend any request that fails with a potentially transient error. Default is 0

retry.backoff.ms

LONG

The amount of time to wait before attempting to retry a failed request to a given topic partition. This avoids repeatedly sending requests in a tight loop under some failure scenarios. Default is 100.

send.buffer.bytes

LONG

The size of the TCP send buffer (SO_SNDBUF) to use when sending data. If the value is -1, the OS default will be used. Default is 128 * 1024.

state.cleanup.delay.ms

LONG

The amount of time in milliseconds to wait before deleting state when a partition has migrated.

Alongside the keys above, the Kafka consumer and producer settings can be also tweaked.

SET session.timeout.ms=120000;
SET max.poll.record = 20000;

Some of the configurations for the consumer and producer have the same name. At times, maybe there is a requirement to distinguish them. To do that the keys have to be prefixed with: consumer or producer.

SET consumer.<duplicate_config_key>=<value_1>;
SET producer.<duplicate_config_key>=<value_2>;

RocksDB

Stateful data flow applications might require, on rare occasions, some of the parameters for the underlying RocksDB to be tweaked.

To set the properties, use:

SET rocksdb.<key> = <value>;

Key

Type

Description

rocksdb.table.block.cache.size

LONG

Set the amount of cache in bytes that will be used by RocksDB. If cacheSize is non-positive, then cache will not be used. DEFAULT: 8M

rocksdb.table.block.size

LONG

Approximate size of user data packed per lock. Default: 4K

rocksdb.table.block.cache.compressed.num.shard.bits

INT

Controls the number of shards for the block compressed cache

rocksdb.table.block.cache.num.shard.bits

INT

Controls the number of shards for the block cache

rocksdb.table.block.cache.compressed.size

LONG

Size of compressed block cache. If 0,then block_cache_compressed is set to null

rocksdb.table.block.restart.interval

INT

Set block restart interval

rocksdb.table.block.cache.size.and.filter

BOOL

Indicating if we’d put index/filter blocks to the block cache. If not specified, each ’table reader’ object will pre-load index/filter block during table initialization

rocksdb.table.block.checksum.type

STRING

Sets the checksum type to be used with this table. Available values: kNoChecksum, kCRC32c, kxxHash.

rocksdb.table.block.hash.allow.collision

BOOL

Influence the behavior when kHashSearch is used. If false, stores a precise prefix to block range mapping if true, does not store prefix and allows prefix hash collision(less memory consumption)

rocksdb.table.block.index.type

STRING

Sets the index type to used with this table. Available values: kBinarySearch, kHashSearch

rocksdb.table.block.no.cache

BOOL

Disable block cache. If this is set to true, then no block cache should be used. Default: false

rocksdb.table.block.whole.key.filtering

BOOL

If true, place whole keys in the filter (not just prefixes).This must generally be true for gets to be efficient. Default: true

rocksdb.table.block.pinl0.filter

BOOL

Indicating if we’d like to pin L0 index/filter blocks to the block cache. If not specified, defaults to false.

rocksdb.total.threads

INT

The max threads RocksDB should use

rocksdb.write.buffer.size

LONG

Sets the number of bytes the database will build up in memory (backed by an unsorted log on disk) before converting to a sorted on-disk file

rocksdb.table.block.size.deviation

INT

This is used to close a block before it reaches the configured ‘block_size’. If the percentage of free space in the current block is less than this specified number and adding a new record to the block will exceed the configured block size, then this block will be closed and thenew record will be written to the next block. Default is 10.

rocksdb.compaction.style

STRING

Available values: LEVEL, UNIVERSAL, FIFO

rocksdb.max.write.buffer

INT

rocksdb.base.background.compaction

INT

rocksdb.background.compaction.max

INT

rocksdb.subcompaction.max

INT

rocksdb.background.flushes.max

INT

rocksdb.log.file.max

LONG

rocksdb.log.fle.roll.time

LONG

rocksdb.compaction.auto

BOOL

rocksdb.compaction.level.max

INT

rocksdb.files.opened.max

INT

rocksdb.wal.ttl

LONG

rocksdb.wal.size.limit

LONG

rocksdb.memtable.concurrent.write

BOOL

rocksdb.os.buffer

BOOL

rocksdb.data.sync

BOOL

rocksdb.fsync

BOOL

rocksdb.log.dir

STRING

rocksdb.wal.dir

STRING

SQL Reference

This page describes the SQL Reference for Lenses SQL Processors for data in Kafka.

Expressions

Functions

Custom Functions

Deserializers

Supported data formats

Expressions

This page describes the expressions in Lenses SQL Processors.

Expressions are the parts of a Lenses SQL query that will be evaluated to single values.

Below is the complete list of expressions that Lenses SQL supports.

Literals

A literal is an expression that represents a concrete value of a given type. This means that there is no resolution needed for evaluating a literal and its value is simply what is specified in the query.

Integers

Integer numbers can be introduced in a Lenses SQL query using integer literals:

SELECT 1 + 2 FROM myTopic

In the above query 1, 2 are integer literals.

Decimals

Decimal number literals can be used to express constant floating-point numbers:

SELECT 3.14 as pi FROM myTopic

Strings

To express strings, string literals can be used. Single quotes (') and double quotes (") are both supported as delimiters:

SELECT CONCAT("hello ", 'world!') FROM myTopic

In the example above, "hello " and 'world!' are string literals.

Booleans

Boolean constant values can be expressed using the false and true boolean literals:

SELECT false, true FROM myTopic

Nulls

Sometimes it is necessary to the NULL literal in a query, for example to test that something is or is not null, or to put a NULL the value facet, useful to delete records in a compacted topic:

INSERT INTO cleanedTopic
SELECT NULL as _value FROM myTopic WHERE myField IS NULL

Arrays

An array is a collection of elements of the same type.

Array expressions

A new array can be defined with the familiar [...] syntax:

["a", "b", "c"", "d"]

You can use more complex expressions inside the array:

[1 + 1, 7 * 2, COS(myfield)]

and nested arrays as well:

[["a"], ["b", "c"]]

Note: empty array literals like [] are currently not supported by Lenses SQL. That will change in future versions.

Array selections

An element of an array can be extracted appending, to the array expression, a pair of square brackets containing the index of the element.

Some examples:

SELECT
  myArray[0],
  myNestedArray[1][1],
  [1, 2, 3][myIndex],
  complexExpression[0].inner[1]
FROM myTopic

Note how the expression on the left of the brackets can be of arbitrary complexity, like in complexExpression[0].inner[1] or [1, 2, 3][myIndex].

Structs

A Struct is a value that is composed by fields and sub-values assigned to those fields. It is similar to what an object is in JSON.

In Lenses SQL there are two ways of building new structs.

Nested aliases

In a SELECT projection, it is possible to use nested aliases to denote the fields of a struct.

In the next example, we are building a struct field called user, with two subfields, one that is a string, and another one that is a struct:

SELECT
    myName as user.name,
    "email" as user.contact.type,
    CONCAT(myName, "@lenses.io") as user.contact.value
FROM myTopic

When the projection will be evaluated, a new struct user will be built.

The result will be a struct with a name field, and a nested struct assigned to the contact field, containing type and value subfields.

Struct Expressions

While nested aliases are a quick way to define new structs, they have some limitations: they can only be used in the projection section of a SELECT, and they do not cover all the cases where a struct can potentially be used.

Struct expressions overcome these limitations.

With struct expressions one can explicitly build complex structs, specifying the name and the values of the fields, one by one, and as any other expression, they can be used inside other expressions and in any other part of the query where an expression is allowed.

The syntax is similar to the one used to define JSON objects:

SELECT
    {
        name: myName,
        contact: { type: "email", value: CONCAT(myName, "@lenses.io") }
    } as user,
    {
        name: myName,
        contacts: [
            { type: "email", value: myEmail },
            { type: "address", value: myAddress }
        ]
    } as userWithContacts 
FROM myTopic

Note how the first projection

{
    name: myName,
    contact: { type: "email", value: CONCAT(myName, "y") }
} as user

is equivalent to the three projections used in the previous paragraph:

myName as user.name,
"email" as user.contact.type,
CONCAT(myName, "@lenses.io") as user.contact.value

while the second projection userWithContacts is not representable with nested aliases, because it defines structs inside an array.

Struct Selections

A selection is an explicit reference to a field within a struct. The syntax for a selection is:

<expression>.<field_name>

Selections can be used to directly access a field of a facet, optionally specifying the topic and the facet:

SELECT
    myField,                 -- value facet field, with implicit topic and facet
    myTopic.myField,         -- value facet field, with explicit topic and implicit facet
    _value.myField,          -- value facet field, with implicit topic and explicit facet
    myTopic._value.myField,  -- value facet field, with explicit topic and facet
    _key.myKeyField,         -- key facet field, with implicit topic and explicit facet
    myTopic._key.myKeyField  -- key facet field, with explicit topic and facet
FROM
   myTopic

It is also possible to select a field from more complex expressions. Here we use selections to select fields from array elements, or to directly access a nested field of a struct expression:

SELECT
    anArrayWithObjects[1].field,
    anArrayWithNestedObjects[1].children[2].field,
    { a: { b: 123 } }.a.b
FROM
    myTopic

In general, a field selection can be used on any expression that returns a struct.

Special characters in field names

If there are special characters in the field names, backticks (`) can be used:

SELECT { `a field!`: "hi" }.`a field!` FROM myTopic

Binary Expressions

A binary expression is an expression that is composed of the left-hand side and right-hand side sub-expressions and an operator that describes how the results of the sub-expressions are to be combined into a single result.

Currently, supported operators are:

Logical operators: AND, OR
Arithmetic operators: +, -, *, /, % (mod)
Ordering operators: >, >=, <, <=
Equality operators: =, !=
String operators: LIKE, NOT LIKE
Inclusion operators: IN, NOT IN

A binary expression is the main way to compose expressions into more complex ones.

For example, 1 + field1 and LENGTH(field2) > 5 are binary expressions, using the + and the >= operator respectively.

Case statements

CASE expressions return conditional values, depending on the evaluation of sub-expressions present in each of the CASE’s branches. This expression is Lenses SQL version of what other languages call a switch-statement or if-elseif-else construct.

SELECT
    CASE
      WHEN field3 = "Robert" THEN "It's bobby"
      WHEN field3 = "William" THEN "It's willy"
      ELSE "Unknown"
    END AS who_is_it
FROM myTopic

Functions

A function is a predefined named operation that takes a number of input arguments and is evaluated into a result. Functions usually accept the result of other expressions as input arguments, so functions can be nested.

Functions

This page describes how to use functions in Lenses SQL Processors.

Aggregate

This section describes how to use AGGREGATE functions in Lenses SQL.

AVG

This page describes the AVG function in Lenses SQL.

AVG(numExpr)

Returns the average of the values in a group. It ignores the null value. It can be used with numeric input only.

Available in:

Processor (stateless)

Processors (stateful)

SQL Studio

✓

CREATE TABLE order-events 
     (_key string, total int, orderId int) 
FORMAT 
     (string, json) 
PROPERTIES 
    (partitions=1, compacted=false); 

INSERT INTO order-events  
    (_key, name, total, orderId) 
VALUES 
   ("user1",23423,23423

sample code:

USE `kafka`;
SELECT AVG(total) 
FROM orders-events
LIMIT 1000;

Output:

{
  "value": {
    "AVG": "54.540900000000000000"
  }
}

BOTTOMK

This page describes the BOTTOMK function in Lenses SQL.

BOTTOMK(numExpr, N)

Returns the last K lowest ranked values. The ranking is based on how many times a value has been seen.

Available in:

Processor (stateless)

Processors (stateful)

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT BOTTOMK(total) 
FROM orders-events

Output:

{
  "value": {
    "BOTTOMK": [
      10.14
    ]
  }
}

COLLECT

This page describe the COLLECT function in Lenses SQL.

COLLECT(expr, maxN)

Returns an array in which each value in the input set is assigned to an element of the array.

Available in:

Processor (stateless)

Processors (stateful)

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT COLLECT(total, 5) 
FROM orders-events

Output:

{
  "value": {
    "COLLECT": [
      43.63,
      16.93,
      80.47,
      56.86,
      29.12
    ]
  }
}

COLLECT_UNIQUE

This page describes the COLLECT_UNIQUE function in Lenses SQL.

COLLECT_UNIQUE(expr, maxN)

Returns an array of unique values in which each value in the input set is assigned to an element of the array.

Available in:

Processor (stateless)

Processors (stateful)

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT COLLECT_UNIQUE(total, 5) 
FROM orders-events

Output:

{
  "value": {
    "COLLECT_UNIQUE": [
      56.8,
      25.75,
      88.97,
      19.47,
      99.26
    ]
  }
}

COUNT

This page describes the COUNT function in Lenses SQL.

COUNT(*) AS total

Returns the number of records returned by a query or the records in a group as a result of a GROUP BY statement.

Available in:

Processor (stateless)

Processors (stateful)

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT COUNT(id) 
FROM orders-events

Output:

{
  "value": {
    "COUNT": 1000
  }
}

FIRST

This page describes the FIRST function in Lenses SQL.

FIRST(expr)

Returns the first item seen in a group.

Available in:

Processor (stateless)

Processors (stateful)

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT FIRST(id) 
FROM orders-events

Output:

{
  "value": {
    "FIRST": "d4e60554-260d-4f14-873a-9c0352ad9387"
  }
}

LAST

This page describes the LAST function in Lenses SQL.

LAST(expr)

Returns the last item seen in a group.

Available in:

Processor (stateless)

Processors (stateful)

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT LAST(id) 
FROM orders-events

Output:

{
  "value": {
    "LAST": "9ce6a2c1-104f-49d8-818a-7fb632bb2f29"
  }
}

MAXK

This page describes the MAXK function in Lenses SQL.

MAXK(numExpr, N)

Returns the N largest values of a numExpr.

Available in:

Processor (stateless)

Processors (stateful)

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT MAXK(total) 
FROM orders-events

Output:

{
  "value": {
    "MAXK": [
      99.92
    ]
  }
}

MAXK_UNIQUE

This page describes the MAXK_UNIQUE function in Lenses SQL.

MAXK_UNIQUE(numExpr, N)

Returns the N smallest unique values of a numExpr.

Available in:

Processor (stateless)

Processors (stateful)

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT MAXK_UNIQUE(total) 
FROM orders-events

Output:

{
  "value": {
    "MAXK_UNIQUE": [
      99.97
    ]
  }
}

MINK

This page describes the MINK function in Lenses SQL.

MINK(numExpr, N)

Returns the N smallest values of an numExpr.

Available in:

Processor (stateless)

Processors (stateful)

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT MINK(total) 
FROM orders-events

Output:

{
  "value": {
    "MINK": [
      10.14
    ]
  }
}

MINK_UNIQUE

This page describes the MINK_UNIQUE function in Lenses SQL.

MINK_UNIQUE(numExpr, N)

Returns the N smallest unique values of a numExpr.

Available in:

Processor (stateless)

Processors (stateful)

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT MINK_UNIQUE(total) 
FROM orders-events

Output:

{
  "value": {
    "MINK_UNIQUE": [
      10.14
    ]
  }
}

SUM

This page describes the SUM function in Lenses SQL.

SUM(numExpr)

Returns the sum of all the values, in the expression. It can be used with numeric input only. Null values are ignored.

Available in:

Processor (stateless)

Processors (stateful)

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT SUM(total) 
FROM orders-events

Output:

{
  "value": {
    "SUM": "163053.270000000000000000"
  }
}

TOPK

This page describes the TOPK function in Lenses SQL.

TOPK(numExpr, N)

Returns the K highest ranked values. The ranking is based on how many times a value has been seen.

Available in:

Processor (stateless)

Processors (stateful)

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT TOPK(id) 
FROM orders-events

Output:

{
  "value": {
    "TOPK": [
      "ffd7c548-06bb-40e9-8aae-558be65c1e70"
    ]
  }
}

Array

This page describes how to use ARRAY functions in Lenses SQL Processors.

ELEMENT_OF

This page describes the ELEMENT_OF function in Lenses SQL.

ELEMENT_OF(array, index)

Return the element of array at index.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT ELEMENT_OF(products, 2) 
FROM orders-events
LIMIT 1;

Output:

{
  "value": {
    "ELEMENT_OF": {
      "product_id": "d3f085e0-c049-4e8b-9dd9-ea3ae124720b",
      "quantity": 1
    }
  }
}

FLATTEN

This page describes the FLATTEN function in Lenses SQL.

FLATTEN(array)

Flatten an array of arrays into an array.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
    FLATTEN(_value), 
    _value 
FROM products-array-test-2
LIMIT 1;

Output:

{
  "value": {
    "FLATTEN": [
      "128279ca-5f8d-44ef-b340-b8054a6611ec",
      6,
      "5ae063ba-0c94-4b9c-ad40-2da08ad8bfd2",
      8,
      "41bb7268-fa7c-426f-b549-4bb215579280",
      9,
      "35df5368-8b77-4081-9aab-96bc8de16c23",
      10,
      "da6d69ce-bbd9-4e56-861d-5018f2d9b23a",
      10,
      "c9725591-e6b6-4928-a6e8-7e37707965ec",
      2,
      "ac821f76-2ff0-4d0c-9540-b775f0602332",
      3,
      "f1f98a2c-e66e-44f3-bba1-58169a425862",
      1,
      "78ea76f1-7381-4b38-a25f-ee487bae9749",
      10,
      "5b84f92d-d8b8-4532-b452-639ab44ae017",
      5
    ]
  }
}
{
  "value": [
    [
      "128279ca-5f8d-44ef-b340-b8054a6611ec",
      6
    ],
    [
      "5ae063ba-0c94-4b9c-ad40-2da08ad8bfd2",
      8
    ],
    [
      "41bb7268-fa7c-426f-b549-4bb215579280",
      9
    ],
    [
      "35df5368-8b77-4081-9aab-96bc8de16c23",
      10
    ],
    [
      "da6d69ce-bbd9-4e56-861d-5018f2d9b23a",
      10
    ],
    [
      "c9725591-e6b6-4928-a6e8-7e37707965ec",
      2
    ],
    [
      "ac821f76-2ff0-4d0c-9540-b775f0602332",
      3
    ],
    [
      "f1f98a2c-e66e-44f3-bba1-58169a425862",
      1
    ],
    [
      "78ea76f1-7381-4b38-a25f-ee487bae9749",
      10
    ],
    [
      "5b84f92d-d8b8-4532-b452-639ab44ae017",
      5
    ]
  ]
}

IN_ARRAY

This page describes the IN_ARRAY function in Lenses SQL.

IN_ARRAY(element, array)

Check if element is an element of array.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
    IN_ARRAY('5ec69520-258f-4669-915f-9d842f96fa14', products), 
    products 
FROM orders-test-4
LIMIT 10;

Output:

{
  "value": {
    "IN_ARRAY": true,
    "products": [
      "5ec69520-258f-4669-915f-9d842f96fa14",
      "1946c4ac-ea7a-48c5-870c-1800c3982a14"
    ]
  }
}
{
  "value": {
    "IN_ARRAY": false,
    "products": [
      "49c537d5-078f-4bf8-a74d-49aa10e39c31",
      "c1eb9291-f292-4079-8957-88cce18c2739"
    ]
  }
}

REPEAT

This page describes the REPEAT function in Lenses SQL.

REPEAT(element, n)

Build an array repeating element n times.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
    REPEAT(products, 5), 
    products 
FROM orders-topic
LIMIT 1;

Output:

{
  "value": {
    "REPEAT": [
      [
        {
          "product_id": "4e774082-0c03-4751-b4dc-c3799da54f48",
          "quantity": 3
        }
      ],
      [
        {
          "product_id": "4e774082-0c03-4751-b4dc-c3799da54f48",
          "quantity": 3
        }
      ],
      [
        {
          "product_id": "4e774082-0c03-4751-b4dc-c3799da54f48",
          "quantity": 3
        }
      ],
      [
        {
          "product_id": "4e774082-0c03-4751-b4dc-c3799da54f48",
          "quantity": 3
        }
      ],
      [
        {
          "product_id": "4e774082-0c03-4751-b4dc-c3799da54f48",
          "quantity": 3
        }
      ]
    ],
    "products": [
      {
        "product_id": "4e774082-0c03-4751-b4dc-c3799da54f48",
        "quantity": 3
      }
    ]
  }
}

SIZEOF

This page describes the SIZEOF function in Lenses SQL.

SIZEOF(expr)

Returns the number of elements in an array.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
  SIZEOF(products) AS qty_products_array, 
  products 
FROM orders-events
LIMIT 5;

Output:

{
  "value": {
    "qty_products_array": 1,
    "products": [
      {
        "product_id": "0e719f35-e7fb-4717-9151-b4f3cd2017a2",
        "quantity": 5
      }
    ]
  }
}
{
  "value": {
    "qty_products_array": 5,
    "products": [
      {
        "product_id": "0847e631-150b-45e3-a2cd-c31b11afef3c",
        "quantity": 8
      },
      {
        "product_id": "707b30ad-a506-4583-acc8-95856d3bb88e",
        "quantity": 1
      },
      {
        "product_id": "efdfe7e2-fc01-4018-8d7f-ac0a130d0292",
        "quantity": 6
      },
      {
        "product_id": "2f9c5899-3df4-4090-ba8c-3b97f34e3afe",
        "quantity": 8
      },
      {
        "product_id": "e1c71c52-fa6f-47d7-a488-1eadc8a04e5d",
        "quantity": 1
      }
    ]
  }
}
{
  "value": {
    "qty_products_array": 2,
    "products": [
      {
        "product_id": "01fb793f-baad-44b2-8a52-55cdfbe062f1",
        "quantity": 8
      },
      {
        "product_id": "29a2914c-7f81-4c06-b3a4-a42124d5fdf3",
        "quantity": 10
      }
    ]
  }
}
{
  "value": {
    "qty_products_array": 5,
    "products": [
      {
        "product_id": "77ea0afb-3184-4a1a-bc08-d1cfdab445ad",
        "quantity": 1
      },
      {
        "product_id": "798e2412-ebc0-4549-a935-632f1dccc48c",
        "quantity": 7
      },
      {
        "product_id": "768e6a17-b5d9-4488-b4b4-62e54ea9e818",
        "quantity": 8
      },
      {
        "product_id": "3876af98-2b6c-4690-a3c2-ab7a9c4e2565",
        "quantity": 3
      },
      {
        "product_id": "23c12a75-7a16-46ef-a253-21aae8b705ba",
        "quantity": 2
      }
    ]
  }
}
{
  "value": {
    "qty_products_array": 4,
    "products": [
      {
        "product_id": "acd8758a-36b0-429c-9246-3e5cf9dc0545",
        "quantity": 6
      },
      {
        "product_id": "7262d1ce-f467-47f5-b835-061eb82593ca",
        "quantity": 1
      },
      {
        "product_id": "0751080a-d707-4c76-90fd-17b8724f37a2",
        "quantity": 3
      },
      {
        "product_id": "96630315-46d2-489a-9e2c-bb5ece7fab38",
        "quantity": 4
      }
    ]
  }
}

ZIP_ALL

This page describes the ZIP_ALL function in Lenses SQL.

ZIP_ALL(array1, field1, array2, field2, …)

Zip two or more arrays into a single one, returning nulls when an array is not long enough. Example: ZIP_ALL([1, 2], 'x', [3, 4, 5], 'y') will be evaluated to [{ x: 1, y: 3 }, { x: 2, y: 4 }, { x: null, y: 5 }]

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
    ZIP_ALL([1, 2], 'x', [3, 4, 5], 'y') 
FROM numbers-data
LIMIT 1;

Output:

{
  "value": {
    "ZIP_ALL": [
      {
        "x": 1,
        "y": 3
      },
      {
        "x": 2,
        "y": 4
      },
      {
        "x": null,
        "y": 5
      }
    ]
  }
}

ZIP

This page describes the ZIP function in Lenses SQL.

ZIP(array1, field1, array2, field2, …)

Zip two or more arrays into a single one. Example: ZIP([1, 2], 'x', [3, 4, 5], 'y') will be evaluated to [{ x: 1, y: 3 }, { x: 2, y: 4 }]

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
    ZIP([1, 2], 'x', [3, 4, 5], 'y') 
FROM numbers-data
LIMIT 1;

Output:

{
  "value": {
    "ZIP": [
      {
        "x": 1,
        "y": 3
      },
      {
        "x": 2,
        "y": 4
      }
    ]
  }
}

Conditions

This page describes how to use Conditions in Lenses SQL Processors.

EXISTS

EXISTS (field)

Returns true if the given field is present false otherwise.

Available in:

Processors

SQL Studio

✓

Returns true if the given field is present false otherwise.

Sample code:

USE `kafka`;
SELECT `EXISTS`(id) 
FROM users-events
LIMIT 1;

Output:

{
  "value": {
    "EXISTS": true
  }
}

Conversion

This page describes how to use conversion or CAST functions in Lenses SQL Processors.

CAST

CAST (dt AS int)

Enables conversion of values from one data type to another.

Available in:

Processors

SQL Studio

✓

Sample code

USE `kafka`;
SELECT 
    CAST(price as INT),
    price 
FROM products-events
LIMIT 1;

Output:

{
  "value": {
    "alias": 2, # new value
    "price": 2.46 # old value
  }
}

Date & Time

This page describes how to use date and time functions in Lenses SQL Processors.

Every Date Math expression starts with a base date or time followed by the addition or subtraction of one or more durations.

The base date or time (from here onward) is derived from a field in a table or a function such as now() or yesterday() that generates datetime values.

The shorthand syntax is a unit value followed by a unit symbol. The symbols are:

y (year)
M (month)
w (week)
d (day)
h (hour)
m (minute)
s (second)

CONVERT_DATETIME

This page describes the CONVERT_DATETIME function in Lenses SQL.

CONVERT_DATETIME (strExpr, fromPattern, toPattern)

Converts the string format of a date [and time] to another using the pattern provided.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
 CONVERT_DATETIME(tpep_pickup_datetime, 'yyyy-MM-dd HH:mm:ss', 'dd-MM-yyyy HH:mm'), 
 tpep_pickup_datetime 
FROM nyc-yellow-taxi-trip
LIMIT 1;

Output:

{
  "value": {
    "CONVERT_DATETIME": "01-01-2016 00:00",
    "tpep_pickup_datetime": "2016-01-01 00:00:02"
  }
}

DATE

This page describes the DATE function in Lenses SQL.

DATE(days)

Builds a local date value from a long or int value. This function can also be used with no parameters to return the current ISO date.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT DATE() 
FROM sea-vessel-position-reports
LIMIT 1;

Output:

{
  "value": {
    "DATE": "2023-11-29T00:00:00Z"
  }
}

DATETIME

This page describes the DATETIME function in Lenses SQL.

DATETIME ()

Provides the current ISO date and time.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT DATETIME() 
FROM sea-vessel-position-reports
LIMIT 1;

Output:

{
  "value": {
    "DATETIME": "2023-11-29T16:30:41.146Z"
  }
}

EXTRACT_TIME

This page describes the EXTRACT_TIME function in Lenses SQL.

EXTRACT_TIME(timestamp)

Extracts the time portion of a timestamp-micros or timestamp-millis returning a time-millis or time-micros value depending on the timestamp precision.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
    EXTRACT_TIME(_meta.timestamp), 
    _meta.timestamp 
FROM nyc-yellow-taxi-trip
LIMIT 1;

Output:

{
  "value": {
    "EXTRACT_TIME": "20:28:45.41",
    "timestamp": "2023-11-28T20:28:45.41Z"
  }
}

EXTRACT_DATE

This page describes the EXTRACT_DATE function in Lenses SQL.

EXTRACT_DATE(timestamp)

Extracts the date portion of a timestamp-micros or timestamp-millis returning a date value.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
    EXTRACT_DATE(_meta.timestamp), 
    _meta.timestamp 
FROM nyc-yellow-taxi-trip
LIMIT 1;

Output:

{
  "value": {
    "EXTRACT_DATE": "2023-11-28",
    "timestamp": "2023-11-28T20:28:45.41Z"
  }
}

FORMAT_DATE

This page describes the FORMAT_DATE function in Lenses SQL.

FORMAT_DATE(date,output_pattern)

Returns a string representation of a date value according to a given pattern.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
    FORMAT_DATE(_meta.timestamp,'dd-MM-yyyy HH:mm'), 
    _meta.timestamp 
FROM nyc-yellow-taxi-trip
LIMIT 1;

Output:

{
  "value": {
    "FORMAT_DATE": "28-11-2023 20:28",
    "timestamp": "2023-11-28T20:28:45.41Z"
  }
}

FORMAT_TIME

This page describes the FORMAT_TIME function in Lenses SQL.

FORMAT_TIME(time, output_pattern)

Returns a string representation of a time value according to a given pattern.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
    FORMAT_TIME(_meta.timestamp,'K:mm a, z'), 
    _meta.timestamp 
FROM nyc-yellow-taxi-trip
LIMIT 1;

Output:

{
  "value": {
    "FORMAT_TIME": "8:28 PM, UTC",
    "timestamp": "2023-11-28T20:28:45.41Z"
  }
}

FORMAT_TIMESTAMP

This page describes the FORMAT_TIMESTAMP function in Lenses SQL.

FORMAT_TIMESTAMP(timestamp,output_pattern)

Returns a string representation of a timestamp value according to a given pattern.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
    FORMAT_TIMESTAMP(_meta.timestamp,'yyyy-MM-dd HH:mm:ss'), 
    _meta.timestamp 
FROM sea-vessel-position-reports
LIMIT 1;

Output:

{
  "value": {
    "FORMAT_TIMESTAMP": "2023-11-28 20:28:45",
    "timestamp": "2023-11-28T20:28:45.337Z"
  }
}

HOUR

This page describes the HOUR function in Lenses SQL.

HOUR (expr)

Extracts the hour component of an expression that is of type timestamp.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
    HOUR(_meta.timestamp), 
    _meta.timestamp 
FROM sea-vessel-position-reports
LIMIT 1;

Output:

{
  "value": {
    "HOUR": 20,
    "timestamp": "2023-11-28T20:28:45.337Z"
  }
}

MONTH_TEXT

This page describes the MONTH_TEXT function in Lenses SQL.

MONTH_TEXT(dataExpr)

Returns the month name.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
    MONTH_TEXT(_meta.timestamp), 
    _meta.timestamp 
FROM sea-vessel-position-reports
LIMIT 1;

Output:

{
  "value": {
    "MONTH_TEXT": "November",
    "timestamp": "2023-11-28T20:28:45.337Z"
  }
}

MINUTE

This page describes the MINUTE function in Lenses SQL.

MINUTE(dataExpr)

Extracts the minute component of an expression that is of type timestamp.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
    MINUTE(_meta.timestamp), 
    _meta.timestamp 
FROM sea-vessel-position-reports
LIMIT 1;

Output:

{
  "value": {
    "MINUTE": 28,
    "timestamp": "2023-11-28T20:28:45.337Z"
  }
}

MONTH

This page describes the MONTH function in Lenses SQL.

MONTH(dataExpr)

Builds a timestamp-millis value from a long or int value.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
    MONTH(_meta.timestamp), 
    _meta.timestamp 
FROM sea-vessel-position-reports
LIMIT 1;

Output:

{
  "value": {
    "MONTH": 11,
    "timestamp": "2023-11-28T20:28:45.337Z"
  }
}

PARSE_DATE

This page describes the PARSE_DATE function in Lenses SQL.

PARSE_DATE(string, pattern)

Builds a date value given a date string representation and a date pattern.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
    PARSE_DATE(tpep_pickup_datetime, 'yyyy-MM-dd HH:mm:ss'), 
    tpep_pickup_datetime 
FROM nyc-yellow-taxi-trip
LIMIT 1;

Output:

{
  "value": {
    "PARSE_DATE": "2016-01-01",
    "tpep_pickup_datetime": "2016-01-01 00:00:02"
  }
}

PARSE_TIME_MILLIS

This page describes the PARSE_TIME_MILLIS function in Lenses SQL.

PARSE_TIME_MILLIS(millis, pattern)

Builds a time-millis value given a time string representation and a time pattern.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
    PARSE_TIME_MILLIS(tpep_pickup_datetime, 'yyyy-MM-dd HH:mm:ss'), 
    tpep_pickup_datetime 
FROM nyc-yellow-taxi-trip
LIMIT 1;

Output:

{
  "value": {
    "PARSE_TIME_MILLIS": "00:00:02",
    "tpep_pickup_datetime": "2016-01-01 00:00:02"
  }
}

PARSE_TIME_MICROS

This page describes the PARSE_TIME_MICROS function in Lenses SQL.

PARSE_TIME_MICROS(micros, pattern)

Builds a time-micros value given a time string representation and a time pattern.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
    PARSE_TIME_MICROS(tpep_pickup_datetime, 'yyyy-MM-dd HH:mm:ss'), 
    tpep_pickup_datetime 
FROM nyc-yellow-taxi-trip
LIMIT 1;

Output:

{
  "value": {
    "PARSE_TIME_MICROS": "00:00:02",
    "tpep_pickup_datetime": "2016-01-01 00:00:02"
  }
}

PARSE_TIMESTAMP_MILLIS

This page describes the PARSE_TIMESTAMP_MILLIS function in Lenses SQL.

PARSE_TIMESTAMP_MILLIS(string, input_pattern)

Builds a timestamp-millis value given a datetime string representation and a date time pattern.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
    PARSE_TIMESTAMP_MILLIS(tpep_pickup_datetime, 'yyyy-MM-dd HH:mm:ss'), 
    tpep_pickup_datetime 
FROM nyc-yellow-taxi-trip
LIMIT 1;

Output:

{
  "value": {
    "PARSE_TIMESTAMP_MILLIS": "2016-01-01T00:00:02Z",
    "tpep_pickup_datetime": "2016-01-01 00:00:02"
  }
}

PARSE_TIMESTAMP_MICROS

This page describes the PARSE_TIMESTAMP_MICROS function in Lenses SQL.

PARSE_TIMESTAMP_MICROS(string, input_pattern)

Builds a timestamp-micros value given a datetime string representation and a date time pattern.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
    PARSE_TIMESTAMP_MICROS(tpep_pickup_datetime, 'yyyy-MM-dd HH:mm:ss'), 
    tpep_pickup_datetime 
FROM nyc-yellow-taxi-trip
LIMIT 1;

Output:

{
  "value": {
    "PARSE_TIMESTAMP_MICROS": "2016-01-01T00:00:02Z",
    "tpep_pickup_datetime": "2016-01-01 00:00:02"
  }
}

SECOND

This page describes the SECOND function in Lenses SQL.

SECOND(dataExpr)

Extracts the second component of an expression that is of type timestamp.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
    SECOND(_meta.timestamp), 
    _meta.timestamp 
FROM nyc-yellow-taxi-trip
LIMIT 1;

Output:

{
  "value": {
    "SECOND": 45,
    "timestamp": "2023-11-28T20:28:45.41Z"
  }
}

TIMESTAMP

This page describes the TIMESTAMP function in Lenses SQL.

TIMESTAMP(date, time, zoneStr)

Returns a timestamp for a given date and time at a specific zone id.

Available in:

Processors

SQL Studio

✓

Sample code:

Output:

TIME_MICROS

This page describes the TIME_MICROS function in Lenses SQL.

TIME_MICROS(micros)

Builds a time-micros value from a long or int value.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
    TIME_MICROS(Timestamp) 
FROM sea-vessel-position-reports
LIMIT 1;

Output:

{
  "value": {
    "TIME_MICROS": "20:47:13.67939",
    "Timestamp": "1503158676433679390"
  }
}

TIMESTAMP_MICROS

This page describes the TIMESTAMP_MICROS function in Lenses SQL.

TIMESTAMP_MICROS(micros)

Builds a timestamp-micros value from a long or int value.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
    TIMESTAMP_MICROS(Timestamp), 
    Timestamp 
FROM sea-vessel-position-reports
LIMIT 1;

Output:

{
  "value": {
    "TIMESTAMP_MICROS": "+49603-03-15T20:47:13.67939Z",
    "Timestamp": "1503158676433679390"
  }
}

TIME_MILLIS

This page describes the TIME_MILLIS function in Lenses SQL.

TIME_MILLIS(millis)

Builds a time-millis value from a long or int value.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
    TIME_MILLIS(Timestamp), 
    Timestamp 
FROM sea-vessel-position-reports
LIMIT 1;

Output:

{
  "value": {
    "TIME_MILLIS": "03:07:59.39",
    "Timestamp": "1503158676433679390"
  }
}

TIMESTAMP_MILLIS

This page describes the TIMESTAMP_MILLIS function in Lenses SQL.

TIMESTAMP_MILLIS(millis)

Builds a timestamp-millis value from a long or int value.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
    TIMESTAMP_MILLIS(Timestamp), 
    Timestamp 
FROM sea-vessel-position-reports
LIMIT 1;

Output:

{
  "value": {
    "TIMESTAMP_MILLIS": "+47635172-03-26T03:07:59.39Z",
    "Timestamp": "1503158676433679390"
  }
}

TO_DATE

This page describes the TO_DATE function in Lenses SQL.

TO_DATE(strExpr, pattern)

Converts a string representation of a date into epoch value using the pattern provided.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
    TO_DATE(tpep_dropoff_datetime, 'yyyy-MM-dd HH:mm:ss'), 
    tpep_dropoff_datetime 
FROM nyc-yellow-taxi-trip
LIMIT 1;

Output:

{
  "value": {
    "TO_DATE": "2016-01-01T00:00:00Z",
    "tpep_dropoff_datetime": "2016-01-01 00:11:14"
  }
}

TO_DATETIME

This page describes the TO_DATETIME function in Lenses SQL.

TO_DATETIME(strExpr, pattern)

Converts a string representation of a datetime into epoch value using the pattern provided.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
    TO_DATETIME(tpep_dropoff_datetime, 'yyyy-MM-dd HH:mm:ss'), 
    tpep_dropoff_datetime 
FROM nyc-yellow-taxi-trip
LIMIT 1;

Output:

{
  "value": {
    "TO_DATETIME": "2016-01-01T00:11:14Z",
    "tpep_dropoff_datetime": "2016-01-01 00:11:14"
  }
}

TOMORROW

This page describes the TOMORROW function in Lenses SQL.

TOMORROW()

Returns the current date time plus 1 day.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT TOMORROW() 
FROM nyc-yellow-taxi-trip
LIMIT 1;

Output:

{
  "value": {
    "TOMORROW": "2023-11-30T18:49:25.182Z"
  }
}

TO_TIMESTAMP

This page describes the TO_TIMESTAMP function in Lenses SQL.

TO_TIMESTAMP(longExpr)

Converts a string representation of a date into epoch value using the pattern provided.

TO_TIMESTAMP (strExpr, pattern)

Converts a string using a pattern to a date and time type.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
    TO_TIMESTAMP(tpep_pickup_datetime),
    tpep_pickup_datetime 
FROM nyc-yellow-taxi-trip
LIMIT 1;

Output:

{
  "value": {
    "TO_TIMESTAMP": "2016-01-01T00:00:02Z",
    "tpep_pickup_datetime": "2016-01-01 00:00:02"
  }
}

YEAR

This page describes the YEAR function in Lenses SQL.

YEAR(expr)

Extracts the year component of an expression that is of type timestamp.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
    YEAR(_meta.timestamp),
    _meta.timestamp 
FROM nyc-yellow-taxi-trip
LIMIT 1;

Output:

{
  "value": {
    "YEAR": 2023,
    "timestamp": "2023-11-28T20:28:45.41Z"
  }
}

YESTERDAY

This page describes the YESTERDAY function in Lenses SQL.

YESTERDAY()

Returns the current date time minus 1 day.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT YESTERDAY() 
FROM nyc-yellow-taxi-trip
LIMIT 1;

Output:

{
  "value": {
    "YESTERDAY": "2023-11-28T18:57:13.27Z"
  }
}

Headers

This page describes how to use HEADER functions in Lenses SQL Processors.

HEADERASSTRING

This page describes the HEADERASSTRING function in Lenses SQL.

HEADERASSTRING(keyStr)

Returns the value of the record header key as a STRING value.

Available in:

Processors

SQL Studio

✓

HEADERASINT

This page describes the HEADERASINT function in Lenses SQL.

HEADERASINT(keyStr)

Returns the value of the record header key as an INT value.

Available in:

Processors

SQL Studio

✓

HEADERASLONG

This page describes the HEADERASLONG function in Lenses SQL.

HEADERASLONG(keyStr)

Returns the value of the record header key as a LONG value.

Available in:

Processors

SQL Studio

✓

HEADERASDOUBLE

This page describes the HEADERASDOUBLE function in Lenses SQL.

HEADERASDOUBLE(keyStr)

Returns the value of the record header key as a DOUBLE value.

Available in:

Processors

SQL Studio

✓

HEADERASFLOAT

This page describes the HEADERASFLOAT function in Lenses SQL.

HEADERASFLOAT(keyStr)

Returns the value of the record header key as a FLOAT value.

Available in:

Processors

SQL Studio

✓

HEADERKEYS

This page describes the HEADERKEYS function in Lenses SQL.

HEADERKEYS()

Returns all the header keys for the current record.

Available in:

Processors

SQL Studio

✓

JSON

This page describes the JSON functions in Lenses SQL.

JSON_EXTRACT_FIRST

This page describes the JSON_EXTRACT_FIRST function in Lenses SQL.

JSON_EXTRACT_FIRST (json_string,pattern)

Interprets ‘pattern’ as a Json path pattern and applies it to ‘json_string’, returning the first match, as a string containing valid json. Examples for the pattern parameter: “$.a”, “$[‘a’]”, “$[0]”, “$.points[?(@[‘id’]==‘i4’)].x”, “$[‘points’][?(@[‘y’] >= 3)].id”, “$.conditions[?(@ == false)]”

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
    JSON_EXTRACT_FIRST(time_details, '$.hour'), 
    time_details 
FROM credit-card-transactions
LIMIT 2;

Output:

{
  "value": {
    "JSON_EXTRACT_FIRST": "1",
    "time_details": "{\"hour\": 1, \"minute\": 17, \"second\": 15, \"millisecond\": 36}"
  }
}
{
  "value": {
    "JSON_EXTRACT_FIRST": "16",
    "time_details": "{\"hour\": 16, \"minute\": 23, \"second\": 2, \"millisecond\": 255}"
  }
}

JSON_EXTRACT_ALL

This page describes the JSON_EXTRACT_ALL function in Lenses SQL.

JSON_EXTRACT_ALL (json_string,pattern)

Interprets ‘pattern’ as a Json path pattern and applies it to ‘json_string’, returning all matches, as an array of strings containing valid json. Examples for the pattern parameter: “$.a”, “$[‘a’]”, “$[0]”, “$.points[?(@[‘id’]==‘i4’)].x”, “$[‘points’][?(@[‘y’] >= 3)].id”, “$.conditions[?(@ == false)]”

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
    JSON_EXTRACT_ALL(time_details, '$.*'), 
    time_details 
FROM credit-card-transactions
LIMIT 2;

Output:

{
  "value": {
    "JSON_EXTRACT_ALL": [
      "1",
      "17",
      "15",
      "36"
    ],
    "time_details": "{\"hour\": 1, \"minute\": 17, \"second\": 15, \"millisecond\": 36}"
  }
}
{
  "value": {
    "JSON_EXTRACT_ALL": [
      "16",
      "23",
      "2",
      "255"
    ],
    "time_details": "{\"hour\": 16, \"minute\": 23, \"second\": 2, \"millisecond\": 255}"
  }
}

Numeric

This page describes how to use numeric functions in Lenses SQL Processors.

%

The remainder operator (%) computes the remainder after dividing its first operand by its second i.e. numExpr % numExpr

✓

/

Divides one number by another (an arithmetic operator) i.e. numExpr / numExpr

✓

-

Subtracts one number from another (an arithmetic operator) i.e. numExpr - numExpr

✓

*

Multiplies one number with another (an arithmetic operator) i.e. numExpr * numExpr

✓

+

Adds one number to another (an arithmetic operator) i.e. numExpr + numExpr

✓

- (negative)

Returns the negative of the value of a numeric expression (a unary operator) i.e. -numExpr

✓

ABS

This page describes the ABS function in Lenses SQL.

ABS(numExpr)

Returns the absolute value of an expression that evaluates to a number type.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
    ACOS(total/100.0) AS arccosine_total_ratio, 
    total 
FROM orders-events
LIMIT 1;

Output:

{
  "value": {
    "arccosine_total_ratio": "1.119313791750037",
    "total": 43.63
  }
}

ACOS

This page describes the ACOS function in Lenses SQL.

ACOS(numExpr)

Returns the trigonometric arc cosine of an expression.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT ACOS(extra) 
FROM nyc-yellow-taxi-trip
LIMIT 1;

Output:

{
  "value": {
    "ACOS": "1.0471975511965979"
  }
}

ASIN

This page describes the ASIN function in Lenses SQL.

ASIN(numExpr)

Returns the trigonometric arc sine of an expression.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT ASIN(extra) 
FROM nyc-yellow-taxi-trip
LIMIT 1;

Output:

{
  "value": {
    "ASIN": "0.5235987755982989"
  }
}

ATAN

This page describes the ATAN function in Lenses SQL.

ATAN(numExpr)

Returns the trigonometric arc tangent of an expression.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT ATAN(pickup_longitude) 
FROM nyc-yellow-taxi-trip
LIMIT 1;

Output:

{
  "value": {
    "ATAN": "-1.5572801259337674"
  }
}

CBRT

This page describes the CBRT function in Lenses SQL.

CBRT(numExpr)

Returns the cube root of numExpr.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
    CBRT(integer),
    integer 
FROM numbers-data
LIMIT 1;

Output:

{
  "value": {
    "CBRT": "25.483906287370523",
    "integer": 16550
  }
}

CEIL

This page describes the CEIL function in Lenses SQL.

CEIL(numExpr)

Returns the absolute value of an expression.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
    CEIL(float),
    float 
FROM numbers-data
LIMIT 1;

Output:

{
  "value": {
    "CEIL": 96,
    "float": 95.95
  }
}

COSH

This page describes the COSH function in Lenses SQL.

COSH(numExpr)

Returns the hyperbolic cosine of an expression.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
    COSH(float),
    float 
FROM numbers-data
LIMIT 1;

Output:

{
  "value": {
    "COSH": "2.3416691959782676E41",
    "float": 95.95
  }
}

COS

This page describes the COS function in Lenses SQL.

COS(numExpr)

Returns the trigonometric cosine of an expression.

Available in:

Processors

SQL Studio

✓

Sample code:

USE `kafka`;
SELECT 
    COS(float),
    float 
FROM numbers-data
LIMIT 1;

Output:

{
  "value": {
    "COS": "-0.13104605979102568",
    "float": 95.95
  }
}

Lenses SQL

SQL Studio

Writing queries

Concepts

What is a message?

Facets

Selecting nested fields

Primitive types

Accessing metadata

Projections and nested aliases

Alias clashes (repeated fields)

Nested queries

Functions

Best practices

Understanding bad records

Control execution

Query tuning

Example

Creating & deleting Kafka topics

CREATE TABLE

SHOW TABLES

DESCRIBE TABLE

DROP TABLE

System virtual tables (__tables, __fields)

Record metadata

Filtering

WHERE clause

Missing values

Multiple WHERE conditions

HAVING clause

Read a table partition only

Query by partitions

Query by offsets

Query by timestamp

Limit & Sampling

Limit the output

Set a time limit for a query

Sampling data

Joins

Joins

Lateral Joins

Inserting & deleting data

Inserting data from a SELECT

Insert complex key

Deleting data in Kafka

Truncating a table

Aggregations

Count the records

Use SUM to aggregate your amounts

Group data with GROUP BY

Metadata fields

Select all the meta fields

View headers

Filter on record timestamp

Filter on table partition

Search for a record on a specific offset

Get the latest N records per partition

Views & synonyms

CREATE VIEW

DROP VIEW

SHOW VIEW

CREATE SYNONYM

DROP SYNONYM

Common use cases

Projection subset

Representing joins

Preset filter

Arrays

Create an array field

Select an array field

Query for array size

Managing queries

Termination Control

Recent queries

Kill Running queries

SQL Processors

Creating a SQL Processor

Starting a SQL Processor

Viewing SQL Processors

Scaling a SQL Processor

System virtual tables (tables, fields)

Projection from a `_value` field