Learn in 10 minutes

Welcome to this Lenses SQL tabled-based query tutorial! This document will take the user through typical use cases for querying existing data. Each little exercise will first build the topics and insert the data before proceeding to the actual SELECT statement. It is expected that the reader has an understanding of the concepts behind the SQL engine and that Lenses is running. The most convenient option to run Lenses is using Lenses-Box which is a self-contained Docker image packing Lenses and its dependencies. Assuming the Docker image is up and running, the reader should navigate to http://localhost:3030/#, log in with the default username and password (admin/admin) and then from the top menu navigate to the SQL screen.

First query

Selecting data requires an existing table. Let us create a basic table for Apache Kafka which stores String values for both Key and Value and then inserts one record.

-- Run
CREATE TABLE greetings(_key string, _value string) FORMAT (string, string);

-- And then
INSERT INTO greetings(_key, _value) VALUES("Hello", "World");

Go ahead and run your first SELECT query now:

SELECT * FROM greetings;

It is always a good practice to clean after finishing, so please go ahead and drop the greetings table:

DROP TABLE greetings;

Row metadata

A table mapping on a Kafka topic contains metadata for each Kafka record. This is Kafka specific topic information like partition, offset and timestamp. Letus create a tutorial table and insert a few records.

-- Run
CREATE TABLE tutorial(_key string, name string, difficulty int) FORMAT (Avro, Avro);

-- And then
INSERT INTO tutorial(_key, name, difficulty) VALUES("1", "Learn Lenses SQL", 3);
INSERT INTO tutorial(_key, name, difficulty) VALUES("2", "Learn Quantum Physics", 10);
INSERT INTO tutorial(_key, name, difficulty) VALUES("3", "Learn French", 5);

With the data available, a query can ask for specific metadata related to the records. To query for metadata such as the underlying Kafka topic offset, partition and timestamp just prefix your desired fields with _meta. Run the following query to see each tutorial name along with its metadata information:

SELECT name, _meta.offset, _meta.timestamp, _meta.partition
FROM tutorial

/* The output is (timestamp will be different)
Learn Lenses SQL         0    1540575169198    0
Learn Quantum Physics    1    1540575169226    0
Learn French             2    1540575169247    0
*/

Performing simple arithmetic

Through the SQL expressions, you can perform basic arithmetic. Let us create a grocery table and insert data into it:

-- Run
CREATE TABLE groceries (_key int, id int, name string, quantity int, price double) FORMAT(INT, Avro);

-- And then
INSERT INTO groceries(_key, id, name, quantity , price) VALUES (1, 1, "Fairtrade Bananas", 1, 1.90);
INSERT INTO groceries(_key, id, name, quantity , price) VALUES (2, 2, "Meridian Crunchy Peanut Butter", 1, 2.50);
INSERT INTO groceries(_key, id, name, quantity , price) VALUES (3, 3, "Green & Black's organic 85% dark chocolate bar", 2, 1.99);
INSERT INTO groceries(_key, id, name, quantity , price) VALUES (4, 4, "Activia fat free cherry yogurts", 3, 2.00);
INSERT INTO groceries(_key, id, name, quantity , price) VALUES (5, 5, "Green & Blacks Organic Chocolate Ice Cream", 2, 4.20);

With the data available, let us calculate the price of each purchase:

SELECT name, quantity * price as total FROM groceries

/*The output:
Fairtrade Bananas                               1.9
Meridian Crunchy Peanut Butter                  2.5
Green & Black's organic 85% dark chocolate bar  3.98
Activia fat free cherry yogurts                 6
Green & Blacks Organic Chocolate Ice Cream      8.4
*/

The common arithmetic operations +, -, *, / and % are supported. Try using a different arithmetic operation on your own.

Using functions

Lenses provides quite an array of functions out of the box – it even has support for user-defined functions. Built-in functions give flexibility when processing your data. The ROUND function allows you to round numeric functions:

SELECT name , round(quantity * price) as rounded_total
FROM groceries

/*The output:
Fairtrade Bananas                                   2
Meridian Crunchy Peanut Butter                      3
Green & Black's organic 85% dark chocolate bar      4
Activia fat free cherry yogurts                     6
Green & Blacks Organic Chocolate Ice Cream          8
*/

Manipulating text columns

String functions allow you to perform different operations on text. Take CONCAT for example: it allows to combine multiple values together. Let us first create a customer table:

-- Run
CREATE TABLE customers (_key string, first_name string, last_name string) FORMAT(string, Avro);

-- And then
INSERT INTO customers(_key, first_name, last_name) VALUES("mikejones", "Mike", "Jones");
INSERT INTO customers(_key, first_name, last_name) VALUES("anasmith", "Ana", "Smith");
INSERT INTO customers(_key, first_name, last_name) VALUES("shannonelliott", "Shannon","Elliott");

In the following example, the query combines first and last names into full names.

SELECT CONCAT(first_name, " ", last_name) as name
FROM customers;

/* Output
Mike        Jones
Ana         Smith
Shannon     Elliott
*/

Filtering with WHERE

WHERE clause allows the user to define a set of logical predicates the data needs to match in order to be returned. All of the standard comparison operators are supported (>, >=, <, <=, =, and !=) as well as calling functions.

We are going to use the groceries table created earlier – please recreate it if you have dropped it. Let us select all items purchased where the prices are greater or equal to 2.00:

SELECT name
     , price
FROM groceries
WHERE price >= 2.0

/* Output
Meridian Crunchy Peanut Butter              2.5
Activia fat free cherry yogurts             2
Green & Blacks Organic Chocolate Ice Cream  4.2
*/

Now let us select all customers whose last name length equals to 5:

SELECT *
FROM customers
WHERE LEN(last_name)=5

/* Output
key         value.first_name    value.last_name
mikejones       Mike                Jones
anasmith        Ana                 Smith
*/

Try changing the comparison to something else, maybe use > or < instead of =.

Before moving to the next section, let us search all customers containing Ana in their first name:

SELECT *
FROM customers
WHERE first_name like '%Ana%'

Keep in mind that text search is case sensitive. To use case insensitive text search, you could write:

SELECT *
FROM customers
WHERE lowercase(first_name) like '%ana%';

-- And here is the negated version
SELECT *
FROM customers
WHERE lowercase(first_name) not like '%ana%';

Missing values

Sometimes the data can contain explicit NULL values or it can omit fields entirely. Using IS [ NOT ] NULL, or EXISTS function allows you to check for these situations.

Currently, the CREATE syntax does not allow creating a new table stored as Avro where fields can be marked with a flag to say they can be null. The next release will cover that. However, tables created externally as a result of creating and storing Avro records via Kafka can lead to those scenarios. Since Lenses SQL is agnostic of the actual payload let us rely on the JSON storage format to create the table. JSON is not enforcing a schema and therefore is not as restrictive when inserting the records. Let us create the following table named customers_json:

-- Run
CREATE TABLE customers_json (_key string, first_name string, last_name string, middle_name string) FORMAT(string, json);

-- And then
INSERT INTO customers_json(_key, first_name, last_name, middle_name) VALUES("mikejones", "Mike", "Jones", "Albert");
INSERT INTO customers_json(_key, first_name, last_name) VALUES("anasmith", "Ana", "Smith");
INSERT INTO customers_json(_key, first_name, last_name) VALUES("shannonelliott", "Shannon","Elliott");

First, let us query this table for all its entries:

SELECT * FROM customers_json

/* The output
key             value.first_name   value.middle_name   value.last_name
mikejones           Mike                Albert          Jones
anasmith            Ana                                 Smith
shannonelliott      Shannon                             Elliott
*/

As you can see the middle_name is only present on the mikejones record. Let us write a query which filters out records where middle_name is not present:

SELECT *
FROM customers_json
WHERE exists(middle_name)

/* The output
 key             value.first_name   value.middle_name   value.last_name
mikejones            Mike            Albert                Jones
*/

A similar outcome can be achieved with the following query:

SELECT *
FROM customers_json
WHERE middle_name is NULL

When a field is actually NULL or is missing, checking like in the above query has the same outcome.

Multiple WHERE conditions

Scenarios where one condition is required are not covering the entire domain space. To cover them, you can use AND/OR to specify complex conditions for filtering your data. Let us use the groceries table again. Please go ahead and recreate it if it is not there. Here is how you could filter the purchased items where more than one item has been bought for a given product, and the unit price is greater than 2

SELECT *
FROM groceries
WHERE quantity > 1 AND price > 2

Now try changing the AND logical operand to OR and see the differences in output.

Limit the output

Sometimes, you would want to return just a few records - you are just sampling the data. Also, returning tens of thousands of records, or millions will probably be something that you do not really need. It is unlikely you will sit there and go through each record one by one. In this scenario, there are a few things you can do to limit the data returned. The first one is delivered via the LIMIT N syntax which specifies the maximum number of records that will be returned when querying the data:

SELECT *
FROM groceries
LIMIT 1

An advanced user can also set the maximum data size to be returned:

-- set the data size returned to be 1 megabyte.
SET max.size="1m";

-- on the small dataset we have here 1 MB will accommodate all records added and more
SELECT *
FROM groceries

One other scenario you might encounter is sampling data but discarding a few records before the actual sample ones. Run this query and see the difference between this and the first one in this section.

SELECT *
FROM groceries
LIMIT 1,2

What the syntax above does is instruct the engine to skip the first record matched and then sample the next two. Run the query to see the results.

Read a table partition only

By now you know that a Kafka topic maps to a table. Since a Kafka topic is partitioned, the table itself is partitioned. A SELECT statement targets all the table partitions at once. However, you have the option to look at specific partitions. Let us create a partitioned table first. We are going to recreate the customers table used earlier. And we will set the partitions count to 3. A record is sent to a specific partition based on the Key. The Key is hashed and then the remainder from doing HashValue % partitions will be the table partition the record is sent to.

-- Run
CREATE TABLE customers_partitioned (_key string, first_name string, last_name string) FORMAT(string, Avro) properties(partitions = 3);

-- And then
INSERT INTO customers_partitioned(_key, first_name, last_name) VALUES("mikejones", "Mike", "Jones");
INSERT INTO customers_partitioned(_key, first_name, last_name) VALUES("anasmith", "Ana", "Smith");
INSERT INTO customers_partitioned(_key, first_name, last_name) VALUES("shannonelliott", "Shannon","Elliott");
INSERT INTO customers_partitioned(_key, first_name, last_name) VALUES("tomwood", "Tom","Wood");
INSERT INTO customers_partitioned(_key, first_name, last_name) VALUES("adelewatson", "Adele","Watson");
INSERT INTO customers_partitioned(_key, first_name, last_name) VALUES("mariasanchez", "Maria", "Sanchez");

Next, run the following query:

SELECT *
FROM customers_partitioned

/* The output
offset  partition   timestamp       key         value.first_name    value.last_name
0       0           1540830780401   mikejones       Mike                Jones
1       0           1540830780441   anasmith        Ana                 Smith
2       0           1540830780464   shannonelliott  Shannon             Elliott
0       2           1540831270170   mariasanchez    Maria               Sanchez
0       1           1540830984698   tomwood         Tom                 Wood
1       1           1540831183308   adelewatson     Adele               Watson
*/

As you can see from the results (your timestamps will be different) the records span over the three partitions. Now let us query specific partitions:

-- selects only records from partition = 0
SELECT *
FROM customers_partitioned
WHERE _meta.partition = 0

-- selects only records from partition  0 and 2
SELECT *
FROM customers_partitioned
WHERE _meta.partition = 0
   OR _meta.partition = 2

Count the records

Looking and manipulating single entries can only take you so far. There are scenarios where you want the information about groups of data instead of individual rows. In this exercise, we are going to use the COUNT aggregate to count the records on a table. Run the following SQL to see how many records we have on the customers_partitioned:

SELECT count(*) as total
FROM customers_partitioned

Use SUM to aggregate your amounts

Earlier on this page, the groceries table was used. You can find out the amount spent with a very simple query using the SUM function. Please execute the following query:

SELECT SUM(quantity * price) as amount
FROM groceries

Group data with GROUP BY

There are scenarios where you want to group the data of a table based on some of the fields of the table. Imagine a customer table where you store the customer country information. Let us create a new table customers with a bit more information on each row. Let us name it customer:

CREATE TABLE customer (firstName string, lastName string, city string,country string, phone string) FORMAT(string, avro) properties(compact=true);

INSERT INTO customer (_key,firstName, lastName, city, country, phone) VALUES ('1','Craig', 'Smith', 'New York', 'USA', '1-01-993 2800');
INSERT INTO customer (_key,firstName, lastName, city, country, phone) VALUES ('2','William', 'Maugham','Toronto','Canada','+1-613-555-0110');
INSERT INTO customer (_key,firstName, lastName, city, country, phone) VALUES ('3','William', 'Anthony','Leeds','UK', '+44 1632 960427');
INSERT INTO customer (_key,firstName, lastName, city, country, phone) VALUES ('4','S.B.','Swaminathan','Bangalore','India','+91 7513200000');
INSERT INTO customer (_key,firstName, lastName, city, country, phone) VALUES ('5','Thomas','Morgan','Arnsberg','Germany','+49-89-636-48018');
INSERT INTO customer (_key,firstName, lastName, city, country, phone) VALUES ('6','Thomas', 'Merton','New York','USA', '+1-617-555-0147');
INSERT INTO customer (_key,firstName, lastName, city, country, phone) VALUES ('7','Piers','Gibson','London','UK', '+44 1632 960269');
INSERT INTO customer (_key,firstName, lastName, city, country, phone) VALUES ('8','Nikolai','Dewey','Atlanta','USA','+1-404-555-0178');
INSERT INTO customer (_key,firstName, lastName, city, country, phone) VALUES ('9','Marquis', 'de Ellis','Rio De Janeiro','Brazil','+55 21 5555 5555');
INSERT INTO customer (_key,firstName, lastName, city, country, phone) VALUES ('10','Joseph', 'Milton','Houston','USA','+1-202-555-0153');
INSERT INTO customer (_key,firstName, lastName, city, country, phone) VALUES ('11','John','Betjeman Hunter','Sydney','Australia','+61 1900 654 321');
INSERT INTO customer (_key,firstName, lastName, city, country, phone) VALUES ('12','Evan', 'Hayek','Vancouver','Canada','+1-613-555-0130');
INSERT INTO customer (_key,firstName, lastName, city, country, phone) VALUES ('13','E.','Howard','Adelaide','Australia','+61 491 570 157');
INSERT INTO customer (_key,firstName, lastName, city, country, phone) VALUES ('14','C. J.', 'Wilde','London','UK','+44 1632 960111');
INSERT INTO customer (_key,firstName, lastName, city, country, phone) VALUES ('15','Butler', 'Andre','Florida','USA','+1-202-555-0107');

Now let us see how many customers there are from each country. Here is the code which computes that:

SELECT COUNT(*) as count
     , country
FROM customer
GROUP BY country
/* Output
2    Canada
3    UK
2    Australia
1    India
1    Brazil
5    USA
1    Germany
*/

Nested queries

The tabled query allows you to nest queries. Let us take the query in the previous section and say we are only interested in those entries where there exist more than 1 customer per country.

SELECT *
FROM (
    SELECT COUNT(*) as count
         , country
    FROM customer
    GROUP BY country)
WHERE count > 1

Run the query and you will only see those entries for which there is more than one person registered per country.

Filtering grouped data with HAVING

As you saw from the previous query, you might want to filter the entries returned from a grouping query. Just like the WHERE statement, you can use HAVING syntax to achieve the same result when it comes to grouped queries. You will most likely agree that this is a much easier syntax than nesting two queries. Let us rewrite the query from the previous section as make it as follows:

SELECT COUNT(*) as count
     , country
FROM customer
GROUP BY country
HAVING count > 1

Just run the code and you will get the same output you got with the nested query.

Joins

Lenses Tabled-query allows you to combine records from two tables. A query can contain zero, one or multiple JOIN operations. Let us create an orders table and insert some data into it:

CREATE TABLE orders(_key int, orderDate string, customerId string, amount double) format(int, avro);

INSERT INTO orders (_key, orderDate, customerId, amount) VALUES (1, '2018-10-01', '1', 200.50);
INSERT INTO orders (_key, orderDate, customerId, amount) VALUES (2, '2018-10-11', '1', 813.00);
INSERT INTO orders (_key, orderDate, customerId, amount) VALUES (3, '2018-10-11', '3', 625.20);
INSERT INTO orders (_key, orderDate, customerId, amount) VALUES (4, '2018-10-11', '14', 730.00);
INSERT INTO orders (_key, orderDate, customerId, amount) VALUES (5, '2018-10-11', '10', 440.00);
INSERT INTO orders (_key, orderDate, customerId, amount) VALUES (6, '2018-10-11', '9', 444.80);

With these tables in place, let us join them to get more information about an order by combining it with the customer information found in the customer table:

SELECT o._key as orderNumber
    , o.amount as totalAmount
    , c.firstName
    , c.lastName
    , c.city
    , c.country
FROM orders o INNER JOIN customer c
    ON o.customerId = c._key

/*
city        orderNumber     country     totalAmount     lastName    firstName
New York        1               USA         200.5       Smith       Craig
New York        2               USA         813         Smith       Craig
Leeds           3               UK          625.2       Anthony     William
Rio De Janeiro  6               Brazil      444.8       de Ellis    Marquis
Houston         5               USA         440         Milton      Joseph
London          4               UK          730         Wilde       C. J.
*/