Learn in 10 minutes

Welcome to this Lenses SQL tabled-based query tutorial! This document will take the user through typical use cases for querying existing data. Each little exercise will first build the topics and insert the data before proceeding to the actual select statement. It is expected the reader has an understanding of the concepts behind the SQL engine and that Lenses is running. The most convenient option to run Lenses is using Lenses-Box which is a self-contained docker image packing Lenses and it’s dependencies. Assuming the docker image was run, the reader should navigate to http://localhost:3030/#, log in with the default admin/admin and then from the top menu navigate to the SQL screen.

First query

Selecting data requires an existing table. Let’s create a basic table for Apache Kafka which stores String for both Key and Value and then insert one record.

-- Run
CREATE TABLE greetings(_key string, _value string) FORMAT (string, string);

-- And then
INSERT INTO greetings(_key, _value) VALUES("Hello", "World");

Go ahead and run your first SELECT query now!

SELECT * FROM greetings;

It is always good to clean after yourself, so go ahead and drop the table.

DROP TABLE greetings;

Row metadata

A table mapping on a Kafka topic contains metadata for each Kafka record. This is Kafka specific topic information like partition, offset, timestamp. Let’s create a tutorial table and insert a few records.

-- Run
CREATE TABLE tutorial(_key string, name string, difficulty int) FORMAT (Avro, Avro);

-- And then
INSERT INTO tutorial(_key, name, difficulty) VALUES("1", "Learn Lenses SQL", 3);
INSERT INTO tutorial(_key, name, difficulty) VALUES("2", "Learn Quantum Physics", 10);
INSERT INTO tutorial(_key, name, difficulty) VALUES("3", "Learn French", 5);

With the data available, a query can ask for record’s specific metadata. To query for metadata such as the underlying Kafka topic offset, partition and timestamp just prefix them with _meta. Run this query to see each tutorial and their metadata information:

SELECT name, _meta.offset, _meta.timestamp, _meta.partition
FROM tutorial

/* The output is (timestamp will be different)
Learn Lenses SQL         0    1540575169198    0
Learn Quantum Physics    1    1540575169226    0
Learn French             2    1540575169247    0
*/

Performing simple arithmetic

Through the SQL expressions, you can perform basic arithmetic. Let’s create a grocery table and insert data into it.

-- Run
CREATE TABLE groceries (_key int, id int, name string, quantity int, price double) FORMAT(INT, Avro);

-- And then
INSERT INTO groceries(_key, id, name, quantity , price) VALUES (1, 1, "Fairtrade Bananas", 1, 1.90);
INSERT INTO groceries(_key, id, name, quantity , price) VALUES (2, 2, "Meridian Crunchy Peanut Butter", 1, 2.50);
INSERT INTO groceries(_key, id, name, quantity , price) VALUES (3, 3, "Green & Black's organic 85% dark chocolate bar", 2, 1.99);
INSERT INTO groceries(_key, id, name, quantity , price) VALUES (4, 4, "Activia fat free cherry yogurts", 3, 2.00);
INSERT INTO groceries(_key, id, name, quantity , price) VALUES (5, 5, "Green & Blacks Organic Chocolate Ice Cream", 2, 4.20);

With the data now available, let’s calculate the price on each produce bought.

SELECT name, quantity * price as total FROM groceries

/*The output:
Fairtrade Bananas                               1.9
Meridian Crunchy Peanut Butter                  2.5
Green & Black's organic 85% dark chocolate bar  3.98
Activia fat free cherry yogurts                 6
Green & Blacks Organic Chocolate Ice Cream      8.4
*/

The common arithmetic operations +, -, *, / and % are supported. Try using a different arithmetic operation.

Using functions

Lenses provides out of the box quite an array of functions and has even support for allowing user-defined functions. The built-in functions allow greater flexibility when processing your data. The ROUND function allows you to round numeric functions.

SELECT name , round(quantity * price) as rounded_total
FROM groceries

/*The output:
Fairtrade Bananas                                   2
Meridian Crunchy Peanut Butter                      3
Green & Black's organic 85% dark chocolate bar      4
Activia fat free cherry yogurts                     6
Green & Blacks Organic Chocolate Ice Cream          8
*/

Manipulating text columns

String functions allow you to perform different operations on text. Take CONCAT for example, it allows to combine multiple values together. Let’s create a customer table:

-- Run
CREATE TABLE customers (_key string, first_name string, last_name string) FORMAT(string, Avro);

-- And then
INSERT INTO customers(_key, first_name, last_name) VALUES("mikejones", "Mike", "Jones");
INSERT INTO customers(_key, first_name, last_name) VALUES("anasmith", "Ana", "Smith");
INSERT INTO customers(_key, first_name, last_name) VALUES("shannonelliott", "Shannon","Elliott");

In the following example, the query combines first and last names into full names.

SELECT CONCAT(first_name, " ", last_name) as name
FROM customers;

/* Output
Mike        Jones
Ana         Smith
Shannon     Elliott
*/

Filtering with WHERE

WHERE clause allows the user to define a set of logical predicates the data needs to match in order to be returned. All of the standard comparison operators are supported (>, >=, <, <=, =, and !=) as well as calling functions.

Considering the groceries table created earlier, please recreate it if you have dropped it. Let’s select all items purchased where the prices are greater or equal to 2.00.

SELECT name
     , price
FROM groceries
WHERE price >= 2.0

/* Output
Meridian Crunchy Peanut Butter              2.5
Activia fat free cherry yogurts             2
Green & Blacks Organic Chocolate Ice Cream  4.2
*/

Now let’s select all customers whose last name length equals to 5

SELECT *
FROM customers
WHERE LEN(last_name)=5

/* Output
key         value.first_name    value.last_name
mikejones       Mike                Jones
anasmith        Ana                 Smith
*/

Try changing the comparison to something else, maybe use > or <.

Before moving to the next section, let’s search all customers containing Ana in their first name:

SELECT *
FROM customers
WHERE first_name like '%Ana%'

Keep in mind the text search is case sensitive. To use case-insensitive, you could write:

SELECT *
FROM customers
WHERE lowercase(first_name) like '%ana%';

-- And here is the negated version
SELECT *
FROM customers
WHERE lowercase(first_name) not like '%ana%';

Missing values

Sometimes the data can contain explicit NULL values or it can omit fields entirely. Using IS [ NOT ] NULL, or EXISTS function allows you to check for these situations.

Currently, the CREATE syntax does not allow creating a new table stored as Avro where fields can be marked with a flag to say they can be null. The next release will cover that. However, tables created externally as a result of creating and storing Avro records via Kafka can lead to those scenarios. Since Lenses SQL is agnostic of the actual payload let’s rely on the JSON storage format to create the table. JSON is not enforcing a schema and therefore is not as restrictive when inserting the records. Let’s create the following customers’ table:

-- Run
CREATE TABLE customers_json (_key string, first_name string, last_name string, middle_name string) FORMAT(string, json);

-- And then
INSERT INTO customers_json(_key, first_name, last_name, middle_name) VALUES("mikejones", "Mike", "Jones", "Albert");
INSERT INTO customers_json(_key, first_name, last_name) VALUES("anasmith", "Ana", "Smith");
INSERT INTO customers_json(_key, first_name, last_name) VALUES("shannonelliott", "Shannon","Elliott");

Let’s query first this table for all the entries.

SELECT * FROM customers_json

/* The output
key             value.first_name   value.middle_name   value.last_name
mikejones           Mike                Albert          Jones
anasmith            Ana                                 Smith
shannonelliott      Shannon                             Elliott
*/

As you can see the middle_name is only present on the mikejones record. Let’s write a query which filters out records where middle_name is not present.

SELECT *
FROM customers_json
WHERE exists(middle_name)

/* The output
 key             value.first_name   value.middle_name   value.last_name
mikejones            Mike            Albert                Jones
*/

A similar outcome can be achieved with the following query.

SELECT *
FROM customers_json
WHERE middle_name is NULL

When a field is actually null or is missing, checking like in the above query has the same outcome.

Multiple WHERE conditions

Scenarios, where one condition is required, are not covering the entire domain space. To cover them, you can use AND/OR to specify complex conditions to filter the data. Let’s use the groceries table. Go ahead and recreate it. Here is how you could filter the purchased items where more than one item has been bought for a given product, and the unit price is greater than 2

SELECT *
FROM groceries
WHERE quantity > 1 AND price > 2

Now try changing the AND logical operand to OR and see the difference in output.

Limit the output

Sometimes, you would want to return just a few records - you are sampling the data. Also, returning tens of thousands of records, or millions will probably be something you don’t really need. It is unlikely you will sit there and go through each record. In this scenario, there are a few things you can use to limit the data returned. The first one is delivered via the LIMIT N syntax which specifies the maximum record to return when querying the data.

SELECT *
FROM groceries
LIMIT 1

An advanced user can set the maximum data size to be returned.

-- set the data size returned to be 1 megabyte.
SET max.size="1m";

-- on the small dataset we have here 1 MB will accommodate all records added and more
SELECT *
FROM groceries

One other scenario you might encounter is sampling data but discarding a few records before the actual sample ones. Run this query and see the difference between this and the first one in this section.

SELECT *
FROM groceries
LIMIT 1,2

What the syntax above does is instruct the engine to skip the first record matched and then sample the next two. Run the query to see the results.

Read a table partition only

By now you know a topic maps to a table. Since a topic is partitioned the table itself is partitioned. A SELECT statement targets all the table partitions at once. However, you have the option to look at a specific partition(-s). Let’s create a partitioned table first. We are going to recreate the customers table used earlier. And we will set the partitions count to 3. A record is sent to a specific partition based on the Key. The Key is hashed and then the remainder from doing HashValue % partitions will be the table partition the record is sent to.

-- Run
CREATE TABLE customers_partitioned (_key string, first_name string, last_name string) FORMAT(string, Avro) properties(partitions = 3);

-- And then
INSERT INTO customers_partitioned(_key, first_name, last_name) VALUES("mikejones", "Mike", "Jones");
INSERT INTO customers_partitioned(_key, first_name, last_name) VALUES("anasmith", "Ana", "Smith");
INSERT INTO customers_partitioned(_key, first_name, last_name) VALUES("shannonelliott", "Shannon","Elliott");
INSERT INTO customers_partitioned(_key, first_name, last_name) VALUES("tomwood", "Tom","Wood");
INSERT INTO customers_partitioned(_key, first_name, last_name) VALUES("adelewatson", "Adele","Watson");
INSERT INTO customers_partitioned(_key, first_name, last_name) VALUES("mariasanchez", "Maria", "Sanchez");

Run now this query:

SELECT *
FROM customers_partitioned

/* The output
offset  partition   timestamp       key         value.first_name    value.last_name
0       0           1540830780401   mikejones       Mike                Jones
1       0           1540830780441   anasmith        Ana                 Smith
2       0           1540830780464   shannonelliott  Shannon             Elliott
0       2           1540831270170   mariasanchez    Maria               Sanchez
0       1           1540830984698   tomwood         Tom                 Wood
1       1           1540831183308   adelewatson     Adele               Watson
*/

As you can see from the results (your timestamps will be different) the records span over the three partitions. Now let’s query specific partitions.

-- selects only records from partition = 0
SELECT *
FROM customers_partitioned
WHERE _meta.partition = 0

-- selects only records from partition  0 and 2
SELECT *
FROM customers_partitioned
WHERE _meta.partition = 0
   OR _meta.partition = 2

Count the records

Looking and manipulating single entries can only take you so far. There are scenarios where you want the information about groups of data instead of individual rows. In this exercise, we are going to use the COUNT aggregate to count the records on a table. Run the following SQL to see how many records we have on the customers_partitioned.

SELECT count(*) as total
FROM customers_partitioned

Use SUM to aggregate your amounts

Earlier the groceries table was used. You can find out the amount spent with a very simple query using the SUM function. Go ahead and run this:

SELECT SUM(quantity * price) as amount
FROM groceries

Group data with GROUP BY

There are scenarios where you want to group the data based on some of the fields it contains. Imagine a customer table where you store the customer country information. Let’s create a new table customers with a bit more information on each row. Let’s name it customer:

CREATE TABLE customer (firstName string, lastName string, city string,country string, phone string) FORMAT(string, avro) properties(compact=true);

INSERT INTO customer (_key,firstName, lastName, city, country, phone) VALUES ('1','Craig', 'Smith', 'New York', 'USA', '1-01-993 2800');
INSERT INTO customer (_key,firstName, lastName, city, country, phone) VALUES ('2','William', 'Maugham','Toronto','Canada','+1-613-555-0110');
INSERT INTO customer (_key,firstName, lastName, city, country, phone) VALUES ('3','William', 'Anthony','Leeds','UK', '+44 1632 960427');
INSERT INTO customer (_key,firstName, lastName, city, country, phone) VALUES ('4','S.B.','Swaminathan','Bangalore','India','+91 7513200000');
INSERT INTO customer (_key,firstName, lastName, city, country, phone) VALUES ('5','Thomas','Morgan','Arnsberg','Germany','+49-89-636-48018');
INSERT INTO customer (_key,firstName, lastName, city, country, phone) VALUES ('6','Thomas', 'Merton','New York','USA', '+1-617-555-0147');
INSERT INTO customer (_key,firstName, lastName, city, country, phone) VALUES ('7','Piers','Gibson','London','UK', '+44 1632 960269');
INSERT INTO customer (_key,firstName, lastName, city, country, phone) VALUES ('8','Nikolai','Dewey','Atlanta','USA','+1-404-555-0178');
INSERT INTO customer (_key,firstName, lastName, city, country, phone) VALUES ('9','Marquis', 'de Ellis','Rio De Janerio','Brazil','+55 21 5555 5555');
INSERT INTO customer (_key,firstName, lastName, city, country, phone) VALUES ('10','Joseph', 'Milton','Houston','USA','+1-202-555-0153');
INSERT INTO customer (_key,firstName, lastName, city, country, phone) VALUES ('11','John','Betjeman Hunter','Sydney','Australia','+61 1900 654 321');
INSERT INTO customer (_key,firstName, lastName, city, country, phone) VALUES ('12','Evan', 'Hayek','Vancouver','Canada','+1-613-555-0130');
INSERT INTO customer (_key,firstName, lastName, city, country, phone) VALUES ('13','E.','Howard','Adelaide','Australia','+61 491 570 157');
INSERT INTO customer (_key,firstName, lastName, city, country, phone) VALUES ('14','C. J.', 'Wilde','London','UK','+44 1632 960111');
INSERT INTO customer (_key,firstName, lastName, city, country, phone) VALUES ('15','Butler', 'Andre','Florida','USA','+1-202-555-0107');

Now let’s see how many customers there are for each country. Here is the code which computes that:

SELECT COUNT(*) as count
     , country
FROM customer
GROUP BY country
/* Output
2    Canada
3    UK
2    Australia
1    India
1    Brazil
5    USA
1    Germany
*/

Nested queries

The tabled query allows you to nest queries. Let’s take the query in the previous section and say we are only interested in those entries where there more than 1 customer per country.

SELECT *
FROM (
    SELECT COUNT(*) as count
         , country
    FROM customer
    GROUP BY country)
WHERE count > 1

Run the query and you will only see those entries for which there is more than one person registered per country.

Filtering grouped data with HAVING

As you saw with the previous query, you might want to filter the entries returned from a grouping query. Just like the WHERE statement you can use HAVING syntax to achieve the same when it comes to grouped queries. Probably, you will agree it is a much easier syntax than nesting two queries. Let’s rewrite the query in the previous section:

SELECT COUNT(*) as count
     , country
FROM customer
GROUP BY country
HAVING count > 1

Just run the code and you will get the same output you got with the nested query.

Joins

Lenses Tabled-query allows you to combine records from two tables. A query can contain zero, one or multiple JOIN operations. Let’s create an orders table and insert some data into it:

CREATE TABLE orders(_key int, orderDate string, customerId string, amount double) format(int, avro);

INSERT INTO orders (_key, orderDate, customerId, amount) VALUES (1, '2018-10-01', '1', 200.50);
INSERT INTO orders (_key, orderDate, customerId, amount) VALUES (2, '2018-10-11', '1', 813.00);
INSERT INTO orders (_key, orderDate, customerId, amount) VALUES (3, '2018-10-11', '3', 625.20);
INSERT INTO orders (_key, orderDate, customerId, amount) VALUES (4, '2018-10-11', '14', 730.00);
INSERT INTO orders (_key, orderDate, customerId, amount) VALUES (5, '2018-10-11', '10', 440.00);
INSERT INTO orders (_key, orderDate, customerId, amount) VALUES (6, '2018-10-11', '9', 444.80);

With these tables in place let’s join them to get a more information on the order by combining it with the customers’ information:

SELECT o._key as orderNumber
    , o.amount as totalAmount
    , c.firstName
    , c.lastName
    , c.city
    , c.country
FROM orders o INNER JOIN customer c
    ON o.customerId = c._key

/*
city        orderNumber     country     totalAmount     lastName    firstName
New York        1               USA         200.5       Smith       Craig
New York        2               USA         813         Smith       Craig
Leeds           3               UK          625.2       Anthony     William
Rio De Janerio  6               Brazil      444.8       de Ellis    Marquis
Houston         5               USA         440         Milton      Joseph
London          4               UK          730         Wilde       C. J.
*/