Regex within Lenses.io


Regex matching and extraction 

In this tutorial you will learn how to use the regexp and regex_matches functions to match and extract data from text/string fields.

Applying a regex to validate or extract text data is a very common scenario especially when working with string/text data sources like log files.

This tutorial is split in 5 main stages:

  1. Creating and populating an example topic.
  2. Filtering out non-standard log messages.
  3. Extracting patterns from a string (capturing groups)
  4. Creating complex structs from arrays of capturing groups
  5. Turning it into a processor

NOTE: All the code snippets bellow apart from step 5) can be used within SQL Studio. Step 5) is to be used to create a new processor.

Step 01 - Demo data 

Running the following code in SQL Studio should enable you to create a table (log_lines) with some logs taken from a Zookeeper instance:

CREATE TABLE log_lines(_key LONG, _value.log_line STRING) FORMAT (LONG, AVRO);

INSERT INTO log_lines(_key, log_line) VALUES
    (1, '[2021-09-22 08:51:39,347] INFO clientPortAddress is 0.0.0.0:2181 (org.apache.zookeeper.server.quorum.QuorumPeerConfig)'),
    (2, '[2021-09-22 08:51:39,347] INFO secureClientPort is not set (org.apache.zookeeper.server.quorum.QuorumPeerConfig)'),
    (3, '[2021-09-22 08:51:39,347] INFO Starting server (org.apache.zookeeper.server.ZooKeeperServerMain)'),
    (4, '[2021-09-22 08:51:39,351] INFO zookeeper.snapshot.trust.empty : false (org.apache.zookeeper.server.persistence.FileTxnSnapLog)'),
    (5, '[2021-09-22 10:52:39,347] INFO Starting server (org.apache.zookeeper.server.ZooKeeperServerMain)'),
    (6, '[2021-09-22 10:52:42,351] INFO zookeeper.snapshot.trust.empty : false (org.apache.zookeeper.server.persistence.FileTxnSnapLog)'),
    (7, '[2021-09-22 10:52:50,365] INFO Server environment:zookeeper.version=3.5.9-83df9301aa5c2a5d284a9940177808c01bc35cef, built on 01/06/2021 20:03 GMT (org.apache.zookeeper.server.ZooKeeperServer)')
    (8, '[19283312] JFHR Server environment:zookeeper.version=3.5.9-83df9301aa5c2a5d284a9940177808c01bc35cef, built on 01/06/2021 20:03 GMT (org.apache.zookeeper.server.ZooKeeperServer)')
;

Step 02 - Filtering messages 

Since we’re dealing with strings we should first filter any log lines that do not match our existing assumptions and in that way avoid errors. As such we first filter out any messages that do not respect the following regex ^\[\d{4}-\d{2}-\d{2} \d{2}\:\d{2}:\d{2},\d{3}\] (INFO|ERROR|WARN).*$:

SELECT
 *
FROM
 log_lines
WHERE
  SIZEOF(
    regexp(log_line, "^\[\d{4}-\d{2}-\d{2} \d{2}\:\d{2}:\d{2},\d{3}\] (INFO|ERROR|WARN).*$")
  ) >= 1

Notice that _key = 8 was filtered out from the result set since it does not start with a valid date format.

Step 03 - Capturing groups 

Now that we know the data we’re working with matches our expectations, the next step is to extract the relevant fields. For that we’ll use the regex_matches function which allows us to define capturing groups allowing us to deconstruct our string based on a pattern:

SELECT
 log_line,
 regex_matches(log_line, '^\[(\d{4})-(\d{2})-(\d{2}) (\d{2})\:(\d{2}):(\d{2}),(\d{3})\] (INFO|ERROR|WARN) (.*)$') as matches
FROM
 log_lines
WHERE
  SIZEOF(regexp(log_line, '^\[\d{4}-\d{2}-\d{2} \d{2}\:\d{2}:\d{2},\d{3}\] (INFO|ERROR|WARN).*$')) >= 1

Notice that in the result set, we now have a new property “matches” which is a 2D array. This is because regex_matches will return one set of results every time the whole regex provided is matched, since our regex matches the whole string, we only get one result per message.

Step 04 - Creating a struct 

A closer look at the matches property seems to indicate that:

  1. all matches have only size = 1
  2. all arrays inside matches have size = 10 (the same number of capturing groups we defined plus one)

As such we can extract the fields using the following query:

SELECT
  CAST(matches[0][1] AS INT) as year,
  CAST(matches[0][2] AS INT) as month,
  CAST(matches[0][3] AS INT) as day,
  CAST(matches[0][4] AS INT) as hour,
  CAST(matches[0][5] AS INT) as minutes,
  CAST(matches[0][6] AS INT) as seconds,
  CAST(matches[0][7] AS INT) as millis,
  matches[0][8] as log_level,
  matches[0][9] as log_msg
FROM
    (
    SELECT
       log_line,
       regex_matches(log_line, '^\[(\d{4})-(\d{2})-(\d{2}) (\d{2})\:(\d{2}):(\d{2}),(\d{3})\] (INFO|ERROR|WARN) (.*)$') as matches
     FROM
       log_lines
     WHERE
       SIZEOF(regexp(log_line, '^\[\d{4}-\d{2}-\d{2} \d{2}\:\d{2}:\d{2},\d{3}\] (INFO|ERROR|WARN).*$')) >= 1
    )

Step 05 - Creating a processor 

Finally, the last step is to make a processor that will continuously apply the changes we’ve describe above. That can be done by creating a processor with the following query:

SET defaults.topic.autocreate=true;
INSERT INTO
 structured_logs
SELECT STREAM
  CAST(matches[0][1] AS INT) as year,
  CAST(matches[0][2] AS INT) as month,
  CAST(matches[0][3] AS INT) as day,
  CAST(matches[0][4] AS INT) as hour,
  CAST(matches[0][5] AS INT) as minutes,
  CAST(matches[0][6] AS INT) as seconds,
  CAST(matches[0][7] AS INT) as millis,
  matches[0][8] as log_level,
  matches[0][9] as log_msg
FROM
    (
    SELECT STREAM
       log_line,
       regex_matches(log_line, '^\[(\d{4})-(\d{2})-(\d{2}) (\d{2})\:(\d{2}):(\d{2}),(\d{3})\] (INFO|ERROR|WARN) (.*)$') as matches
     FROM
       log_lines
     WHERE
       SIZEOF(regexp(log_line, '^\[\d{4}-\d{2}-\d{2} \d{2}\:\d{2}:\d{2},\d{3}\] (INFO|ERROR|WARN).*$')) >= 1
    )
--
Last modified: January 22, 2024