REGEXΒΆ

Processing application logs as they arrive is quite a common scenario. Looking at the log entries as a stream of data, which can be aggregated or even joined in real-time, can add a lot of value when it comes to monitoring what your applications are doing.

The Lenses SQL engine supports the projection of an unstructured text line to a well defined structure based on a regular expression used to extract the information. This way turning your log data into searchable data can be achieved in a matter of minutes.

Each log line is published to Kafka as a message of type STRING. To be able to project from the primitive type to a structured type, in this case JSON, the following decoder has been introduced STRING_TO_JSON. When used, it wraps the text in a simple structure and the message value is referenced via the content field.

SELECT content
FROM `topic`
WHERE _vtype=STRING_TO_JSON

Imagine there is a requirement to process the logs from a set of running web application and extract the information like incoming IP address, the API call, response code, etc. Here is an example of such a log entry:

127.0.0.1.800.00, 127.0.0.1.800.00 - - [08/Feb/2017:16:33:27 +0100] "GET /api/house/get_for_compare?idn=33&amp;code=99992&amp;type= HTTP/1.1<em>" 404 19636 "mywebsitelocation.com" "Mozilla/5.0 (Linux; Android 5.0.1; GT-I9505 Build/LRX22C) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Mobile Safari/537.36"</em>

One possible regular expression to apply to the given input can be seen below:

(.+?)\s-\s-\s\[(.+?)\]\s\"(.+?)\s(.+?)\s(.+?)\"\s(.+?)\s(.+?)\s\"(.+?)\"\s\"(.+?)\".*

Based on the above the output produced should be the following:

{
   "ip":"127.0.0.1.800.00, 127.0.0.1.800.00",
   "timestamp":"08/Feb/2017:16:33:27 +0100",
   "type":"GET",
   "request":"/api/house/get_for_compare?idn=33&amp;code=99992&amp;type=",
   "status":"404",
   "response_size":"19636",
   "website":"mywebsitelocation.com"
}

To achieve the above, the next SQL code should be used:

INSERT INTO $toTopic
WITH log_regex as
(
  SELECT STREAM regexp(content, '(.+?)\\s-\\s-\\s\\[(.+?)\\]\\s\\\"(.+?)\\s(.+?)\\s(.+?)\\\"\\s(.+?)\\s(.+?)\\s\\\"(.+?)\\\"\\s\\\"(.+?)\\\".*') as regx
  FROM $fromTopic
  WHERE _vtype='STRING_AS_JSON'
)

SELECT STREAM
  regx[1] as ip,
  regx[2] as `timestamp`,
  regx[3] as `type`,
  regx[4] as request,
  regx[6] as status,
  regx[7] as response_size,
  regx[8] as website
FROM log_regex
WHERE regx is not null

Above, is just a stream transformation. log_regex stream can be grouped or joined further if the requirements dictate to.