REGEXΒΆ

Processing application logs as they arrive is quite a common scenario. Looking at the log entries as a stream of data which can be aggregated or even joined in real-time can add a lot of value when it comes to monitoring what your apps are doing.

LSQL engine supports the projection of an unstructured text line to a well defined structure based on a regular expression used to extract the information. This way turning your log data into searchable data can be achieved in matter of minutes.

Each log line is published to Kafka as a message are of type STRING. To be able to project from the primitive type to a structured type, in this case JSON, the following decoder has been introduced STRING_TO_JSON. When used, it wraps the text in a simple structure and the message value is referenced via the content field.

SELECT content
FROM `topic`
WHERE _vtype=STRING_TO_JSON

Imagine there is a requirement to process the logs from a set of running web application, and extract the information like incoming IP address, the API call, response code, etc. Here is an example of such a log entry:

127.0.0.1.800.00, 127.0.0.1.800.00 - - [08/Feb/2017:16:33:27 +0100] "GET /api/house/get_for_compare?idn=33&amp;code=99992&amp;type= HTTP/1.1<em>" 404 19636 "mywebsitelocation.com" "Mozilla/5.0 (Linux; Android 5.0.1; GT-I9505 Build/LRX22C) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Mobile Safari/537.36"</em>

One possible regular expression to apply to the given input can be seen below:

(.+?)\s-\s-\s\[(.+?)\]\s\"(.+?)\s(.+?)\s(.+?)\"\s(.+?)\s(.+?)\s\"(.+?)\"\s\"(.+?)\".*

Based on the above the output produced should be the following:

{
   "ip":"127.0.0.1.800.00, 127.0.0.1.800.00",
   "timestamp":"08/Feb/2017:16:33:27 +0100",
   "type":"GET",
   "request":"/api/house/get_for_compare?idn=33&amp;code=99992&amp;type=",
   "status":"404",
   "response_size":"19636",
   "website":"mywebsitelocation.com"
}

To achieve the above the next LSQL code should be used:

INSERT INTO $toTopic
WITH log_regex as
(
  SELECT STREAM regexp(content, '(.+?)\\s-\\s-\\s\\[(.+?)\\]\\s\\\"(.+?)\\s(.+?)\\s(.+?)\\\"\\s(.+?)\\s(.+?)\\s\\\"(.+?)\\\"\\s\\\"(.+?)\\\".*') as regx
  FROM $fromTopic
  WHERE _vtype='STRING_AS_JSON'
)

SELECT STREAM
  regx[1] as ip,
  regx[2] as `timestamp`,
  regx[3] as `type`,
  regx[4] as request,
  regx[6] as status,
  regx[7] as response_size,
  regx[8] as website
FROM log_regex
WHERE regx is not null

Above, is just a stream transformation. log_regex stream can be groupd or joined further if the requirements dictate to.