1 of 29

Connectors

This page describes the Stream Reactor connector plugins.

Install

Learn how to install Connectors.

Sources

Learn about the available source connectors.

Sinks

Learn about the available sink connectors.

Install

This page describes installing the Lenses Kafka Connectors.

If you do not use the plugin.path and add the connectors directly to the CLASSPATH you may have dependency conflicts.

Download the release and unpack.

Within the unpacked directory you will find the following structure:

/opt/stream-reactor-x.x.x-x.x.x
├── bin
├── conf
├── libs
├── LICENSE

The libs directory contains all the Stream Reactor Connector jars. Edit your Connect worker properties add the path to the directory containing the connectors and restart your workers. Repeat this process for all the Connect workers in your cluster. The connectors must be available to all the workers.

Example:

plugin.path=/usr/share/connectors,/opt/stream-reactor-x.x.x-x.x.x/libs

Sources

This page details the configuration options for the Stream Reactor Kafka Connect source connectors.

Source connectors read data from external systems and write to Kafka.

AWS S3

Load data from AWS S3 including restoring topics.

Azure Data Lake Gen2

Load data from Azure Data Lake Gen2 including restoring topics.

Azure Event Hubs

Load data from Azure Event Hubs into Kafka topics.

Azure Service Bus

Load data from Azure Service Bus into Kafka topics.

Cassandra

Load data from Cassandra into Kafka topics.

GCP PubSub

Load data from GCP PubSub into Kafka topics.

GCP Storage

Load data from GCP Storage including restoring topics.

FTP

Load data from files on FTP servers into Kafka topics.

JMS

Load data from JMS topics and queues into Kafka topics.

MQTT

Load data from MQTT into Kafka topics.

AWS S3

This page describes the usage of the Stream Reactor AWS S3 Source Connector.

This connector is also available on the AWS Marketplace.

AWS Marketplace: Lenses Stream Reactor: S3 Source & Sink Kafka Connectors

Objects that have been archived to AWS Glacier storage class are skipped, in order to load these objects you must manually restore the objects. Skipped objects are logged in the Connect workers log files.

Connector Class

io.lenses.streamreactor.connect.aws.s3.source.S3SourceConnector

Example

For more examples see the tutorials.

name=aws-s3SourceConnectorParquet
connector.class=io.lenses.streamreactor.connect.aws.s3.source.S3SourceConnector
tasks.max=1
connect.s3.kcql=insert into $TOPIC_NAME select * from $BUCKET_NAME:$PREFIX_NAME STOREAS `parquet`
connect.s3.aws.region=eu-west-2
connect.s3.aws.secret.key=SECRET_KEY
connect.s3.aws.access.key=ACCESS_KEY
connect.s3.aws.auth.mode=Credentials

KCQL Support

You can specify multiple KCQL statements separated by ; to have the connector sink into multiple topics.

The connector uses a SQL-like syntax to configure the connector behaviour. The full KCQL syntax is:

INSERT INTO $kafka-topic
SELECT *
FROM bucketAddress:pathPrefix
[BATCH=batch]
[STOREAS storage_format]
[LIMIT limit]
[PROPERTIES(
  'property.1'=x,
  'property.2'=x,
)]

Please note that you can employ escaping within KCQL for the INSERT INTO, SELECT * FROM, and PARTITIONBY clauses when necessary. For example, if you need to use a topic name that contains a hyphen, you can escape it as follows:

INSERT INTO `my-topic-with-hyphen`
SELECT *
FROM bucketAddress:pathPrefix

Source Bucket & Path

The S3 source location is defined within the FROM clause. The connector will read all objects from the given location considering the data partitioning and ordering options. Each data partition will be read by a single connector task.

The FROM clause format is:

FROM [bucketname]:pathprefix
//my-bucket-called-pears:my-folder-called-apples

If your data in AWS was not written by the Lenses AWS sink set to traverse a folder hierarchy in a bucket and load based on the last modified timestamp of the objects in the bucket. If LastModified sorting is used, ensure objects do not arrive late, or use a post-processing step to handle them.

connect.s3.source.partition.extractor.regex=none

connect.s3.source.ordering.type=LastModified

To load in alpha numeric order set the ordering type to AlphaNumeric.

Target Bucket & Path

The target Kafka topic is specified via the INSERT INTO clause. The connector will write all the records to the given topic:

INSERT INTO my-apples-topic SELECT * FROM  my-bucket-called-pears:my-folder-called-apples

S3 Object formats

The connector supports a range of storage formats, each with its own distinct functionality:

JSON: The connector will read objects containing JSON content, each line representing a distinct record.
Avro: The connector will read Avro-stored messages from S3 and translate them into Kafka’s native format.
Parquet: The connector will read Parquet-stored messages from S3 and translate them into Kafka’s native format.
Text: The connector will read objects containing lines of text, each line representing a distinct record.
CSV: The connector will read objects containing lines of text, each line representing a distinct record.
CSV_WithHeaders: The connector will read objects containing lines of text, each line representing a distinct record while skipping the header row.
Bytes: The connector will read objects containing bytes, each object is translated to a Kafka message.

Use the STOREAS clause to configure the storage format. The following options are available:

STOREAS `JSON`
STOREAS `Avro`
STOREAS `Parquet`
STOREAS `Text`
STOREAS `CSV`
STOREAS `CSV_WithHeaders`
STOREAS `Bytes`

Text Processing

When using Text storage, the connector provides additional configuration options to finely control how text content is processed.

Regex

In Regex mode, the connector applies a regular expression pattern, and only when a line matches the pattern is it considered a record. For example, to include only lines that start with a number, you can use the following configuration:

connect.s3.kcql=insert into $kafka-topic select * from lensesio:regex STOREAS `text` PROPERTIES('read.text.mode'='regex', 'read.text.regex'='^[1-9].*')

Start-End line

In Start-End Line mode, the connector reads text content between specified start and end lines, inclusive. This mode is useful when you need to extract records that fall within defined boundaries. For instance, to read records where the first line is ‘SSM’ and the last line is an empty line (’’), you can configure it as follows:

connect.s3.kcql=insert into $kafka-topic select * from lensesio:multi_line STOREAS `text` PROPERTIES('read.text.mode'='startEndLine', 'read.text.start.line'='SSM', 'read.text.end.line'='')

To trim the start and end lines, set the read.text.trim property to true:

connect.s3.kcql=insert into $kafka-topic select * from lensesio:multi_line STOREAS `text` PROPERTIES('read.text.mode'='startEndLine', 'read.text.start.line'='SSM', 'read.text.end.line'='', 'read.text.trim'='true')

Start-End tag

In Start-End Tag mode, the connector reads text content between specified start and end tags, inclusive. This mode is particularly useful when a single line of text in S3 corresponds to multiple output Kafka messages. For example, to read XML records enclosed between ‘’ and ‘’, configure it as follows:

connect.s3.kcql=insert into $kafka-topic select * from lensesio:xml STOREAS `text` PROPERTIES('read.text.mode'='startEndTag', 'read.text.start.tag'='<SSM>', 'read.text.end.tag'='</SSM>')

Storage output matrix

Depending on the storage format of Kafka topics’ messages, the need for replication to a different cluster, and the specific data analysis requirements, there exists a guideline on how to effectively utilize converters for both sink and source operations. This guidance aims to optimize performance and minimize unnecessary CPU and memory usage.

S3 Storage Format

Kafka Output Format

Restore or replicate cluster

Analytics

Sink Converter

Source Converter

JSON

STRING

Same,Other

Yes, No

StringConverter

AVRO,Parquet

STRING

Same,Other

Yes

StringConverter

AVRO,Parquet

STRING

Same,Other

ByteArrayConverter

JSON

Same,Other

Yes

JsonConverter

StringConverter

JSON

Same,Other

StringConverter

AVRO,Parquet

JSON

Same,Other

Yes,No

JsonConverter

JsonConverter or Avro Converter( Glue, Confluent)

AVRO,Parquet, JSON

BYTES

Same,Other

Yes,No

ByteArrayConverter

AVRO,Parquet

AVRO

Same

Yes

Avro Converter( Glue, Confluent)

AVRO,Parquet

AVRO

Same

ByteArrayConverter

AVRO,Parquet

AVRO

Other

Yes,No

Avro Converter( Glue, Confluent)

AVRO,Parquet

Protobuf

Same

Yes

Protobuf Converter( Glue, Confluent)

AVRO,Parquet

Protobuf

Same

ByteArrayConverter

AVRO,Parquet

Protobuf

Other

Yes,No

Protobuf Converter( Glue, Confluent)

AVRO,Parquet, JSON

Other

Same, Other

Yes,No

ByteArrayConverter

Projections

Currently, the connector does not offer support for SQL projection; consequently, anything other than a SELECT * query is disregarded. The connector will faithfully write all the record fields to Kafka exactly as they are.

Ordering

The S3 sink employs zero-padding in object names to ensure precise ordering, leveraging optimizations offered by the S3 API, guaranteeing the accurate sequence of object.

When using the S3 source alongside the S3 sink, the connector can adopt the same ordering method, ensuring data processing follows the correct chronological order. However, there are scenarios where S3 data is generated by applications that do not maintain lexical object key name order.

In such cases, to process object in the correct sequence, the source needs to list all objects in the bucket and sort them based on their last modified timestamp. To enable this behavior, set the connect.s3.source.ordering.type to LastModified. This ensures that the source correctly arranges and processes the data based on the timestamps of the objects.

If using LastModified sorting, ensure objects do not arrive late, or use a post-processing step to handle them.

Throttling

To limit the number of object keys the source reads from S3 in a single poll. The default value, if not specified, is 1000:

BATCH = 100

To limit the number of result rows returned from the source in a single poll operation, you can use the LIMIT clause. The default value, if not specified, is 10000.

LIMIT 10000

Object Extension Filtering

The AWS S3 Source Connector allows you to filter the objects to be processed based on their extensions. This is controlled by two properties: connect.s3.source.extension.excludes and connect.s3.source.extension.includes.

Excluding Object Extensions

The connect.s3.source.extension.excludes property is a comma-separated list of object extensions to exclude from the source object search. If this property is not configured, all objects are considered. For example, to exclude .txt and .csv objects, you would set this property as follows:

connect.s3.source.extension.excludes=txt,csv

Including Object Extensions

The connect.s3.source.extension.includes property is a comma-separated list of object extensions to include in the source object search. If this property is not configured, all objects are considered. For example, to include only .json and .xml objects, you would set this property as follows:

connect.s3.source.extension.includes=json,xml

Note: If both connect.s3.source.extension.excludes and connect.s3.source.extension.includes are set, the connector first applies the exclusion filter and then the inclusion filter.

Post-Processing Options

Post-processing options offer flexibility in managing how objects are handled after they have been processed. By configuring these options, users can automate tasks such as deleting objects to save storage space or moving files to an archive for compliance and data retention purposes. These features are crucial for efficient data lifecycle management, particularly in environments where storage considerations or regulatory requirements dictate the need for systematic handling of processed data.

Use Cases for Post-Processing Options

Deleting Objects After Processing
For scenarios where freeing up storage is critical and reprocessing is not necessary, configure the connector to delete objects after they are processed. This option is particularly useful in environments with limited storage capacity or where processed data is redundantly stored elsewhere.
Example:
```
INSERT INTO `my-topic`
SELECT * FROM `my-s3-bucket:my-prefix`
PROPERTIES (
    'post.process.action'=`DELETE`
)
```
Result: Objects are permanently removed from the S3 bucket after processing, effectively reducing storage usage and preventing reprocessing.
Moving Objects to an Archive Bucket
To preserve processed objects for archiving or compliance reasons, set the connector to move them to a designated archive bucket. This use case applies to organizations needing data retention strategies or for regulatory adherence by keeping processed records accessible but not in active use.
Example:
```
INSERT INTO `my-topic`
SELECT * FROM `my-s3-bucket:my-prefix`
PROPERTIES (
    'post.process.action'=`MOVE`,
    'post.process.action.bucket'=`archive-bucket`,
    'post.process.action.prefix'=`processed/`
)
```
Result: Objects are transferred to an archive-bucket, stored with an updated path that includes the processed/ prefix, maintaining an organized archive structure.

Properties

The PROPERTIES clause is optional and adds a layer of configuration to the connector. It enhances versatility by permitting the application of multiple configurations (delimited by ‘,’). The following properties are supported:

Name

Description

Type

Available Values

read.text.mode

Controls how Text content is read

Enum

Regex, StartEndTag, StartEndLine

read.text.regex

Regular Expression for Text Reading (if applicable)

String

read.text.start.tag

Start Tag for Text Reading (if applicable)

String

read.text.end.tag

End Tag for Text Reading (if applicable)

String

read.text.buffer.size

Text Buffer Size (for optimization)

Int

read.text.start.line

Start Line for Text Reading (if applicable)

String

read.text.end.line

End Line for Text Reading (if applicable)

String

read.text.trim

Trim Text During Reading

Boolean

store.envelope

Messages are stored as “Envelope”

Boolean

post.process.action

Defines the action to perform on source objects after successful processing.

Enum

DELETE or MOVE

post.process.action.bucket

Specifies the target bucket for the MOVE action (required for MOVE).

String

post.process.action.prefix

Specifies a new prefix for the object’s location when using the MOVE action (required for MOVE).

String

post.process.action.retain.dirs

Ensure that paths are retained after a post-process action, using a zero-byte object to represent the path.

Boolean

false

Authentication

The connector offers two distinct authentication modes:

Default: This mode relies on the default AWS authentication chain, simplifying the authentication process.
Credentials: In this mode, explicit configuration of AWS Access Key and Secret Key is required for authentication.

When selecting the “Credentials” mode, it is essential to provide the necessary access key and secret key properties. Alternatively, if you prefer not to configure these properties explicitly, the connector will follow the credentials retrieval order as described here.

Here’s an example configuration for the “Credentials” mode:

connect.s3.aws.auth.mode=Credentials
connect.s3.aws.region=eu-west-2
connect.s3.aws.access.key=$AWS_ACCESS_KEY
connect.s3.aws.secret.key=$AWS_SECRET_KEY

For enhanced security and flexibility when using the “Credentials” mode, it is highly advisable to utilize Connect Secret Providers. This approach ensures robust security practices while handling access credentials.

API Compatible systems

The connector can also be used against API compatible systems provided they implement the following:

listObjectsV2
listObjectsV2Pagbinator
putObject
getObject
headObject
deleteObjects
deleteObject

Option Reference

Name

Description

Type

Available Values

Default Value

connect.s3.aws.auth.mode

Specifies the AWS authentication mode for connecting to S3.

string

"Credentials," "Default"

"Default"

connect.s3.aws.access.key

Access Key for AWS S3 Credentials

string

connect.s3.aws.secret.key

Secret Key for AWS S3 Credentials

string

connect.s3.aws.region

AWS Region for S3 Bucket

string

connect.s3.pool.max.connections

Maximum Connections in the Connection Pool

int

-1 (undefined)

connect.s3.custom.endpoint

Custom Endpoint URL for S3 (if applicable)

string

connect.s3.kcql

Kafka Connect Query Language (KCQL) Configuration to control the connector behaviour

string

kcql configuration

connect.s3.vhost.bucket

Enable Virtual Hosted-style Buckets for S3

boolean

true, false

false

connect.s3.source.extension.excludes

A comma-separated list of object extensions to exclude from the source object search.

string

[Object extension filtering]({{< relref "#object-extension-filtering" >}})

connect.s3.source.extension.includes

A comma-separated list of object extensions to include in the source object search.

string

[object extension filtering]({{< relref "#object-extension-filtering" >}})

connect.s3.source.partition.extractor.type

Type of Partition Extractor (Hierarchical or Regex)

string

hierarchical, regex

connect.s3.source.partition.extractor.regex

Regex Pattern for Partition Extraction (if applicable)

string

connect.s3.ordering.type

Type of ordering for the S3 object keys to ensure the processing order.

string

AlphaNumeric, LastModified

AlphaNumeric

connect.s3.source.partition.search.continuous

If set to true the connector will continuously search for new partitions.

boolean

true, false

true

connect.s3.source.partition.search.excludes

A comma-separated list of paths to exclude from the partition search.

string

".indexes"

connect.s3.source.partition.search.interval

The interval in milliseconds between searching for new partitions.

long

300000

connect.s3.source.partition.search.recurse.levels

Controls how many levels deep to recurse when searching for new partitions

int

connect.s3.source.empty.results.backoff.initial.delay

Initial delay before retrying when no results are found.

long

1000 Milliseconds

connect.s3.source.empty.results.backoff.max.delay

Maximum delay before retrying when no results are found.

long

10000 Milliseconds

connect.s3.source.empty.results.backoff.multiplier

Multiplier to apply to the delay when retrying when no results are found.

double

2.0 Multiplier (x)

connect.s3.source.write.watermark.header

Write the record with kafka headers including details of the source and line number of the file.

boolean

true, false

false

Azure Data Lake Gen2

This page describes the usage of the Stream Reactor Azure Data Lake Gen2 Source Connector.

Coming soon!

Azure Event Hubs

This page describes the usage of the Stream Reactor Azure Event Hubs Source Connector.

A Kafka Connect source connector to read events from Azure Event Hubs and push them to Kafka.

In order to leverage Kafka API in your Event Hubs it has to be at least on Standard Pricing Tier. More info here.

Connector Class

io.lenses.streamreactor.connect.azure.eventhubs.source.AzureEventHubsSourceConnector

Example

For more examples see the tutorials.

Below example presents all the necessary parameters configuration in order to use Event Hubs connector. It contains all the necessary parameters (but nothing optional, so feel free to tweak it to your needs):

name=AzureEventHubsSourceConnector
connector.class=io.lenses.streamreactor.connect.azure.eventhubs.source.AzureEventHubsSourceConnector
tasks.max=1
connect.eventhubs.kcql=INSERT INTO azureoutput SELECT * FROM inputhub;
connect.eventhubs.source.connection.settings.bootstrap.servers=MYNAMESPACE.servicebus.windows.net:9093
connect.eventhubs.source.connection.settings.sasl.mechanism=PLAIN
connect.eventhubs.source.connection.settings.security.protocol=SASL_SSL
connect.eventhubs.source.connection.settings.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="$ConnectionString" password="Endpoint=sb://MYNAMESPACE.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=SOME_SHARED_ACCESS_STRING;EntityPath=inputhub";

KCQL support

Connector allows for multiple KCQL commands.

The following KCQL is supported:

INSERT INTO <your-kafka-topic>
SELECT *
FROM <your-event-hub>;

The selection of fields from the Event Hubs message is not supported.

Payload support

As for now Azure Event Hubs Connector supports raw bytes passthrough from source Hub to Kafka Topic specified in the KCQL config.

Authentication

You can connect to Azure EventHubs passing specific JAAS parameters in configuration.

connect.eventhubs.connection.settings.bootstrap.servers=NAMESPACENAME.servicebus.windows.net:9093
connect.eventhubs.connection.settings.sasl.mechanism=PLAIN
connect.eventhubs.connection.settings.security.protocol=SASL_SSL
connect.eventhubs.connection.settings.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="$ConnectionString" password="{YOUR.EVENTHUBS.CONNECTION.STRING}";

Learn more about different methods of connecting to Event Hubs on Azure website. The only caveat is to add connector-specific prefix like in example above. See Fine-tunning the Kafka Connector for more info.

Fine-tunning the Kafka Connector

The Azure Event Hubs Connector utilizes the Apache Kafka API implemented by Event Hubs. This also allows fine-tuning for user-specific needs because the Connector passes all of the properties with a specific prefix directly to the consumer. The prefix is connect.eventhubs.connection.settings and when user specifies a property with it, it will be automatically passed to the Consumer.

User wants to fine-tune how much data records comes through the network at once. He specifies below property as part of his configuration for Azure Event Hubs Connector before starting it.

connect.eventhubs.connection.settings.max.poll.records = 100

It means that internal Kafka Consumer will poll at most 100 records at time (as max.poll.records is passed directly to it)

There are certain exceptions to this rule as couple of those are internally used in order to smoothly proceed with consumption. Those exceptions are listed below:

client.id - Connector sets it by itself
group.id - Connector sets it by itself
key.deserializer - Connector transitions bytes 1-to-1
value.deserializer - Connector transitions bytes 1-to-1
enable.auto.commit - connector automatically sets it to false and checks what offsets are committed in output topic instead

Option Reference

Name

Description

Type

Default Value

connect.eventhubs.source.connection.settings.bootstrap.servers

Specifies the Event Hubs server location.

string

connect.eventhubs.source.close.timeout

Amount of time (in seconds) for Consumer to close.

int

connect.eventhubs.source.default.offset

Specifies whether by default we should consume from earliest (default) or latest offset.

string

earliest

connect.eventhubs.kcql

Comma-separated output KCQL queries

string

Azure Service Bus

This page describes the usage of the Stream Reactor Azure Service Bus Source Connector.

This Kafka connector is designed to effortlessly ingest records from Azure Service Bus into your Kafka cluster. It leverages Microsoft Azure API to seamlessly transfer data from Service Buses, allowing for their safe transition and safekeeping both payloads and metadata (see Payload support). It provides its user with AT-LEAST-ONCE guarantee as the data is committed (marked as read) in Service Bus once Connector verifies it was successfully committed to designated Kafka topic. Azure Service Bus Source Connector supports both types of Service Buses: Queues and Topics.

Connector Class

io.lenses.streamreactor.connect.azure.servicebus.source.AzureServiceBusSourceConnector

Full Config Example

For more examples see the tutorials.

The following example presents all the mandatory configuration properties for the Service Bus connector. Please note there are also optional parameters listed (link to option reference??). Feel free to tweak the configuration to your requirements.

connector.class=io.lenses.streamreactor.connect.azure.servicebus.source.AzureServiceBusSourceConnector
name=AzureServiceBusSourceConnector
tasks.max=1
connect.servicebus.connection.string="Endpoint=sb://MYNAMESPACE.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=SOME_SHARED_ACCESS_STRING";
connect.servicebus.kcql=INSERT INTO output-topic SELECT * FROM servicebus-queue PROPERTIES('servicebus.type'='QUEUE');

KCQL support

You can specify multiple KCQL statements separated by ; to have the connector map between multiple topics.

The following KCQL is supported:

INSERT INTO <your-kafka-topic>
SELECT *
FROM <your-service-bus> 
PROPERTIES(...);

It allows you to map Service Bus of name <your-service-bus> to Kafka topic of name <your-kafka-topic> using the PROPERTIES specified.

The selection of fields from the Service Bus message is not supported.

Payload support

Azure Service Bus Connector follows specific pattern (Schema) of messages. Please look below for the format of the data transferred to Kafka Topics specified in the KCQL config.

Key Format (Schema)

Field Name

Schema Type

Description

MessageId

String

The message identifier that uniquely identifies the message and its payload.

Payload Format (Schema)

Field Name

Schema Type

Description

deliveryCount

int64

The number of the times this message was delivered to clients.

enqueuedTimeUtc

int64

The time at which this message was enqueued in Azure Service Bus.

contentType

Optional String

The content type of this message.

label

Optional String

The application specific message label.

correlationId

Optional String

The correlation identifier.

messageProperties

Optional String

The map of user application properties of this message.

partitionKey

Optional String

The partition key for sending a message to a partitioned entity.

replyTo

Optional String

The address of an entity to send replies to.

replyToSessionId

Optional String

The session identifier augmenting the ReplyTo address.

deadLetterSource

Optional String

The name of the queue or subscription that this message was enqueued on, before it was deadlettered.

timeToLive

int64

The duration before this message expires.

lockedUntilUtc

Optional int64

The time when the lock of this message expires.

sequenceNumber

Optional int64

The unique number assigned to a message by Azure Service Bus.

sessionId

Optional String

The session identifier for a session-aware entity.

lockToken

Optional String

The lock token for the current message.

messageBody

Optional bytes

The body of this message as a byte array.

getTo

Optional String

The “to” address.

Authentication

You can connect to an Azure Service Bus by passing your connection string in configuration. The connection string can be found in the Shared access policies section of your Azure Portal.

connect.servicebus.connection.string=Endpoint=sb://YOURNAMESPACE.servicebus.windows.net/;SharedAccessKeyName=YOUR_KEYNAME;SharedAccessKey=YOUR_ACCESS_KEY=

Learn more about different methods of connecting to Service Bus on the Azure Website.

Queues & Topics

The Azure Service Bus Connector connects to Service Bus via Microsoft API. In order to smoothly configure your mappings you have to pay attention to PROPERTIES part of your KCQL mappings. There are two cases here: reading from Service Bus of type QUEUE and of type TOPIC. Please refer to the relevant sections below. In case of further questions check Azure Service Bus documentation to learn more about those mechanisms.

Reading from Queues

In order to be reading from the queue there's an additional parameter that you need to pass with your KCQL mapping in the PROPERTIES part. This parameter is servicebus.type and it can take one of two values depending on the type of the service bus: QUEUE or TOPIC. Naturally for Queue we're interested in QUEUE here and we need to pass it.

connect.servicebus.kcql=INSERT INTO kafka-topic SELECT * FROM azure-queue PROPERTIES('servicebus.type'='QUEUE');

This is sufficient to enable you to create the mapping with your queue.

Reading from Topics

In order to be reading from the topic there are two additional parameters that you need to pass with your KCQL mapping in the PROPERTIES part:

Parameter servicebus.type which can take one of two values depending on the type of the service bus: QUEUE or TOPIC. For topic we're interested in TOPIC here and we need to pass it.
Parameter subscription.name which takes the (case-sensitive) value of a subscription name that you've created for this topic for the connector to use. Please use Azure Portal to create one.

Make sure your subscription exists otherwise you will get a similar error to this

Caused by: com.azure.core.amqp.exception.AmqpException: The messaging entity 'streamreactor:Topic:my-topic|lenses' could not be found. To know more visit https://aka.ms/sbResourceMgrExceptions.

Create the subscription per topic in the Azure Portal.

connect.servicebus.kcql=INSERT INTO kafka-topic SELECT * FROM azure-topic PROPERTIES('servicebus.type'='TOPIC','subscription.name'='subscription1');

This is sufficient to enable you to create the mapping with your topic.

Option Reference

KCQL Properties

Please find below all the necessary KCQL properties:

Name

Description

Type

Default Value

servicebus.type

Specifies Service Bus type: QUEUE or TOPIC

string

subscription.name

Specifies subscription name if Service Bus type is TOPIC

string

Configuration parameters

Please find below all the relevant configuration parameters:

Name

Description

Type

Default Value

connect.servicebus.connection.string

Specifies the Connection String to connect to Service Bus

string

connect.servicebus.kcql

Comma-separated output KCQL queries

string

connect.servicebus.source.task.records.queue.size

Specifies the Queue size between Service Bus Receivers and Kafka

int

5000

connect.servicebus.source.sleep.on.empty.poll.ms

The duration in milliseconds to sleep when no records are returned from the poll. This avoids a tight loop in Connect.

long

250

connect.servicebus.source.complete.retries.max

The maximum number of retries to complete a message.

int

connect.servicebus.source.complete.retries.min.backoff.ms

The minimum duration in milliseconds for the first backoff

long

1000

connect.servicebus.source.prefetch.count

The number of messages to prefetch from the Azure Service Bus.

int

2000

Cassandra

This page describes the usage of the Stream Reactor Cassandra Source Connector.

Kafka Connect Cassandra is a Source Connector for reading data from Cassandra and writing to Kafka.

Connector Class

io.lenses.streamreactor.connect.cassandra.source.CassandraSourceConnector

Example

For more examples see the tutorials.

name=cassandra
connector.class=io.lenses.streamreactor.connect.cassandra.source.CassandraSourceConnector
connect.cassandra.key.space=demo
connect.cassandra.kcql=INSERT INTO orders-topic SELECT * FROM orders PK created INCREMENTALMODE=TIMEUUID
connect.cassandra.contact.points=cassandra

KCQL support

You can specify multiple KCQL statements separated by ; to have the connector sink into multiple topics.

The following KCQL is supported:

INSERT INTO <your-topic>
SELECT FIELD,...
FROM <your-cassandra-table>
[PK FIELD]
[WITHFORMAT JSON]
[INCREMENTALMODE=TIMESTAMP|TIMEUUID|TOKEN|DSESEARCHTIMESTAMP]
[WITHKEY(<your-key-field>)]

Examples:

-- Select all columns from table orders and insert into a topic
-- called orders-topic, use column created to track new rows.
-- Incremental mode set to TIMEUUID
INSERT INTO orders-topic SELECT * FROM orders PK created INCREMENTALMODE=TIMEUUID

-- Select created, product, price from table orders and insert
-- into a topic called orders-topic, use column created to track new rows.
INSERT INTO orders-topic SELECT created, product, price FROM orders PK created.

Keyed JSON Format

The connector can write JSON to your Kafka topic using the WITHFORMAT JSON clause but the key and value converters must be set:

key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.storage.StringConverter

In order to facilitate scenarios like retaining the latest value for a given device identifier, or support Kafka Streams joins without having to re-map the topic data the connector supports WITHKEY in the KCQL syntax.

INSERT INTO <topic>
SELECT <fields>
FROM <column_family>
PK <PK_field>
WITHFORMAT JSON
WITHUNWRAP INCREMENTALMODE=<mode>
WITHKEY(<key_field>)

Multiple key fields are supported using a delimiter:

// `[` enclosed by `]` denotes optional values
WITHKEY(field1 [, field2.A , field3]) [KEYDELIMITER='.']

The resulting Kafka record key content will be the string concatenation for the values of the fields specified. Optionally the delimiter can be set via the KEYDELIMITER keyword.

Keying is only supported in conjunction with the WITHFORMAT JSON clause

Incremental mode

This mode tracks new records added to a table. The columns to track are identified by the PK clause in the KCQL statement. Only one column can be used to track new records. The support Cassandra column data types are:

TIMESTAMP
TIMEUUID
TOKEN
DSESEARCHTIMESTAMP

If set to TOKEN this column value is wrapped inside Cassandra's token function which needs unwrapping with the WITHUNWRAP command.

You must use the Byte Order Partitioner for the TOKEN mode to work correctly or data will be missing from the Kafka topic. This is not recommended due to the creation of hotspots in Cassandra.

DSESEARCHTIMESTAMP will make a DSE Search queries using Solr instead of a native Cassandra query.

INSERT INTO <topic>
SELECT a, b, c, d
FROM keyspace.table
WHERE solr_query= 'pkCol:{2020-03-23T15:02:21Z TO 2020-03-23T15:30:12.989Z]}'
INCREMENTALMODE=DSESEARCHTIMESTAMP

Bulk Mode

The connector constantly loads the entire table.

Controlling throughput

The connector can be configured to:

Start from a particular offset - connect.cassandra.initial.offset
Increase or decrease the poll interval - connect.cassandra.import.poll.interval
Set a slice duration to query for in milliseconds - connect.cassandra.slice.duration

For a more detailed explanation of how to use Cassandra to Kafka options.

Source Data Type Mapping

The following CQL data types are supported:

CQL Type

Connect Data Type

TimeUUID

Optional String

UUID

Optional String

Inet

Optional String

Ascii

Optional String

Text

Optional String

Timestamp

Optional String

Date

Optional String

Tuple

Optional String

UDT

Optional String

Boolean

Optional Boolean

TinyInt

Optional Int8

SmallInt

Optional Int16

Int

Optional Int32

Decimal

Optional String

Float

Optional Float32

Counter

Optional Int64

BigInt

Optional Int64

VarInt

Optional Int64

Double

Optional Int64

Time

Optional Int64

Blob

Optional Bytes

Map

Optional [String -> MAP]

List

Optional [String -> ARRAY]

Set

Optional [String -> ARRAY]

Option Reference

Name

Description

Type

Default Value

connect.cassandra.contact.points

Initial contact point host for Cassandra including port.

string

localhost

connect.cassandra.port

Cassandra native port.

int

9042

connect.cassandra.key.space

Keyspace to write to.

string

connect.cassandra.username

Username to connect to Cassandra with.

string

connect.cassandra.password

Password for the username to connect to Cassandra with.

password

connect.cassandra.ssl.enabled

Secure Cassandra driver connection via SSL.

boolean

false

connect.cassandra.trust.store.path

Path to the client Trust Store.

string

connect.cassandra.trust.store.password

Password for the client Trust Store.

password

connect.cassandra.trust.store.type

Type of the Trust Store, defaults to JKS

string

JKS

connect.cassandra.key.store.type

Type of the Key Store, defauts to JKS

string

JKS

connect.cassandra.ssl.client.cert.auth

Enable client certification authentication by Cassandra. Requires KeyStore options to be set.

boolean

false

connect.cassandra.key.store.path

Path to the client Key Store.

string

connect.cassandra.key.store.password

Password for the client Key Store

password

connect.cassandra.consistency.level

Consistency refers to how up-to-date and synchronized a row of Cassandra data is on all of its replicas. Cassandra offers tunable consistency. For any given read or write operation, the client application decides how consistent the requested data must be.

string

connect.cassandra.fetch.size

The number of records the Cassandra driver will return at once.

int

5000

connect.cassandra.load.balancing.policy

Cassandra Load balancing policy. ROUND_ROBIN, TOKEN_AWARE, LATENCY_AWARE or DC_AWARE_ROUND_ROBIN. TOKEN_AWARE and LATENCY_AWARE use DC_AWARE_ROUND_ROBIN

string

TOKEN_AWARE

connect.cassandra.error.policy

Specifies the action to be taken if an error occurs while inserting the data. There are three available options: NOOP - the error is swallowed THROW - the error is allowed to propagate. RETRY - The exception causes the Connect framework to retry the message. The number of retries is set by connect.cassandra.max.retries. All errors will be logged automatically, even if the code swallows them.

string

THROW

connect.cassandra.max.retries

The maximum number of times to try the write again.

int

connect.cassandra.retry.interval

The time in milliseconds between retries.

int

60000

connect.cassandra.task.buffer.size

The size of the queue as read writes to.

int

10000

connect.cassandra.assigned.tables

The tables a task has been assigned.

string

connect.cassandra.batch.size

The number of records the source task should drain from the reader queue.

int

100

connect.cassandra.import.poll.interval

The polling interval between queries against tables for bulk mode.

long

1000

connect.cassandra.time.slice.ms

The range of time in milliseconds the source task the timestamp/timeuuid will use for query

long

10000

connect.cassandra.import.allow.filtering

Enable ALLOW FILTERING in incremental selects.

boolean

true

connect.cassandra.slice.duration

Duration to query for in target Cassandra table. Used to restrict query timestamp span

long

10000

connect.cassandra.slice.delay.ms

The delay between the current time and the time range of the query. Used to insure all of the data in the time slice is available

long

30000

connect.cassandra.initial.offset

The initial timestamp to start querying in Cassandra from (yyyy-MM-dd HH:mm:ss.SSS’Z’). Default 1900-01-01 00:00:00.0000000Z

string

1900-01-01 00:00:00.0000000Z

connect.cassandra.mapping.collection.to.json

Mapping columns with type Map, List and Set like json

boolean

true

connect.cassandra.kcql

KCQL expression describing field selection and routes.

string

GCP PubSub

This page describes the usage of the Stream Reactor Google PubSub Source Connector.

The Kafka connector is designed to seamlessly ingest records from GCP Pub/Sub topics and queues into your Kafka cluster. This makes it useful for backing up or streaming data from Pub/Sub to your Kafka infrastructure. This connector provides robust support for at least once semantics (this connector ensures that each record reaches the Kafka topic at least once).

Connector Class

io.lenses.streamreactor.connect.gcp.pubsub.source.GCPPubSubSourceConnector

Example

For more examples see the tutorials.

name=GcpPubSubSourceDemo
connector.class=io.lenses.streamreactor.connect.gcp.pubsub.source.GCPPubSubSourceConnector
topics=kafka_topic_to_write_to
tasks.max=1
connect.pubsub.gcp.auth.mode=File
connect.pubsub.gcp.file=/path/to/gcp-service-account-key.json
connect.pubsub.gcp.project.id=gcp-project-id
connect.pubsub.kcql=insert into `kafka_topic_to_write_to` select * from `gcp-subscription-id`

KCQL Support

The connector uses a SQL-like syntax to configure the connector behaviour. The full KCQL syntax is:

INSERT INTO kafka-topic
SELECT *
FROM subscriptionId
[PROPERTIES(
  'property.1'=x,
  'property.2'=x,
)]

Please note that you can employ escaping within KCQL for the INSERT INTO and SELECT * FROM clauses when necessary. For example, if you need to use a topic name that contains a hyphen, you can escape it as follows:

INSERT INTO `my-topic-with-hyphen`
SELECT *
FROM bucketAddress:pathPrefix

Source Subscription ID and Target Topic

The source and target of the data are specified via the INSERT INTO... SELECT * FROM clause. The connector will write all the records to the given topic, from the given subscription:

INSERT INTO my-topic SELECT * FROM subscriptionId;

Properties

The PROPERTIES clause is optional and adds a layer of configurability to the connector. It enhances versatility by permitting the application of multiple configurations (delimited by ',').

Properties can be defined in any order.

The following properties are supported:

Name

Description

Type

Default Value

batch.size

The maximum number of messages the connector will retrieve and process at one time per polling request (per KCQL mapping).

int

1000

cache.ttl

The maximum amount of time (in milliseconds) to store message data to allow acknowledgement of a message.

long

1 hour

queue.max

Data is loaded into a queue asynchronously so that it stands ready when the poll call is activated. Control the maximum number of records to hold in the queue per KCQL mapping.

int

10000

Auth Mode

The connector offers three distinct authentication modes:

Default: This mode relies on the default GCP authentication chain, simplifying the authentication process.
File: This mode uses a local (to the connect worker) path for a file containing GCP authentication credentials.
Credentials: In this mode, explicit configuration of a GCP Credentials string is required for authentication.

The simplest example to configure in the connector is the "Default" mode, as this requires no other configuration.

connect.pubsub.gcp.auth.mode=Default

When selecting the "Credentials" mode, it is essential to provide the necessary credentials. Alternatively, if you prefer not to configure these properties explicitly, the connector will follow the credentials retrieval order as described here.

Here's an example configuration for the "Credentials" mode:

connect.pubsub.gcp.auth.mode=Credentials
connect.pubsub.gcp.credentials=$GCP_CREDENTIALS
connect.pubsub.gcp.project.id=$GCP_PROJECT_ID

And here is an example configuration using the "File" mode:

connect.pubsub.gcp.auth.mode=File
connect.pubsub.gcp.file=/home/secure-stuff/gcp-read-credential.txt

Remember when using file mode the file will need to exist on every worker node in your Kafka connect cluster and be readable by the Kafka Connect process.

For enhanced security and flexibility when using either the "Credentials" mode, it is highly advisable to utilize Connect Secret Providers.

Output Modes

Two modes are available: Default Mode and Compatibility Mode.

Compatibility Mode is intended to ensure compatibility with existing tools, while Default Mode offers a simpler modern redesign of the functionality.

You can choose whichever suits your requirements.

Default Mode

Configuration

connect.pubsub.output.mode=DEFAULT

Record Schema

Each Pub/Sub message is transformed into a single Kafka record, structured as follows:

Kafka Key: A String of the Pub/Sub MessageID.
Kafka Value: The Pub/Sub message value as BYTES.
Kafka Headers: Includes the "PublishTimestamp" (in seconds) and all Pub/Sub message attributes mapped as separate headers.

Key Schema

The Kafka Key is mapped from the Pub/Sub MessageID, a unique ID for a Pub/Sub message.

Value Schema

The Kafka Value is mapped from the body of the Pub/Sub message.

Headers Schema

The Kafka Headers include:

PublishTimestamp: Long value representing the time when the Pub/Sub message was published, in seconds.
GCPProjectID: The GCP Project
PubSubTopicID: The Pub/Sub Topic ID.
PubSubSubscriptionID: The Pub/Sub Subscription ID.
All Pub/Sub message attributes: Each attribute from the Pub/Sub message is mapped as a separate header.

Compatibility Mode

Configuration

connect.pubsub.output.mode=COMPATIBILITY

Record Schema

Each Pub/Sub message is transformed into a single Kafka record, structured as follows:

Kafka Key: Comprises the project ID, message ID, and subscription ID of the Pub/Sub message.
Kafka Value: Contains the message data and attributes from the Pub/Sub message.

Key Schema

The Key is a structure with these fields:

Field Name

Schema Type

Description

ProjectId

String

The Pub/Sub project containing the topic from which messages are polled.

TopicId

String

The Pub/Sub topic containing the messages.

SubscriptionId

String

The Pub/Sub subscription of the Pub/Sub topic.

MessageId

String

A unique ID for a Pub/Sub message

Value Schema

The Value is a structure with these fields:

Field Name

Schema Type

Description

MessageData

Optional String

The body of the Pub/Sub message.

AttributeMap

Optional String

The attribute map associated with the Pub/Sub message.

Option Reference

Name

Description

Type

Available Values

Default Value

connect.pubsub.gcp.auth.mode

Specifies the authentication mode for connecting to GCP.

string

Credentials, File or Default

Default

connect.pubsub.gcp.credentials

For “auth.mode” credentials: GCP Authentication credentials string.

string

(Empty)

connect.pubsub.gcp.file

For “auth.mode” file: Local file path for file containing GCP authentication credentials.

string

(Empty)

connect.pubsub.gcp.project.id

GCP Project ID.

string

(Empty)

connect.pubsub.kcql

Kafka Connect Query Language (KCQL) Configuration to control the connector behaviour

string

kcql configuration

connect.pubsub.output.mode

Output mode. Please see Output Modes documentation below.

Default or Compatibility

Default

GCP Storage

This page describes the usage of the Stream Reactor GCP Storage Source Connector.

Connector Class

io.lenses.streamreactor.connect.gcp.storage.source.GCPStorageSourceConnector

Example

For more examples see the tutorials.

name=gcp-storageSourceConnectorParquet # this can be anything
connector.class=io.lenses.streamreactor.connect.gcp.storage.source.GCPStorageSourceConnector
tasks.max=1
connect.gcpstorage.kcql=insert into $TOPIC_NAME select * from $BUCKET_NAME:$PREFIX_NAME STOREAS `parquet`
connect.gcpstorage.gcp.auth.mode=Credentials
connect.gcpstorage.gcp.credentials=$GCP_CREDENTIALS
connect.gcpstorage.gcp.project.id=$GCP_PROJECT_ID

KCQL Support

You can specify multiple KCQL statements separated by ; to have the connector sink into multiple topics.

The connector uses a SQL-like syntax to configure the connector behaviour. The full KCQL syntax is:

INSERT INTO $kafka-topic
SELECT *
FROM bucketAddress:pathPrefix
[BATCH=batch]
[STOREAS storage_format]
[LIMIT limit]
[PROPERTIES(
  'property.1'=x,
  'property.2'=x,
)]

INSERT INTO `my-topic-with-hyphen`
SELECT *
FROM bucketAddress:pathPrefix

Source Bucket & Path

The GCP Storage source location is defined within the FROM clause. The connector will read all objects from the given location considering the data partitioning and ordering options. Each data partition will be read by a single connector task.

The FROM clause format is:

FROM [bucketname]:pathprefix
//my-bucket-called-pears:my-folder-called-apples

If your data in GCS was not written by the Lenses GCS sink set to traverse a folder hierarchy in a bucket and load based on the last modified timestamp of the objects in the bucket. If LastModified sorting is used, ensure objects do not arrive late, or use a post-processing step to handle them.

connect.gcpstorage.source.partition.extractor.regex=none

connect.gcpstorage.source.ordering.type=LastModified

To load in alpha numeric order set the ordering type to AlphaNumeric.

Target Bucket & Path

The target Kafka topic is specified via the INSERT INTO clause. The connector will write all the records to the given topic:

INSERT INTO my-apples-topic SELECT * FROM  my-bucket-called-pears:my-folder-called-apples

GCP Storage object formats

The connector supports a range of storage formats, each with its own distinct functionality:

JSON: The connector will read objects containing JSON content, each line representing a distinct record.
Avro: The connector will read Avro-stored messages from GCP Storage and translate them into Kafka’s native format.
Parquet: The connector will read Parquet-stored messages from GCP Storage and translate them into Kafka’s native format.
Text: The connector will read objects containing lines of text, each line representing a distinct record.
CSV: The connector will read objects containing lines of text, each line representing a distinct record.
CSV_WithHeaders: The connector will read objects containing lines of text, each line representing a distinct record while skipping the header row.
Bytes: The connector will read objects containing bytes, each object is translated to a Kafka message.

Use the STOREAS clause to configure the storage format. The following options are available:

STOREAS `JSON`
STOREAS `Avro`
STOREAS `Parquet`
STOREAS `Text`
STOREAS `CSV`
STOREAS `CSV_WithHeaders`
STOREAS `Bytes`

Text Processing

When using Text storage, the connector provides additional configuration options to finely control how text content is processed.

Regex

connect.gcpstorage.kcql=insert into $kafka-topic select * from lensesio:regex STOREAS `text` PROPERTIES('read.text.mode'='regex', 'read.text.regex'='^[1-9].*')

Start-End line

connect.gcpstorage.kcql=insert into $kafka-topic select * from lensesio:multi_line STOREAS `text` PROPERTIES('read.text.mode'='startEndLine', 'read.text.start.line'='SSM', 'read.text.end.line'='')

To trim the start and end lines, set the read.text.trim property to true:

connect.gcpstorage.kcql=insert into $kafka-topic select * from lensesio:multi_line STOREAS `text` PROPERTIES('read.text.mode'='startEndLine', 'read.text.start.line'='SSM', 'read.text.end.line'='', 'read.text.trim'='true')

Start-End tag

 connect.gcpstorage.kcql=insert into $kafka-topic select * from lensesio:xml STOREAS `text` PROPERTIES('read.text.mode'='startEndTag', 'read.text.start.tag'='<SSM>', 'read.text.end.tag'='</SSM>')

Storage output matrix

S3 Storage Format

Kafka Output Format

Restore or replicate cluster

Analytics

Sink Converter

Source Converter

JSON

STRING

Same,Other

Yes, No

StringConverter

AVRO,Parquet

STRING

Same,Other

Yes

StringConverter

AVRO,Parquet

STRING

Same,Other

ByteArrayConverter

JSON

Same,Other

Yes

JsonConverter

StringConverter

JSON

Same,Other

StringConverter

AVRO,Parquet

JSON

Same,Other

Yes,No

JsonConverter

JsonConverter or Avro Converter( Glue, Confluent)

AVRO,Parquet, JSON

BYTES

Same,Other

Yes,No

ByteArrayConverter

AVRO,Parquet

AVRO

Same

Yes

Avro Converter( Glue, Confluent)

AVRO,Parquet

AVRO

Same

ByteArrayConverter

AVRO,Parquet

AVRO

Other

Yes,No

Avro Converter( Glue, Confluent)

AVRO,Parquet

Protobuf

Same

Yes

Protobuf Converter( Glue, Confluent)

AVRO,Parquet

Protobuf

Same

ByteArrayConverter

AVRO,Parquet

Protobuf

Other

Yes,No

Protobuf Converter( Glue, Confluent)

AVRO,Parquet, JSON

Other

Same, Other

Yes,No

ByteArrayConverter

Projections

Ordering

The sink employs zero-padding in object name[^3]s to ensure precise ordering, leveraging optimizations offered by the GCS API, guaranteeing the accurate sequence of objects.

When using the GCS source alongside the GCS sink, the connector can adopt the same ordering method, ensuring data processing follows the correct chronological order. However, there are scenarios where GCS data is generated by applications that do not maintain lexical object name order.

In such cases, to process objects in the correct sequence, the source needs to list all objects in the bucket and sort them based on their last modified timestamp. To enable this behavior, set the connect.gcpstorage.source.ordering.type to LastModified. This ensures that the source correctly arranges and processes the data based on the timestamps of the objects.

If using LastModified sorting, ensure objects do not arrive late, or use a post-processing step to handle them.

Throttling

To limit the number of object names the source reads from GCS in a single poll. The default value, if not specified, is 1000:

BATCH = 100

To limit the number of result rows returned from the source in a single poll operation, you can use the LIMIT clause. The default value, if not specified, is 10000.

LIMIT 10000

Object Extension Filtering

The GCP Storage Source Connector allows you to filter the objects to be processed based on their extensions. This is controlled by two properties: connect.gcpstorage.source.extension.excludes and connect.gcpstorage.source.extension.includes.

Excluding Object Extensions

The connect.gcpstorage.source.extension.excludes property is a comma-separated list of object extensions to exclude from the source object search. If this property is not configured, all objects are considered. For example, to exclude .txt and .csv objects, you would set this property as follows:

connect.gcpstorage.source.extension.excludes=txt,csv

Including Object Extensions

The connect.gcpstorage.source.extension.includes property is a comma-separated list of object extensions to include in the source object search. If this property is not configured, all objects are considered. For example, to include only .json and .xml objects, you would set this property as follows:

connect.gcpstorage.source.extension.includes=json,xml

Note: If both connect.gcpstorage.source.extension.excludes and connect.gcpstorage.source.extension.includes are set, the connector first applies the exclusion filter and then the inclusion filter.

Post-Processing Options

Post-processing options offer flexibility in managing how objects are handled after they have been processed. By configuring these options, users can automate tasks such as deleting objects to save storage space or moving objects to an archive for compliance and data retention purposes. These features are crucial for efficient data lifecycle management, particularly in environments where storage considerations or regulatory requirements dictate the need for systematic handling of processed data.

Use Cases for Post-Processing Options

Deleting objects After Processing
For scenarios where freeing up storage is critical and reprocessing is not necessary, configure the connector to delete objects after they are processed. This option is particularly useful in environments with limited storage capacity or where processed data is redundantly stored elsewhere.
Example:
```
INSERT INTO `my-topic`
SELECT * FROM `my-gcp-storage-bucket:my-prefix`
PROPERTIES (
    'post.process.action'=`DELETE`
)
```
Result: objects are permanently removed from the S3 bucket after processing, effectively reducing storage usage and preventing reprocessing.
Moving objects to an Archive Bucket
To preserve processed objects for archiving or compliance reasons, set the connector to move them to a designated archive bucket. This use case applies to organizations needing data retention strategies or for regulatory adherence by keeping processed records accessible but not in active use.
Example:
```
INSERT INTO `my-topic`
SELECT * FROM `my-gcp-storage-bucket:my-prefix`
PROPERTIES (
    'post.process.action'=`MOVE`,
    'post.process.action.bucket'=`archive-bucket`,
    'post.process.action.prefix'=`processed/`
)
```
Result: objects are transferred to an archive-bucket, stored with an updated path that includes the processed/ prefix, maintaining an organized archive structure.

Properties

Name

Description

Type

Available Values

read.text.mode

Controls how Text content is read

Enum

Regex, StartEndTag, StartEndLine

read.text.regex

Regular Expression for Text Reading (if applicable)

String

read.text.start.tag

Start Tag for Text Reading (if applicable)

String

read.text.end.tag

End Tag for Text Reading (if applicable)

String

read.text.buffer.size

Text Buffer Size (for optimization)

Int

read.text.start.line

Start Line for Text Reading (if applicable)

String

read.text.end.line

End Line for Text Reading (if applicable)

String

read.text.trim

Trim Text During Reading

Boolean

store.envelope

Messages are stored as “Envelope”

Boolean

post.process.action

Defines the action to perform on source objects after successful processing.

Enum

DELETE or MOVE

post.process.action.bucket

Specifies the target bucket for the MOVE action (required for MOVE).

String

post.process.action.prefix

Specifies a new prefix for the object’s location when using the MOVE action (required for MOVE).

String

Authentication

The connector offers two distinct authentication modes:

Default: This mode relies on the default GCP authentication chain, simplifying the authentication process.
File: This mode uses a local (to the connect worker) path for a file containing GCP authentication credentials.
Credentials: In this mode, explicit configuration of a GCP Credentials string is required for authentication.

The simplest example to configure in the connector is the “Default” mode, as this requires no other configuration.

connect.gcpstorage.gcp.auth.mode=Default

When selecting the “Credentials” mode, it is essential to provide the necessary credentials. Alternatively, if you prefer not to configure these properties explicitly, the connector will follow the credentials retrieval order as described here.

Here’s an example configuration for the “Credentials” mode:

connect.gcpstorage.gcp.auth.mode=Credentials
connect.gcpstorage.gcp.credentials=$GCP_CREDENTIALS
connect.gcpstorage.gcp.project.id=$GCP_PROJECT_ID

And here is an example configuration using the “File” mode:

connect.gcpstorage.gcp.auth.mode=File
connect.gcpstorage.gcp.file=/home/secure-stuff/gcp-read-credential.txt

Remember when using file mode the file will need to exist on every worker node in your Kafka connect cluster and be readable by the Kafka Connect process.

Backup and Restore

When used in tandem with the GCP Storage Sink Connector, the GCP Storage Source Connector becomes a powerful tool for restoring Kafka topics from GCP Storage. To enable this behavior, you should set store.envelope to true. This configuration ensures that the source expects the following data structure in GCP Storage:

{
  "key": <the message Key, which can be a primitive or a complex object>,
  "value": <the message Value, which can be a primitive or a complex object>,
  "headers": {
    "header1": "value1",
    "header2": "value2"
  },
  "metadata": {
    "offset": 0,
    "partition": 0,
    "timestamp": 0,
    "topic": "topic"
  }
}

When the messages are sent to Kafka, the GCP Storage Source Connector ensures that it correctly maps the key, value, headers, and metadata fields (including timestamp and partition) to their corresponding Kafka message fields. Please note that the envelope functionality can only be used with data stored in GCP Storage as Avro, JSON, or Parquet formats.

Partition Extraction

When the envelope feature is not in use, and data restoration is required, the responsibility falls on the connector to establish the original topic partition value. To ensure that the source correctly conveys the original partitions back to Kafka Connect during reads from the source, a partition extractor can be configured to extract this information from the GCP Storage object key.

To configure the partition extractor, you can utilize the connect.gcpstorage.source.partition.extractor.type property, which supports two options:

hierarchical: This option aligns with the default format used by the sink, topic/partition/offset.json.
regex: When selected, you can provide a custom regular expression to extract the partition information. Additionally, when using the regex option, you must also set the connect.gcpstorage.source.partition.extractor.regex property. It’s important to note that only one lookup group is expected. For an example of a regular expression pattern, please refer to the pattern used for hierarchical, which is:

(?i)^(?:.*)\/([0-9]*)\/(?:[0-9]*)[.](?:Json|Avro|Parquet|Text|Csv|Bytes)$

Option Reference

Name

Description

Type

Available Values

Default Value

connect.gcpstorage.gcp.auth.mode

Specifies the authentication mode for connecting to GCP.

string

"Credentials", "File" or "Default"

"Default"

connect.gcpstorage.gcp.credentials

For "auth.mode" credentials: GCP Authentication credentials string.

string

(Empty)

connect.gcpstorage.gcp.file

For "auth.mode" file: Local file path for file containing GCP authentication credentials.

string

(Empty)

connect.gcpstorage.gcp.project.id

GCP Project ID.

string

(Empty)

connect.gcpstorage.gcp.quota.project.id

GCP Quota Project ID.

string

(Empty)

connect.gcpstorage.endpoint

Endpoint for GCP Storage.

string

connect.gcpstorage.error.policy

Defines the error handling policy when errors occur during data transfer to or from GCP Storage.

string

"NOOP," "THROW," "RETRY"

"THROW"

connect.gcpstorage.max.retries

Sets the maximum number of retries the connector will attempt before reporting an error to the Connect Framework.

int

connect.gcpstorage.retry.interval

Specifies the interval (in milliseconds) between retry attempts by the connector.

int

60000

connect.gcpstorage.http.max.retries

Sets the maximum number of retries for the underlying HTTP client when interacting with GCP Storage.

long

connect.gcpstorage.http.retry.interval

Specifies the retry interval (in milliseconds) for the underlying HTTP client. An exponential backoff strategy is employed.

long

connect.gcpstorage.kcql

Kafka Connect Query Language (KCQL) Configuration to control the connector behaviour

string

[kcql configuration]({{< relref "#kcql-support" >}})

connect.gcpstorage.source.extension.excludes

A comma-separated list of object extensions to exclude from the source object search.

string

[object extension filtering]({{< relref "#object-extension-filtering" >}})

connect.gcpstorage.source.extension.includes

A comma-separated list of object extensions to include in the source object search.

string

[object extension filtering]({{< relref "#object-extension-filtering" >}})

connect.gcpstorage.source.partition.extractor.type

Type of Partition Extractor (Hierarchical or Regex)

string

hierarchical, regex

connect.gcpstorage.source.partition.extractor.regex

Regex Pattern for Partition Extraction (if applicable)

string

connect.gcpstorage.source.partition.search.continuous

If set to true the connector will continuously search for new partitions.

boolean

true, false

true

connect.gcpstorage.source.partition.search.interval

The interval in milliseconds between searching for new partitions.

long

300000

connect.gcpstorage.source.partition.search.excludes

A comma-separated list of paths to exclude from the partition search.

string

".indexes"

connect.gcpstorage.source.partition.search.recurse.levels

Controls how many levels deep to recurse when searching for new partitions

int

connect.gcpstorage.ordering,type

Type of ordering for the GCS object keys to ensure the processing order.

string

AlphaNumeric, LastModified

AlphaNumeric

connect.gcpstorage.source.empty.results.backoff.initial.delay

Initial delay before retrying when no results are found.

long

1000 Milliseconds

connect.gcpstorage.source.empty.results.backoff.max.delay

Maximum delay before retrying when no results are found.

long

10000 Milliseconds

connect.gcpstorage.source.empty.results.backoff.multiplier

Multiplier to apply to the delay when retrying when no results are found.

double

2.0 Multiplier (x)

connect.gcpstorage.source.write.watermark.header

Write the record with kafka headers including details of the source and line number of the file.

boolean

true, false

false

FTP

This page describes the usage of the Stream Reactor FTP Source Connector.

Provide the remote directories and on specified intervals, the list of files in the directories is refreshed. Files are downloaded when they were not known before, or when their timestamp or size are changed. Only files with a timestamp younger than the specified maximum age are considered. Hashes of the files are maintained and used to check for content changes. Changed files are then fed into Kafka, either as a whole (update) or only the appended part (tail), depending on the configuration. Optionally, file bodies can be transformed through a pluggable system prior to putting them into Kafka.

Connector Class

io.lenses.streamreactor.connect.ftp.source.FtpSourceConnector

Example

For more examples see the tutorials.

name=ftp-source
connector.class=io.lenses.streamreactor.connect.ftp.source.FtpSourceConnector
tasks.max=1

#server settings
connect.ftp.address=localhost:21
connect.ftp.user=ftp
connect.ftp.password=ftp

#refresh rate, every minute
connect.ftp.refresh=PT1M

#ignore files older than 14 days.
connect.ftp.file.maxage=P14D

#monitor /forecasts/weather/ and /logs/ for appends to files.
#any updates go to the topics `weather` and `error-logs` respectively.
connect.ftp.monitor.tail=/forecasts/weather/:weather,/logs/:error-logs

#keep an eye on /statuses/, files are retrieved as a whole and sent to topic `status`
connect.ftp.monitor.update=/statuses/:status

#keystyle controls the format of the key and can be string or struct.
#string only provides the file name
#struct provides a structure with the filename and offset
connect.ftp.keystyle=struct

Data types

Each Kafka record represents a file and has the following types.

The format of the keys is configurable through connect.ftp.keystyle=string|struct. It can be a string with the file name, or a FileInfo structure with the name: string and offset: long. The offset is always 0 for files that are updated as a whole, and hence only relevant for tailed files.
The values of the records contain the body of the file as bytes.

Tailing Versus Update as a Whole

The following rules are used.

Tailed files are only allowed to grow. Bytes that have been appended to it since the last inspection are yielded. Preceding bytes are not allowed to change;

Updated files can grow, shrink and change anywhere. The entire contents are yielded.

Data converters

Instead of dumping whole file bodies (and the danger of exceeding Kafka’s message.max.bytes), one might want to give an interpretation to the data contained in the files before putting it into Kafka. For example, if the files that are fetched from the FTP are comma-separated values (CSVs), one might prefer to have a stream of CSV records instead. To allow to do so, the connector provides a pluggable conversion of SourceRecords. Right before sending a SourceRecord to the Connect framework, it is run through an object that implements:

package io.lenses.streamreactor.connect.ftp

trait SourceRecordConverter extends Configurable {
    def convert(in:SourceRecord) : java.util.List[SourceRecord]
}

The default object that is used is a pass-through converter, an instance of:

class NopSourceRecordConverter extends SourceRecordConverter{
    override def configure(props: util.Map[String, _]): Unit = {}
    override def convert(in: SourceRecord): util.List[SourceRecord] = Seq(in).asJava
}

To override it, create your own implementation of SourceRecordConverter and place the jar in the plugin.path.

connect.ftp.sourcerecordconverter=your.name.space.YourConverter

To learn more examples of using the FTP Kafka connector read this blog.

Option Reference

Name

Description

Type

Default Value

connect.ftp.address

host[:port] of the ftp server

string

connect.ftp.user

Username to connect with

string

connect.ftp.password

Password to connect with

string

connect.ftp.refresh

iso8601 duration that the server is polled

string

connect.ftp.file.maxage

iso8601 duration for how old files can be

string

connect.ftp.keystyle

SourceRecord keystyle, string or struct

string

connect.ftp.protocol

Protocol to use, FTP or FTPS

string

ftp

connect.ftp.timeout

Ftp connection timeout in milliseconds

int

30000

connect.ftp.filter

Regular expression to use when selecting files for processing

string

connect.ftp.monitor.tail

Comma separated lists of path:destinationtopic; tail of file to tracked

string

connect.ftp.monitor.update

Comma separated lists of path:destinationtopic; whole file is tracked

string

connect.ftp.monitor.slicesize

File slice size in bytes

int

-1

connect.ftp.fileconverter

File converter class

string

io.lenses.streamreactor.connect.ftp.source.SimpleFileConverter

connect.ftp.sourcerecordconverter

Source record converter class

string

io.lenses.streamreactor.connect.ftp.source.NopSourceRecordConverter

connect.ftp.max.poll.records

Max number of records returned per poll

int

10000

JMS

This page describes the usage of the Stream Reactor JMS Source Connector.

A Kafka Connect JMS source connector to subscribe to messages on JMS queues and topics and write them to a Kafka topic.

The connector uses the standard JMS protocols and has been tested against ActiveMQ.

The connector allows for the JMS initial.context.factory and connection.factory to be set according to your JMS provider. The appropriate implementation jars must be added to the CLASSPATH of the connect workers or placed in the plugin.path of the connector.

Each JMS message is committed only when it has been written to Kafka. If a failure happens when writing to Kafka, i.e. the message is too large, then the JMS message will not be acknowledged. It will stay in the queue so it can be actioned upon.

The schema of the messages is fixed and can be found under Data Types unless a converter is used.

You must provide the JMS implementation jars for your JMS service.

Connector Class

io.lenses.streamreactor.connect.jms.source.JMSSourceConnector

Example

For more examples see the tutorials.

name=jms-source
connector.class=io.lenses.streamreactor.connect.jms.source.JMSSourceConnector
tasks.max=1
connect.jms.kcql=INSERT INTO jms SELECT * FROM jms-queue WITHTYPE QUEUE
connect.jms.initial.context.factory=org.apache.activemq.jndi.ActiveMQInitialContextFactory
connect.jms.url=tcp://activemq:61616
connect.jms.connection.factory=ConnectionFactory

KCQL support

You can specify multiple KCQL statements separated by ; to have the connector sink into multiple topics.

The following KCQL is supported:

INSERT INTO kafka_topic
SELECT *
FROM jms_destination
WITHTYPE [TOPIC|QUEUE]
[WITHCONVERTER=`myclass`]

The selection of fields from the JMS message is not supported.

Examples:

-- Select from a JMS queue and write to a Kafka topic
INSERT INTO topicA SELECT * FROM jms_queue WITHTYPE QUEUE

-- Select from a JMS topic and write to a Kafka topic with a json converter
INSERT INTO topicA
SELECT * FROM jms_topic
WITHTYPE TOPIC
WITHCONVERTER=`io.lenses.streamreactor.connect.converters.source.AvroConverter`

Destination types

The connector supports both TOPICS and QUEUES, controlled by the WITHTYPE KCQL clause.

Message Converters

The connector supports converters to handle different message payload formats in the source topic or queue.

If no converter is provided the JMS message is converter to a Kafka Struct representation.

See source record converters.

Data type conversion

Name

Type

message_timestamp

Optional int64

correlation_id

Optional string

redelivered

Optional boolean

reply_to

Optional string

destination

Optional string

message_id

Optional string

mode

Optional int32

type

Optional string

priority

Optional int32

bytes_payload

Optional bytes

properties

Map of string

Option Reference

Name

Description

Type

Default Value

connect.jms.url

Provides the JMS broker url

string

connect.jms.initial.context.factory

Initial Context Factory, e.g: org.apache.activemq.jndi.ActiveMQInitialContextFactory

string

connect.jms.connection.factory

Provides the full class name for the ConnectionFactory compile to use, e.gorg.apache.activemq.ActiveMQConnectionFactory

string

ConnectionFactory

connect.jms.kcql

string

connect.jms.subscription.name

subscription name to use when subscribing to a topic, specifying this makes a durable subscription for topics

string

connect.jms.password

Provides the password for the JMS connection

password

connect.jms.username

Provides the user for the JMS connection

string

connect.jms.error.policy

Specifies the action to be taken if an error occurs while inserting the data. There are two available options: NOOP - the error is swallowed THROW - the error is allowed to propagate. RETRY - The exception causes the Connect framework to retry the message. The number of retries is based on The error will be logged automatically

string

THROW

connect.jms.retry.interval

The time in milliseconds between retries.

int

60000

connect.jms.max.retries

The maximum number of times to try the write again.

int

connect.jms.destination.selector

Selector to use for destination lookup. Either CDI or JNDI.

string

CDI

connect.jms.initial.context.extra.params

List (comma-separated) of extra properties as key/value pairs with a colon delimiter to supply to the initial context e.g. SOLACE_JMS_VPN:my_solace_vp

list

[]

connect.jms.batch.size

The number of records to poll for on the target JMS destination in each Connect poll.

int

100

connect.jms.polling.timeout

Provides the timeout to poll incoming messages

long

1000

connect.jms.source.default.converter

Contains a canonical class name for the default converter of a raw JMS message bytes to a SourceRecord. Overrides to the default can be done by using connect.jms.source.converters still. i.e. io.lenses.streamreactor.connect.converters.source.AvroConverter

string

connect.jms.converter.throw.on.error

If set to false the conversion exception will be swallowed and everything carries on BUT the message is lost!!; true will throw the exception.Default is false.

boolean

false

connect.converter.avro.schemas

If the AvroConverter is used you need to provide an avro Schema to be able to read and translate the raw bytes to an avro record. The format is $MQTT_TOPIC=$PATH_TO_AVRO_SCHEMA_FILE

string

connect.jms.headers

Contains collection of static JMS headers included in every SinkRecord The format is connect.jms.headers="$MQTT_TOPIC=rmq.jms.message.type:TextMessage,rmq.jms.message.priority:2;$MQTT_TOPIC2=rmq.jms.message.type:JSONMessage"

string

connect.progress.enabled

Enables the output for how many records have been processed

boolean

false

connect.jms.evict.interval.minutes

Removes the uncommitted messages from the internal cache. Each JMS message is linked to the Kafka record to be published. Failure to publish a record to Kafka will mean the JMS message will not be acknowledged.

int

connect.jms.evict.threshold.minutes

The number of minutes after which an uncommitted entry becomes evictable from the connector cache.

int

connect.jms.scale.type

How the connector tasks parallelization is decided. Available values are kcql and default. If kcql is provided it will be based on the number of KCQL statements written; otherwise it will be driven based on the connector tasks.max

MQTT

This page describes the usage of the Stream Reactor MQTT Source Connector.

A Kafka Connect source connector to read events from MQTT and push them to Kafka.

Connector Class

io.lenses.streamreactor.connect.mqtt.source.MqttSourceConnector

Example

For more examples see the tutorials.

name=mqtt-source
connector.class=io.lenses.streamreactor.connect.mqtt.source.MqttSourceConnector
tasks.max=1
connect.mqtt.kcql=INSERT INTO mqtt SELECT * FROM /mjson WITHCONVERTER=`io.lenses.streamreactor.connect.converters.source.JsonSimpleConverter`
connect.mqtt.client.id=dm_source_id
connect.mqtt.hosts=tcp://mqtt:1883
connect.mqtt.service.quality=1

KCQL support

You can specify multiple KCQL statements separated by ; to have the connector sink into multiple topics.

The following KCQL is supported:

INSERT INTO <your-kafka-topic>
SELECT *
FROM <your-mqtt-topic>
[WITHCONVERTER=`myclass`]

The selection of fields from the JMS message is not supported.

Examples:

-- Insert mode, select all fields from topicA
-- and write to topic topic with converter myclass
INSERT INTO topic SELECT * FROM /mqttTopicA [WITHCONVERTER=myclass]

-- wildcard
INSERT INTO topic SELECT * FROM /mqttTopicA/+/sensors [WITHCONVERTER=`myclass`]

Keyed JSON format

To facilitate scenarios like retaining the latest value for a given device identifier, or support Kafka Streams joins without having to re-map the topic data the connector supports WITHKEY in the KCQL syntax.

Multiple key fields are supported using a delimiter:

// `[` enclosed by `]` denotes optional values
WITHKEY(field1 [, field2.A , field3]) [KEYDELIMITER='.']

The resulting Kafka record key content will be the string concatenation for the values of the fields specified. Optionally the delimiter can be set via the KEYDELIMITER keyword.

Shared and Wildcard Subscriptions

The connector supports both wildcard and shared subscriptions but the KCQL command must be placed inside single quotes.

-- wildcard
INSERT INTO kafkaTopic1
SELECT * FROM /mqttTopicA/+/sensors
WITHCONVERTER=`myclass`

Message converters

The connector supports converters to handle different messages payload format in the source topic. See source record converters.

Option Reference

Name

Description

Type

Default Value

connect.mqtt.hosts

Contains the MQTT connection end points.

string

connect.mqtt.username

Contains the Mqtt connection user name

string

connect.mqtt.password

Contains the Mqtt connection password

password

connect.mqtt.service.quality

Specifies the Mqtt quality of service

int

connect.mqtt.timeout

Provides the time interval to establish the mqtt connection

int

3000

connect.mqtt.clean

boolean

true

connect.mqtt.keep.alive

The keep alive functionality assures that the connection is still open and both broker and client are connected to the broker during the establishment of the connection. The interval is the longest possible period of time, which broker and client can endure without sending a message.

int

5000

connect.mqtt.client.id

Contains the Mqtt session client id

string

connect.mqtt.error.policy

string

THROW

connect.mqtt.retry.interval

The time in milliseconds between retries.

int

60000

connect.mqtt.max.retries

The maximum number of times to try the write again.

int

connect.mqtt.retained.messages

Specifies the Mqtt retained flag.

boolean

false

connect.mqtt.converter.throw.on.error

If set to false the conversion exception will be swallowed and everything carries on BUT the message is lost!!; true will throw the exception.Default is false.

boolean

false

connect.converter.avro.schemas

If the AvroConverter is used you need to provide an avro Schema to be able to read and translate the raw bytes to an avro record. The format is $MQTT_TOPIC=$PATH_TO_AVRO_SCHEMA_FILE in case of source converter, or $KAFKA_TOPIC=PATH_TO_AVRO_SCHEMA in case of sink converter

string

connect.mqtt.log.message

Logs received MQTT messages

boolean

false

connect.mqtt.kcql

Contains the Kafka Connect Query Language describing the sourced MQTT source and the target Kafka topics

string

connect.mqtt.polling.timeout

Provides the timeout to poll incoming messages

int

1000

connect.mqtt.share.replicate

Replicate the shared subscriptions to all tasks instead of distributing them

boolean

false

connect.progress.enabled

Enables the output for how many records have been processed

boolean

false

connect.mqtt.ssl.ca.cert

Provides the path to the CA certificate file to use with the Mqtt connection

string

connect.mqtt.ssl.cert

Provides the path to the certificate file to use with the Mqtt connection

string

connect.mqtt.ssl.key

Certificate private [config] key file path.

string

connect.mqtt.process.duplicates

Process duplicate messages

boolean

false

Sinks

This page details the configuration options for the Stream Reactor Kafka Connect sink connectors.

Sink connectors read data from Kafka and write to an external system.

AWS S3

Sink data from Kafka to AWS S3 including backing up topics and offsets.

Azure CosmosDB

Sink data from Kafka to Azure CosmosDB.

Azure Data Lake Gen2

Sink data from Kafka to Azure Data Lake Gen2 including backing up topics and offsets.

Azure Event Hubs

Load data from Azure Event Hubs into Kafka topics.

Azure Service Bus

Sink data from Kafka to Azure Service Bus topics and queues.

Cassandra

Sink data from Kafka to Cassandra.

Elasticsearch

Sink data from Kafka to Elasticsearch.

GCP PubSub

Sink data from Kafka to GCP PubSub.

GCP Storage

Sink data from Kafka to GCP Storage.

HTTP Sink

Sink data from Kafka to a HTTP endpoint.

InfluxDB

Sink data from Kafka to InfluxDB.

JMS

Sink data from Kafka to JMS.

MongoDB

Sink data from Kafka to MongoDB.

MQTT

Sink data from Kafka to MQTT.

Redis

Sink data from Kafka to Redis.

FAQ

Can the datalakes sinks lose data?

Kafka topic retention policies determine how long a message is retained in a topic before it is deleted. If the retention period expires and the connector has not processed the messages, possibly due to not running or other issues, the unprocessed Kafka data will be deleted as per the retention policy. This can lead to significant data loss since the messages will no longer be available for the connector to sink to the target system.

Do the datalake sinks support exactly once semantics?

Yes, the datalakes connectors natively support exactly-once guarantees.

How do I escape dots in field names in KCQL?

Field names in Kafka message headers or values may contain dots (.). To access these correctly, enclose the entire target in backticks (```) and each segment which consists of a field name in single quotes ('):

INSERT INTO `_value.'customer.name'.'first.name'` SELECT * FROM topicA

How do I escape other special characters in field names in KCQL?

For field names with spaces or special characters, use a similar escaping strategy:

Field name with a space: `_value.'full name'`
Field name with special characters: `_value.'$special_characters!'`

This ensures the connector correctly extracts the intended fields and avoids parsing errors.

AWS S3

This page describes the usage of the Stream Reactor AWS S3 Sink Connector.

This Kafka Connect sink connector facilitates the seamless transfer of records from Kafka to AWS S3 Buckets. It offers robust support for various data formats, including AVRO, Parquet, JSON, CSV, and Text, making it a versatile choice for data storage. Additionally, it ensures the reliability of data transfer with built-in support for exactly-once semantics.

Connector Class

io.lenses.streamreactor.connect.aws.s3.sink.S3SinkConnector

Example

For more examples see the tutorials.

This example writes to a bucket called demo, partitioning by a field called ts, store as JSON.

connector.class=io.lenses.streamreactor.connect.aws.s3.sink.S3SinkConnector
connect.s3.kcql=insert into lensesio:demo select * from demo PARTITIONBY _value.ts STOREAS `JSON` PROPERTIES ('flush.size'=1000000, 'flush.interval'=30, 'flush.count'=5000)
topics=demo
name=demo

KCQL Support

You can specify multiple KCQL statements separated by ; to have a connector sink multiple topics. The connector properties topics or topics.regex are required to be set to a value that matches the KCQL statements.

The connector uses KCQL to map topics to S3 buckets and paths. The full KCQL syntax is:

INSERT INTO bucketAddress[:pathPrefix]
SELECT *
FROM kafka-topic
[[PARTITIONBY (partition[, partition] ...)] | NOPARTITION]
[STOREAS storage_format]
[PROPERTIES(
  'property.1'=x,
  'property.2'=x,
)]

Please note that you can employ escaping within KCQL for the INSERT INTO, SELECT * FROM, and PARTITIONBY clauses when necessary. For example, an incoming Kafka message stored as JSON can use fields containing .:

{
  ...
  "a.b": "value",
  ...
}

In this case, you can use the following KCQL statement:

INSERT INTO `bucket-name`:`prefix` SELECT * FROM `kafka-topic` PARTITIONBY `a.b`

Target Bucket and Path

The target bucket and path are specified in the INSERT INTO clause. The path is optional and if not specified, the connector will write to the root of the bucket and append the topic name to the path.

Here are a few examples:

INSERT INTO testbucket:pathToWriteTo SELECT * FROM topicA;
INSERT INTO testbucket SELECT * FROM topicA;
INSERT INTO testbucket:path/To/Write/To SELECT * FROM topicA PARTITIONBY fieldA;

SQL Projection

Currently, the connector does not offer support for SQL projection; consequently, anything other than a SELECT * query is disregarded. The connector will faithfully write all fields from Kafka exactly as they are.

Source Topic

The source topic is defined within the FROM clause. To avoid runtime errors, it’s crucial to configure either the topics or topics.regex property in the connector and ensure proper mapping to the KCQL statements.

Set the FROM clause to *. This will auto map the topic as a partition.

Partitioning & Object Keys

The object key serves as the filename used to store data in S3. There are two options for configuring the object key:

Default: The object key is automatically generated by the connector and follows the Kafka topic-partition structure. The format is $bucket/[$prefix]/$topic/$partition/offset.extension. The extension is determined by the chosen storage format.
Custom: The object key is driven by the PARTITIONBY clause. The format is either $bucket/[$prefix]/$topic/customKey1=customValue1/customKey2=customValue2/topic(partition_offset).extension (AWS Athena naming style mimicking Hive-like data partitioning) or $bucket/[$prefix]/customValue/topic(partition_offset).ext. The extension is determined by the selected storage format.

Custom keys and values can be extracted from the Kafka message key, message value, or message headers, as long as the headers are of types that can be converted to strings. There is no fixed limit to the number of elements that can form the object key, but you should be aware of AWS S3 key length restrictions.

The Connector automatically adds the topic name to the partition. There is no need to add it to the partition clause. If you want to explicitly add the topic or partition you can do so by using _topic and _partition.

The partition clause works on header, key and values fields of the Kafka message.

To extract fields from the message values, simply use the field names in the PARTITIONBY clause. For example:

PARTITIONBY fieldA, fieldB

However, note that the message fields must be of primitive types (e.g., string, int, long) to be used for partitioning.

You can also use the entire message key as long as it can be coerced into a primitive type:

PARTITIONBY _key

In cases where the Kafka message Key is not a primitive but a complex object, you can use individual fields within the message Key to create the S3 object key name:

PARTITIONBY _key.fieldA, _key.fieldB

Kafka message headers can also be used in the S3 object key definition, provided the header values are of primitive types easily convertible to strings:

PARTITIONBY _header.<header_key1>[, _header.<header_key2>]

Customizing the object key can leverage various components of the Kafka message. For example:

PARTITIONBY fieldA, _key.fieldB, _headers.fieldC

This flexibility allows you to tailor the object key to your specific needs, extracting meaningful information from Kafka messages to structure S3 object keys effectively.

To enable Athena-like partitioning, use the following syntax:

INSERT INTO $bucket[:$prefix]
SELECT * FROM $topic
PARTITIONBY fieldA, _key.fieldB, _headers.fieldC
STOREAS `AVRO`
PROPERTIES (
    'partition.include.keys'=true,
)

Rolling Windows

Storing data in Amazon S3 and partitioning it by time is a common practice in data management. For instance, you may want to organize your S3 data in hourly intervals. This partitioning can be seamlessly achieved using the PARTITIONBY clause in combination with specifying the relevant time field. However, it’s worth noting that the time field typically doesn’t adjust automatically.

To address this, we offer a Kafka Connect Single Message Transformer (SMT) designed to streamline this process.

Let’s consider an example where you need the object key to include the wallclock time (the time when the message was processed) and create an hourly window based on a field called timestamp. Here’s the connector configuration to achieve this:

connector.class=io.lenses.streamreactor.connect.aws.s3.sink.S3SinkConnector
connect.s3.kcql=insert into lensesio:demo select * from demo PARTITIONBY _value.metadata_id, _value.customer_id, _header.ts, _header.wallclock STOREAS `JSON` PROPERTIES ('flush.size'=1000000, 'flush.interval'=30, 'flush.count'=5000)
topics=demo
name=demo
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
transforms=insertFormattedTs,insertWallclock
transforms.insertFormattedTs.type=io.lenses.connect.smt.header.TimestampConverter
transforms.insertFormattedTs.header.name=ts
transforms.insertFormattedTs.field=timestamp
transforms.insertFormattedTs.target.type=string
transforms.insertFormattedTs.format.to.pattern=yyyy-MM-dd-HH
transforms.insertWallclock.type=io.lenses.connect.smt.header.InsertWallclock
transforms.insertWallclock.header.name=wallclock
transforms.insertWallclock.value.type=format
transforms.insertWallclock.format=yyyy-MM-dd-HH

In this example, the incoming Kafka message’s Value content includes a field called timestamp, represented as a long value indicating the epoch time in milliseconds. The TimestampConverter SMT will expertly convert this into a string value according to the format specified in the format.to.pattern property. Additionally, the insertWallclock SMT will incorporate the current wallclock time in the format you specify in the format property.

The PARTITIONBY clause then leverages both the timestamp field and the wallclock header to craft the object key, providing you with precise control over data partitioning.

Data Storage Format

While the STOREAS clause is optional, it plays a pivotal role in determining the storage format within AWS S3. It’s crucial to understand that this format is entirely independent of the data format stored in Kafka. The connector maintains its neutrality towards the storage format at the topic level and relies on the key.converter and value.converter settings to interpret the data.

Supported storage formats encompass:

AVRO
Parquet
JSON
CSV (including headers)
Text
BYTES

Opting for BYTES ensures that each record is stored in its own separate file. This feature proves particularly valuable for scenarios involving the storage of images or other binary data in S3. For cases where you prefer to consolidate multiple records into a single binary file, AVRO or Parquet are the recommended choices.

By default, the connector exclusively stores the Kafka message value. However, you can expand storage to encompass the entire message, including the key, headers, and metadata, by configuring the store.envelope property as true. This property operates as a boolean switch, with the default value being false. When the envelope is enabled, the data structure follows this format:

{
  "key": <the message Key, which can be a primitive or a complex object>,
  "value": <the message Key, which can be a primitive or a complex object>,
  "headers": {
    "header1": "value1",
    "header2": "value2"
  },
  "metadata": {
    "offset": 0,
    "partition": 0,
    "timestamp": 0,
    "topic": "topic"
  }
}

Utilizing the envelope is particularly advantageous in scenarios such as backup and restore or replication, where comprehensive storage of the entire message in S3 is desired.

Examples

Storing the message Value Avro data as Parquet in S3:

...
connect.s3.kcql=INSERT INTO lensesioaws:car_speed SELECT * FROM car_speed_events STOREAS `PARQUET` 
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
key.converter=org.apache.kafka.connect.storage.StringConverter
...

The converter also facilitates seamless JSON to AVRO/Parquet conversion, eliminating the need for an additional processing step before the data is stored in S3.

...
connect.s3.kcql=INSERT INTO lensesioaws:car_speed SELECT * FROM car_speed_events STOREAS `PARQUET` 
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
...

Enabling the full message stored as JSON in S3:

...
connect.s3.kcql=INSERT INTO lensesioaws:car_speed SELECT * FROM car_speed_events STOREAS `JSON` PROPERTIES('store.envelope'=true)
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
...

Enabling the full message stored as AVRO in S3:

...
connect.s3.kcql=INSERT INTO lensesioaws:car_speed SELECT * FROM car_speed_events STOREAS `AVRO` PROPERTIES('store.envelope'=true)
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
key.converter=org.apache.kafka.connect.storage.StringConverter
...

If the restore (see the S3 Source documentation) happens on the same cluster, then the most performant way is to use the ByteConverter for both Key and Value and store as AVRO or Parquet:

...
connect.s3.kcql=INSERT INTO lensesioaws:car_speed SELECT * FROM car_speed_events STOREAS `AVRO` PROPERTIES('store.envelope'=true)
value.converter=org.apache.kafka.connect.converters.ByteArrayConverter
key.converter=org.apache.kafka.connect.converters.ByteArrayConverter
...

Flush Options

The connector offers three distinct flush options for data management:

Flush by Count - triggers a file flush after a specified number of records have been written to it.
Flush by Size - initiates a file flush once a predetermined size (in bytes) has been attained.
Flush by Interval - enforces a file flush after a defined time interval (in seconds).

It’s worth noting that the interval flush is a continuous process that acts as a fail-safe mechanism, ensuring that files are periodically flushed, even if the other flush options are not configured or haven’t reached their thresholds.

Consider a scenario where the flush size is set to 10MB, and only 9.8MB of data has been written to the object, with no new Kafka messages arriving for an extended period of 6 hours. To prevent undue delays, the interval flush guarantees that the object is flushed after the specified time interval has elapsed. This ensures the timely management of data even in situations where other flush conditions are not met.

The flush options are configured using the flush.count, flush.size, and flush.interval properties. The settings are optional and if not specified the defaults are:

flush.count = 50_000
flush.size = 500000000 (500MB)
flush.interval = 3_600 (1 hour)

A connector instance can simultaneously operate on multiple topic partitions. When one partition triggers a flush, it will initiate a flush operation for all of them, even if the other partitions are not yet ready to flush.

Flushing By Interval

The next flush time is calculated based on the time the previous flush completed (the last modified time of the object written to S3). Therefore, by design, the sink connector’s behaviour will have a slight drift based on the time it takes to flush records and whether records are present or not. If Kafka Connect makes no calls to put records, the logic for flushing won't be executed. This ensures a more consistent number of records per object.

Properties

The PROPERTIES clause is optional and adds a layer of configuration to the connector. It enhances versatility by permitting the application of multiple configurations (delimited by ‘,’). The following properties are supported:

Name

Description

Type

Available Values

Default Value

padding.type

Specifies the type of padding to be applied.

LeftPad, RightPad, NoOp

LeftPad

padding.char

Defines the character used for padding.

Char

‘0’

padding.length.partition

Sets the padding length for the partition.

Int

padding.length.offset

Sets the padding length for the offset.

Int

partition.include.keys

Specifies whether partition keys are included.

Boolean

false Default (Custom Partitioning): true

store.envelope

Indicates whether to store the entire Kafka message

Boolean

store.envelope.fields.key

Indicates whether to store the envelope’s key.

Boolean

store.envelope.fields.headers

Indicates whether to store the envelope’s headers.

Boolean

store.envelope.fields.value

Indicates whether to store the envelope’s value.

Boolean

store.envelope.fields.metadata

Indicates whether to store the envelope’s metadata.

Boolean

flush.size

Specifies the size (in bytes) for the flush operation.

Long

500000000 (500MB)

flush.count

Specifies the number of records for the flush operation.

Int

50000

flush.interval

Specifies the interval (in seconds) for the flush operation

Long

3600(1h)

key.suffix

When specified it appends the given value to the resulting object key before the "extension" (avro, json, etc) is added

String

<empty>

The sink connector optimizes performance by padding the output objects. This proves beneficial when using the S3 Source connector to restore data. This object name padding ensures that objects are ordered lexicographically, allowing the S3 Source connector to skip the need for reading, sorting, and processing all objects, thereby enhancing efficiency.

Compression

AVRO and Parquet offer the capability to compress files as they are written. The GCP Storage Sink connector provides advanced users with the flexibility to configure compression options.

Here are the available options for the connect.gcpstorage.compression.codec, along with indications of their support by Avro, Parquet and JSON writers:

Compression

Avro Support

Avro (requires Level)

Parquet Support

UNCOMPRESSED

✅

SNAPPY

✅

GZIP

✅

LZ0

✅

LZ4

✅

BROTLI

✅

BZIP2

✅

ZSTD

✅

⚙️

✅

DEFLATE

✅

⚙️

✅

⚙️

Please note that not all compression libraries are bundled with the S3 connector. Therefore, you may need to manually add certain libraries to the classpath to ensure they function correctly.

Authentication

The connector offers two distinct authentication modes:

Default: This mode relies on the default AWS authentication chain, simplifying the authentication process.
Credentials: In this mode, explicit configuration of AWS Access Key and Secret Key is required for authentication.

Here’s an example configuration for the Credentials mode:

...
connect.s3.aws.auth.mode=Credentials
connect.s3.aws.region=eu-west-2
connect.s3.aws.access.key=$AWS_ACCESS_KEY
connect.s3.aws.secret.key=$AWS_SECRET_KEY
...

For enhanced security and flexibility when using the Credentials mode, it is highly advisable to utilize Connect Secret Providers.

Error policies

The connector supports Error policies.

API Compatible systems

The connector can also be used against API compatible systems provided they implement the following:

listObjectsV2
listObjectsV2Pagbinator
putObject
getObject
headObject
deleteObjects
deleteObject

Indexes Directory

The connector uses the concept of index objects that it writes to in order to store information about the latest offsets for Kafka topics and partitions as they are being processed. This allows the connector to quickly resume from the correct position when restarting and provides flexibility in naming the index objects.

By default, the index objects are grouped within a prefix named .indexes for all connectors. However, each connector will create and store its index objects within its own nested prefix inside this .indexes prefix.

You can configure the prefix for these index objects using the property connect.s3.indexes.name. This property specifies the path from the root of the S3 bucket. Note that even if you configure this property, the connector will still place the indexes within a nested prefix of the specified prefix.

Examples

Index Name (connect.s3.indexes.name)

Resulting Indexes Prefix Structure

Description

.indexes (default)

.indexes/<connector_name>/

The default setup, where each connector uses its own subdirectory within .indexes.

custom-indexes

custom-indexes/<connector_name>/

Custom root directory custom-indexes, with a subdirectory for each connector.

indexes/s3-connector-logs

indexes/s3-connector-logs/<connector_name>/

Uses a custom subdirectory s3-connector-logs within indexes, with a subdirectory for each connector.

logs/indexes

logs/indexes/<connector_name>/

Indexes are stored under logs/indexes, with a subdirectory for each connector.

Option Reference

Name

Description

Type

Available Values

Default Value

connect.s3.aws.auth.mode

Specifies the AWS authentication mode for connecting to S3.

string

“Credentials,” “Default”

“Default”

connect.s3.aws.access.key

The AWS Access Key used for authentication.

string

(Empty)

connect.s3.aws.secret.key

The AWS Secret Key used for authentication.

string

(Empty)

connect.s3.aws.region

The AWS Region where the S3 bucket is located.

string

(Empty)

connect.s3.pool.max.connections

Specifies the maximum number of connections allowed in the AWS Client’s HTTP connection pool when interacting with S3.

int

-1 (undefined)

connect.s3.custom.endpoint

Allows for the specification of a custom S3 endpoint URL if needed.

string

(Empty)

connect.s3.vhost.bucket

Enables the use of Vhost Buckets for S3 connections. Always set to true when custom endpoints are used.

boolean

true, false

false

connect.s3.error.policy

Defines the error handling policy when errors occur during data transfer to or from S3.

string

“NOOP,” “THROW,” “RETRY”

“THROW”

connect.s3.max.retries

Sets the maximum number of retries the connector will attempt before reporting an error to the Connect Framework.

int

connect.s3.retry.interval

Specifies the interval (in milliseconds) between retry attempts by the connector.

int

60000

connect.s3.http.max.retries

Sets the maximum number of retries for the underlying HTTP client when interacting with S3.

long

connect.s3.http.retry.interval

Specifies the retry interval (in milliseconds) for the underlying HTTP client. An exponential backoff strategy is employed.

long

connect.s3.local.tmp.directory

Enables the use of a local folder as a staging area for data transfer operations.

string

(Empty)

connect.s3.kcql

A SQL-like configuration that defines the behavior of the connector. Refer to the KCQL section below for details.

string

(Empty)

connect.s3.compression.codec

Sets the Parquet compression codec to be used when writing data to S3.

string

“UNCOMPRESSED,” “SNAPPY,” “GZIP,” “LZ0,” “LZ4,” “BROTLI,” “BZIP2,” “ZSTD,” “DEFLATE,” “XZ”

“UNCOMPRESSED”

connect.s3.compression.level

Sets the compression level when compression is enabled for data transfer to S3.

int

1-9

(Empty)

connect.s3.seek.max.files

Specifies the maximum threshold for the number of files the connector uses to ensure exactly-once processing of data.

int

connect.s3.indexes.name

Configure the indexes prefix for this connector.

string

".indexes"

connect.s3.exactly.once.enable

By setting to 'false', disable exactly-once semantics, opting instead for Kafka Connect’s native at-least-once offset management

boolean

true, false

true

connect.s3.schema.change.detector

Configure how the file will roll over upon receiving a record with a schema different from the accumulated ones. This property configures schema change detection with default (object equality), version (version field comparison), or compatibility (Avro compatibility checking).

string

default, version, compatibility

default

connect.s3.skip.null.values

Skip records with null values (a.k.a. tombstone records).

boolean

true, false

false

Azure CosmosDB

This page describes the usage of the Stream Reactor Azure CosmosDB Sink Connector.

A Kafka Connect sink connector for writing records from Kafka to Azure CosmosDB using the SQL API.

Connector Class

io.lenses.streamreactor.connect.azure.documentdb.sink.DocumentDbSinkConnector

Example

For more examples see the tutorials.

name=cosmosdb
connector.class=io.lenses.streamreactor.connect.azure.documentdb.sink.DocumentDbSinkConnector
tasks.max=1
topics=orders-string
connect.documentdb.kcql=INSERT INTO orders SELECT * FROM orders-string
connect.documentdb.db=dm
connect.documentdb.endpoint=[YOUR_AZURE_ENDPOINT]
connect.documentdb.db.create=true
connect.documentdb.master.key=[YOUR_MASTER_KEY]
connect.documentdb.batch.size=10

KCQL support

You can specify multiple KCQL statements separated by**;** to have a connector sink multiple topics. The connector properties topics or topics.regex are required to be set to a value that matches the KCQL statements.

The following KCQL is supported:

INSERT | UPSERT
INTO <your-collection>
SELECT FIELD, ...
FROM kafka_topic
[PK FIELDS,...]

Examples:

-- Insert mode, select all fields from topicA
-- and write to tableA
INSERT INTO collectionA SELECT * FROM topicA

-- UPSERT mode, select 3 fields and
-- rename from topicB and write to tableB
-- with primary key as the field id from the topic
UPSERT INTO tableB SELECT x AS a, y, z AS c FROM topicB PK id

Insert Mode

Insert is the default write mode of the sink. It inserts messages from Kafka topics into DocumentDB.

Upsert Mode

The Sink supports DocumentDB upsert functionality which replaces the existing row if a match is found on the primary keys.

This mode works with at least once delivery semantics on Kafka as the order is guaranteed within partitions. If the same record is delivered twice to the sink, it results in an idempotent write. The existing record will be updated with the values of the second which are the same.

If records are delivered with the same field or group of fields that are used as the primary key on the target table, but different values, the existing record in the target table will be updated.

Since records are delivered in the order they were written per partition the write is idempotent on failure or restart. Redelivery produces the same result.

Kafka payload support

This sink supports the following Kafka payloads:

Schema.Struct and Struct (Avro)
Schema.Struct and JSON
No Schema and JSON

Error policies

The connector supports Error policies.

Option Reference

Name

Description

Type

Default Value

connect.documentdb.endpoint

The Azure DocumentDb end point.

string

connect.documentdb.master.key

The connection master key

password

connect.documentdb.consistency.level

Determines the write visibility. There are four possible values: Strong,BoundedStaleness,Session or Eventual

string

Session

connect.documentdb.db

The Azure DocumentDb target database.

string

connect.documentdb.db.create

If set to true it will create the database if it doesn’t exist. If this is set to default(false) an exception will be raised.

boolean

false

connect.documentdb.proxy

Specifies the connection proxy details.

string

connect.documentdb.error.policy

Specifies the action to be taken if an error occurs while inserting the data There are two available options: NOOP - the error is swallowed THROW - the error is allowed to propagate. RETRY - The exception causes the Connect framework to retry the message. The number of retries is based on The error will be logged automatically

string

THROW

connect.documentdb.max.retries

The maximum number of times to try the write again.

int

connect.documentdb.retry.interval

The time in milliseconds between retries.

int

60000

connect.documentdb.kcql

KCQL expression describing field selection and data routing to the target DocumentDb.

string

connect.progress.enabled

Enables the output for how many records have been processed

boolean

false

Azure Data Lake Gen2

This page describes the usage of the Stream Reactor Azure Datalake Gen 2 Sink Connector.

This Kafka Connect sink connector facilitates the seamless transfer of records from Kafka to Azure Data Lake Buckets. It offers robust support for various data formats, including AVRO, Parquet, JSON, CSV, and Text, making it a versatile choice for data storage. Additionally, it ensures the reliability of data transfer with built-in support for exactly-once semantics.

Connector Class

io.lenses.streamreactor.connect.datalake.sink.DatalakeSinkConnector

Example

For more examples see the tutorials.

connector.class=io.lenses.streamreactor.connect.datalake.sink.DatalakeSinkConnector
connect.datalake.kcql=insert into lensesio:demo select * from demo PARTITIONBY _value.metadata_id, _value.customer_id, _header.ts, _header.wallclock STOREAS `JSON` PROPERTIES('flush.interval'=600, 'flush.size'=1000000, 'flush.count'=5000)
topics=demo
name=demo
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
transforms=insertFormattedTs,insertWallclock
transforms.insertFormattedTs.type=io.lenses.connect.smt.header.TimestampConverter
transforms.insertFormattedTs.header.name=ts
transforms.insertFormattedTs.field=timestamp
transforms.insertFormattedTs.target.type=string
transforms.insertFormattedTs.format.to.pattern=yyyy-MM-dd-HH
transforms.insertWallclock.type=io.lenses.connect.smt.header.InsertWallclock
transforms.insertWallclock.header.name=wallclock
transforms.insertWallclock.value.type=format
transforms.insertWallclock.format=yyyy-MM-dd-HH
topics=demo
name=demo
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
transforms=insertFormattedTs,insertWallclock
transforms.insertFormattedTs.type=io.lenses.connect.smt.header.TimestampConverter
transforms.insertFormattedTs.header.name=ts
transforms.insertFormattedTs.field=timestamp
transforms.insertFormattedTs.target.type=string
transforms.insertFormattedTs.format.to.pattern=yyyy-MM-dd-HH
transforms.insertWallclock.type=io.lenses.connect.smt.header.InsertWallclock
transforms.insertWallclock.header.name=wallclock
transforms.insertWallclock.value.type=format
transforms.insertWallclock.format=yyyy-MM-dd-HH

KCQL Support

The connector uses KCQL to map topics to Datalake buckets and paths. The full KCQL syntax is:

INSERT INTO bucketAddress[:pathPrefix]
SELECT *
FROM kafka-topic
[[PARTITIONBY (partition[, partition] ...)] | NOPARTITION]
[STOREAS storage_format]
[PROPERTIES(
  'property.1'=x,
  'property.2'=x,
)]

Please note that you can employ escaping within KCQL for the INSERT INTO, SELECT * FROM, and PARTITIONBY clauses when necessary. For example, an incoming Kafka message stored as JSON can use fields containing .:

{
  ...
  "a.b": "value",
  ...
}

In this case, you can use the following KCQL statement:

INSERT INTO `container-name`:`prefix` SELECT * FROM `kafka-topic` PARTITIONBY `a.b`

Target Bucket and Path

The target bucket and path are specified in the INSERT INTO clause. The path is optional and if not specified, the connector will write to the root of the bucket and append the topic name to the path.

Here are a few examples:

INSERT INTO testcontainer:pathToWriteTo SELECT * FROM topicA;
INSERT INTO testcontainer SELECT * FROM topicA;
INSERT INTO testcontainer:path/To/Write/To SELECT * FROM topicA PARTITIONBY fieldA;

SQL Projection

Source Topic

Set the FROM clause to *. This will auto map the topic as a partition.

KCQL Properties

Name

Description

Type

Available Values

Default Value

padding.type

Specifies the type of padding to be applied.

LeftPad, RightPad, NoOp

LeftPad

padding.char

Defines the character used for padding.

Char

‘0’

padding.length.partition

Sets the padding length for the partition.

Int

padding.length.offset

Sets the padding length for the offset.

Int

partition.include.keys

Specifies whether partition keys are included.

Boolean

false Default (Custom Partitioning): true

store.envelope

Indicates whether to store the entire Kafka message

Boolean

store.envelope.fields.key

Indicates whether to store the envelope’s key.

Boolean

store.envelope.fields.headers

Indicates whether to store the envelope’s headers.

Boolean

store.envelope.fields.value

Indicates whether to store the envelope’s value.

Boolean

store.envelope.fields..metadata

Indicates whether to store the envelope’s metadata.

Boolean

flush.size

Specifies the size (in bytes) for the flush operation.

Long

500000000 (500MB)

flush.count

Specifies the number of records for the flush operation.

Int

50000

flush.interval

Specifies the interval (in seconds) for the flush operation.

Long

3600 (1 hour)

key.suffix

When specified it appends the given value to the resulting object key before the "extension" (avro, json, etc) is added

String

<empty>

The sink connector optimizes performance by padding the output files, a practice that proves beneficial when using the Datalake Source connector to restore data. This file padding ensures that files are ordered lexicographically, allowing the Datalake Source connector to skip the need for reading, sorting, and processing all files, thereby enhancing efficiency.

Partitioning & File names

The object key serves as the filename used to store data in Datalake. There are two options for configuring the object key:

Default: The object key is automatically generated by the connector and follows the Kafka topic-partition structure. The format is $container/[$prefix]/$topic/$partition/offset.extension. The extension is determined by the chosen storage format.
Custom: The object key is driven by the PARTITIONBY clause. The format is either $container/[$prefix]/$topic/customKey1=customValue1/customKey2=customValue2/topic(partition_offset).extension (naming style mimicking Hive-like data partitioning) or $container/[$prefix]/customValue/topic(partition_offset).ext. The extension is determined by the selected storage format.

The partition clause works on Header, Key and Values fields of the Kafka message.

To extract fields from the message values, simply use the field names in the PARTITIONBY clause. For example:

PARTITIONBY fieldA, fieldB

However, note that the message fields must be of primitive types (e.g., string, int, long) to be used for partitioning.

You can also use the entire message key as long as it can be coerced into a primitive type:

PARTITIONBY _key

In cases where the Kafka message Key is not a primitive but a complex object, you can use individual fields within the message Key to create the Datalake object key name:

PARTITIONBY _key.fieldA, _key.fieldB

Kafka message headers can also be used in the Datalake object key definition, provided the header values are of primitive types easily convertible to strings:

PARTITIONBY _header.<header_key1>[, _header.<header_key2>]

Customizing the object key can leverage various components of the Kafka message. For example:

PARTITIONBY fieldA, _key.fieldB, _headers.fieldC

This flexibility allows you to tailor the object key to your specific needs, extracting meaningful information from Kafka messages to structure Datalake object keys effectively.

To enable Athena-like partitioning, use the following syn

INSERT INTO $container[:$prefix]
SELECT * FROM $topic
PARTITIONBY fieldA, _key.fieldB, _headers.fieldC
STOREAS `AVRO`
PROPERTIES (
    'partition.include.keys'=true,
)

Rolling Windows

Storing data in Azure Datalake and partitioning it by time is a common practice in data management. For instance, you may want to organize your Datalake data in hourly intervals. This partitioning can be seamlessly achieved using the PARTITIONBY clause in combination with specifying the relevant time field. However, it’s worth noting that the time field typically doesn’t adjust automatically.

To address this, we offer a Kafka Connect Single Message Transformer (SMT) designed to streamline this process. You can find the transformer plugin and documentation here.

connector.class=io.lenses.streamreactor.connect.azure.datalake.sink.DatalakeSinkConnector
connect.datalake.kcql=insert into lensesio:demo select * from demo PARTITIONBY _value.metadata_id, _value.customer_id, _header.ts, _header.wallclock STOREAS `JSON` PROPERTIES('flush.interval'=30, 'flush.size'=1000000, 'flush.count'=5000)
topics=demo
name=demo
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
transforms=insertFormattedTs,insertWallclock
transforms.insertFormattedTs.type=io.lenses.connect.smt.header.TimestampConverter
transforms.insertFormattedTs.header.name=ts
transforms.insertFormattedTs.field=timestamp
transforms.insertFormattedTs.target.type=string
transforms.insertFormattedTs.format.to.pattern=yyyy-MM-dd-HH
transforms.insertWallclock.type=io.lenses.connect.smt.header.InsertWallclock
transforms.insertWallclock.header.name=wallclock
transforms.insertWallclock.value.type=format
transforms.insertWallclock.format=yyyy-MM-dd-HH
topics=demo
name=demo
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
transforms=insertFormattedTs,insertWallclock
transforms.insertFormattedTs.type=io.lenses.connect.smt.header.TimestampConverter
transforms.insertFormattedTs.header.name=ts
transforms.insertFormattedTs.field=timestamp
transforms.insertFormattedTs.target.type=string
transforms.insertFormattedTs.format.to.pattern=yyyy-MM-dd-HH
transforms.insertWallclock.type=io.lenses.connect.smt.header.InsertWallclock
transforms.insertWallclock.header.name=wallclock
transforms.insertWallclock.value.type=format
transforms.insertWallclock.format=yyyy-MM-dd-HH

The PARTITIONBY clause then leverages both the timestamp field and the wallclock header to craft the object key, providing you with precise control over data partitioning.

Data Storage Format

While the STOREAS clause is optional, it plays a pivotal role in determining the storage format within Azure Datalake. It’s crucial to understand that this format is entirely independent of the data format stored in Kafka. The connector maintains its neutrality towards the storage format at the topic level and relies on the key.converter and value.converter settings to interpret the data.

Supported storage formats encompass:

AVRO
Parquet
JSON
CSV (including headers)
Text
BYTES

Opting for BYTES ensures that each record is stored in its own separate file. This feature proves particularly valuable for scenarios involving the storage of images or other binary data in Datalake. For cases where you prefer to consolidate multiple records into a single binary file, AVRO or Parquet are the recommended choices.

{
  "key": <the message Key, which can be a primitive or a complex object>,
  "value": <the message Key, which can be a primitive or a complex object>,
  "headers": {
    "header1": "value1",
    "header2": "value2"
  },
  "metadata": {
    "offset": 0,
    "partition": 0,
    "timestamp": 0,
    "topic": "topic"
  }
}

Utilizing the envelope is particularly advantageous in scenarios such as backup and restore or replication, where comprehensive storage of the entire message in Datalake is desired.

Examples

Storing the message Value Avro data as Parquet in Datalake:

...
connect.datalake.kcql=INSERT INTO lensesioazure:car_speed SELECT * FROM car_speed_events STOREAS `PARQUET` 
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
key.converter=org.apache.kafka.connect.storage.StringConverter
...

The converter also facilitates seamless JSON to AVRO/Parquet conversion, eliminating the need for an additional processing step before the data is stored in Datalake.

...
connect.datalake.kcql=INSERT INTO lensesioazure:car_speed SELECT * FROM car_speed_events STOREAS `PARQUET` 
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
...

Enabling the full message stored as JSON in Datalake:

  ...
  connect.datalake.kcql=INSERT INTO lensesioazure:car_speed SELECT * FROM car_speed_events STOREAS `JSON` PROPERTIES('store.envelope'=true)
  value.converter=org.apache.kafka.connect.json.JsonConverter
  key.converter=org.apache.kafka.connect.storage.StringConverter
  ...

Enabling the full message stored as AVRO in Datalake:

...
connect.datalake.kcql=INSERT INTO lensesioazure:car_speed SELECT * FROM car_speed_events STOREAS `AVRO` PROPERTIES('store.envelope'=true)
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
key.converter=org.apache.kafka.connect.storage.StringConverter
...

If the restore (see the Datalake Source documentation) happens on the same cluster, then the most performant way is to use the ByteConverter for both Key and Value and store as AVRO or Parquet:

...
connect.datalake.kcql=INSERT INTO lensesioazure:car_speed SELECT * FROM car_speed_events STOREAS `AVRO` PROPERTIES('store.envelope'=true)
value.converter=org.apache.kafka.connect.converters.ByteArrayConverter
key.converter=org.apache.kafka.connect.converters.ByteArrayConverter
...

Flush Options

The connector offers three distinct flush options for data management:

Flush by Count - triggers a file flush after a specified number of records have been written to it.
Flush by Size - initiates a file flush once a predetermined size (in bytes) has been attained.
Flush by Interval - enforces a file flush after a defined time interval (in seconds).

Consider a scenario where the flush size is set to 10MB, and only 9.8MB of data has been written to the file, with no new Kafka messages arriving for an extended period of 6 hours. To prevent undue delays, the interval flush guarantees that the file is flushed after the specified time interval has elapsed. This ensures the timely management of data even in situations where other flush conditions are not met.

The flush options are configured using the flush.count, flush.size, and flush.interval KCQL Properties (see KCQL Properties section). The settings are optional and if not specified the defaults are:

flush.count = 50_000
flush.size = 500000000 (500MB)
flush.interval = 3600 (1 hour)

Flushing By Interval

The next flush time is calculated based on the time the previous flush completed (the last modified time of the file written to Data Lake). Therefore, by design, the sink connector’s behaviour will have a slight drift based on the time it takes to flush records and whether records are present or not. If Kafka Connect makes no calls to put records, the logic for flushing won't be executed. This ensures a more consistent number of records per file.

Compression

AVRO and Parquet offer the capability to compress files as they are written. The GCP Storage Sink connector provides advanced users with the flexibility to configure compression options.

Here are the available options for the connect.gcpstorage.compression.codec, along with indications of their support by Avro, Parquet and JSON writers:

Compression

Avro Support

Avro (requires Level)

Parquet Support

JSON

UNCOMPRESSED

✅

SNAPPY

✅

GZIP

✅

LZ0

✅

LZ4

✅

BROTLI

✅

BZIP2

✅

ZSTD

✅

⚙️

✅

DEFLATE

✅

⚙️

✅

⚙️

Please note that not all compression libraries are bundled with the Datalake connector. Therefore, you may need to manually add certain libraries to the classpath to ensure they function correctly.

Authentication

The connector offers two distinct authentication modes:

Default: This mode relies on the default Azure authentication chain, simplifying the authentication process.
Connection String: This mode enables simpler configuration by relying on the connection string to authenticate with Azure.
Credentials: In this mode, explicit configuration of Azure Access Key and Secret Key is required for authentication.

Here’s an example configuration for the “Credentials” mode:

...
connect.datalake.azure.auth.mode=Credentials
connect.datalake.azure.account.name=$AZURE_ACCOUNT_NAME
connect.datalake.azure.account.key=$AZURE_ACCOUNT_KEY
...

And here is an example configuration using the “Connection String” mode:

...
connect.datalake.azure.auth.mode=ConnectionString
connect.datalake.azure.connection.string=$AZURE_CONNECTION_STRING
...

For enhanced security and flexibility when using either the “Credentials” or “Connection String” modes, it is highly advisable to utilize Connect Secret Providers.

Error policies

The connector supports Error policies.

Indexes Directory

The connector uses the concept of index files that it writes to in order to store information about the latest offsets for Kafka topics and partitions as they are being processed. This allows the connector to quickly resume from the correct position when restarting and provides flexibility in naming the index files.

By default, the root directory for these index files is named .indexes for all connectors. However, each connector will create and store its index files within its own subdirectory inside this .indexes directory.

You can configure the root directory for these index files using the property connect.datalake.indexes.name. This property specifies the path from the root of the data lake filesystem. Note that even if you configure this property, the connector will still create a subdirectory within the specified root directory.

Examples

Index Name (connect.datalake.indexes.name)

Resulting Indexes Directory Structure

Description

.indexes (default)

.indexes/<connector_name>/

The default setup, where each connector uses its own subdirectory within .indexes.

custom-indexes

custom-indexes/<connector_name>/

Custom root directory custom-indexes, with a subdirectory for each connector.

indexes/datalake-connector-logs

indexes/datalake-connector-logs/<connector_name>/

Uses a custom subdirectory datalake-connector-logs within indexes, with a subdirectory for each connector.

logs/indexes

logs/indexes/<connector_name>/

Indexes are stored under logs/indexes, with a subdirectory for each connector.

Option Reference

Name

Description

Type

Available Values

Default Value

connect.datalake.azure.auth.mode

Specifies the Azure authentication mode for connecting to Datalake.

string

“Credentials”, “ConnectionString” or “Default”

“Default”

connect.datalake.azure.account.key

The Azure Account Key used for authentication.

string

(Empty)

connect.datalake.azure.account.name

The Azure Account Name used for authentication.

string

(Empty)

connect.datalake.pool.max.connections

Specifies the maximum number of connections allowed in the Azure Client’s HTTP connection pool when interacting with Datalake.

int

-1 (undefined)

connect.datalake.endpoint

Datalake endpoint URL.

string

(Empty)

connect.datalake.error.policy

Defines the error handling policy when errors occur during data transfer to or from Datalake.

string

“NOOP,” “THROW,” “RETRY”

“THROW”

connect.datalake.max.retries

Sets the maximum number of retries the connector will attempt before reporting an error to the Connect Framework.

int

connect.datalake.retry.interval

Specifies the interval (in milliseconds) between retry attempts by the connector.

int

60000

connect.datalake.http.max.retries

Sets the maximum number of retries for the underlying HTTP client when interacting with Datalake.

long

connect.datalake.http.retry.interval

Specifies the retry interval (in milliseconds) for the underlying HTTP client. An exponential backoff strategy is employed.

long

connect.datalake.local.tmp.directory

Enables the use of a local folder as a staging area for data transfer operations.

string

(Empty)

connect.datalake.kcql

A SQL-like configuration that defines the behavior of the connector. Refer to the KCQL section below for details.

string

(Empty)

connect.datalake.compression.codec

Sets the Parquet compression codec to be used when writing data to Datalake.

string

“UNCOMPRESSED,” “SNAPPY,” “GZIP,” “LZ0,” “LZ4,” “BROTLI,” “BZIP2,” “ZSTD,” “DEFLATE,” “XZ”

“UNCOMPRESSED”

connect.datalake.compression.level

Sets the compression level when compression is enabled for data transfer to Datalake.

int

1-9

(Empty)

connect.datalake.seek.max.files

Specifies the maximum threshold for the number of files the connector uses to ensure exactly-once processing of data.

int

connect.datalake.indexes.name

Configure the indexes root directory for this connector.

string

".indexes"

connect.datalake.exactly.once.enable

By setting to 'false', disable exactly-once semantics, opting instead for Kafka Connect’s native at-least-once offset management

boolean

true, false

true

connect.datalake.schema.change.detector

string

default, version, compatibility

default

connect.datalake.skip.null.values

Skip records with null values (a.k.a. tombstone records).

boolean

true, false

false

Azure Event Hubs

This page describes the usage of the Stream Reactor Azure Event Hubs Sink Connector.

Coming soon!

Azure Service Bus

This page describes the usage of the Stream Reactor Azure Service Bus Sink Connector.

Stream Reactor Azure Service Bus Sink Connector is designed to effortlessly translate Kafka records into your Azure Service Bus cluster. It leverages Microsoft Azure API to transfer data to Service Bus in a seamless manner, allowing for their safe transition and safekeeping both payloads and metadata (see Payload support). It supports both types of Service Buses: Queues and Topics. Azure Service Bus Source Connector provides its user with AT-LEAST-ONCE guarantee as the data is committed (marked as read) in Kafka topic (for assigned topic and partition) once Connector verifies it was successfully committed to designated Service Bus topic.

Connector Class

io.lenses.streamreactor.connect.azure.servicebus.sink.AzureServiceBusSinkConnector

Full Config Example

For more examples see the tutorials.

The following example presents all the mandatory configuration properties for the Service Bus connector. Please note there are also optional parameters listed in Option Reference. Feel free to tweak the configuration to your requirements.

connector.class=io.lenses.streamreactor.connect.azure.servicebus.sink.AzureServiceBusSinkConnector
name=AzureEventHubsSinkConnector
tasks.max=1
value.converter=org.apache.kafka.connect.storage.StringConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
connect.servicebus.connection.string="Endpoint=sb://MYNAMESPACE.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=SOME_SHARED_ACCESS_STRING";
connect.servicebus.kcql=INSERT INTO output-servicebus SELECT * FROM input-topic PROPERTIES('servicebus.type'='QUEUE');

KCQL support

You can specify multiple KCQL statements separated by ; to have the connector map between multiple topics.

The following KCQL is supported:

INSERT INTO <your-service-bus>
SELECT *
FROM <your-kafka-topic>
PROPERTIES(...);

It allows you to map Kafka topic of name <your-kafka-topic> to Service Bus of name <your-service-bus> using the PROPERTIES specified (please check QUEUE and TOPIC Mappings for more info on necessary properties)

The selection of fields from the Service Bus message is not supported.

Authentication

You can connect to an Azure Service Bus by passing your connection string in configuration. The connection string can be found in the Shared access policies section of your Azure Portal.

connect.servicebus.connection.string=Endpoint=sb://YOURNAMESPACE.servicebus.windows.net/;SharedAccessKeyName=YOUR_KEYNAME;SharedAccessKey=YOUR_ACCESS_KEY=

Learn more about different methods of connecting to Service Bus on the Azure Website.

QUEUE and TOPIC Mappings

Writing to QUEUE ServiceBus

In order to be writing to the queue there's an additional parameter that you need to pass with your KCQL mapping in the PROPERTIES part. This parameter is servicebus.type and it can take one of two values depending on the type of the service bus: QUEUE or TOPIC. Naturally for Queue we're interested in QUEUE here and we need to pass it.

connect.servicebus.kcql=INSERT INTO azure-queue SELECT * FROM kafka-topic PROPERTIES('servicebus.type'='QUEUE');

This is sufficient to enable you to create the mapping with your queue.

Writing to TOPIC ServiceBus

In order to be writing to the topic there is an additional parameter that you need to pass with your KCQL mapping in the PROPERTIES part:

Parameter servicebus.type which can take one of two values depending on the type of the service bus: QUEUE or TOPIC. For topic we're interested in TOPIC here and we need to pass it.

connect.servicebus.kcql=INSERT INTO azure-topic SELECT * FROM kafka-topic PROPERTIES('servicebus.type'='TOPIC');

This is sufficient to enable you to create the mapping with your topic.

Disabling batching

If the Connector is supposed to transfer big messages (size of one megabyte and more), Service Bus may not want to accept a batch of such payloads, failing the Connector Task. In order to remediate that you may want to use batch.enabled parameter, setting it to false. This will sacrifice the ability to send the messages in batch (possibly doing it slower) but should enable user to transfer them safely.

For most of the usages, we recommend omitting it (it's set to true by default).

Kafka payload support

This sink supports the following Kafka payloads:

String Schema Key and Binary payload (then MessageId in Service Bus is set with Kafka Key)
any other key (or keyless) and Binary payload (this causes Service Bus messages to not have specified MessageId)
No Schema and JSON

Null Payload Transfer

Azure Service Bus doesn't allow to send messages with null content (payload)

Null Payload (sometimes referred as Kafka Tombstone) is a known concept in Kafka messages world. However, because of Service Bus limitations around that matter, we aren't allowed to send messages with null payload and we have to drop them instead.

Please keep that in mind when using Service Bus and designing business logic around null payloads!

Option Reference

KCQL Properties

Please find below all the necessary KCQL properties:

Name

Description

Type

Default Value

servicebus.type

Specifies Service Bus type: QUEUE or TOPIC

string

batch.enabled

Specifies if the Connector can send messages in batch, see #specifying-the-batching-parameter

boolean

true

Configuration parameters

Please find below all the relevant configuration parameters:

Name

Description

Type

Default Value

connect.servicebus.connection.string

Specifies the Connection String to connect to Service Bus

string

connect.servicebus.kcql

Comma-separated output KCQL queries

string

connect.servicebus.sink.retries.max

Number of retries if message has failed to be delivered to Service Bus

int

connect.servicebus.sink.retries.timeout

Timeout (in milliseconds) between retries if message has failed to be delivered to Service Bus

int

500

Cassandra

This page describes the usage of the Stream Reactor Cassandra Sink Connector.

The connector converts the value of Kafka messages to JSON and uses the Cassandra JSON insert feature to write records.

Connector Class

io.lenses.streamreactor.connect.cassandra.sink.CassandraSinkConnector

Example

For more examples see the tutorials.

name=cassandra-sink
connector.class=io.lenses.streamreactor.connect.cassandra.sink.CassandraSinkConnector
tasks.max=1
topics=orders
connect.cassandra.kcql=INSERT INTO orders SELECT * FROM orders
connect.cassandra.port=9042
connect.cassandra.key.space=demo
connect.cassandra.contact.points=cassandra

KCQL support

The following KCQL is supported:

INSERT INTO <your-cassandra-table>
SELECT FIELD,...
FROM <your-table>
[TTL=Time to live]

Examples:

-- Insert mode, select all fields from topicA and
-- write to tableA
INSERT INTO tableA SELECT * FROM topicA

-- Insert mode, select 3 fields and rename from topicB
-- and write to tableB
INSERT INTO tableB SELECT x AS a, y, c FROM topicB

-- Insert mode, select 3 fields and rename from topicB
-- and write to tableB with TTL
INSERT INTO tableB SELECT x, y FROM topicB TTL=100000

Deletion in Cassandra

Compacted topics in Kafka retain the last message per key. Deletion in Kafka occurs by tombstoning. If compaction is enabled on the topic and a message is sent with a null payload, Kafka flags this record for deletion and is compacted/removed from the topic.

Deletion in Cassandra is supported based on fields in the key of messages with an empty/null payload. A Cassandra delete statement must be provided which specifies the Cassandra CQL delete statement and with parameters to bind field values from the key to, for example, with the delete statement of:

DELETE FROM orders WHERE id = ? and product = ?

If a message was received with an empty/null value and key fields key.id and key.product the final bound Cassandra statement would be:

# Message
# "{ "key": { "id" : 999, "product" : "DATAMOUNTAINEER" }, "value" : null }"
# DELETE FROM orders WHERE id = 999 and product = "DATAMOUNTAINEER"

# connect.cassandra.delete.enabled=true
# connect.cassandra.delete.statement=DELETE FROM orders WHERE id = ? and product = ?
# connect.cassandra.delete.struct_flds=id,product

Deletion will only occur if a message with an empty payload is received from Kafka.

Ensure your ordinal position of the connect.cassandra.delete.struct_flds matches the binding order in the Cassandra delete statement!

Kafka payload support

This sink supports the following Kafka payloads:

Schema.Struct and Struct (Avro)
Schema.Struct and JSON
No Schema and JSON

Error policies

The connector supports Error policies.

Option Reference

Name

Description

Type

Default Value

connect.cassandra.contact.points

Initial contact point host for Cassandra including port.

string

localhost

connect.cassandra.port

Cassandra native port.

int

9042

connect.cassandra.key.space

Keyspace to write to.

string

connect.cassandra.username

Username to connect to Cassandra with.

string

connect.cassandra.password

Password for the username to connect to Cassandra with.

password

connect.cassandra.ssl.enabled

Secure Cassandra driver connection via SSL.

boolean

false

connect.cassandra.trust.store.path

Path to the client Trust Store.

string

connect.cassandra.trust.store.password

Password for the client Trust Store.

password

connect.cassandra.trust.store.type

Type of the Trust Store, defaults to JKS

string

JKS

connect.cassandra.key.store.type

Type of the Key Store, defauts to JKS

string

JKS

connect.cassandra.ssl.client.cert.auth

Enable client certification authentication by Cassandra. Requires KeyStore options to be set.

boolean

false

connect.cassandra.key.store.path

Path to the client Key Store.

string

connect.cassandra.key.store.password

Password for the client Key Store

password

connect.cassandra.consistency.level

string

connect.cassandra.fetch.size

The number of records the Cassandra driver will return at once.

int

5000

connect.cassandra.load.balancing.policy

Cassandra Load balancing policy. ROUND_ROBIN, TOKEN_AWARE, LATENCY_AWARE or DC_AWARE_ROUND_ROBIN. TOKEN_AWARE and LATENCY_AWARE use DC_AWARE_ROUND_ROBIN

string

TOKEN_AWARE

connect.cassandra.error.policy

string

THROW

connect.cassandra.max.retries

The maximum number of times to try the write again.

int

connect.cassandra.retry.interval

The time in milliseconds between retries.

int

60000

connect.cassandra.threadpool.size

The sink inserts all the data concurrently. To fail fast in case of an error, the sink has its own thread pool. Set the value to zero and the threadpool will default to 4* NO_OF_CPUs. Set a value greater than 0 and that would be the size of this threadpool.

int

connect.cassandra.delete.struct_flds

Fields in the key struct data type used in there delete statement. Comma-separated in the order they are found in connect.cassandra.delete.statement. Keep default value to use the record key as a primitive type.

list

[]

connect.cassandra.delete.statement

Delete statement for cassandra

string

connect.cassandra.kcql

KCQL expression describing field selection and routes.

string

connect.cassandra.default.value

By default a column omitted from the JSON map will be set to NULL. Alternatively, if set UNSET, pre-existing value will be preserved.

string

connect.cassandra.delete.enabled

Enables row deletion from cassandra

boolean

false

connect.progress.enabled

Enables the output for how many records have been processed

boolean

false

Elasticsearch

This page describes the usage of the Stream Reactor Elasticsearch Sink Connector.

Connector Class

Elasticsearch 6

io.lenses.streamreactor.connect.elastic6.ElasticSinkConnector

Elasticsearch 7

io.lenses.streamreactor.connect.elastic7.ElasticSinkConnector

Example

For more examples see the tutorials.

name=elastic
connector.class=io.lenses.streamreactor.connect.elastic7.ElasticSinkConnector
tasks.max=1
topics=orders
connect.elastic.protocol=http
connect.elastic.hosts=elastic
connect.elastic.port=9200
connect.elastic.cluster.name=elasticsearch
connect.elastic.kcql=INSERT INTO orders SELECT * FROM orders
connect.progress.enabled=true

KCQL support

You can specify multiple KCQL statements separated by ; to have a connector sink multiple topics. The connector properties topics or topics.regex are required to be set to a value that matches the KCQL statements.

The following KCQL is supported:

INSERT | UPSERT
INTO <elastic_index >
SELECT FIELD, ...
FROM kafka_topic
[PK FIELD,...]
[WITHDOCTYPE=<your_document_type>]
[WITHINDEXSUFFIX=<your_suffix>]

Examples:

-- Insert mode, select all fields from topicA and write to indexA
INSERT INTO indexA SELECT * FROM topicA

-- Insert mode, select 3 fields and rename from topicB
-- and write to indexB
INSERT INTO indexB SELECT x AS a, y, zc FROM topicB PK y

-- UPSERT
UPSERT INTO indexC SELECT id, string_field FROM topicC PK id

Kafka Tombstone Handling

It is possible to configure how the Connector handles a null value payload (called Kafka tombstones). Please use the behavior.on.null.values property in your KCQL with one of the possible values:

IGNORE (ignores tombstones entirely)
FAIL (throws Exception if tombstone happens)
DELETE (deletes index with specified id)

Example:

INSERT INTO indexA SELECT * FROM topicA PROPERTIES ('behavior.on.null.values'='IGNORE')

Primary Keys

The PK keyword allows you to specify fields that will be used to generate the key value in Elasticsearch. The values of the selected fields are concatenated and separated by a hyphen (-).

If no fields are defined, the connector defaults to using the topic name, partition, and message offset to construct the key.

Field Prefixes

When defining fields, specific prefixes can be used to determine where the data should be extracted from:

_key Prefix Specifies that the value should be extracted from the message key.
- If a path is provided after _key, it identifies the location within the key where the field value resides.
- If no path is provided, the entire message key is used as the value.
_value Prefix Specifies that the value should be extracted from the message value.
- The remainder of the path identifies the specific location within the message value to extract the field.
_header Prefix Specifies that the value should be extracted from the message header.
- The remainder of the path indicates the name of the header to be used for the field value.

Insert and Upsert modes

INSERT writes new records to Elastic, replacing existing records with the same ID set by the PK (Primary Key) keyword. UPSERT replaces existing records if a matching record is found, nor insert a new one if none is found.

Document Type

WITHDOCTYPE allows you to associate a document type to the document inserted.

Index Suffix

WITHINDEXSUFFIX allows you to specify a suffix to your index and we support date format.

Example:

WITHINDEXSUFFIX=_suffix_{YYYY-MM-dd}

Index Names

Static Index Names

To use a static index name, define the target index in the KCQL statement without any prefixes:

INSERT INTO index_name SELECT * FROM topicA

This will consistently create an index named index_name for any messages consumed from topicA.

Extracting Index Names from Headers, Keys, and Values

Headers

To extract an index name from a message header, use the _header prefix followed by the header name:

INSERT INTO _header.gate SELECT * FROM topicA

This statement extracts the value from the gate header field and uses it as the index name.

For headers with names that include dots, enclose the entire target in backticks (```) and each segment which consists of a field name in single quotes ('):

INSERT INTO `_header.'prefix.abc.suffix'` SELECT * FROM topicA

In this case, the value of the header named prefix.abc.suffix is used to form the index name.

Keys

To use the full value of the message key as the index name, use the _key prefix:

INSERT INTO _key SELECT * FROM topicA

For example, if the message key is "freddie", the resulting index name will be freddie.

Values

To extract an index name from a field within the message value, use the _value prefix followed by the field name:

INSERT INTO _value.name SELECT * FROM topicA

This example uses the value of the name field from the message's value. If the field contains "jason", the index name will be jason.

Nested Fields in Values

To access nested fields within a value, specify the full path using dot notation:

INSERT INTO _value.name.firstName SELECT * FROM topicA

If the firstName field is nested within the name structure, its value (e.g., "hans") will be used as the index name.

Fields with Dots in Their Names

For field names that include dots, enclose the entire target in backticks (```) and each segment which consists of a field name in single quotes ('):

INSERT INTO `_value.'customer.name'.'first.name'` SELECT * FROM topicA

If the value structure contains:

{
  "customer.name": {
    "first.name": "hans"
  }
}

The extracted index name will be hans.

Auto Index Creation

The Sink will automatically create missing indexes at startup.

Please note that this feature is not compatible with index names extracted from message headers/keys/values.

Options Reference

Name

Description

Type

Default Value

connect.elastic.protocol

URL protocol (http, https)

string

http

connect.elastic.hosts

List of hostnames for Elastic Search cluster node, not including protocol or port.

string

localhost

connect.elastic.port

Port on which Elastic Search node listens on

string

9300

connect.elastic.tableprefix

Table prefix (optional)

string

connect.elastic.cluster.name

Name of the elastic search cluster, used in local mode for setting the connection

string

elasticsearch

connect.elastic.write.timeout

The time to wait in millis. Default is 5 minutes.

int

300000

connect.elastic.batch.size

How many records to process at one time. As records are pulled from Kafka it can be 100k+ which will not be feasible to throw at Elastic search at once

int

4000

connect.elastic.use.http.username

Username if HTTP Basic Auth required default is null.

string

connect.elastic.use.http.password

Password if HTTP Basic Auth required default is null.

string

connect.elastic.error.policy

string

THROW

connect.elastic.max.retries

The maximum number of times to try the write again.

int

connect.elastic.retry.interval

The time in milliseconds between retries.

int

60000

connect.elastic.kcql

KCQL expression describing field selection and routes.

string

connect.elastic.pk.separator

Separator used when have more that one field in PK

string

connect.progress.enabled

Enables the output for how many records have been processed

boolean

false

KCQL Properties

Name

Description

Type

Default Value

behavior.on.null.values

Specifies behavior on Kafka tombstones: IGNORE , DELETE or FAIL

String

IGNORE

SSL Configuration Properties

Property Name

Description

ssl.truststore.location

Path to the truststore file containing the trusted CA certificates for verifying broker certificates.

ssl.truststore.password

Password for the truststore file to protect its integrity.

ssl.truststore.type

Type of the truststore (e.g., JKS, PKCS12). Default is JKS.

ssl.keystore.location

Path to the keystore file containing the client’s private key and certificate chain for client authentication.

ssl.keystore.password

Password for the keystore to protect the private key.

ssl.keystore.type

Type of the keystore (e.g., JKS, PKCS12). Default is JKS.

ssl.protocol

The SSL protocol used for secure connections (e.g., TLSv1.2, TLSv1.3). Default is TLS.

ssl.trustmanager.algorithm

Algorithm used by the TrustManager to manage certificates. Default value is the key manager factory algorithm configured for the Java Virtual Machine.

ssl.keymanager.algorithm

Algorithm used by the KeyManager to manage certificates. Default value is the key manager factory algorithm configured for the Java Virtual Machine.

SSL Configuration

Enabling SSL connections between Kafka Connect and Elasticsearch ensures that the communication between these services is secure, protecting sensitive data from being intercepted or tampered with. SSL (or TLS) encrypts data in transit, verifying the identity of both parties and ensuring data integrity.

While newer versions of Elasticsearch have SSL enabled by default for internal communication, it’s still necessary to configure SSL for client connections, such as those from Kafka Connect. Even if Elasticsearch has SSL enabled by default, Kafka Connect still needs these configurations to establish a secure connection. By setting up SSL in Kafka Connect, you ensure:

Data encryption: Prevents unauthorized access to data being transferred.
Authentication: Confirms that Kafka Connect and Elasticsearch are communicating with trusted entities.
Compliance: Meets security standards for regulatory requirements (such as GDPR or HIPAA).

Configuration Example

ssl.truststore.location=/path/to/truststore.jks
ssl.truststore.password=your_truststore_password
ssl.truststore.type=JKS  # Can also be PKCS12 if applicable

ssl.keystore.location=/path/to/keystore.jks
ssl.keystore.password=your_keystore_password
ssl.keystore.type=JKS  # Can also be PKCS12 if applicable

ssl.protocol=TLSv1.2  # Or TLSv1.3 for stronger security

ssl.trustmanager.algorithm=PKIX  # Default algorithm for managing certificates
ssl.keymanager.algorithm=PKIX  # Default algorithm for managing certificates

Terminology:

Truststore: Holds certificates to check if the node’s certificate is valid.
Keystore: Contains your client’s private key and certificate to prove your identity to the node.
SSL Protocol: Use TLSv1.2 or TLSv1.3 for up-to-date security.
Password Security: Protect passwords by encrypting them or using secure methods like environment variables or secret managers.

GCP PubSub

This page describes the usage of the Stream Reactor Google PubSub Sink Connector.

Coming soon!

GCP Storage

This page describes the usage of the Stream Reactor GCP Storage Sink Connector.

Connector Class

io.lenses.streamreactor.connect.gcp.storage.sink.GCPStorageSinkConnector

Example

For more examples see the tutorials.

connector.class=io.lenses.streamreactor.connect.gcp.storage.sink.GCPStorageSinkConnector
connect.gcpstorage.kcql=insert into lensesio:demo select * from demo PARTITIONBY _value.metadata_id, _value.customer_id, _header.ts, _header.wallclock STOREAS `JSON` PROPERTIES('flush.interval'=600, 'flush.size'=1000000, 'flush.count'=5000)
topics=demo
name=demo
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
transforms=insertFormattedTs,insertWallclock
transforms.insertFormattedTs.type=io.lenses.connect.smt.header.TimestampConverter
transforms.insertFormattedTs.header.name=ts
transforms.insertFormattedTs.field=timestamp
transforms.insertFormattedTs.target.type=string
transforms.insertFormattedTs.format.to.pattern=yyyy-MM-dd-HH
transforms.insertWallclock.type=io.lenses.connect.smt.header.InsertWallclock
transforms.insertWallclock.header.name=wallclock
transforms.insertWallclock.value.type=format
transforms.insertWallclock.format=yyyy-MM-dd-HH

KCQL Support

The connector uses KCQL to map topics to GCP Storage buckets and paths. The full KCQL syntax is:

INSERT INTO bucketAddress[:pathPrefix]
SELECT *
FROM kafka-topic
[[PARTITIONBY (partition[, partition] ...)] | NOPARTITION]
[STOREAS storage_format]
[PROPERTIES(
  'property.1'=x,
  'property.2'=x,
)]

{
  ...
  "a.b": "value",
  ...
}

In this case you can use the following KCQL statement:

INSERT INTO `container-name`:`prefix` SELECT * FROM `kafka-topic` PARTITIONBY `a.b`

Target Bucket and Path

Here are a few examples:

INSERT INTO testcontainer:pathToWriteTo SELECT * FROM topicA;
INSERT INTO testcontainer SELECT * FROM topicA;
INSERT INTO testcontainer:path/To/Write/To SELECT * FROM topicA PARTITIONBY fieldA;

SQL Projection

Source Topic

Set the FROM clause to *. This will auto map the topic as a partition.

KCQL Properties

The PROPERTIES clause is optional and adds a layer of configurability to the connector. It enhances versatility by permitting the application of multiple configurations (delimited by ‘,’). The following properties are supported:

Name

Description

Type

Available Values

Default Value

padding.type

Specifies the type of padding to be applied.

LeftPad, RightPad, NoOp

LeftPad

padding.char

Defines the character used for padding.

Char

‘0’

padding.length.partition

Sets the padding length for the partition.

Int

padding.length.offset

Sets the padding length for the offset.

Int

partition.include.keys

Specifies whether partition keys are included.

Boolean

false Default (Custom Partitioning): true

store.envelope

Indicates whether to store the entire Kafka message

Boolean

store.envelope.fields.key

Indicates whether to store the envelope’s key.

Boolean

store.envelope.fields.headers

Indicates whether to store the envelope’s headers.

Boolean

store.envelope.fiels.value

Indicates whether to store the envelope’s value.

Boolean

store.envelope.fields.metadata

Indicates whether to store the envelope’s metadata.

Boolean

flush.size

Specifies the size (in bytes) for the flush operation.

Long

500000000 (500MB)

flush.count

Specifies the number of records for the flush operation.

Int

50000

flush.interval

Specifies the interval (in seconds) for the flush operation.

Long

3600 (1 hour)

key.suffix

When specified it appends the given value to the resulting object key before the "extension" (avro, json, etc) is added

String

<empty>

The sink connector optimizes performance by padding the output object names, a practice that proves beneficial when using the GCP Storage Source connector to restore data. This object name padding ensures that objects are ordered lexicographically, allowing the GCP Storage Source connector to skip the need for reading, sorting, and processing all objects, thereby enhancing efficiency.

Partitioning & Object Keys

The object key serves as the filename used to store data in GCP Storage. There are two options for configuring the object key:

Default: The object key is automatically generated by the connector and follows the Kafka topic-partition structure. The format is $container/[$prefix]/$topic/$partition/offset.extension. The extension is determined by the chosen storage format.
Custom: The object key is driven by the PARTITIONBY clause. The format is either $container/[$prefix]/$topic/customKey1=customValue1/customKey2=customValue2/topic(partition_offset).extension (GCP Athena naming style mimicking Hive-like data partitioning) or $container/[$prefix]/customValue/topic(partition_offset).ext. The extension is determined by the selected storage format.

The partition clause works on header, key and values fields of the Kafka message.

To extract fields from the message values, simply use the field names in the PARTITIONBY clause. For example:

PARTITIONBY fieldA, fieldB

However, note that the message fields must be of primitive types (e.g., string, int, long) to be used for partitioning.

You can also use the entire message key as long as it can be coerced into a primitive type:

PARTITIONBY _key

In cases where the Kafka message Key is not a primitive but a complex object, you can use individual fields within the message Key to create the GCP Storage object key name:

PARTITIONBY _key.fieldA, _key.fieldB

Kafka message headers can also be used in the GCP Storage object key definition, provided the header values are of primitive types easily convertible to strings:

PARTITIONBY _header.<header_key1>[, _header.<header_key2>]

Customizing the object key can leverage various components of the Kafka message. For example:

PARTITIONBY fieldA, _key.fieldB, _headers.fieldC

This flexibility allows you to tailor the object key to your specific needs, extracting meaningful information from Kafka messages to structure GCP Storage object keys effectively.

To enable Athena-like partitioning, use the following syntax:

INSERT INTO $container[:$prefix]
SELECT * FROM $topic
PARTITIONBY fieldA, _key.fieldB, _headers.fieldC
STOREAS `AVRO`
PROPERTIES (
    'partition.include.keys'=true,
)

Rolling Windows

Storing data in GCP Storage and partitioning it by time is a common practice in data management. For instance, you may want to organize your GCP Storage data in hourly intervals. This partitioning can be seamlessly achieved using the PARTITIONBY clause in combination with specifying the relevant time field. However, it’s worth noting that the time field typically doesn’t adjust automatically.

To address this, we offer a Kafka Connect Single Message Transformer (SMT) designed to streamline this process. You can find the transformer plugin and documentation here.

connector.class=io.lenses.streamreactor.connect.gcp.storage.sink.GCPStorageSinkConnector
connect.gcpstorage.kcql=insert into lensesio:demo select * from demo PARTITIONBY _value.metadata_id, _value.customer_id, _header.ts, _header.wallclock STOREAS `JSON` PROPERTIES('flush.interval'=30, 'flush.size'=1000000, 'flush.count'=5000)
topics=demo
name=demo
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
transforms=insertFormattedTs,insertWallclock
transforms.insertFormattedTs.type=io.lenses.connect.smt.header.TimestampConverter
transforms.insertFormattedTs.header.name=ts
transforms.insertFormattedTs.field=timestamp
transforms.insertFormattedTs.target.type=string
transforms.insertFormattedTs.format.to.pattern=yyyy-MM-dd-HH
transforms.insertWallclock.type=io.lenses.connect.smt.header.InsertWallclock
transforms.insertWallclock.header.name=wallclock
transforms.insertWallclock.value.type=format
transforms.insertWallclock.format=yyyy-MM-dd-HH

The PARTITIONBY clause then leverages both the timestamp field and the wallclock header to craft the object key, providing you with precise control over data partitioning.

Data Storage Format

While the STOREAS clause is optional, it plays a pivotal role in determining the storage format within GCP Storage. It’s crucial to understand that this format is entirely independent of the data format stored in Kafka. The connector maintains its neutrality towards the storage format at the topic level and relies on the key.converter and value.converter settings to interpret the data.

Supported storage formats encompass:

AVRO
Parquet
JSON
CSV (including headers)
Text
BYTES

Opting for BYTES ensures that each record is stored in its own separate object. This feature proves particularly valuable for scenarios involving the storage of images or other binary data in GCP Storage. For cases where you prefer to consolidate multiple records into a single binary object, AVRO or Parquet are the recommended choices.

{
  "key": <the message Key, which can be a primitive or a complex object>,
  "value": <the message Key, which can be a primitive or a complex object>,
  "headers": {
    "header1": "value1",
    "header2": "value2"
  },
  "metadata": {
    "offset": 0,
    "partition": 0,
    "timestamp": 0,
    "topic": "topic"
  }
}

Utilizing the envelope is particularly advantageous in scenarios such as backup and restore or replication, where comprehensive storage of the entire message in GCP Storage is desired.

Examples

Storing the message Value Avro data as Parquet in GCP Storage:

...
connect.gcpstorage.kcql=INSERT INTO lensesiogcpstorage:car_speed SELECT * FROM car_speed_events STOREAS `PARQUET` 
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
key.converter=org.apache.kafka.connect.storage.StringConverter
...

The converter also facilitates seamless JSON to AVRO/Parquet conversion, eliminating the need for an additional processing step before the data is stored in GCP Storage.

...
connect.gcpstorage.kcql=INSERT INTO lensesiogcpstorage:car_speed SELECT * FROM car_speed_events STOREAS `PARQUET` 
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
...

Enabling the full message stored as JSON in GCP Storage:

...
connect.gcpstorage.kcql=INSERT INTO lensesiogcpstorage:car_speed SELECT * FROM car_speed_events STOREAS `JSON` PROPERTIES('store.envelope'=true)
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
...

Enabling the full message stored as AVRO in GCP Storage:

...
connect.gcpstorage.kcql=INSERT INTO lensesiogcpstorage:car_speed SELECT * FROM car_speed_events STOREAS `AVRO` PROPERTIES('store.envelope'=true)
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
key.converter=org.apache.kafka.connect.storage.StringConverter
...

If the restore (see the GCP Storage Source documentation) happens on the same cluster, then the most performant way is to use the ByteConverter for both Key and Value and store as AVRO or Parquet:

...
connect.gcpstorage.kcql=INSERT INTO lensesiogcpstorage:car_speed SELECT * FROM car_speed_events STOREAS `AVRO` PROPERTIES('store.envelope'=true)
value.converter=org.apache.kafka.connect.converters.ByteArrayConverter
key.converter=org.apache.kafka.connect.converters.ByteArrayConverter
...

Flush Options

The connector offers three distinct flush options for data management:

Flush by Count - triggers a object flush after a specified number of records have been written to it.
Flush by Size - initiates a object flush once a predetermined size (in bytes) has been attained.
Flush by Interval - enforces a object flush after a defined time interval (in seconds).

It’s worth noting that the interval flush is a continuous process that acts as a fail-safe mechanism, ensuring that objects are periodically flushed, even if the other flush options are not configured or haven’t reached their thresholds.

The flush options are configured using the flush.count, flush.size, and flush.interval KCQL Properties (see KCQL Propertiessection). The settings are optional and if not specified the defaults are:

flush.count = 50_000
flush.size = 500000000 (500MB)
flush.interval = 3600 (1 hour)

Flushing By Interval

The next flush time is calculated based on the time the previous flush completed (the last modified time of the object written to GCP Storage). Therefore, by design, the sink connector’s behaviour will have a slight drift based on the time it takes to flush records and whether records are present or not. If Kafka Connect makes no calls to put records, the logic for flushing won't be executed. This ensures a more consistent number of records per object.

Compression

AVRO and Parquet offer the capability to compress files as they are written. The GCP Storage Sink connector provides advanced users with the flexibility to configure compression options.

Here are the available options for the connect.gcpstorage.compression.codec, along with indications of their support by Avro, Parquet and JSON writers:

Compression

Avro Support

Avro (requires Level)

Parquet Support

JSON

UNCOMPRESSED

✅

SNAPPY

✅

GZIP

✅

LZ0

✅

LZ4

✅

BROTLI

✅

BZIP2

✅

ZSTD

✅

⚙️

✅

DEFLATE

✅

⚙️

✅

⚙️

Please note that not all compression libraries are bundled with the GCP Storage connector. Therefore, you may need to manually add certain libraries to the classpath to ensure they function correctly.

Authentication

The connector offers two distinct authentication modes:

Default: This mode relies on the default GCP authentication chain, simplifying the authentication process.
File: This mode uses a local (to the connect worker) path for a file containing GCP authentication credentials.
Credentials: In this mode, explicit configuration of a GCP Credentials string is required for authentication.

The simplest example to configure in the connector is the "Default" mode, as this requires no other configuration. This is configured, as its name suggests, by default.

Here's an example configuration for the "Credentials" mode:

...
connect.gcpstorage.gcp.auth.mode=Credentials
connect.gcpstorage.gcp.credentials=$GCP_CREDENTIALS
connect.gcpstorage.gcp.project.id=$GCP_PROJECT_ID
...

And here is an example configuration using the "File" mode:

...
connect.gcpstorage.gcp.auth.mode=File
connect.gcpstorage.gcp.file=/home/secure-stuff/gcp-write-credential.txt
...

Remember when using file mode the file will need to exist on every worker node in your Kafka connect cluster and be readable by the Kafka Connect process.

For enhanced security and flexibility when using the “Credentials” mode, it is highly advisable to utilize Connect Secret Providers. You can find detailed information on how to use the Connect Secret Providers here. This approach ensures robust security practices while handling access credentials.

Error policies

The connector supports Error policies.

Indexes Prefix

By default, the prefix for these index objects is named .indexes for all connectors. However, each connector will create and store its index objects within its own nested prefix inside this .indexes prefix.

You can configure the root prefix for these index objects using the property connect.gcpstorage.indexes.name. This property specifies the path from the root of the GCS bucket. Note that even if you configure this property, the connector will still create a nested prefix within the specified prefix.

Examples

Index Name (connect.gcpstorage.indexes.name)

Resulting Indexes Prefix Structure

Description

.indexes (default)

.indexes/<connector_name>/

The default setup, where each connector uses its own nested prefix within .indexes.

custom-indexes

custom-indexes/<connector_name>/

Custom prefix custom-indexes, with a nested prefix for each connector.

indexes/gcs-connector-logs

indexes/gcs-connector-logs/<connector_name>/

Uses a custom nested prefix gcs-connector-logs within indexes, with a nested prefix for each connector.

logs/indexes

logs/indexes/<connector_name>/

Indexes are stored under logs/indexes, with a nested prefix for each connector.

Option Reference

Name

Description

Type

Available Values

Default Value

connect.gcpstorage.gcp.auth.mode

Specifies the authentication mode for connecting to GCP.

string

"Credentials", "File" or "Default"

"Default"

connect.gcpstorage.gcp.credentials

For "auth.mode" credentials: GCP Authentication credentials string.

string

(Empty)

connect.gcpstorage.gcp.file

For "auth.mode" file: Local file path for file containing GCP authentication credentials.

string

(Empty)

connect.gcpstorage.gcp.project.id

GCP Project ID.

string

(Empty)

connect.gcpstorage.gcp.quota.project.id

GCP Quota Project ID.

string

(Empty)

connect.gcpstorage.endpoint

Endpoint for GCP Storage.

string

(Empty)

connect.gcpstorage.error.policy

Defines the error handling policy when errors occur during data transfer to or from GCP Storage.

string

"NOOP", "THROW", "RETRY"

"THROW"

connect.gcpstorage.max.retries

Sets the maximum number of retries the connector will attempt before reporting an error.

int

connect.gcpstorage.retry.interval

Specifies the interval (in milliseconds) between retry attempts by the connector.

int

60000

connect.gcpstorage.http.max.retries

Sets the maximum number of retries for the underlying HTTP client when interacting with GCP.

long

connect.gcpstorage.http.retry.interval

Specifies the retry interval (in milliseconds) for the underlying HTTP client.

long

connect.gcpstorage.http.retry.timeout.multiplier

Specifies the change in delay before the next retry or poll

double

3.0

connect.gcpstorage.local.tmp.directory

Enables the use of a local folder as a staging area for data transfer operations.

string

(Empty)

connect.gcpstorage.kcql

A SQL-like configuration that defines the behavior of the connector.

string

(Empty)

connect.gcpstorage.compression.codec

Sets the Parquet compression codec to be used when writing data to GCP Storage.

string

"UNCOMPRESSED", "SNAPPY", "GZIP", "LZ0", "LZ4", "BROTLI", "BZIP2", "ZSTD", "DEFLATE", "XZ"

"UNCOMPRESSED"

connect.gcpstorage.compression.level

Sets the compression level when compression is enabled for data transfer to GCP Storage.

int

1-9

(Empty)

connect.gcpstorage.seek.max.files

Specifies the maximum threshold for the number of files the connector uses.

int

connect.gcpstorage.indexes.name

Configure the indexes prefix for this connector.

string

".indexes"

connect.gcpstorage.exactly.once.enable

By setting to 'false', disable exactly-once semantics, opting instead for Kafka Connect’s native at-least-once offset management

boolean

true, false

true

connect.gcpstorage.schema.change.detector

string

default, version, compatibility

default

connect.gcpstorage.skip.null.values

Skip records with null values (a.k.a. tombstone records).

boolean

true, false

false

HTTP

This page describes the usage of the Stream Reactor HTTP Sink Connector.

A Kafka Connect sink connector for writing records from Kafka to HTTP endpoints.

Features

Support for Json/Avro/String/Protobuf messages via Kafka Connect (in conjunction with converters for Schema-Registry based data storage).
URL, header and content templating ability give you full control of the HTTP request.
Configurable batching of messages, even allowing you to combine them into a single request selecting which data to send with your HTTP request.

Connector Class

io.lenses.streamreactor.connect.http.sink.HttpSinkConnector

Example

For more examples see the tutorials.

name=lenseshttp
connector.class=io.lenses.streamreactor.connect.http.sink.HttpSinkConnector
tasks.max=1
topics=topicToRead
value.converter=org.apache.kafka.connect.storage.StringConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
connect.http.authentication.type=none
connect.http.method=POST
connect.http.endpoint=http://endpoint.local/receive
connect.http.request.content="My Static Content Template"
connect.http.batch.count=1

Content Template

The Lenses HTTP sink comes with multiple options for content templating of the HTTP request.

Static Templating

If you do not wish any part of the key, value, headers or other data to form a part of the message, you can use static templating:

connect.http.request.content="My Static Content Template"

Single Message Templating

When you are confident you will be generating a single HTTP request per Kafka message, then you can use the simpler templating.

In your configuration, in the content property of your config, you can define template substitutions like the following example:

(please note the XML is only an example, your template can consist of any text format that can be submitted in a http request)

connect.http.request.content="<product><id>{{value.name}}</id></product>"

Multiple Message Templating

To collapse multiple messages into a single HTTP request, you can use the multiple messaging template. This is automatic if the template has a messages tag. See the below example:

  <messages>
    {{#message}}
        <message>
          <topic>{{topic}}</topic>
          <employee>{{value.employeeId}}</employee>
          <order>{{value.orderNo}}</order>
          <groupDomain>{{value.groupDomain}}</groupDomain>
        </message>
    {{/message}}
  </messages>

Again, this is an XML example but your message body can consist of anything including plain text, json or yaml.

Your connector configuration will look like this:

connect.http.request.content="<messages>{{#message}}<message><topic>{{topic}}</topic><employee>{{value.employeeId}}</employee><order>{{value.orderNo}}</order><groupDomain>{{value.groupDomain}}</groupDomain></message>{{/message}}</messages>"

The final result will be HTTP requests with bodies like this:

  <messages>
    <message>
      <topic>myTopic</topic>
       <employee>Abcd1234</employee>
       <order>10</order>
       <groupDomain>myExampleGroup.uk</groupDomain>
    </message>
    <message>
       <topic>myTopic</topic>
       <employee>Efgh5678</employee>
       <order>11</order>
       <groupDomain>myExampleGroup.uk</groupDomain>
    </message>
  </messages>

Available Keys

When using simple and multiple message templating, the following are available:

Field

Usage Example

Header

Value

Key

Topic

Partition

Offset

Timestamp

URL Templating

URL including protocol (eg. http://lenses.io). Template variables can be used.

The URL is also a Content Template so can contain substitutions from the message key/value/headers etc. If you are batching multiple kafka messages into a single request, then the first message will be used for the substitution of the URL.

Authentication Options

Currently, the HTTP Sink supports either no authentication, BASIC HTTP authentication and OAuth2 authentication.

No Authentication (Default)

By default, no authentication is set. This can be also done by providing a configuration like this:

connect.http.authentication.type=none

BASIC HTTP Authentication

BASIC auth can be configured by providing a configuration like this:

connect.http.authentication.type=basic
connect.http.authentication.basic.username=user
connect.http.authentication.basic.password=password

OAuth2 Authentication

OAuth auth can be configured by providing a configuration like this:

connect.http.authentication.type=oauth2
connect.http.authentication.oauth2.token.url=http://myoauth2.local/getToken
connect.http.authentication.oauth2.client.id=clientId
connect.http.authentication.oauth2.client.secret=client-secret
connect.http.authentication.oauth2.token.property=access_token
connect.http.authentication.oauth2.client.scope=any
connect.http.authentication.oauth2.client.headers=header:value

Headers List

To customise the headers sent with your HTTP request you can supply a Headers List.

Each header key and value is also a Content Template so can contain substitutions from the message key/value/headers etc. If you are batching multiple kafka messages into a single request, then the first message will be used for the substitution of the headers.

Example:

connect.http.request.headers="Content-Type","text/plain","X-User","{{header.kafkauser}}","Product","{{value.product.id}}"

SSL Configuration

Enabling SSL connections between Kafka Connect and HTTP Endpoint ensures that the communication between these services is secure, protecting sensitive data from being intercepted or tampered with. SSL (or TLS) encrypts data in transit, verifying the identity of both parties and ensuring data integrity. Please check out SSL Configuration Properties section in order to set it up.

Batch Configuration

The connector offers three distinct flush options for data management:

Flush by Count - triggers a file flush after a specified number of records have been written to it.
Flush by Size - initiates a file flush once a predetermined size (in bytes) has been attained.
Flush by Interval - enforces a file flush after a defined time interval (in seconds).

It's worth noting that the interval flush is a continuous process that acts as a fail-safe mechanism, ensuring that files are periodically flushed, even if the other flush options are not configured or haven't reached their thresholds.

The flush options are configured using the batchCount, batchSize and `timeInterval properties. The settings are optional and if not specified the defaults are:

Field

Default

batchCount

50_000 records

batchSize

500000000 (500MB)

timeInterval

3_600 seconds (1 hour)

connect.http.batch.count=50000
connect.http.batch.size=500000000
connect.http.time.interval=3600

Configuration Examples

Some configuration examples follow on how to apply this connector to different message types.

These include converters, which are required to instruct Kafka Connect on how to read the source content.

Static string template

In this case the converters are irrelevant as we are not using the message content to populate our message template.

connector.class=io.lenses.streamreactor.connect.http.sink.HttpSinkConnector
topics=mytopic
tasks.max=1
connect.http.method=POST
connect.http.endpoint="https://my-endpoint.example.com"
connect.http.request.content="My Static Content Template"
connect.http.batch.count=1

Dynamic string template

The HTTP request body contains the value of the message, which is retained as a string value via the StringConverter.

connector.class=io.lenses.streamreactor.connect.http.sink.HttpSinkConnector
topics=mytopic
tasks.max=1
connect.http.method=POST
connect.http.endpoint="https://my-endpoint.example.com"
connect.http.request.content="{{value}}"
connect.http.batch.count=1
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.storage.StringConverter

Dynamic string template containing json message fields

Specific fields from the JSON message are substituted into the HTTP request body alongside some static content.

connector.class=io.lenses.streamreactor.connect.http.sink.HttpSinkConnector
topics=mytopic
tasks.max=1
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
connect.http.method=POST
connect.http.endpoint="https://my-endpoint.example.com"
connect.http.request.content="product: {{value.product}}"
connect.http.batch.size=1
value.converter.schemas.enable=false

Dynamic string template containing whole json message

The entirety of the message value is substituted into a placeholder in the message body. The message is treated as a string via the StringConverter.

connector.class=io.lenses.streamreactor.connect.http.sink.HttpSinkConnector
topics=mytopic
tasks.max=1
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.storage.StringConverter
connect.http.method=POST
connect.http.endpoint="https://my-endpoint.example.com"
connect.http.request.content="whole product message: {{value}}"
connect.http.time.interval=5

Dynamic string template containing avro message fields

Fields from the AVRO message are substituted into the message body in the following example:

connector.class=io.lenses.streamreactor.connect.http.sink.HttpSinkConnector
topics=mytopic
tasks.max=1
connect.http.method=POST
connect.http.endpoint="https://my-endpoint.example.com"
connect.http.request.content="product: {{value.product}}"
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schemas.enable=true
value.converter.schema.registry.url=http://schema-registry:8081

Error/Success Reporter

Starting from version 8.1 as pilot release we give our customers ability to use functionality called Reporter which (if enabled) writes Success and Error processing reports to specified Kafka topic. Reports don't have key and you can find details about status in the message headers and value.

In order to enable this functionality we have to enable one (or both if we want full reporting) of the properties below:

connect.reporting.error.config.enabled=true
connect.reporting.success.config.enabled=true

These settings configure the Kafka producer for success and error reports. Full configuration options are available in the Success Reporter Properties and Error Reporter Properties sections. Three examples follow:

local/plain configuration
SASL configuration
SSL configuration

Plain Error Reporting

This is the most common scenario for on-premises Kafka Clusters used just for monitoring

connect.reporting.error.config.enabled=true
connect.reporting.error.config.bootstrap.servers=localhost:9094
connect.reporting.error.config.topic=http-monitoring

Error Reporting using SASL

Using SASL provides a secure and standardized method for authenticating connections to an external Kafka cluster. It is especially valuable when connecting to clusters that require secure communication, as it supports mechanisms like SCRAM, GSSAPI (Kerberos), and OAuth, ensuring that only authorized clients can access the cluster. Additionally, SASL can help safeguard credentials during transmission, reducing the risk of unauthorized access.

connect.reporting.error.config.enabled=true
connect.reporting.error.config.bootstrap.servers=my-kafka-cluster.com:9093
connect.reporting.error.config.security.protocol=SASL_SSL
connect.reporting.error.config.sasl.mechanism=PLAIN
connect.reporting.error.config.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="MYUSER" password="MYPASSWORD";

Error Reporting using SSL

Using SSL ensures secure communication between clients and the Kafka cluster by encrypting data in transit. This prevents unauthorized parties from intercepting or tampering with sensitive information. SSL also supports mutual authentication, allowing both the client and server to verify each other’s identities, which enhances trust and security in the connection.

connect.reporting.error.config.enabled=true
connect.reporting.error.config.bootstrap.servers=SSL://my-ssl-protected-cluster:9094
connect.reporting.error.config.security.protocol=SSL
connect.reporting.error.config.ssl.keystore.location=/path/to/my/keystore.p12
connect.reporting.error.config.ssl.keystore.type=PKCS12
connect.reporting.error.config.ssl.truststore.location=/path/to/my/truststore.p12
connect.reporting.error.config.ssl.truststore.password=************
connect.reporting.error.config.ssl.truststore.type=PKCS12
connect.reporting.error.config.topic=http-error-topic

Options

Configuration parameters

This sink connector supports the following options as part of its configuration:

Field

Type

Required

Values (Default)

connect.http.method

HttpMethod

Yes

POST, PUT, PATCH

connect.http.endpoint

String

Yes

URL Template

connect.http.request.content

String

Yes

Content Template

connect.http.authentication.type

Authentication

Authentication Options (none)

connect.http.request.headers

List[String]

Headers List

connect.http.batch.count

Int

The number of records to batch before sending the request, see Batch Configuration

connect.http.batch.size

Int

The size of the batch in bytes before sending the request, see Batch Configuration

connect.http.time.interval

Int

The time interval in milliseconds to wait before sending the request

connect.http.upload.sync.period

Int

Upload Sync Period (100) - polling time period for uploads in milliseconds

connect.http.error.threshold

Int

The number of errors to tolerate before failing the sink (5)

connect.http.retry.mode

String

The http retry mode. It can be one of : Fixed or Exponential(default)

connect.http.retries.on.status.codes

List[String]

The status codes to retry on (default codes are : 408,429,500,502,5003,504)

connect.http.retries.max.retries

Int

The maximum number of retries to attempt (default is 5)

connect.http.retry.fixed.interval.ms

Int

The set duration to wait before retrying HTTP requests. The default is 10000 (10 seconds)

connect.http.retries.max.timeout.ms

Int

The maximum time in milliseconds to retry a request when Exponential retry is set. Backoff is used to increase the time between retries, up to this maximum (30000)

connect.http.connection.timeout.ms

Int

The HTTP connection timeout in milliseconds (10000)

connect.http.max.queue.size

int

For each processed topic, the connector maintains an internal queue. This value specifies the maximum number of entries allowed in the queue before the enqueue operation blocks. The default is 100,000.

connect.http.max.queue.offer.timeout.ms

int

The maximum time window, specified in milliseconds, to wait for the internal queue to accept new records. The d

SSL Configuration Properties

Property Name

Description

ssl.truststore.location

Path to the truststore file containing the trusted CA certificates for verifying broker certificates.

ssl.truststore.password

Password for the truststore file to protect its integrity.

ssl.truststore.type

Type of the truststore (e.g., JKS, PKCS12). Default is JKS.

ssl.keystore.location

Path to the keystore file containing the client’s private key and certificate chain for client authentication.

ssl.keystore.password

Password for the keystore to protect the private key.

ssl.keystore.type

Type of the keystore (e.g., JKS, PKCS12). Default is JKS.

ssl.protocol

The SSL protocol used for secure connections (e.g., TLSv1.2, TLSv1.3). Default is TLSv1.3.

ssl.keymanager.algorithm

Algorithm used by the KeyManager to manage certificates. Default value is the key manager factory algorithm configured for the Java Virtual Machine.

ssl.trustmanager.algorithm

Algorithm used by the TrustManager to manage certificates. Default value is the key manager factory algorithm configured for the Java Virtual Machine.

Error Reporter Properties

Property Name

Description

connect.reporting.error.config.enabled

Specifies whether the reporter is enabled. false by default.

connect.reporting.error.config.bootstrap.servers

A list of host/port pairs to use for establishing the initial connection to the Kafka cluster. Required if reporter is enabled.

connect.reporting.error.config.topic

Specifies the topic for Reporter to write to.

connect.reporting.error.config.location

SASL Mechanism used when connecting.

connect.reporting.error.config.sasl.jaas.config

JAAS login context parameters for SASL connections in the format used by JAAS configuration files.

connect.reporting.error.config.sasl.mechanism

SASL mechanism used for client connections. This may be any mechanism for which a security provider is available.

The error reporter can also be configured with SSL Properties. See the section SSL Configuration Properties. In this case all properties should be prefixed with connect.reporting.error.config to ensure they apply to the error reporter.

Success Reporter Properties

Property Name

Description

connect.reporting.success.config.enabled

Specifies whether the reporter is enabled. false by default.

connect.reporting.success.config.bootstrap.servers

A list of host/port pairs to use for establishing the initial connection to the Kafka cluster. Required if reporter is enabled.

connect.reporting.success.config.topic

Specifies the topic for Reporter to write to.

connect.reporting.success.config.location

SASL Mechanism used when connecting.

connect.reporting.success.config.sasl.jaas.config

JAAS login context parameters for SASL connections in the format used by JAAS configuration files.

connect.reporting.success.config.sasl.mechanism

SASL mechanism used for client connections. This may be any mechanism for which a security provider is available.

The error reporter can also be configured with SSL Properties. See the section SSL Configuration Properties. In this case all properties should be prefixed with connect.reporting.success.config to ensure they apply to the success reporter.

InfluxDB

This page describes the usage of the Stream Reactor InfluxDB Sink Connector.

Connector Class

io.lenses.streamreactor.connect.influx.InfluxSinkConnector

Example

For more examples see the tutorials.

name=influxdb
connector.class=io.lenses.streamreactor.connect.influx.InfluxSinkConnector
tasks.max=1
topics=influx
connect.influx.url=http://influxdb:8086
connect.influx.db=mydb
connect.influx.username=admin
connect.influx.kcql=INSERT INTO influxMeasure SELECT * FROM influx WITHTIMESTAMP sys_time()

KCQL support

The following KCQL is supported:

INSERT INTO <your-measure>
SELECT FIELD, ...
FROM kafka_topic_name
[WITHTIMESTAMP FIELD|sys_time]
[WITHTAG(FIELD|(constant_key=constant_value)]

Examples:

-- Insert mode, select all fields from topicA and write to indexA
INSERT INTO measureA SELECT * FROM topicA

-- Insert mode, select 3 fields and rename from topicB and write to indexB,
-- use field Y as the point measurement
INSERT INTO measureB SELECT x AS a, y AS b, c FROM topicB WITHTIMESTAMP y

-- Insert mode, select 3 fields and rename from topicB and write to indexB,
-- use field Y as the current system time for Point measurement
INSERT INTO measureB SELECT x AS a, y AS b, z FROM topicB WITHTIMESTAMP sys_time()

-- Tagging using constants
INSERT INTO measureA SELECT * FROM topicA WITHTAG (DataMountaineer=awesome, Influx=rulz!)

-- Tagging using fields in the payload. Say we have a Payment structure
-- with these fields: amount, from, to, note
INSERT INTO measureA SELECT * FROM topicA WITHTAG (from, to)

-- Tagging using a combination of fields in the payload and constants.
-- Say we have a Payment structure with these fields: amount, from, to, note
INSERT INTO measureA SELECT * FROM topicA WITHTAG (from, to, provider=DataMountaineer)

Kafka payload support

This sink supports the following Kafka payloads:

Schema.Struct and Struct (Avro)
Schema.Struct and JSON
No Schema and JSON

Error policies

The connector supports Error policies.

Option Reference

Name

Description

Type

Default Value

connect.influx.url

The InfluxDB database url.

string

connect.influx.db

The database to store the values to.

string

connect.influx.username

The user to connect to the influx database

string

connect.influx.password

The password for the influxdb user.

password

connect.influx.kcql

KCQL expression describing field selection and target measurements.

string

connect.progress.enabled

Enables the output for how many records have been processed by the connector

boolean

false

connect.influx.error.policy

string

THROW

connect.influx.retry.interval

The time in milliseconds between retries.

int

60000

connect.influx.max.retries

The maximum number of times to try the write again.

int

connect.influx.retention.policy

Determines how long InfluxDB keeps the data - the options for specifying the duration of the retention policy are listed below. Note that the minimum retention period is one hour. DURATION determines how long InfluxDB keeps the data - the options for specifying the duration of the retention policy are listed below. Note that the minimum retention period is one hour. m minutes h hours d days w weeks INF infinite Default retention is autogen from 1.0 onwards or default for any previous version

string

autogen

connect.influx.consistency.level

Specifies the write consistency. If any write operations do not meet the configured consistency guarantees, an error will occur and the data will not be indexed. The default consistency-level is ALL.

string

ALL

JMS

This page describes the usage of the Stream Reactor JMS Sink Connector.

Connector Class

io.lenses.streamreactor.connect.jms.sink.JMSSinkConnector

Example

For more examples see the tutorials.

name=jms
connector.class=io.lenses.streamreactor.connect.jms.sink.JMSSinkConnector
tasks.max=1
topics=orders
connect.jms.url=tcp://activemq:61616
connect.jms.initial.context.factory=org.apache.activemq.jndi.ActiveMQInitialContextFactory
connect.jms.connection.factory=ConnectionFactory
connect.jms.kcql=INSERT INTO orders SELECT * FROM orders WITHTYPE QUEUE WITHFORMAT JSON

KCQL support

The following KCQL is supported:

INSERT INTO <jms-destination>
SELECT FIELD, ...
FROM <your-kafka-topic>
[WITHFORMAT AVRO|JSON|MAP|OBJECT]
WITHTYPE TOPIC|QUEUE

Examples:

-- Select all fields from topicA and write to jmsA queue
INSERT INTO jmsA SELECT * FROM topicA WITHTYPE QUEUE

-- Select 3 fields and rename from topicB and write
-- to jmsB topic as JSON in a TextMessage
INSERT INTO jmsB SELECT x AS a, y, z FROM topicB WITHFORMAT JSON WITHTYPE TOPIC

JMS Topics and Queues

The sink can write to either topics or queues, specified by the WITHTYPE clause.

JMS Payload

When a message is sent to a JMS target it can be one of the following:

JSON - Send a TextMessage
AVRO - Send a BytesMessage
MAP - Send a MapMessage
OBJECT - Send an ObjectMessage

This is set by the WITHFORMAT keyword.

Kafka payload support

This sink supports the following Kafka payloads:

Schema.Struct and Struct (Avro)
Schema.Struct and JSON
No Schema and JSON

Error policies

The connector supports Error policies.

Option Reference

Name

Description

Type

Default Value

connect.jms.url

Provides the JMS broker url

string

connect.jms.initial.context.factory

Initial Context Factory, e.g: org.apache.activemq.jndi.ActiveMQInitialContextFactory

string

connect.jms.connection.factory

Provides the full class name for the ConnectionFactory compile to use, e.gorg.apache.activemq.ActiveMQConnectionFactory

string

ConnectionFactory

connect.jms.kcql

string

connect.jms.subscription.name

subscription name to use when subscribing to a topic, specifying this makes a durable subscription for topics

string

connect.jms.password

Provides the password for the JMS connection

password

connect.jms.username

Provides the user for the JMS connection

string

connect.jms.error.policy

string

THROW

connect.jms.retry.interval

The time in milliseconds between retries.

int

60000

connect.jms.max.retries

The maximum number of times to try the write again.

int

connect.jms.destination.selector

Selector to use for destination lookup. Either CDI or JNDI.

string

CDI

connect.jms.initial.context.extra.params

List (comma-separated) of extra properties as key/value pairs with a colon delimiter to supply to the initial context e.g. SOLACE_JMS_VPN:my_solace_vp

list

[]

connect.jms.batch.size

The number of records to poll for on the target JMS destination in each Connect poll.

int

100

connect.jms.polling.timeout

Provides the timeout to poll incoming messages

long

1000

connect.jms.source.default.converter

string

connect.jms.converter.throw.on.error

If set to false the conversion exception will be swallowed and everything carries on BUT the message is lost!!; true will throw the exception.Default is false.

boolean

false

connect.converter.avro.schemas

If the AvroConverter is used you need to provide an avro Schema to be able to read and translate the raw bytes to an avro record. The format is $MQTT_TOPIC=$PATH_TO_AVRO_SCHEMA_FILE

string

connect.jms.headers

string

connect.progress.enabled

Enables the output for how many records have been processed

boolean

false

connect.jms.evict.interval.minutes

int

connect.jms.evict.threshold.minutes

The number of minutes after which an uncommitted entry becomes evictable from the connector cache.

int

connect.jms.scale.type

MongoDB

This page describes the usage of the Stream Reactor MongoDB Sink Connector.

Connector Class

io.lenses.streamreactor.connect.mongodb.sink.MongoSinkConnector

Example

For more examples see the tutorials.

name=mongo
connector.class=io.lenses.streamreactor.connect.mongodb.sink.MongoSinkConnector
tasks.max=1
topics=orders
connect.mongo.kcql=INSERT INTO orders SELECT * FROM orders
connect.mongo.db=connect
connect.mongo.connection=mongodb://mongo:27017

KCQL support

The following KCQL is supported:

INSERT | UPSERT
INTO <collection_name>
SELECT FIELD, ...
FROM <kafka-topic>
BATCH = 100

Examples:

--  Select all fields from topic fx_prices and insert into the fx collection
INSERT INTO fx SELECT * FROM fx_prices

--  Select all fields from topic fx_prices and upsert into the fx collection,
--  The assumption is there will be a ticker field in the incoming json:
UPSERT INTO fx SELECT * FROM fx_prices PK ticker

Insert Mode

Insert is the default write mode of the sink.

Upsert Mode

The connector supports Kudu upserts which replaces the existing row if a match is found on the primary keys. If records are delivered with the same field or group of fields that are used as the primary key on the target table, but different values, the existing record in the target table will be updated.

Batching

The BATCH clause controls the batching of writes to MongoDB.

TLS/SSL

TLS/SSL is supported by setting ?ssl=true in the connect.mongo.connection option. The MongoDB driver will then attempt to load the truststore and keystore using the JVM system properties.

You need to set JVM system properties to ensure that the client is able to validate the SSL certificate presented by the server:

javax.net.ssl.trustStore: the path to a trust store containing the certificate of the signing authority
javax.net.ssl.trustStorePassword: the password to access this trust store
javax.net.ssl.keyStore: the path to a key store containing the client’s SSL certificates
javax.net.ssl.keyStorePassword: the password to access this key store

Authentication Mechanism

All authentication methods are supported, X.509, LDAP Plain, Kerberos (GSSAPI), MongoDB-CR and SCRAM-SHA-1. The default as of MongoDB version 3.0 SCRAM-SHA-1. To set the authentication mechanism set the authMechanism in the connect.mongo.connection option.

The mechanism can either be set in the connection string but this requires the password to be in plain text in the connection string or via the connect.mongo.auth.mechanism option.

If the username is set it overrides the username/password set in the connection string and the connect.mongo.auth.mechanism has precedence.

e.g.

# default of scram
mongodb://host1/?authSource=db1
# scram explict
mongodb://host1/?authSource=db1&authMechanism=SCRAM-SHA-1
# mongo-cr
mongodb://host1/?authSource=db1&authMechanism=MONGODB-CR
# x.509
mongodb://host1/?authSource=db1&authMechanism=MONGODB-X509
# kerberos
mongodb://host1/?authSource=db1&authMechanism=GSSAPI
# ldap
mongodb://host1/?authSource=db1&authMechanism=PLAIN

JSON Field dates

List of fields that should be converted to ISO Date on MongoDB insertion (comma-separated field names), for JSON topics only. Field values may be an epoch time or an ISO8601 datetime string with an offset (offset or ‘Z’ required). If the string does not parse to ISO, it will be written as a string instead.

Subdocument fields can be referred to in the following examples:

topLevelFieldName
topLevelSubDocument.FieldName
topLevelParent.subDocument.subDocument2.FieldName

If a field is converted to ISODate and that same field is named as a PK, then the PK field is also written as an ISODate.

This is controlled via the connect.mongo.json_datetime_fields option.

Kafka payload support

This sink supports the following Kafka payloads:

Schema.Struct and Struct (Avro)
Schema.Struct and JSON
No Schema and JSON

Error policies

The connector supports Error policies.

Option Reference

Name

Description

Type

Default Value

ssl.cipher.suites

A list of cipher suites. This is a named combination of authentication, encryption, MAC and key exchange algorithm used to negotiate the security settings for a network connection using TLS or SSL network protocol. By default all the available cipher suites are supported.

list

ssl.enabled.protocols

The list of protocols enabled for SSL connections.

list

[TLSv1.2, TLSv1.1, TLSv1]

ssl.keystore.password

The store password for the key store file. This is optional for client and only needed if ssl.keystore.location is configured.

password

ssl.key.password

The password of the private key in the key store file. This is optional for client.

password

ssl.keystore.type

The file format of the key store file. This is optional for client.

string

JKS

ssl.truststore.location

The location of the trust store file.

string

ssl.endpoint.identification.algorithm

The endpoint identification algorithm to validate server hostname using server certificate.

string

https

ssl.protocol

The SSL protocol used to generate the SSLContext. Default setting is TLS, which is fine for most cases. Allowed values in recent JVMs are TLS, TLSv1.1 and TLSv1.2. SSL, SSLv2 and SSLv3 may be supported in older JVMs, but their usage is discouraged due to known security vulnerabilities.

string

TLS

ssl.trustmanager.algorithm

The algorithm used by trust manager factory for SSL connections. Default value is the trust manager factory algorithm configured for the Java Virtual Machine.

string

PKIX

ssl.secure.random.implementation

The SecureRandom PRNG implementation to use for SSL cryptography operations.

string

ssl.truststore.type

The file format of the trust store file.

string

JKS

ssl.keymanager.algorithm

The algorithm used by key manager factory for SSL connections. Default value is the key manager factory algorithm configured for the Java Virtual Machine.

string

SunX509

ssl.provider

The name of the security provider used for SSL connections. Default value is the default security provider of the JVM.

string

ssl.keystore.location

The location of the key store file. This is optional for client and can be used for two-way authentication for client.

string

ssl.truststore.password

The password for the trust store file. If a password is not set access to the truststore is still available, but integrity checking is disabled.

password

connect.mongo.connection

The mongodb connection in the format mongodb://[username:password@]host1[:port1],host2[:port2],…[,hostN[:portN]]][/[database][?options]].

string

connect.mongo.db

The mongodb target database.

string

connect.mongo.username

The username to use when authenticating

string

connect.mongo.password

The password for the use when authenticating

password

connect.mongo.auth.mechanism

String

string

SCRAM-SHA-1

connect.mongo.error.policy

string

THROW

connect.mongo.max.retries

The maximum number of times to try the write again.

int

connect.mongo.retry.interval

The time in milliseconds between retries.

int

60000

connect.mongo.kcql

KCQL expression describing field selection and data routing to the target mongo db.

string

connect.mongo.json_datetime_fields

List of fields that should be converted to ISODate on Mongodb insertion (comma-separated field names). For JSON topics only. Field values may be an integral epoch time or an ISO8601 datetime string with an offset (offset or ‘Z’ required). If string does not parse to ISO, it will be written as a string instead. Subdocument fields can be referred to as in the following examples: “topLevelFieldName”, “topLevelSubDocument.FieldName”, “topLevelParent.subDocument.subDocument2.FieldName”, (etc.) If a field is converted to ISODate and that same field is named as a PK, then the PK field is also written as an ISODate.

list

[]

connect.progress.enabled

Enables the output for how many records have been processed

boolean

false

MQTT

This page describes the usage of the Stream Reactor MQTT Sink Connector.

Connector Class

io.lenses.streamreactor.connect.mqtt.sink.MqttSinkConnector

Example

For more examples see the tutorials.

name=mqtt
connector.class=io.lenses.streamreactor.connect.mqtt.sink.MqttSinkConnector
tasks.max=1
topics=orders
connect.mqtt.hosts=tcp://mqtt:1883
connect.mqtt.clean=true
connect.mqtt.timeout=1000
connect.mqtt.keep.alive=1000
connect.mqtt.service.quality=1
connect.mqtt.client.id=dm_sink_id
connect.mqtt.kcql=INSERT INTO /lenses/orders SELECT * FROM orders

KCQL Support

The following KCQL is supported:

INSERT
INTO <mqtt-topic>
SELECT FIELD, ...
FROM <kafka-topic>

Examples:

-- Insert into /landoop/demo all fields from kafka_topicA
INSERT INTO /landoop/demo SELECT * FROM kafka_topicA

-- Insert into /landoop/demo all fields from dynamic field
INSERT INTO `$field` SELECT * FROM control.boxes.test WITHTARGET = field

Dynamic targets

The connector can dynamically write to MQTT topics determined by a field in the Kafka message value by using the WITHTARGET target clause and specifying $field as the target field to extract.

Kafka payload support

This sink supports the following Kafka payloads:

Schema.Struct and Struct (Avro)
Schema.Struct and JSON
No Schema and JSON

Error policies

The connector supports Error policies.

Option Reference

Name

Description

Type

Default Value

connect.mqtt.hosts

Contains the MQTT connection end points.

string

connect.mqtt.username

Contains the Mqtt connection user name

string

connect.mqtt.password

Contains the Mqtt connection password

password

connect.mqtt.service.quality

Specifies the Mqtt quality of service

int

connect.mqtt.timeout

Provides the time interval to establish the mqtt connection

int

3000

connect.mqtt.clean

boolean

true

connect.mqtt.keep.alive

int

5000

connect.mqtt.client.id

Contains the Mqtt session client id

string

connect.mqtt.error.policy

string

THROW

connect.mqtt.retry.interval

The time in milliseconds between retries.

int

60000

connect.mqtt.max.retries

The maximum number of times to try the write again.

int

connect.mqtt.retained.messages

Specifies the Mqtt retained flag.

boolean

false

connect.mqtt.converter.throw.on.error

If set to false the conversion exception will be swallowed and everything carries on BUT the message is lost!!; true will throw the exception.Default is false.

boolean

false

connect.converter.avro.schemas

string

connect.mqtt.kcql

Contains the Kafka Connect Query Language describing the sourced MQTT source and the target Kafka topics

string

connect.progress.enabled

Enables the output for how many records have been processed

boolean

false

connect.mqtt.ssl.ca.cert

Provides the path to the CA certificate file to use with the Mqtt connection

string

connect.mqtt.ssl.cert

Provides the path to the certificate file to use with the Mqtt connection

string

connect.mqtt.ssl.key

Certificate private [config] key file path.

string

Redis

This page describes the usage of the Stream Reactor Redis Sink Connector.

Connector Class

io.lenses.streamreactor.connect.redis.sink.RedisSinkConnector

Example

For more examples see the tutorials.

name=redis
connector.class=io.lenses.streamreactor.connect.redis.sink.RedisSinkConnector
tasks.max=1
topics=redis
connect.redis.host=redis
connect.redis.port=6379
connect.redis.kcql=INSERT INTO lenses SELECT * FROM redis STOREAS STREAM

KCQL support

The following KCQL is supported:

[INSERT INTO <redis-cache>]
SELECT FIELD, ...
FROM <kafka-topic>
[PK FIELD]
[STOREAS SortedSet(key=FIELD)|GEOADD|STREAM]

Cache mode

The purpose of this mode is to cache in Redis [Key-Value] pairs. Imagine a Kafka topic with currency foreign exchange rate messages:

{ "symbol": "USDGBP" , "price": 0.7943 }
{ "symbol": "EURGBP" , "price": 0.8597 }

You may want to store in Redis: the symbol as the Key and the price as the Value. This will effectively make Redis a caching system, which multiple other applications can access to get the (latest) value. To achieve that using this particular Kafka Redis Sink Connector, you need to specify the KCQL as:

SELECT price from yahoo-fx PK symbol

This will update the keys USDGBP , EURGBP with the relevant price using the (default) JSON format:

Key=EURGBP  Value={ "price": 0.7943 }

Composite keys are supported with the PK clause, a delimiter can be set with the optional configuration property connect.redis.pk.delimiter.

Sorted Sets

To insert messages from a Kafka topic into 1 Sorted Set use the following KCQL syntax:

INSERT INTO cpu_stats SELECT * from cpuTopic STOREAS SortedSet(score=timestamp) TTL=60

This will create and add entries to the (sorted set) named cpu_stats. The entries will be ordered in the Redis set based on the score that we define it to be the value of the timestamp field of the AVRO message from Kafka. In the above example, we are selecting and storing all the fields of the Kafka message.

The TTL statement allows setting a time to live on the sorted set. If not specified TTL is set.

Multiple Sorted Sets

The connector can create multiple sorted sets by promoting each value of one field from the Kafka message into one Sorted Set and selecting which values to store in the sorted-sets. Set KCQL clause to define the filed using PK (primary key)

SELECT temperature, humidity FROM sensorsTopic PK sensorID STOREAS SortedSet(score=timestamp)

Notice we have dropped the INSERT clause.

The connector can also prefix the name of the Key using the INSERT statement for Multiple SortedSets:

INSERT INTO FX- SELECT price from yahoo-fx PK symbol STOREAS SortedSet(score=timestamp) TTL=60

This will create a key with names FX-USDGBP , FX-EURGBP etc.

The TTL statement allows setting a time to live on the sorted set. If not specified TTL is set.

Geospatial add

To insert messages from a Kafka topic with GEOADD use the following KCQL syntax:

INSERT INTO cpu_stats SELECT * from cpuTopic STOREAS GEOADD

Streams

To insert messages from a Kafka topic to a Redis Stream use the following KCQL syntax:

INSERT INTO redis_stream_name SELECT * FROM my-kafka-topic STOREAS STREAM

PubSub

To insert a message from a Kafka topic to a Redis PubSub use the following KCQL syntax:

SELECT * FROM topic STOREAS PubSub (channel=myfield)

The channel to write to in Redis is determined by a field in the payload of the Kafka message set in the KCQL statement, in this case, a field called myfield.

Kafka payload support

This sink supports the following Kafka payloads:

Schema.Struct and Struct (Avro)
Schema.Struct and JSON
No Schema and JSON

Error policies

The connector supports Error policies.

Option Reference

Name

Description

Type

Default Value

connect.redis.pk.delimiter

Specifies the redis primary key delimiter

string

ssl.provider

The name of the security provider used for SSL connections. Default value is the default security provider of the JVM.

string

ssl.protocol

string

TLS

ssl.truststore.location

The location of the trust store file.

string

ssl.keystore.password

The store password for the key store file. This is optional for client and only needed if ssl.keystore.location is configured.

password

ssl.keystore.location

The location of the key store file. This is optional for client and can be used for two-way authentication for client.

string

ssl.truststore.password

The password for the trust store file. If a password is not set access to the truststore is still available, but integrity checking is disabled.

password

ssl.keymanager.algorithm

The algorithm used by key manager factory for SSL connections. Default value is the key manager factory algorithm configured for the Java Virtual Machine.

string

SunX509

ssl.trustmanager.algorithm

The algorithm used by trust manager factory for SSL connections. Default value is the trust manager factory algorithm configured for the Java Virtual Machine.

string

PKIX

ssl.keystore.type

The file format of the key store file. This is optional for client.

string

JKS

ssl.cipher.suites

list

ssl.endpoint.identification.algorithm

The endpoint identification algorithm to validate server hostname using server certificate.

string

https

ssl.truststore.type

The file format of the trust store file.

string

JKS

ssl.enabled.protocols

The list of protocols enabled for SSL connections.

list

[TLSv1.2, TLSv1.1, TLSv1]

ssl.key.password

The password of the private key in the key store file. This is optional for client.

password

ssl.secure.random.implementation

The SecureRandom PRNG implementation to use for SSL cryptography operations.

string

connect.redis.kcql

KCQL expression describing field selection and routes.

string

connect.redis.host

Specifies the redis server

string

connect.redis.port

Specifies the redis connection port

int

connect.redis.password

Provides the password for the redis connection.

password

connect.redis.ssl.enabled

Enables ssl for the redis connection

boolean

false

connect.redis.error.policy

string

THROW

connect.redis.retry.interval

The time in milliseconds between retries.

int

60000

connect.redis.max.retries

The maximum number of times to try the write again.

int

connect.progress.enabled

Enables the output for how many records have been processed

boolean

false