GCP Storage

This page describes the usage of the Stream Reactor GCP Storage Sink Connector.

You can specify multiple KCQL statements separated by ; to have a connector sink multiple topics. The connector properties topics or topics.regex are required to be set to a value that matches the KCQL statements.

Connector Class

io.lenses.streamreactor.connect.gcp.storage.sink.GCPStorageSinkConnector

Example

For more examples see the tutorials.

connector.class=io.lenses.streamreactor.connect.gcp.storage.sink.GCPStorageSinkConnector
connect.gcpstorage.kcql=insert into lensesio:demo select * from demo PARTITIONBY _value.metadata_id, _value.customer_id, _header.ts, _header.wallclock STOREAS `JSON` PROPERTIES('flush.interval'=600, 'flush.size'=1000000, 'flush.count'=5000)
topics=demo
name=demo
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
transforms=insertFormattedTs,insertWallclock
transforms.insertFormattedTs.type=io.lenses.connect.smt.header.TimestampConverter
transforms.insertFormattedTs.header.name=ts
transforms.insertFormattedTs.field=timestamp
transforms.insertFormattedTs.target.type=string
transforms.insertFormattedTs.format.to.pattern=yyyy-MM-dd-HH
transforms.insertWallclock.type=io.lenses.connect.smt.header.InsertWallclock
transforms.insertWallclock.header.name=wallclock
transforms.insertWallclock.value.type=format
transforms.insertWallclock.format=yyyy-MM-dd-HH

KCQL Support

The connector uses KCQL to map topics to GCP Storage buckets and paths. The full KCQL syntax is:

INSERT INTO bucketAddress[:pathPrefix]
SELECT *
FROM kafka-topic
[[PARTITIONBY (partition[, partition] ...)] | NOPARTITION]
[STOREAS storage_format]
[PROPERTIES(
  'property.1'=x,
  'property.2'=x,
)]

Please note that you can employ escaping within KCQL for the INSERT INTO, SELECT * FROM, and PARTITIONBY clauses when necessary. For example, an incoming Kafka message stored as Json can use fields contaiing .:

{
  ...
  "a.b": "value",
  ...
}

In this case you can use the following KCQL statement:

INSERT INTO `container-name`:`prefix` SELECT * FROM `kafka-topic` PARTITIONBY `a.b`

Target Bucket and Path

The target bucket and path are specified in the INSERT INTO clause. The path is optional and if not specified, the connector will write to the root of the bucket and append the topic name to the path.

Here are a few examples:

INSERT INTO testcontainer:pathToWriteTo SELECT * FROM topicA;
INSERT INTO testcontainer SELECT * FROM topicA;
INSERT INTO testcontainer:path/To/Write/To SELECT * FROM topicA PARTITIONBY fieldA;

SQL Projection

Currently, the connector does not offer support for SQL projection; consequently, anything other than a SELECT * query is disregarded. The connector will faithfully write all fields from Kafka exactly as they are.

Source Topic

To avoid runtime errors, make sure the topics or topics.regex setting matches your KCQL statements. If the connector receives data for a topic without matching KCQL, it will throw an error. When using a regex to select topics, follow this KCQL pattern:

topics.regex = ^sensor_data_\d+$
connect.gcpstorage.kcql= INSERT INTO $target SELECT * FROM  `*` ....

In this case the topic name will be appended to the $target destination.

KCQL Properties

The PROPERTIES clause is optional and adds a layer of configurability to the connector. It enhances versatility by permitting the application of multiple configurations (delimited by ‘,’). The following properties are supported:

Name

Description

Type

Available Values

Default Value

padding.type

Specifies the type of padding to be applied.

LeftPad, RightPad, NoOp

LeftPad

padding.char

Defines the character used for padding.

Char

‘0’

padding.length.partition

Sets the padding length for the partition.

Int

padding.length.offset

Sets the padding length for the offset.

Int

partition.include.keys

Specifies whether partition keys are included.

Boolean

false Default (Custom Partitioning): true

store.envelope

Indicates whether to store the entire Kafka message

Boolean

store.envelope.fields.key

Indicates whether to store the envelope’s key.

Boolean

store.envelope.fields.headers

Indicates whether to store the envelope’s headers.

Boolean

store.envelope.fiels.value

Indicates whether to store the envelope’s value.

Boolean

store.envelope.fields.metadata

Indicates whether to store the envelope’s metadata.

Boolean

flush.size

Specifies the size (in bytes) for the flush operation.

Long

500000000 (500MB)

flush.count

Specifies the number of records for the flush operation.

Int

50000

flush.interval

Specifies the interval (in seconds) for the flush operation.

Long

3600 (1 hour)

key.suffix

When specified it appends the given value to the resulting object key before the "extension" (avro, json, etc) is added

String

<empty>

The sink connector optimizes performance by padding the output object names, a practice that proves beneficial when using the GCP Storage Source connector to restore data. This object name padding ensures that objects are ordered lexicographically, allowing the GCP Storage Source connector to skip the need for reading, sorting, and processing all objects, thereby enhancing efficiency.

Partitioning & Object Keys

The object key serves as the filename used to store data in GCP Storage. There are two options for configuring the object key:

Default: The object key is automatically generated by the connector and follows the Kafka topic-partition structure. The format is $container/[$prefix]/$topic/$partition/offset.extension. The extension is determined by the chosen storage format.
Custom: The object key is driven by the PARTITIONBY clause. The format is either $container/[$prefix]/$topic/customKey1=customValue1/customKey2=customValue2/topic(partition_offset).extension (GCP Athena naming style mimicking Hive-like data partitioning) or $container/[$prefix]/customValue/topic(partition_offset).ext. The extension is determined by the selected storage format.

The Connector automatically adds the topic name to the partition. There is no need to add it to the partition clause. If you want to explicitly add the topic or partition you can do so by using _topic and _partition.

The partition clause works on header, key and values fields of the Kafka message.

Custom keys and values can be extracted from the Kafka message key, message value, or message headers, as long as the headers are of types that can be converted to strings. There is no fixed limit to the number of elements that can form the object key, but you should be aware of GCP Storage key length restrictions.

To extract fields from the message values, simply use the field names in the PARTITIONBY clause. For example:

PARTITIONBY fieldA, fieldB

However, note that the message fields must be of primitive types (e.g., string, int, long) to be used for partitioning.

You can also use the entire message key as long as it can be coerced into a primitive type:

PARTITIONBY _key

In cases where the Kafka message Key is not a primitive but a complex object, you can use individual fields within the message Key to create the GCP Storage object key name:

PARTITIONBY _key.fieldA, _key.fieldB

Kafka message headers can also be used in the GCP Storage object key definition, provided the header values are of primitive types easily convertible to strings:

PARTITIONBY _header.<header_key1>[, _header.<header_key2>]

Customizing the object key can leverage various components of the Kafka message. For example:

PARTITIONBY fieldA, _key.fieldB, _headers.fieldC

This flexibility allows you to tailor the object key to your specific needs, extracting meaningful information from Kafka messages to structure GCP Storage object keys effectively.

To enable Athena-like partitioning, use the following syntax:

INSERT INTO $container[:$prefix]
SELECT * FROM $topic
PARTITIONBY fieldA, _key.fieldB, _headers.fieldC
STOREAS `AVRO`
PROPERTIES (
    'partition.include.keys'=true,
)

Rolling Windows

Storing data in GCP Storage and partitioning it by time is a common practice in data management. For instance, you may want to organize your GCP Storage data in hourly intervals. This partitioning can be seamlessly achieved using the PARTITIONBY clause in combination with specifying the relevant time field. However, it’s worth noting that the time field typically doesn’t adjust automatically.

To address this, we offer a Kafka Connect Single Message Transformer (SMT) designed to streamline this process. You can find the transformer plugin and documentation here.

Let’s consider an example where you need the object key to include the wallclock time (the time when the message was processed) and create an hourly window based on a field called timestamp. Here’s the connector configuration to achieve this:

connector.class=io.lenses.streamreactor.connect.gcp.storage.sink.GCPStorageSinkConnector
connect.gcpstorage.kcql=insert into lensesio:demo select * from demo PARTITIONBY _value.metadata_id, _value.customer_id, _header.ts, _header.wallclock STOREAS `JSON` PROPERTIES('flush.interval'=30, 'flush.size'=1000000, 'flush.count'=5000)
topics=demo
name=demo
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
transforms=insertFormattedTs,insertWallclock
transforms.insertFormattedTs.type=io.lenses.connect.smt.header.TimestampConverter
transforms.insertFormattedTs.header.name=ts
transforms.insertFormattedTs.field=timestamp
transforms.insertFormattedTs.target.type=string
transforms.insertFormattedTs.format.to.pattern=yyyy-MM-dd-HH
transforms.insertWallclock.type=io.lenses.connect.smt.header.InsertWallclock
transforms.insertWallclock.header.name=wallclock
transforms.insertWallclock.value.type=format
transforms.insertWallclock.format=yyyy-MM-dd-HH

In this example, the incoming Kafka message’s Value content includes a field called timestamp, represented as a long value indicating the epoch time in milliseconds. The TimestampConverter SMT will expertly convert this into a string value according to the format specified in the format.to.pattern property. Additionally, the insertWallclock SMT will incorporate the current wallclock time in the format you specify in the format property.

The PARTITIONBY clause then leverages both the timestamp field and the wallclock header to craft the object key, providing you with precise control over data partitioning.

Data Storage Format

While the STOREAS clause is optional, it plays a pivotal role in determining the storage format within GCP Storage. It’s crucial to understand that this format is entirely independent of the data format stored in Kafka. The connector maintains its neutrality towards the storage format at the topic level and relies on the key.converter and value.converter settings to interpret the data.

Supported storage formats encompass:

AVRO
Parquet
JSON
CSV (including headers)
Text
BYTES

Opting for BYTES ensures that each record is stored in its own separate object. This feature proves particularly valuable for scenarios involving the storage of images or other binary data in GCP Storage. For cases where you prefer to consolidate multiple records into a single binary object, AVRO or Parquet are the recommended choices.

By default, the connector exclusively stores the Kafka message value. However, you can expand storage to encompass the entire message, including the key, headers, and metadata, by configuring the store.envelope property as true. This property operates as a boolean switch, with the default value being false. When the envelope is enabled, the data structure follows this format:

Not supported with a custom partition strategy.

{
  "key": <the message Key, which can be a primitive or a complex object>,
  "value": <the message Key, which can be a primitive or a complex object>,
  "headers": {
    "header1": "value1",
    "header2": "value2"
  },
  "metadata": {
    "offset": 0,
    "partition": 0,
    "timestamp": 0,
    "topic": "topic"
  }
}

Utilizing the envelope is particularly advantageous in scenarios such as backup and restore or replication, where comprehensive storage of the entire message in GCP Storage is desired.

Examples

Storing the message Value Avro data as Parquet in GCP Storage:

...
connect.gcpstorage.kcql=INSERT INTO lensesiogcpstorage:car_speed SELECT * FROM car_speed_events STOREAS `PARQUET` 
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
key.converter=org.apache.kafka.connect.storage.StringConverter
...

The converter also facilitates seamless JSON to AVRO/Parquet conversion, eliminating the need for an additional processing step before the data is stored in GCP Storage.

...
connect.gcpstorage.kcql=INSERT INTO lensesiogcpstorage:car_speed SELECT * FROM car_speed_events STOREAS `PARQUET` 
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
...

Enabling the full message stored as JSON in GCP Storage:

...
connect.gcpstorage.kcql=INSERT INTO lensesiogcpstorage:car_speed SELECT * FROM car_speed_events STOREAS `JSON` PROPERTIES('store.envelope'=true)
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
...

Enabling the full message stored as AVRO in GCP Storage:

...
connect.gcpstorage.kcql=INSERT INTO lensesiogcpstorage:car_speed SELECT * FROM car_speed_events STOREAS `AVRO` PROPERTIES('store.envelope'=true)
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
key.converter=org.apache.kafka.connect.storage.StringConverter
...

If the restore (see the GCP Storage Source documentation) happens on the same cluster, then the most performant way is to use the ByteConverter for both Key and Value and store as AVRO or Parquet:

...
connect.gcpstorage.kcql=INSERT INTO lensesiogcpstorage:car_speed SELECT * FROM car_speed_events STOREAS `AVRO` PROPERTIES('store.envelope'=true)
value.converter=org.apache.kafka.connect.converters.ByteArrayConverter
key.converter=org.apache.kafka.connect.converters.ByteArrayConverter
...

Flush Options

The connector offers three distinct flush options for data management:

Flush by Count - triggers an object flush after a specified number of records have been written to it.
Flush by Size - initiates an object flush once a predetermined size (in bytes) has been attained.
Flush by Interval - enforces an object flush after a defined time interval (in seconds).

It’s worth noting that the interval flush is a continuous process that acts as a fail-safe mechanism, ensuring that objects are periodically flushed, even if the other flush options are not configured or haven’t reached their thresholds.

Consider a scenario where the flush size is set to 10MB, and only 9.8MB of data has been written to the object, with no new Kafka messages arriving for an extended period of 6 hours. To prevent undue delays, the interval flush guarantees that the object is flushed after the specified time interval has elapsed. This ensures the timely management of data even in situations where other flush conditions are not met.

The flush options are configured using the flush.count, flush.size, and flush.interval KCQL Properties (see KCQL Propertiessection). The settings are optional and if not specified the defaults are:

flush.count = 50_000
flush.size = 500000000 (500MB)
flush.interval = 3600 (1 hour)

A connector instance can simultaneously operate on multiple topic partitions. When one partition triggers a flush, it will initiate a flush operation for all of them, even if the other partitions are not yet ready to flush.

When connect.gcpstorage.latest.schema.optimization.enabled is set to true, it reduces unnecessary data flushes when writing to Avro or Parquet formats. Specifically, it leverages schema compatibility to avoid flushing data when messages with older but backward-compatible schemas are encountered. Consider the following sequence of messages and their associated schemas:

pgsqlCopyEditmessage1 -> schema1  
message2 -> schema1  
  (No flush needed – same schema)

message3 -> schema2  
  (Flush occurs – new schema introduced)

message4 -> schema2  
  (No flush needed – same schema)

message5 -> schema1  
  Without optimization: would trigger a flush  
  With optimization: no flush – schema1 is backward-compatible with schema2

message6 -> schema2  
message7 -> schema2  
  (No flush needed – same schema, it would happen based on the flush thresholds)

Flushing By Interval

The next flush time is calculated based on the time the previous flush completed (the last modified time of the object written to GCP Storage). Therefore, by design, the sink connector’s behaviour will have a slight drift based on the time it takes to flush records and whether records are present or not. If Kafka Connect makes no calls to put records, the logic for flushing won't be executed. This ensures a more consistent number of records per object.

Compression

AVRO and Parquet offer the capability to compress files as they are written. The GCP Storage Sink connector provides advanced users with the flexibility to configure compression options.

Here are the available options for the connect.gcpstorage.compression.codec, along with indications of their support by Avro, Parquet and JSON writers:

Compression

Avro Support

Avro (requires Level)

Parquet Support

JSON

UNCOMPRESSED

✅

SNAPPY

✅

GZIP

✅

LZ0

✅

LZ4

✅

BROTLI

✅

BZIP2

✅

ZSTD

✅

⚙️

✅

DEFLATE

✅

⚙️

✅

⚙️

Please note that not all compression libraries are bundled with the GCP Storage connector. Therefore, you may need to manually add certain libraries to the classpath to ensure they function correctly.

Authentication

The connector offers two distinct authentication modes:

Default: This mode relies on the default GCP authentication chain, simplifying the authentication process.
File: This mode uses a local (to the connect worker) path for a file containing GCP authentication credentials.
Credentials: In this mode, explicit configuration of a GCP Credentials string is required for authentication.

The simplest example to configure in the connector is the "Default" mode, as this requires no other configuration. This is configured, as its name suggests, by default.

When selecting the "Credentials" mode, it is essential to provide the necessary credentials. Alternatively, if you prefer not to configure these properties explicitly, the connector will follow the credentials retrieval order as described here.

Here's an example configuration for the "Credentials" mode:

...
connect.gcpstorage.gcp.auth.mode=Credentials
connect.gcpstorage.gcp.credentials=$GCP_CREDENTIALS
connect.gcpstorage.gcp.project.id=$GCP_PROJECT_ID
...

And here is an example configuration using the "File" mode:

...
connect.gcpstorage.gcp.auth.mode=File
connect.gcpstorage.gcp.file=/home/secure-stuff/gcp-write-credential.txt
...

Remember when using file mode the file will need to exist on every worker node in your Kafka connect cluster and be readable by the Kafka Connect process.

For enhanced security and flexibility when using the “Credentials” mode, it is highly advisable to utilize Connect Secret Providers. You can find detailed information on how to use the Connect Secret Providers here. This approach ensures robust security practices while handling access credentials.

Error policies

The connector supports Error policies.

Indexes Prefix

The connector uses the concept of index objects that it writes to in order to store information about the latest offsets for Kafka topics and partitions as they are being processed. This allows the connector to quickly resume from the correct position when restarting and provides flexibility in naming the index objects.

By default, the prefix for these index objects is named .indexes for all connectors. However, each connector will create and store its index objects within its own nested prefix inside this .indexes prefix.

You can configure the root prefix for these index objects using the property connect.gcpstorage.indexes.name. This property specifies the path from the root of the GCS bucket. Note that even if you configure this property, the connector will still create a nested prefix within the specified prefix.

Examples

Index Name (connect.gcpstorage.indexes.name)

Resulting Indexes Prefix Structure

Description

.indexes (default)

.indexes/<connector_name>/

The default setup, where each connector uses its own nested prefix within .indexes.

custom-indexes

custom-indexes/<connector_name>/

Custom prefix custom-indexes, with a nested prefix for each connector.

indexes/gcs-connector-logs

indexes/gcs-connector-logs/<connector_name>/

Uses a custom nested prefix gcs-connector-logs within indexes, with a nested prefix for each connector.

logs/indexes

logs/indexes/<connector_name>/

Indexes are stored under logs/indexes, with a nested prefix for each connector.

Option Reference

Name

Description

Type

Available Values

Default Value

connect.gcpstorage.gcp.auth.mode

Specifies the authentication mode for connecting to GCP.

string

"Credentials", "File" or "Default"

"Default"

connect.gcpstorage.gcp.credentials

For "auth.mode" credentials: GCP Authentication credentials string.

string

(Empty)

connect.gcpstorage.gcp.file

For "auth.mode" file: Local file path for file containing GCP authentication credentials.

string

(Empty)

connect.gcpstorage.gcp.project.id

GCP Project ID.

string

(Empty)

connect.gcpstorage.gcp.quota.project.id

GCP Quota Project ID.

string

(Empty)

connect.gcpstorage.endpoint

Endpoint for GCP Storage.

string

(Empty)

connect.gcpstorage.error.policy

Defines the error handling policy when errors occur during data transfer to or from GCP Storage.

string

"NOOP", "THROW", "RETRY"

"THROW"

connect.gcpstorage.max.retries

Sets the maximum number of retries the connector will attempt before reporting an error.

int

connect.gcpstorage.retry.interval

Specifies the interval (in milliseconds) between retry attempts by the connector.

int

60000

connect.gcpstorage.http.max.retries

Sets the maximum number of retries for the underlying HTTP client when interacting with GCP.

long

connect.gcpstorage.http.retry.interval

Specifies the retry interval (in milliseconds) for the underlying HTTP client.

long

connect.gcpstorage.http.retry.timeout.multiplier

Specifies the change in delay before the next retry or poll

double

3.0

connect.gcpstorage.local.tmp.directory

Enables the use of a local folder as a staging area for data transfer operations.

string

(Empty)

connect.gcpstorage.kcql

A SQL-like configuration that defines the behavior of the connector.

string

(Empty)

connect.gcpstorage.compression.codec

Sets the Parquet compression codec to be used when writing data to GCP Storage.

string

"UNCOMPRESSED", "SNAPPY", "GZIP", "LZ0", "LZ4", "BROTLI", "BZIP2", "ZSTD", "DEFLATE", "XZ"

"UNCOMPRESSED"

connect.gcpstorage.compression.level

Sets the compression level when compression is enabled for data transfer to GCP Storage.

int

1-9

(Empty)

connect.gcpstorage.seek.max.files

Specifies the maximum threshold for the number of files the connector uses.

int

connect.gcpstorage.indexes.name

Configure the indexes prefix for this connector.

string

".indexes"

connect.gcpstorage.exactly.once.enable

By setting to 'false', disable exactly-once semantics, opting instead for Kafka Connect’s native at-least-once offset management

boolean

true, false

true

connect.gcpstorage.schema.change.detector

Configure how the file will roll over upon receiving a record with a schema different from the accumulated ones. This property configures schema change detection with default (object equality), version (version field comparison), or compatibility (Avro compatibility checking).

string

default, version, compatibility

default

connect.gcpstorage.skip.null.values

Skip records with null values (a.k.a. tombstone records).

boolean

true, false

false

connect.gcpstorage.latest.schema.optimization.enabled

When set to true, reduces unnecessary data flushes when writing to Avro or Parquet formats. Specifically, it leverages schema compatibility to avoid flushing data when messages with older but backward-compatible schemas are encountered.

boolean

true, false

false

PreviousElasticsearch NextHTTP

Last updated 8 months ago

Was this helpful?

Good afternoon

Connector Class

Example

KCQL Support

Target Bucket and Path

SQL Projection

Source Topic

KCQL Properties

Partitioning & Object Keys

Rolling Windows

Data Storage Format

Examples

Flush Options

Flushing By Interval

Compression

Authentication

Error policies

Indexes Prefix

Examples

Option Reference