Concepts

Learn the core concepts and glossary of Lenses Kafka to Kafka.

The K2K app orchestrates a robust, configurable pipeline for replicating and transforming Kafka data between clusters, with strong support for exactly-once semantics, flexible mapping, schema management, and operational observability.

Stages

K2K is comprised two stages, wrapped into a pipeline:

  • Processing Stages:

    • Modular stages for header handling, schema mapping, routing, serialization, and finalization.

  • Sink Stage:

    • Writes processed records to the target Kafka cluster, supporting both at-least-once and exactly-once semantics.

Before the stages are run, the app loads the pipeline definition from the configuration, validates it, and extracts all necessary settings for source, target, and coordination.

The configuration file path is passed as an argument to the binary.

k2k start -f my-pipeline.yaml

Kafka Clients

Naturally, K2K utilizes several Kafka clients, both consumers and producers.

  • Kafka Consumers:

    • Source Consumer: Reads records from the source Kafka cluster.

    • Control Consumer: Used for coordination and control messages (e.g., offset management), on the source cluster.

  • Kafka Producers:

    • Target Producer: Writes records to the target Kafka cluster, supporting both transactional (exactly-once) and non-transactional modes.

How do I configure the Kafka clients?

You can override the consumer and producer properties in the coordination, source and target sections. See Configuration.

Partition and Topic Mapping

  • Partition Mapper: Determines how partitions are mapped from source to target.

  • Topic Mapper: Handles topic name mapping and routing logic.

How do I configure which topics are to be replicated?

The topics to be replicated from the source clusters are specified in the replication section. See the configuration for more details.

Offset Management

A critical aspect of the replicator's reliability and performance is its internal offset management strategy. Instead of committing consumer offsets back to the source cluster (the default behaviour for a standard Kafka consumer group), the replicator tracks them independently, using a dedicated, compacted Kafka topic within the target cluster as its backing store.

This deliberate design choice is foundational to the replicator's core features and provides several fundamental advantages:

Decoupling from the Source Cluster

Replication tools should be passive observers of the source cluster, imposing minimal performance overhead. By writing offset information exclusively to the target cluster, the replicator avoids generating any write traffic, network latency, or additional load on the source brokers. This "read-only" approach minimizes the impact on the source cluster and allows a restricted Kafka connection that permits only read operations.

Enabling True Exactly-Once Semantics (EOS)

The exactly-once delivery guarantee is achieved using Kafka's atomic transaction capabilities. A transaction allows an application to group multiple write operations into a single, all-or-nothing unit. For the replicator, this means it can perform two actions atomically:

  1. Produce the replicated messages to the target topic.

  2. Commit the source topic offsets that have been successfully processed.

Since Kafka transactions are scoped to a single cluster, this atomicity is only possible if both the resulting data (the replicated messages) and the offset data reside on the same cluster. By managing offsets on the target cluster, the replicator can wrap the entire read-process-write cycle in a single transaction. If the replicator fails at any point after writing messages but before committing the offset, the transaction is aborted, ensuring each message is processed exactly once.

Optimizing for Write Performance and Deployment

All write operations from the replicator—including replicated messages and its own internal offsets—are directed to the target cluster. Kafka client writes are inherently sensitive to network latency, which makes deployment topology a key factor for performance and stability.

For this reason, a critical best practice is to deploy the replicator geographically and logically close to its target cluster. High latency between the replicator and the target can lead to request timeouts. While the Kafka client's built-in retry mechanisms provide resilience, frequent timeouts can have negative consequences

Coordination and Assignment

Assignment management handles partition assignment, fencing, and exclusive access for exactly-once semantics. K2K utilizes up to three internal Kafka topics, created on the target cluster, for state coordination, progress tracking, and metrics reporting. The names and properties of these topics are configurable.

Default Topic Name
Purpose
Usage Context

__k2k-assignment

Manages work distribution and leader election between multiple K2K instances (runners). This ensures each source partition is owned and processed by only one runner at a time, which is fundamental to the application's scalable and fault-tolerant design.

This topic is used for coordination in all multi-instance deployments. Its name is configured via the coordination.kafka.assignment.topic property.

__k2k_consumer-offsets

Stores consumed offsets from the source cluster. This internal offset management mechanism is foundational to the application's recovery process and its implementation of exactly-once semantics (EOS).

This topic is used in all deployments for progress tracking and stateful recovery. Its name is configured via the coordination.kafka.commit.topic property.

__k2k-metrics

Receives operational metrics from the application when the OpenTelemetry Kafka exporter is active.

This topic is used only when the OTEL_METRICS_EXPORTER environment variable is set to kafka. The topic name is configured via the relevant OpenTelemetry settings.

These topics can be provisioned in two ways:

  • Manual Creation: The topics can be created manually on the target Kafka cluster prior to application startup. This approach allows for precise administrative control over partition counts, replication factors, and retention policies.

  • Automatic Creation: Alternatively, K2K can create these topics automatically at runtime. This requires the auto.create.topics.enable=true setting to be active on the Kafka brokers.

Configuration and Sharing

  • Customization: The default names and any standard Kafka topic-level properties (e.g., partitions, cleanup.policy, retention.ms) for these internal topics are customizable within the K2K YAML configuration file.

  • Sharing: To reduce resource utilization on the Kafka cluster, the same set of internal topics can be shared by multiple K2K pipeline instances. When sharing topics, each distinct pipeline must be configured with a unique coordination.kafka.commit.group to ensure complete state isolation between them.

Last updated

Was this helpful?