Concepts
Learn the core concepts and glossary of Lenses Kafka to Kafka.
The K2K app orchestrates a robust, configurable pipeline for replicating and transforming Kafka data between clusters, with strong support for exactly-once semantics, flexible mapping, schema management, and operational observability.
Stages
K2K is comprised two stages, wrapped into a pipeline:
Processing Stages:
Modular stages for header handling, schema mapping, routing, serialization, and finalization.
Sink Stage:
Writes processed records to the target Kafka cluster, supporting both at-least-once and exactly-once semantics.
Before the stages are run, the app loads the pipeline definition from the configuration, validates it, and extracts all necessary settings for source, target, and coordination.
The configuration file path is passed as an argument to the binary.
k2k start -f my-pipeline.yaml
Kafka Clients
Naturally, K2K utilizes several Kafka clients, both consumers and producers.
Kafka Consumers:
Source Consumer: Reads records from the source Kafka cluster.
Control Consumer: Used for coordination and control messages (e.g., offset management), on the source cluster.
Kafka Producers:
Target Producer: Writes records to the target Kafka cluster, supporting both transactional (exactly-once) and non-transactional modes.
How do I configure the Kafka clients?
You can override the consumer and producer properties in the coordination, source and target sections. See Configuration.
Partition and Topic Mapping
Partition Mapper: Determines how partitions are mapped from source to target.
Topic Mapper: Handles topic name mapping and routing logic.
How do I configure which topics are to be replicated?
The topics to be replicated from the source clusters are specified in the replication
section. See the configuration for more details.
Offset Management
A critical aspect of the replicator's reliability and performance is its internal offset management strategy. Instead of committing consumer offsets back to the source cluster (the default behaviour for a standard Kafka consumer group), the replicator tracks them independently, using a dedicated, compacted Kafka topic within the target cluster as its backing store.
This deliberate design choice is foundational to the replicator's core features and provides several fundamental advantages:
Decoupling from the Source Cluster
Replication tools should be passive observers of the source cluster, imposing minimal performance overhead. By writing offset information exclusively to the target cluster, the replicator avoids generating any write traffic, network latency, or additional load on the source brokers. This "read-only" approach minimizes the impact on the source cluster and allows a restricted Kafka connection that permits only read operations.
Enabling True Exactly-Once Semantics (EOS)
The exactly-once
delivery guarantee is achieved using Kafka's atomic transaction capabilities. A transaction allows an application to group multiple write operations into a single, all-or-nothing unit. For the replicator, this means it can perform two actions atomically:
Produce the replicated messages to the target topic.
Commit the source topic offsets that have been successfully processed.
Since Kafka transactions are scoped to a single cluster, this atomicity is only possible if both the resulting data (the replicated messages) and the offset data reside on the same cluster. By managing offsets on the target cluster, the replicator can wrap the entire read-process-write cycle in a single transaction. If the replicator fails at any point after writing messages but before committing the offset, the transaction is aborted, ensuring each message is processed exactly once.
Optimizing for Write Performance and Deployment
All write operations from the replicator—including replicated messages and its own internal offsets—are directed to the target cluster. Kafka client writes are inherently sensitive to network latency, which makes deployment topology a key factor for performance and stability.
For this reason, a critical best practice is to deploy the replicator geographically and logically close to its target cluster. High latency between the replicator and the target can lead to request timeouts. While the Kafka client's built-in retry mechanisms provide resilience, frequent timeouts can have negative consequences
Coordination and Assignment
Runs against the source Kafka cluster, to override the consumer settings in the coordination configuration entry, set an entry for the standard Kafka consumer properties.
Assignment management handles partition assignment, fencing, and exclusive access for exactly-once semantics. K2K utilizes up to three internal Kafka topics, created on the target cluster, for state coordination, progress tracking, and metrics reporting. The names and properties of these topics are configurable.
__k2k-assignment
Manages work distribution and leader election between multiple K2K instances (runners). This ensures each source partition is owned and processed by only one runner at a time, which is fundamental to the application's scalable and fault-tolerant design.
This topic is used for coordination in all multi-instance deployments. Its name is configured via the coordination.kafka.assignment.topic
property.
__k2k_consumer-offsets
Stores consumed offsets from the source cluster. This internal offset management mechanism is foundational to the application's recovery process and its implementation of exactly-once semantics (EOS).
This topic is used in all deployments for progress tracking and stateful recovery. Its name is configured via the coordination.kafka.commit.topic
property.
__k2k-metrics
Receives operational metrics from the application when the OpenTelemetry Kafka exporter is active.
This topic is used only when the OTEL_METRICS_EXPORTER
environment variable is set to kafka
. The topic name is configured via the relevant OpenTelemetry settings.
These topics can be provisioned in two ways:
Manual Creation: The topics can be created manually on the target Kafka cluster prior to application startup. This approach allows for precise administrative control over partition counts, replication factors, and retention policies.
Automatic Creation: Alternatively, K2K can create these topics automatically at runtime. This requires the
auto.create.topics.enable=true
setting to be active on the Kafka brokers.
Configuration and Sharing
Customization: The default names and any standard Kafka topic-level properties (e.g.,
partitions
,cleanup.policy
,retention.ms
) for these internal topics are customizable within the K2K YAML configuration file.Sharing: To reduce resource utilization on the Kafka cluster, the same set of internal topics can be shared by multiple K2K pipeline instances. When sharing topics, each distinct pipeline must be configured with a unique
coordination.kafka.commit.group
to ensure complete state isolation between them.
Last updated
Was this helpful?