This page describes the usage of the Stream Reactor AWS S3 Source Connector.
This connector is also available on the AWS Marketplace.
Files that have been archived to AWS Glacier storage class are skipped, in order to load these files you must manually restore the files. Skipped files are logged in the Connect workers log files.
For more examples see the tutorials.
You can specify multiple KCQL statements separated by ;
to have the connector sink into multiple topics.
The connector uses a SQL-like syntax to configure the connector behaviour. The full KCQL syntax is:
Please note that you can employ escaping within KCQL for the INSERT INTO, SELECT * FROM, and PARTITIONBY clauses when necessary. For example, if you need to use a topic name that contains a hyphen, you can escape it as follows:
The S3 source location is defined within the FROM clause. The connector will read all files from the given location considering the data partitioning and ordering options. Each data partition will be read by a single connector task.
The FROM clause format is:
If your data in AWS was not written by the Lenses AWS sink set to traverse a folder hierarchy in a bucket and load based on the last modified timestamp of the files in the bucket.
connect.s3.source.partition.extractor.regex=none
connect.s3.source.ordering.type=LastModified
To load in alpha numeric order set the ordering type to AlphaNumeric
.
The target Kafka topic is specified via the INSERT INTO clause. The connector will write all the records to the given topic:
The connector supports a range of storage formats, each with its own distinct functionality:
JSON: The connector will read files containing JSON content, each line representing a distinct record.
Avro: The connector will read Avro-stored messages from S3 and translate them into Kafka’s native format.
Parquet: The connector will read Parquet-stored messages from S3 and translate them into Kafka’s native format.
Text: The connector will read files containing lines of text, each line representing a distinct record.
CSV: The connector will read files containing lines of text, each line representing a distinct record.
CSV_WithHeaders: The connector will read files containing lines of text, each line representing a distinct record while skipping the header row.
Bytes: The connector will read files containing bytes, each file is translated to a Kafka message.
Use the STOREAS
clause to configure the storage format. The following options are available:
When using Text storage, the connector provides additional configuration options to finely control how text content is processed.
In Regex mode, the connector applies a regular expression pattern, and only when a line matches the pattern is it considered a record. For example, to include only lines that start with a number, you can use the following configuration:
In Start-End Line mode, the connector reads text content between specified start and end lines, inclusive. This mode is useful when you need to extract records that fall within defined boundaries. For instance, to read records where the first line is ‘SSM’ and the last line is an empty line (’’), you can configure it as follows:
To trim the start and end lines, set the read.text.trim property to true:
In Start-End Tag mode, the connector reads text content between specified start and end tags, inclusive. This mode is particularly useful when a single line of text in S3 corresponds to multiple output Kafka messages. For example, to read XML records enclosed between ‘’ and ‘’, configure it as follows:
Depending on the storage format of Kafka topics’ messages, the need for replication to a different cluster, and the specific data analysis requirements, there exists a guideline on how to effectively utilize converters for both sink and source operations. This guidance aims to optimize performance and minimize unnecessary CPU and memory usage.
Currently, the connector does not offer support for SQL projection; consequently, anything other than a SELECT * query is disregarded. The connector will faithfully write all the record fields to Kafka exactly as they are.
The S3 sink employs zero-padding in file names to ensure precise ordering, leveraging optimizations offered by the S3 API, guaranteeing the accurate sequence of files.
When using the S3 source alongside the S3 sink, the connector can adopt the same ordering method, ensuring data processing follows the correct chronological order. However, there are scenarios where S3 data is generated by applications that do not maintain lexical file name order.
In such cases, to process files in the correct sequence, the source needs to list all files in the bucket and sort them based on their last modified timestamp. To enable this behavior, set the connect.s3.source.ordering.type
to LastModified. This ensures that the source correctly arranges and processes the data based on the timestamps of the files.
To limit the number of file names the source reads from S3 in a single poll. The default value, if not specified, is 1000:
To limit the number of result rows returned from the source in a single poll operation, you can use the LIMIT clause. The default value, if not specified, is 10000.
The AWS S3 Source Connector allows you to filter the files to be processed based on their extensions. This is controlled by two properties: connect.s3.source.extension.excludes
and connect.s3.source.extension.includes
.
The connect.s3.source.extension.excludes
property is a comma-separated list of file extensions to exclude from the source file search. If this property is not configured, all files are considered. For example, to exclude .txt
and .csv
files, you would set this property as follows:
The connect.s3.source.extension.includes
property is a comma-separated list of file extensions to include in the source file search. If this property is not configured, all files are considered. For example, to include only .json
and .xml
files, you would set this property as follows:
Note: If both connect.s3.source.extension.excludes
and connect.s3.source.extension.includes
are set, the connector first applies the exclusion filter and then the inclusion filter.
The PROPERTIES
clause is optional and adds a layer of configuration to the connector. It enhances versatility by permitting the application of multiple configurations (delimited by ‘,’). The following properties are supported:
The connector offers two distinct authentication modes:
Default: This mode relies on the default AWS authentication chain, simplifying the authentication process.
Credentials: In this mode, explicit configuration of AWS Access Key and Secret Key is required for authentication.
When selecting the “Credentials” mode, it is essential to provide the necessary access key and secret key properties. Alternatively, if you prefer not to configure these properties explicitly, the connector will follow the credentials retrieval order as described here.
Here’s an example configuration for the “Credentials” mode:
For enhanced security and flexibility when using the “Credentials” mode, it is highly advisable to utilize Connect Secret Providers. This approach ensures robust security practices while handling access credentials.
The connector can also be used against API compatible systems provided they implement the following:
S3 Storage Format | Kafka Output Format | Restore or replicate cluster | Analytics | Sink Converter | Source Converter |
---|---|---|---|---|---|
Name | Description | Type | Available Values |
---|---|---|---|
Name | Description | Type | Available Values | Default Value |
---|---|---|---|---|
JSON
STRING
Same,Other
Yes, No
StringConverter
StringConverter
AVRO,Parquet
STRING
Same,Other
Yes
StringConverter
StringConverter
AVRO,Parquet
STRING
Same,Other
No
ByteArrayConverter
ByteArrayConverter
JSON
JSON
Same,Other
Yes
JsonConverter
StringConverter
JSON
JSON
Same,Other
No
StringConverter
StringConverter
AVRO,Parquet
JSON
Same,Other
Yes,No
JsonConverter
JsonConverter or Avro Converter( Glue, Confluent)
AVRO,Parquet, JSON
BYTES
Same,Other
Yes,No
ByteArrayConverter
ByteArrayConverter
AVRO,Parquet
AVRO
Same
Yes
Avro Converter( Glue, Confluent)
Avro Converter( Glue, Confluent)
AVRO,Parquet
AVRO
Same
No
ByteArrayConverter
ByteArrayConverter
AVRO,Parquet
AVRO
Other
Yes,No
Avro Converter( Glue, Confluent)
Avro Converter( Glue, Confluent)
AVRO,Parquet
Protobuf
Same
Yes
Protobuf Converter( Glue, Confluent)
Protobuf Converter( Glue, Confluent)
AVRO,Parquet
Protobuf
Same
No
ByteArrayConverter
ByteArrayConverter
AVRO,Parquet
Protobuf
Other
Yes,No
Protobuf Converter( Glue, Confluent)
Protobuf Converter( Glue, Confluent)
AVRO,Parquet, JSON
Other
Same, Other
Yes,No
ByteArrayConverter
ByteArrayConverter
read.text.mode
Controls how Text content is read
Enum
Regex, StartEndTag, StartEndLine
read.text.regex
Regular Expression for Text Reading (if applicable)
String
read.text.start.tag
Start Tag for Text Reading (if applicable)
String
read.text.end.tag
End Tag for Text Reading (if applicable)
String
read.text.buffer.size
Text Buffer Size (for optimization)
Int
read.text.start.line
Start Line for Text Reading (if applicable)
String
read.text.end.line
End Line for Text Reading (if applicable)
String
read.text.trim
Trim Text During Reading
Boolean
store.envelope
Messages are stored as “Envelope”
Boolean
connect.s3.aws.auth.mode
Specifies the AWS authentication mode for connecting to S3.
string
"Credentials," "Default"
"Default"
connect.s3.aws.access.key
Access Key for AWS S3 Credentials
string
connect.s3.aws.secret.key
Secret Key for AWS S3 Credentials
string
connect.s3.aws.region
AWS Region for S3 Bucket
string
connect.s3.pool.max.connections
Maximum Connections in the Connection Pool
int
-1 (undefined)
50
connect.s3.custom.endpoint
Custom Endpoint URL for S3 (if applicable)
string
connect.s3.kcql
Kafka Connect Query Language (KCQL) Configuration to control the connector behaviour
string
connect.s3.vhost.bucket
Enable Virtual Hosted-style Buckets for S3
boolean
true, false
false
connect.s3.source.extension.excludes
A comma-separated list of file extensions to exclude from the source file search.
string
[file extension filtering]({{< relref "#file-extension-filtering" >}})
connect.s3.source.extension.includes
A comma-separated list of file extensions to include in the source file search.
string
[file extension filtering]({{< relref "#file-extension-filtering" >}})
connect.s3.source.partition.extractor.type
Type of Partition Extractor (Hierarchical or Regex)
string
hierarchical, regex
connect.s3.source.partition.extractor.regex
Regex Pattern for Partition Extraction (if applicable)
string
connect.s3.ordering.type
Type of ordering for the S3 file names to ensure the processing order.
string
AlphaNumeric, LastModified
AlphaNumeric
connect.s3.source.partition.search.continuous
If set to true the connector will continuously search for new partitions.
boolean
true, false
true
connect.s3.source.partition.search.excludes
A comma-separated list of paths to exclude from the partition search.
string
".indexes"
connect.s3.source.partition.search.interval
The interval in milliseconds between searching for new partitions.
long
300000
connect.s3.source.partition.search.recurse.levels
Controls how many levels deep to recurse when searching for new partitions
int
0