decorative image for blog on apache kafka connect
April 28, 2022

Exploring Kafka Connect

Middleware

Apache Kafka is arguably the most robust, scalable, and highest performing message streaming platform on the market to date. Capable of handling millions of messages per second, Apache Kafka sits at the heart of many Enterprise Scale data warehouses. But how do organizations connect Kafka to the disparate data sources and storage destinations that comprise enterprise data systems? Kafka Connect.

In this blog, we give an overview of Apache Kafka Connect, how it works, why it's used, common use cases, and how it's used in Kubernetes.

Apache Kafka Connect: Overview

There is no doubt that in and of itself Apache Kafka is an impressive piece of technology, but what we do with the data traversing the cluster is where the magic really happens. Where is the data coming from? Where is the data going to, and what happens to it along the way? At the heart of the answer to these questions is Kafka Connect.

What Is Kafka Connect?

Kafka Connect provides a platform to reliably stream data to/from Apache Kafka and external data sources/destinations.

With it’s core concepts of Source and Sink connectors, Kafka Connect is an open source project that provides a centralized hub for basic data integration between data platforms such as databases, index engines, file stores, and key-value repositories. Kafka Connect provides a pluggable solution to integrate these data platforms together by allowing developers to share and reuse connectors.

diagram for simplified kafka connect architecture

Kafka Connectors

At the core of Kafka Connect are the connectors. Kafka Connectors are ready-made components that allow for defining date sources and destinations for external systems. Separated out between Source Connectors and Sink Connectors, existing connectors can be used for common data sources or sinks or, new connectors can be developed for more uncommon external systems.

Many connectors are community maintained while others are Confluent proprietary connectors, but most popular data platforms like Elastic Search, S3, MongoDB, MySQL, Redis, and etc. have an existing connector. Most community available connectors can be found on a project’s Git repo or project website. For instance, the MongoDB connector can be found here. Most connectors are a short Google search away using the “project name” and “Kafka connector.”

Kafka Source Connectors

A Source Connector defines what data systems to collect data from, this can be a database, real-time data streams, message brokers or application metrics. Once defined, the Source Connector connects to the source data platform and making the data accessible to Kafka topics for stream processing.

Kafka Sink Connectors

A Sink Connector on the other hand, defines the destination data platform or the data’s endpoint. Again these endpoints could be any number of data platforms such as index engines, other databases, files stores etc..

How Does Kafka Connect Work?

Kafka Connect works by implementing the Kafka Connect API. To implement a connector we must provide sinkConnector or sourceConnector and then implement sourceTask or sinkTask in the connector code. These define a connector’s task implementation as well as it’s parameters, connection details, Kafka topic information and etc.. Once these are configured, Kafka Connect manage these task, freeing us up from having to manage task instances and implementation.

Kafka Connect Use Cases

Some common use cases for Kafka Connect include ingesting application metrics from multiple sources into a single data lake, collecting click-stream activity across multiple web platforms into a single activity view, or just simply moving data from one database to another. In all use cases, Kafka Connect gives us a platform to reuse connectors across the entire enterprise.

Another popular use case for Kafka Connect is the role it plays in data warehouse schema registry. Kafka Connect converters can be used to collect schema information from different connectors and convert data types into standardized data formats such as Protobuf, Avro, or JSon Schema. Converters are separated out from the connectors themselves so they can be reused amongst multiple connectors. For instance, you could implement the same converter across an HDFS sink as well as an S3 sink.

Keep in mind though, while many Kafka use cases appear to be very similar to traditional ETL use cases, developers and data engineers should use caution when attempting to do any complicated data transformation with Kafka Connect.

Single Message Transformation or SMT is a supported function in Kafka Connect, however it does come at a cost. The true power of the Kafka platform is its ability to move massive amounts of data at break-neck speeds and injecting complicated data transformations into the platform can potentially cripple this design. While simple SMTs are supported, its generally considered best practices to perform more complicated data transformation before ingesting the data into Kafka Connect.

Kafka Connect in Kubernetes

A great way to implement Kafka Connect is through the use of containers and Kubernetes. The Kafka Operator Strimzi is an amazing project that allows for rapidly deploying and managing your Kafka Connect infrastructure.  This provides the ability to define Kafka Connect images with customs resources and deploy them rapidly, which makes managing your Kafka Connect cluster a breeze. Using a standardized description language across all your connectors implementation present a huge degree of process reusability and standardization.

With Strimzi you can define, download and implement Source and Sink Connectors from repos like Maven Central, Git Hub or other custom defined artifact repositories and let Kubernetes handle their deployment. Strimzi allows you to build Kafka Connect images with a build configuration that contains a list of Connector plugins and the connector configurations into a CustomeResourceDefiniton and deploy those CRDs across your K8s infrastructure. 

Final Thoughts

Kafka Connect is an integral tool leveraged by the Apache Kafka ecosystem to reliably move data in the enterprise with scalability and reusability. Almost every Kafka implementation would benefit from its integration into the environment. And, with tools like the Stimzi operator as a force multiplier, the power of Kafka Connect to transform your data operations is even more powerful.

Need Support for Your Kafka Deployments?

Our team of enterprise architects has a wealth of experience integrating, configuring, optimizing, and supporting Kafka in enterprise applications. Talk to an expert today to see how we can help support your Kafka deployments.

Talk to an Expert

Additional Resources