What Is Apache Kafka?
With low end-to-end latency, exceptional durability, and the ability to handle mass amounts of streaming data, Apache Kafka has quickly become a go-to tool for stream processing.
In this blog, we give a high-level overview of Apache Kafka and how it works, talk Kafka Topics, and discuss when Kafka should be used.
What Is Apache Kafka?
Apache Kafka is a popular open source stream processor / middleware tool that can also be used as a message broker. Kafka provides low end-to-end latency with exceptional durability (persistence).
Kafka is not a message broker. It is a stream processor. There is a difference here — you can use Kakfa in an application as a message handler. Kafka has a publish-subscribe feature, like many message brokers, but unlike many message brokers, Kafka is a distributed streaming platform.
This means the ability to publish and subscribe to streams of records, store streams of records in a durable, fault-tolerant method, and process streams as they occur.
How Kafka Works
Kafka can run as a cluster. These clusters can be local, or they can be disparate, on separate sides of the state, or world.
Records are stored in topics. A record has 3 parts:
- A Key
- A Value
- A Timestamp
Like Apache ActiveMQ / Artemis and other brokers, there is a producer and consumer API available with the Kafka platform. There is also a streams API and a connector API. The streams API assist in wiring applications to manipulate streams and act as a “stream processor”.
The producer and consumer APIs are self-explanatory, they allow applications to act as producers and consumers. There is one more API, the admin API, and it allows management applications control over the stream processor cluster.
Related blog >> What Are APIs?
Message Broker Tool
If the content doesn’t seem to be different from that of message brokers, Kafka might not fit with the system architecture, and a message broker might be the correct tool for you. Kafka is going to provide extremely low latency (FAST) transfer of data (messages) between disconnected, abstract, distant parts of a system.
How to Use Kafka With Spark
Watch the webinar below to learn how to empower your data lake with Kafka.
What Is Kafka Topic?
Kafka breaks the data out into topics. Topics in Kafka allow multiple subscribers to connect, meaning they can have between zero and some positive abstract number of subscribers.
Topics are divided into partitions. A partition is an ordered (ordered as records are published), immutable (unchangeable) sequence of records that is continuously built on. Records are assigned an id, or an offset, as they are added to the topic.
Kafka persists all records, whether they have been consumed or not. Records are persisted for the retention period defined in the broker configuration, but the default is 7 days.
# The minimum age of a log file to be eligible for deletion due to age
Topic partitions are replicated across a number of brokers in the cluster for fault tolerance.
# The default number of log partitions per topic. More partitions allow greater
# parallelism for consumption, but this will also result in more files across
# the brokers.
Producers publish the data to the topics. The producer can publish to topics while choosing which partition to place records in. There are multiple methods of doing this, but “round-robin balancing” would be the simplest.
Consuming Kafka Data
Consumers consume data from the topics. Consumer information is not kept. This is important, because it allows for stateless consumption, faster throughput, less errors and more. Consumers have the offset of their consumption (what record they are currently consuming) and nothing more. Consumers are cheap, they do not use much memory (from the broker perspective), and they can come and go with relative ease.
Consumers also have a lot of freedom with Kafka, freedom to process messages as they like. They can reprocess older messages by changing their offset or skip ahead to the current time to start processing the newest messages.
With MirrorMaker you can replicate your cluster data across the globe to different clusters.
Why Use Apache Kafka?
Kafka offers some high-level guarantees, these are from the Kafka documentation at the Apache Software Foundation.
- Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is sent by the same producer as a record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.
- A consumer instance sees records in the order they are stored in the log.
- For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any records committed to the log.
Kafka is a stream processing platform that is distributable across clusters, near or far, providing reliable, FAST, and durable message processing for the entire enterprise stack. If you do not want to use it as a stream processor, it is an excellent message broker.
Get Help With Kafka
Implementing Kafka requires skill, fully realizing its capability for speed and reliability requires knowledge, and successfully maintaining Kafka implementations requires patience and expertise.
Our experts can help you implement Kafka. We'll bring the skill, knowledge, patience, and expertise to help you get the most out of Kafka.
And if you're already using it, we can provide the Kafka support you need with the very same skill, knowledge, patience, and expertise.
Talk to an expert and gain confidence in your middleware implementation today.