What Is Apache Kafka and How Does Kafka Work?
Wondering how Kafka works and why it is so popular? With low end-to-end latency, exceptional durability, and the ability to handle mass amounts of streaming data, Apache Kafka has quickly become a go-to tool for stream processing.
In this blog, we give a high-level overview of how Apache Kafka works, talk Kafka Topics, and discuss when Kafka should be used. For more in-depth information about Kafka in the enterprise, download the Decision Maker's Guide to Apache Kafka.
What Is Apache Kafka?
Apache Kafka is a popular open source stream processor / middleware tool that can also be used as a message broker. Kafka provides low end-to-end latency with exceptional durability (persistence).
Kafka is a stream processor, and while you can use Kafka in an application as a message handler, it is not technically a message broker.
Kafka has a publish-subscribe feature, like many message brokers, but it is a distributed streaming platform. This means Kafka has the ability to publish and subscribe to streams of records, store streams of records in a durable, fault-tolerant method, and process streams as they occur.
How Does Kafka Work?
Kafka can run as a cluster. These clusters can be local, or they can be disparate, on separate sides of the state, or world.
Records are stored in topics. A record has 3 parts:
- A Key
- A Value
- A Timestamp
Like Apache ActiveMQ/Artemis and other brokers, there is a producer and consumer API available with the Kafka platform. There is also a streams API and a connector API. The streams API assist in wiring applications to manipulate streams and act as a “stream processor”.
The producer and consumer APIs are self-explanatory, they allow applications to act as producers and consumers. There is one more API, the admin API, and it allows management applications control over the stream processor cluster.
Related blog >> What Are APIs?
Message Broker Tool
If the content doesn’t seem to be different from that of message brokers, Kafka might not fit with the system architecture, and a message broker might be the correct tool for you. Kafka is going to provide extremely low latency (FAST) transfer of data (messages) between disconnected, abstract, distant parts of a system.
How to Use Kafka With Spark
Watch the webinar below to learn how to empower your data lake with Kafka.
What Is Kafka Topic?
Kafka breaks the data out into topics. Topics in Kafka allow multiple subscribers to connect, meaning they can have between zero and some positive abstract number of subscribers.
Topics are divided into partitions. A Kafka partition is an ordered (ordered as records are published), immutable (unchangeable) sequence of records that is continuously built on. Records are assigned an id, or an offset, as they are added to the topic.
Kafka persists all records, whether they have been consumed or not. Records are persisted for the retention period defined in the broker configuration, but the default is 7 days.
# The minimum age of a log file to be eligible for deletion due to age
Topic partitions are replicated across a number of brokers in the cluster for fault tolerance.
# The default number of log partitions per topic. More partitions allow greater
# parallelism for consumption, but this will also result in more files across
# the brokers.
Producers publish the data to the topics. The producer can publish to topics while choosing which partition to place records in. There are multiple methods of doing this, but “round-robin balancing” would be the simplest.
Consuming Kafka Data
Consumers consume data from the topics. Consumer information is not kept. This is important, because it allows for stateless consumption, faster throughput, less errors and more. Consumers have the offset of their consumption (what record they are currently consuming) and nothing more. Consumers are cheap, they do not use much memory (from the broker perspective), and they can come and go with relative ease.
Consumers also have a lot of freedom with Kafka, freedom to process messages as they like. They can reprocess older messages by changing their offset or skip ahead to the current time to start processing the newest messages.
With MirrorMaker you can replicate your cluster data across the globe to different clusters.
Why Use Apache Kafka?
Kafka offers some high-level guarantees, these are from the Kafka documentation at the Apache Software Foundation.
- Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is sent by the same producer as a record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.
- A consumer instance sees records in the order they are stored in the log.
- For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any records committed to the log.
Kafka is a stream processing platform that is distributable across clusters, near or far, providing reliable, FAST, and durable message processing for the entire enterprise stack. If you do not want to use it as a stream processor, it is an excellent message broker.
Hopefully this blog has given you a better understanding of how Kafka works and different applications in enterprise settings. For more, including comparisons of Kafka alternatives and configuration strategies, check out The Decision Maker's Guide to Apache Kafka.
Get Help With Apache Kafka
Implementing Kafka requires skill and successfully maintaining Kafka deployments requires patience and expertise. OpenLogic's enterprise architects can help your team get the most out of Kafka and provide 24/7/365 technical support, backed by SLAs.
- Case Study - Credit Card Processing Company Avoids Kafka Exploit
- Webinar - How to Use Kafka Data Lakes
- Blog - Kafka vs. RabbitMQ
- Blog - Using Apache Kafka for Stream Processing
- Blog - Using Kafka with ZooKeeper
- Blog - Exploring Kafka Connect
- White Paper - The New Stack: Cassandra, Kafka, and Spark
- Blog - 5 Apache Kafka Security Best Practices