What Is Apache Kafka?
What Is Apache Kafka and Why Should a Software Engineer or Architect Care?
Apache Kafka is a stream processor and can be used as a message broker as well. If architecture or software requires low end-to-end latency with exceptional durability (persistence), Kafka is the piece of software to provide this and other functionality.
Kafka is not a message broker. It is a stream processor. There is a difference here - you can use Kakfa in an application as a message handler. Kafka has a publish-subscribe feature, like many message brokers, but unlike many message brokers, Kafka is a distributed streaming platform.
This means the ability to publish and subscribe to streams of records, store streams of records in a durable, fault-tolerant method, and process streams as they occur.
How Apache Works
Kafka can run as a cluster. These clusters can be local, or they can be disparate, on separate sides of the state, or world. Records are stored in topics. A record has 3 parts:
- A Key
- A Value
- A Timestamp
Like Apache ActiveMQ / Artemis and other brokers, there is a producer and consumer API available with the Kafka platform. There is also a streams API and a connector API. The streams API assist in wiring applications to manipulate streams and act as a “stream processor”.
The producer and consumer APIs are self-explanatory, they allow applications to act as producers and consumers. There is one more API, the admin API, and it allows management applications control over the stream processor cluster.
Message Broker Tool
If the content doesn’t seem to be different from that of message brokers, Kafka might not fit with the system architecture, and a message broker might be the correct tool for you. Kafka is going to provide extremely low latency (FAST) transfer of data (messages) between disconnected, abstract, distant parts of a system.
What Is Kafka Topics?
Kafka breaks the data out into Topics. Topics in Kafka allow multiple subscribers to connect, meaning they can have between zero and some positive abstract number of subscribers.
Topics are divided into partitions. A partition is an ordered (ordered as records are published), immutable (unchangeable) sequence of records that is continuously built on. Records are assigned an id, or an offset, as they are added to the topic.
Kafka persists all records, whether they have been consumed or not. Records are persisted for the retention period defined in the broker configuration, but the default is 7 days.
# The minimum age of a log file to be eligible for deletion due to age
Topic partitions are replicated across a number of brokers in the cluster for fault tolerance.
# The default number of log partitions per topic. More partitions allow greater
# parallelism for consumption, but this will also result in more files across
# the brokers.
Producers publish the data to the topics. The producer can publish to topics while choosing which partition to place records in. There are multiple methods of doing this, but “round-robin balancing” would be the simplest.
Consuming Kafka Data
Consumers consume data from the topics. Consumer information is not kept. This is important, because it allows for stateless consumption, faster throughput, less errors and more. Consumers have the offset of their consumption (what record they are currently consuming) and nothing more. Consumers are cheap, they do not use much memory (from the broker perspective), and they can come and go with relative ease.
Consumers also have a lot of freedom with Kafka, freedom to process messages as they like. They can reprocess older messages by changing their offset or skip ahead to the current time to start processing the newest messages.
With MirrorMaker you can replicate your cluster data across the globe to different clusters.
Kafka Developer Guarantees
Kafka offers some high-level guarantees, these are from the Kafka documentation at the Apache Software Foundation.
- Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is sent by the same producer as a record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.
- A consumer instance sees records in the order they are stored in the log.
- For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any records committed to the log.
In summary, Kafka is a stream processing platform that is distributable across clusters, near or far, providing reliable, FAST, and durable message processing for the entire enterprise stack. If you do not want to use it as a stream processor, it is an excellent message broker.
If you need help with implementing Kafka into your stack, connect with an OpenLogic expert. Try the OpenLogic free trial to open a consultative support ticket to work with an enterprise architect today!