Data Logistics using Apache Kafka, Spark, and Spring Boot

This post summarizes how Apache Kafka works and demonstrates how it can be used to ingest data and transfer it to another system. It will therefore only highlight the fundamentals of Apache Kafka, its justification, the context of usage as well as using its Producer and Consumer API.

What is Apache Kafka?

Even though it was initially commonly described as a high throughput distributed messaging system but has evolved over the years to become a distributed streaming platform as described on its official website. Regardless of the adopted description, the most important thing is the fact that it is a platform that facilitates the transfer of data from one system to another rapidly in a scalable and reliable way, some sort of data logistics. It is therefore safe to conclude that Kafka is needed whenever the need arises to facilitate the transfer of data among multiple systems in a fast, reliable, and durable way.

The image above shows how Kafka works at the topmost level leaving out its two other APIs (Streams and Connectors). Multiple applications produce data in different structures and rates, pushing it to specified Kafka Topics and multiple applications also consume the pushed data from the same or different Topics.

Topics are :

  • central Kafka abstraction as seen in the figure
  • the named category for messages which are stored in a time-ordered sequence.
  • a logical entity physically represented as a log.

The Topics reside in the Kafka Clusters which might contain one or more brokers with one or more partitions. Partitions are how Kafka achieves its high throughput by allowing the splitting of data for a topic so faster simultaneous writing can occur. There are key-hash, Round-Robin, Direct, or custom implementation approaches that can be used to split the data across available partitions for each topic. Too many partitions can lead to overhead on the Zookeeper side. Each Partition is maintained on one or more broker(s) depending on the replicator factor set. A broker is a typical Kafka server responsible for running the partition(s).

A typical message has three main parts: Timestamp – assigned by Kafka and used for ordering the sequence, a Unique Identifier for referencing each message, and the Payload which is the binary value of the message. A payload is a binary form for effective network, storage resources usage, and compression amidst other possible reasons. Apache Kafka was built on the foundation of Transaction or Commit Logs with the following objectives:

  • High throughput
  • Horizontal Scalability
  • Reliability and Durability
  • Loosely coupled producers and consumers
  • flexible publish-subscribe semantics

Two factors worthy of mention when working with Kafka are the number of nodes to make up the cluster as well as the Replication Factor. The number of nodes in the cluster denotes the number of brokers that can work on tasks with one of them serving as the controller at a time and they can reside on the same or different machine. Replication Factor is responsible for the redundancy or duplication of the messages, cluster resiliency, and fault tolerance and it can also be configured on a per-topic basis. It is advisable that the replication factor should be set optimally, especially as regards network I/O resources usage.

It is good to point out that Kafka is able to achieve the distributed part of these objectives by relying on another Apache project by the name Zookeeper which helps maintain its clusters of nodes as regards configuration information, health status, and group membership.

Apache Zookeeper is a centralized service for maintaining configuration information, naming, proving, distributed synchronization and providing group services.

https://zookeeper.apache.org/

Why Apache Kafka?

Before the emergence of Kafka, there were in existence large systems handling and transferring data in various ways; replication of data – for the relational database-powered systems, shipping of logs, custom extract-transform-load setups, traditional messaging, or a custom middle logic implementation.

However, the aforementioned approaches were plagued with issues like tight coupling as regards data schemas, technology type lock-in e.g RDBMS-To-RDBMS replication, limited scalability, the complexity of implementation, limited reliability, and performance issues. Addressing these issues was central to what made up the design objectives of Kafka.

Apache Kafka API

There are four core APIs present in the Kafka system and they are:

  • Producer API – this is the API that allows an application(s) to push their data records to one or more Kafka topics.
  • Consumer API – this is the API responsible for application(s) to subscribe and listen to existing one or more topics to pull data records.
  • Streams API – this allows an application to act as a data stream processor.
  • Connector API – this allows the building of reusable features (producer and consumer) that can connect Kafka topics to other applications of data systems.

Example of Data Logistics Use-Cases

  1. A farmer owns numerous farmlands across many states or time zones and there is a need to measure the weather conditions or any other entity in these farms in near real-time to make business and farming decisions.
  2. A Telecommunication marketing manager needs near real-time analytics on the pattern of both call trends as well as prepaid recharge patterns across millions of customer bases in other to make effective and data-driven decisions.

To install Kafka on your local machine, you can follow the instructions here

This sample code here is where I try to simulate the production of data using the IoT scenario. Simulated data produced are pushed to Kafka and streamed with Spark, transformed, and persisted to HBase in the preferred transformed format. It should be noted that the streaming aspect can also be achieved with Kafka and the persistence layer can be achieved using any preferred storage e.g HBase, Cassandra, MongoDB, or ElasticSearch.

  1. https://medium.com/startlovingyourself/design-lessons-learned-from-apache-kafka-af26d8d73332
  2. https://kafka.apache.org/intro
  3. https://insidebigdata.com/2018/04/12/developing-deeper-understanding-apache-kafka-architecture/
  4. https://blog.newrelic.com/engineering/kafka-best-practices/

PS: This post was migrated from my old blog. Its original version was published sometime in 2018.

Leave a comment

Blog at WordPress.com.

Up ↑