Day 16 : Kafka! The Streaming Hero!
Wishing you all a very Happy REPUBLIC DAY!! :)
In the spirit of celebration, let's explore and understand some buzz words in the Big Data Industry today. This will be helpful since we will be implementing and using these technologies to build the proposed real time model.
What is Kafka? How Where does it fit in the architecture?
To answer this question, let's dissect our problem statement into smaller pieces.
Since we have to capture network packets continuously, we need a platform that supports streaming this real time data as and when its flowing in the network. We will use Kafka for streaming.
This article is a very good read. A shout-out to the author for a great explanation!
Now that we know what kafka does, what we will be requiring for the project is Kafka connect. It is a tool inbuilt in Kafka that imports and exports data to Kafka.
It runs connectors which is provides the custom logic for interacting with an external system.
More about how to use kafka connectors will be up in the next week's post. Stay tuned for more content of how to create a Kafka connector and deploy it.
Happy learning!!
In the spirit of celebration, let's explore and understand some buzz words in the Big Data Industry today. This will be helpful since we will be implementing and using these technologies to build the proposed real time model.
What is Kafka? How Where does it fit in the architecture?
To answer this question, let's dissect our problem statement into smaller pieces.
Since we have to capture network packets continuously, we need a platform that supports streaming this real time data as and when its flowing in the network. We will use Kafka for streaming.
"Kafka is massively scalable pub/sub message queue designed as a distributed transaction log"
~ as quoted by Wikipedia
Let's decompose the sentence into units.
Kafka is massively scalable because it runs on a cluster of one or more servers. The concept of consumer groups which allows for load balancing brings scalability.
Kafka is publisher / subscriber model since there exists a producer end which produces data, which is consumed by the subscriber. Kafka can provide a mix of queuing and publisher subscribe model.
Consumer groups is another key concept and helps to explain why Kafka is more flexible and powerful than other messaging solutions like RabbitMQ. Consumers are associated to consumer groups. If every consumer belongs to the same consumer group, the topic's messages will be evenly load balanced between consumers; that's called a 'queuing model'. By contrast, if every consumer belongs to different consumer group, all the messages will be consumed in every client; that's called a 'publish-subscribe' model.
Understanding scalability and pub sub model using consumer groups |
Distributed transition log refers to the topics. Kafka is run as a cluster on one or more servers that can span multiple datacenters. The Kafka cluster stores streams of records in categories called topics.
Each record consists of a key, a value, and a timestamp.This article is a very good read. A shout-out to the author for a great explanation!
Now that we know what kafka does, what we will be requiring for the project is Kafka connect. It is a tool inbuilt in Kafka that imports and exports data to Kafka.
It runs connectors which is provides the custom logic for interacting with an external system.
More about how to use kafka connectors will be up in the next week's post. Stay tuned for more content of how to create a Kafka connector and deploy it.
Happy learning!!
In the overall project, you will not be getting directly from the switches, but from the controller. So, look at your data ingestion mechanism accordingly.
ReplyDelete