Day 27: Using Kafka to stream real time network data
Hey guys!
So today we will explore how Kafka can be used to stream real time network data.
We have learnt what Kafka is all about in the previous posts. Today, we will do some hands-on
To quickly summarize Kafka architecture:
Kafka messages are organized into topics. If you wish to send a message you send it to a specific topic and if you wish to read a message you read it from a specific topic. A consumer pulls messages off of a Kafka topic while producers push messages into a Kafka topic. Lastly, Kafka, as a distributed system, runs in a cluster. Each node in the cluster is called a Kafka broker.
To capture network traffic, we will use tcpdump. This utility is extremely helpful because of the wide variety of options it provides.
To install kafka on my system, I followed this tutorial.
One of the important things to consider in this application is scalability. What will happen to your model if there is a huuuge amount of traffic suddenly flowing in? How will it react? Is your architecture built in such a way that you can accomodate for scalability in it?
Keeping that in mind, let's explore the features that kafka provides.
Kafka topics are divided into a number of partitions. Partitions allow you to parallelize a topic by splitting the data in a particular topic across multiple brokers — each partition can be placed on a separate machine to allow for multiple consumers to read from a topic in parallel. Consumers can also be parallelized so that multiple consumers can read from multiple partitions in a topic allowing for very high message processing throughput.
Each message within a partition has an identifier called its offset. The offset the ordering of messages as an immutable sequence. Kafka maintains this message ordering for you. Consumers can read messages starting from a specific offset and are allowed to read from any offset point they choose, allowing consumers to join the cluster at any point in time they see fit. Given these constraints, each specific message in a Kafka cluster can be uniquely identified by a tuple consisting of the message’s topic, partition, and offset within the partition.
For more info, refer to this.
To understand the basic operations of kafka, I found this extremely helpful.
Once you have kafka set up, run the following commands in order :
Start ZooKeeper:
Open a new terminal and type the following command −
bin/zookeeper-server-start.sh config/zookeeper.properties
To start Kafka Broker, type the following command −
bin/kafka-server-start.sh config/server.properties
So today we will explore how Kafka can be used to stream real time network data.
We have learnt what Kafka is all about in the previous posts. Today, we will do some hands-on
To quickly summarize Kafka architecture:
Kafka messages are organized into topics. If you wish to send a message you send it to a specific topic and if you wish to read a message you read it from a specific topic. A consumer pulls messages off of a Kafka topic while producers push messages into a Kafka topic. Lastly, Kafka, as a distributed system, runs in a cluster. Each node in the cluster is called a Kafka broker.
To capture network traffic, we will use tcpdump. This utility is extremely helpful because of the wide variety of options it provides.
To install kafka on my system, I followed this tutorial.
One of the important things to consider in this application is scalability. What will happen to your model if there is a huuuge amount of traffic suddenly flowing in? How will it react? Is your architecture built in such a way that you can accomodate for scalability in it?
Keeping that in mind, let's explore the features that kafka provides.
Kafka topics are divided into a number of partitions. Partitions allow you to parallelize a topic by splitting the data in a particular topic across multiple brokers — each partition can be placed on a separate machine to allow for multiple consumers to read from a topic in parallel. Consumers can also be parallelized so that multiple consumers can read from multiple partitions in a topic allowing for very high message processing throughput.
Each message within a partition has an identifier called its offset. The offset the ordering of messages as an immutable sequence. Kafka maintains this message ordering for you. Consumers can read messages starting from a specific offset and are allowed to read from any offset point they choose, allowing consumers to join the cluster at any point in time they see fit. Given these constraints, each specific message in a Kafka cluster can be uniquely identified by a tuple consisting of the message’s topic, partition, and offset within the partition.
For more info, refer to this.
To understand the basic operations of kafka, I found this extremely helpful.
Once you have kafka set up, run the following commands in order :
Start ZooKeeper:
Open a new terminal and type the following command −
bin/zookeeper-server-start.sh config/zookeeper.properties
To start Kafka Broker, type the following command −
bin/kafka-server-start.sh config/server.properties
We will be setting up a single node configuration for simplicity.
Creating a Kafka Topic − Kafka provides a command line utility named kafka-topics.sh to create topics on the server. Open new terminal and type the below command
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1
--partitions 1 --topic test-topic
This tells us that zookeeper is running on localhost server on port 2181.
Topic Replication is the process to offer fail-over capability for a topic. Replication factor defines the number of copies of a topic in a Kafka cluster.
Since we are running a single node configuration, the number of partitions will be 1.
Once the topic has been created, you can get the notification in Kafka broker terminal window and the log for the created topic specified in “/tmp/kafka-logs/“ in the config/server.properties file.
Start Producer to Send Messages:
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test-topic
Broker-list − The list of brokers that we want to send the messages to. In this case we only have one broker. The Config/server.properties file contains broker port id, since we know our broker is listening on port 9092, so you can specify it directly.
The producer will wait on input from stdin and publishes to the Kafka cluster.
We will be writing a shell script to automate the network packet capture part:
There are 2 ways of doing this
First script is as follows
#!/bin/bash/
sudo tcpdump -l | bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test-topic
Second script:
#!/bin/bash/
sudo tcpdump -l -immediate-mode > tcpdump_packet_capture.txt
sleep 5
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test-topic < tcpdump_packet_capture.txt
To see the messages in the test-topic,
Start Consumer to Receive Messages
Similar to producer, the default consumer properties are specified in config/consumer.proper-ties file. Open a new terminal and type the below for consuming messages.
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test-topic --from-beginning
Finally, you are able to enter messages(network capture packets) from the producer’s terminal and see them appearing in the consumer’s terminal.
Deleting a Topic
To delete the topic you have created, you can use the following
bin/kafka-topics.sh --zookeeper localhost:2181 --delete --topic test-topic
That's it for today folks! Next we have to figure out how to process these messages using storm. We need to preprocess the data using storm before feeding it to Machine Learning model.
Author - Swati N H
Co-author - Shravanya G
The article is very helpful for beginners like me to get the gist of various technologies and how to use them. I do have one question though - why is only one partition used? You have mentioned that parallelly running partitions increase scalability. Since the solution required is scalable, why not use more partitions? Or are partitions designed in a manner as to listen to a single port at a time? Can there be a situation where 2 partitions are listening to the same port? Also, I would really like you to publish another article on how to tackle failover and another topic picking the process up from where it had stopped. Excuse me if few of my questions are redundant :p
ReplyDelete