Setting up kafka bolt
Once storm topology was set up, there were two ways in which machine learning could be incorporated in the project
1. ML in storm
2. ML outside storm topology
In order to choose the design for implementation the factors considered were:
1. Scalability of components
If the storm component is separated from machine learning component, then we can scale each component independently and individually. This factor gives us better flexibility for programming each component.
2. Ease of debugging
Errors that prop up in each individual component will be raised and identified better when the components are disaggregated. We can provide workarounds for the bugs and solve them more efficiently in the suggested setup.
3. Options and research scope that can be explored in each choice
Machine learning in storm has restrictions of its own. The types of built-in algorithms available are limited. If we want to explore different kinds of Machine learning algorithms we need to set it up outside the topology.
Hence, the design choice is justified.
Now, we need to figure out a way to connect the storm topology to the machine learning server.
We could do it using a data messaging pipeline that we have already used - KAFKA!
The idea is :
1. Set up storm topology
2. Stream/Publish the output tuples from bolt into a Kafka topic (aka KAFKA BOLT).
- Kafka bolt is within the topology.
- The tuple has to be serialized while dumping it to the Kafka topic
3. Subscribe to the data in Kafka topic, from machine learning server.
- Since the tuple is serialized, if you directly consume it, garbage values will be dumped
- We need a deserializer code before consuming the data
- Machine learning server code can now consume the data, train the model and do some useful predictions.
Comments
Post a Comment