This is the third blog in the series of blogs regarding real time stream processing at InMobi. The first blog provided insight into the variety of company-wide use-cases that provided the motivation for stream processing at InMobi. The second blog provided a brief overview of both Storm and Spark Streaming platforms that is required for effective comparison.
This blog introduces the criteria by which any enterprise wanting to perform massive-scale stream processing in production should look to evaluate a stream processing platform. It is important that we clearly define each criterion and highlight why it is required.
This section lists the various criteria that one needs to consider while evaluating any stream processing platform.
1. Processing Model:
The processing model provided by platform should be sufficient to handle all possible use-cases for streaming applications. Some applications may have strict low latency requirements and need true event-driven processing. Whereas other applications may simply want to perform stream based processing that is fine with a less stringent SLA on latency and high throughput.
The functional primitives exposed by stream processing platform should be able to provide rich functionalities at individual message level (for example map, filter), across messages (example aggregation), and finally across messages across streams (example join).
3. State management:
Many applications have stateful processing logic that requires maintaining state information. In some scenarios, the requirement is to access the current state as an input for processing current message, and result in updating new state. Another typical scenario where state management is required is during windowing operations. Hence, the platform needs to allow applications to maintain, access and update sufficient state information.
4. Message delivery guarantees (Handling message level failures):
Different applications would need different levels of message delivery guarantees. For example:
- At least once: no data loss is allowed, but replays are acceptable.
- At most once: no data replay is allowed, but data loss is acceptable.
- Exactly once: no data loss and no message replays are acceptable.
Ideally, stream processing platform must provide all the message level guarantees mentioned above. The aim is to explore and identify what exactly are the guarantees levels supported by each platform, and under what conditions.
5. Fault Tolerance (Handling process/platform level failures):
Failures can/do happen at various levels - for example worker nodes going down, master node going down, and so on. Platform should be able to recover from all such failures and resume applications from their last successful state. The aim is to find out the current state of fault tolerance supported at system level by platform.
6. Debuggability and monitoring:
The platform needs to provide capabilities to emit sufficient counters and stats in order to measure the health of the platform and various applications. Also, it should be easy to debug any issue either in platform level, or at application level. There should be sufficient logging to achieve the same. The simplest way to find is to actually run some applications and figure out the level of debuggability/insight provided by the platform. Also one needs to identify any gaps and see if we can fill in short/long term.
7. Auto scaling:
The platform needs to allow any application to scale out horizontally. For example, in case of a spike in the incoming traffic, or if there is an outage followed by service restoration, scaling out should be possible, either automatically/manually.
Multiple Applications should be able to run concurrently on any node without affecting each other at run-time. This requires various isolation levels:
- JVM level isolation: example no classpath, common lib collision issues
- Resource level isolation: system level resources
9. YARN integration:
The stream processing platform should be able to leverage resource management provided by YARN. This allows us to efficiently re-use the underlying resources (as YARN containers) across various distributed systems running on the same set of machines (for example Streaming, Hadoop, Spark, and so on).
10. Active open-source community/adoption level:
A strong community is critical to the successful adoption of any open-source project, by constantly enhancing the feature-set and fixing the production bugs. This is visible through project roadmap, release cycles and current state of development. Also the extent of adoption of any platform in large scale production is a true testimony of the stability and maturity.
11. Ease of development:
The platform should provide easy and simple APIs to develop the applications. A developer should be able to easily write his application, unit-test in a standalone setup, as against going to production cluster for performing unit tests or functionality experiments.
12. Ease of Operability:
It should be easy to set up the platform on a given set of machines, perform rolling upgrades in production. It should provide enough visibility into system level health, and applications’ health status.
13. Application level Replay of Events:
Platform should allow any application to go back in time and replay events from a certain last successful state. This capability is needed in production for a variety of reasons, e.g. suppose there are deployment failures of applications, or any bug fix needs to be applied, or a certain release simply needs to be rolled back, and so on.
Stay tuned for the next and final edition of this blog series that describes our findings on evaluation criteria along with the final recommendation.
Previous Blogs: Real-Time Stream Processing at InMobi- Part 1