This is the second in a series of blogs about real-time stream processing at InMobi.
The first blog provided an introduction to the motivation behind stream processing at InMobi. It also gave insights into the number of company-wide use cases for stream processing.
In this blog, we provide a brief overview of both the Storm and Spark streaming platforms. It is important to understand the internals of each platform to perform an effective comparison. Thus, we cover their overall design and architecture, and also briefly describe the various functional concepts exposed by each of them.
This section provides a brief overview of various concepts exposed by the Storm platform. For a rich and detailed explanation of these concepts, please see Storm’s documentation. There are also many online resources that provide a first-hand deployment experience.
- A topology is a graph of spouts and bolts that are connected with stream groupings. Spouts and bolts are explained below.
- It captures the logic of a real-time application.
- A stream is an unbounded sequence of tuples that is processed and created in parallel, in a distributed fashion.
- They are defined with a schema that names the fields in the stream's tuples.
- Every stream is given an ID when declared.
- There are different types of tuples: integers, longs, shorts, bytes, strings, doubles, floats, booleans, and byte arrays.
- You can also define your own serializers to use natively custom types within tuples.
- A spout is a source of streams in a topology. Generally, it reads tuples from an external source and emits them in topology.
- Reliable spout: Capable of replaying a tuple if it failed to be processed by Storm.
- Unreliable spout: Forgets about the tuple as soon as it emits it.
- nextTuple(): Returns next tuple, or null.
- ack(): Called when Storm detects that a tuple emitted from the spout has successfully completed through the topology.
- fail(): Called when Storm detects that a tuple has failed to be completed.
- A bolt performs all types of processing: filtering, functions, aggregations, joins, and so on.
- A bolt can emit more than one stream, and can accept multiple input streams.
- execute(): Takes a tuple as input and emits new tuples.
- emit(): Performs Anchoring. It tells Storm that a new edge is being created in the tuple tree.
- ack(): Must be called for each tuple when bolt has fully processed it.
- fail(): Must be called if bolt encounters a failure.
- Stream Groupings
- A stream grouping defines how a bolt’s input stream should be partitioned among the bolt's tasks.
- There are seven built-in stream groupings in Storm.
- Shuffle grouping: Tuples are randomly distributed among a bolt’s tasks, keeping load-balancing in mind.
- Fields grouping: Tuples with the same field (for example, ‘ID’) are consistently hashed to the same task.
- All grouping: The entire stream is replicated across all tasks.
- Global grouping: The entire stream goes to only one task.
- None grouping: Defaults to shuffle grouping.
- Direct grouping: The producer of the tuple decides the task to which the tuple goes to.
- Local or shuffle grouping: If the target bolt has tasks in same worker process, tuples will be shuffled to those in-process tasks.
- Custom groupings can also be added.
- A high-level abstraction provided over Storm.
- Gives high throughput (millions of messages per second) and stateful stream processing with low latency distributed querying.
- Allows stateful, incremental processing on top of any persistent store.
- Processes the stream as small batches of tuples (batches are time-based and each batch can potentially contain millions of messages, depending upon input stream throughput).
- Provides high-level batch processing APIs like Pig/Cascading, joins, aggregations, grouping, functions, and filters.
- Cross-batch aggregations like in-memory, Memcached, Cassandra, KV store, and so on, can be persisted with.
- Spout types:
- ITridentSpout: The most general API that can support transactional or opaque transactional semantics. Generally, you will use a partitioned flavor of this API rather than the whole API directly.
- IBatchSpout: A non-transactional spout that emits batches of tuples at a time.
- IPartitionedTridentSpout: A transactional spout that, like a cluster of Kafka servers, reads from a partitioned data source.
- IOpaquePartitionedTridentSpout: An opaque transactional spout that reads from a partitioned data source
- Trident processes a single batch at a time, but batches can be pipelined for parallelism.
- Even while processing multiple batches simultaneously, Trident will order the state updates that take place in the topology, among batches. This is essential for achieving exactly-once processing semantics.
- A Storm cluster involves a Master process (Nimbus), a Supervisor process, a Worker process, an Executor (threads), and Tasks.
- Intra-worker communication in Storm, the inter-thread on the same Storm node, is achieved through a high-performance LMAX Disruptor library.
- Inter-worker communication (node-to-node across the network): Uses Netty, or ZeroMQ (previous releases).
- Inter-topology communication: Application specific (Kafka/DB etc).
This section provides a brief overview of various concepts exposed by Spark streaming.
- Spark streaming is a stream-processing system built on top of the core Spark API. It treats stream processing as a series of deterministic batch operations.
- The incoming stream is divided into batches of a fixed duration (for example, one second).
- Each batch is represented as an RDD, which is Spark’s abstraction of an immutable and distributed dataset.
- A never-ending sequence of these RDDs is called a Discretized Stream (DStream).
- There are 2 main parts in a Spark streaming application
- Receiver: Receives data from an external streaming source and stores it in Spark’s memory for processing. This represents the input DStream.
- Processing: Transfers the data stored in Spark to DStream.
- Operations: There are 2 main operation categories that can be applied on DStream: transformations and output operations.
- Stream Transformations: These transformations modify data from one DStream to another. They include:
- StandardRDD operations: filter, map, mapPartitions, union, distinct, groupByKey, reduceByKey, cogroup, and so on.
- Stateful Window operations: countByWindow, reduceByWindow etc.
- Output operations: These operations write data to an external entity. For example, HDFS, Kafka etc.
- Allows the application to maintain an arbitrary state and to be used during computation.
- Allows the combining of batch processing with Spark streaming by performing arbitrary Spark RDD computations with DStream. For example, joining DStream with a static file.
- Processing Flow:
- DStream Graph > RDD Graph > Spark Jobs
- For each interval, an RDD graph is computed from the DStream graph.
- For each output operation, a Spark action is created.
- For each action, a Spark job is created to compute it.
- Spark Driver: Manages the entire flow of data availability for each batch from all external sources, and periodic job scheduling.
- Driver Components:
- Network Input Tracker: Keeps track of data received by each network receiver and maps them to corresponding input DStreams.
- Job Scheduler: Periodically queries the DStream graph to generate jobs from the received data, and hands them to the Job Manager for execution.
- Job Manager: Maintains a job queue and executes jobs in Spark.
Stay tuned for subsequent blogs that list out various criteria to evaluate a processing platform, and finally, our findings based on these evaluation criteria along with the final recommendation.
Previous Blog: Real-Time Stream Processing at InMobi- Part 1
Next Blogs : Real-Time Stream Processing at InMobi- Part 3