This is the first in a series of blogs that will detail the real-time stream processing at InMobi.
Many InMobi applications increasingly need to be able to process real-time data in a streaming fashion. The absence of a true streaming alternative has forced many of them to model themselves on ‘micro-batching’ and run on the Hadoop platform. This approach is not a natural fit for these kind of applications. There are two reasons for this. One, it leads to huge stress on the Hadoop platform in the form of a large number of short-lived jobs, and two, it does not necessarily provide the end latency expected by these applications. Hence, it takes a substantially long time to gather insights from the large amounts of data.
The natural approach would be to use a processing platform that is natively built for streaming applications. For this purpose, we decided to fix upon a real-time streaming platform that could cater to the varying functional, performance, reliability, and operational requirements from different company applications. This streaming platform would be exposed as a service that allows any developer to quickly develop, deploy, and on-board the platform.
To achieve the above goals, we set out to perform an elaborate exercise that consisted of:
- Identifying currently-running InMobi applications that require streaming requirements, or that can hugely benefit by stream processing.
- Identifying the various open-source platform technology choices available.
- Doing a comparative analysis of the different platforms available based on various evaluation criteria.
- Summarising findings and recommending a platform.
Next, we will cover the various steps of this exercise in more detail
Step 1: Identifying InMobi Use Cases
InMobi receives almost 8 billion daily requests from 175 countries, and has the ability to reach out to more than 1.4 billion users through its network. In the process, terabytes of data are generated daily by multiple geo-distributed serving systems. This data needs to be made available quickly for further processing that includes inference, analytics, and reporting systems. Serving systems then need to be quickly fed back with these inferences and insights.
It is evident that InMobi has a multitude of applications that will benefit from a real-time stream processing platform. Such a platform will allow:
- Real-time ingestion of network activity: Network activity streams like ad requests, impressions, clicks, and so on are generated in multiple geo-distributed clusters. These need to be stored in distributed storage systems. Performing this ingestion in real-time can, in turn, enable downstream processing pipelines to switch to stream processing.
- Real-time business intelligence: We need to be able to generate business reports that give advertisers and publishers an up-to-date status of conversions and monetization happening on the ad network. Real-time stream processing can empower real-time business intelligence.
- Real-time fraud detection: It is important to identify fraudulent activity on network and and take appropriate action. Performing fraud checks in real-time allows the network to identify potential sources of bad quality data and swiftly respond to them.
- Real-time user profile: An end-user performs various interactions with the network. Processing different activity streams can generate and update an end-user profile in real-time.
- Real-time serving feedback: Real-time analysis of various user interactions can give ad-serving systems necessary feedback. This will help to serve the most relevant ad to end user.
Step 2: Identifying Potential Technology Candidates
There are a number of popular stream processing platforms currently available, like Storm, Samza, Spark Streaming, and Flink. All of them offer exciting capabilities and appear promising. However in the interest of time, we chose to narrow down the detailed evaluation candidates to Storm and Spark Streaming.
Stay tuned for subsequent blogs that will: