InMobi Technology Team submitted total 10 papers at Hadoop Summit this year. Here are the abstracts for them.
1. Driving 10x Efficiency in Testing Hadoop Applications through Automation
Abstract: Production data pipelines keeps evolving rapidly as business and products evolves. One of the key challenge is to validate the complex data pipelines for the change and certify for regression. "Strider" is a automation framework, developed at InMobi, solves regression automation challenges for Hadoop Applications. "Strider" solves some of unique challenges with testing of data application like variety of data sources and variety of data formats etc. Framework is extensible to add custom data formats.
2. Managing Data Quality in Petabyte Scale Hadoop Applications
Abstract: Data quality, being very critical for any business, is core in managing hadoop applications. Managing data quality at massive scale is fairly challenging. This session covers how InMobi has solved this challenge at a petabyte data scale by building a data pipeline validation automation framework. One of key topics that would be covered is how this framework generates the test data intelligently.
3. Audience Targeting at a Billion User Scale through a Geo-distributed Hadoop System
Abstract: Building an audience targeting platform requires multiple stages and processes, from collecting data across data centers to shipping and processing, that have different requirements with respect to scale, latency and processing complexity. In this talk, we detail out these requirements and describe the solution that we, at Inmobi, have implemented to address the above problem.
4. Driving Enterprise Data Governance for Big Data systems through Apache Falcon
Abstract: Data Governance is a fairly important element in the enterprise data management world. As Hadoop makes it way to enterprises, there is a pressing need for a comprehensive data governance solution in this space. Apache Falcon looks at big data management in a holistic way by capturing metadata for governance policies and changes for every data assets and data applications and there by enabling comprehensive lineage, change management control and access control etc. In this talk we cover how Apache Falcon (incubating) addresses some of the key challenges in this area and discuss some case studies of how Apache Falcon is used to implement Data Governance in enterprises on big data platforms.
5. Modeling Location Information for Large Scale Data Processing and Analysis
Abstract: Modeling location information for large scale analysis needs different considerations, both from storage and processing perspectives. Location signals can be ingested in multiple forms and stored at different granularities. The querying on data is often nearness based and rarely an equi-join. In this talk, we present the approach that we, at Inmobi, have taken to model the disparate location signals in a manner that enables efficient nearness queries.
6. Apache Lens: Cut Data Analytics Silos in your Enterprise
Abstract: The Apache Lens project aims to ease the analytical querying capabilities and cut the data-silos by providing a single view of data across multiple data stores. Conceiving data as a cube with hierarchical dimensions leads to conceptually straightforward operations to facilitate analysis. Integrating Apache Hive with other traditional warehouses provides the opportunity to optimize on the query execution cost while maintaining the latency SLAs. In this talk Sharad and Amareshwari will discuss about the current and upcoming state of the features in Apache Lens. They will also discuss the experience in running Apache Lens in production and give live demonstration of Lens' salient features.
7. User Centric Analytics at a Billion User Scale
Abstract: At InMobi we have put on the User-First lens and designed an analytics platform with user as the primary axis that can infer behavioural, contextual and intent attributes of the user near real-time. The platform uses Apache Lens for data modelling, Hive on Tez as the primary query engine and Spark for Predictive modelling. In this session, we plan to talk about the details of the platform and its architecture and share the mechanisms of how this model answers a broad spectrum of user related queries like "show me the trend of ad requests by all users near a coffee shop" or advanced queries like "predict the life-time value of a user given his app-session usage pattern".
8. Scalability and Reliability through Analysis and Interpretation of Metrics
Abstract: Hadoop and its associated stack is a highly measurable and observable system. How do we use the hundreds of metrics, emitted by each component of the stack to build an early warning system which could warn us about a potential degradation of performance, which might result in an outage?. It's all about correlating the relationships of different metrics to analyse, visualise as well as to predict the health and to tune the performance of each component of the system. What are key metrics that need to be observed, which are the statistical functions that have to be used to bring sense to otherwise what appears to be a totally obscure collection of metrics?. It's the art of finding order among chaos of metrics.
9. Apache Falcon - Data Pipeline Operations
Abstract: Apache Falcon is being used widely at InMobi to support all feedback/optimisation and reporting applications on Hadoop. Falcon helps solve for challenges relating to data handling and management, scheduling and monitoring. In this talk we cover how Apache Falcon is deployed at InMobi in a multiple co-locations setup and how various aspects such as pipeline deployment, replication, data archival etc are achieved.
10. Shifting large scale data centers with business as usual
"There is a common management interview question: "How would you move Mount Fuji". We had a mountain to move when our Hadoop cluster holding multiple Peta bytes of data had to be migrated to a new datacenter location. The cluster executed an order of 100K jobs daily with several business critical and SLA bound analytical applications running at different frequencies. Primary goals for us were business continuity, no downtime and zero failures. Along with the cluster all the endpoints that read/write to the cluster had to be migrated to reduce latencies. We also took this opportunity to change some stacks, upgrade OS and other components in the eco system. We would talk about how the entire migration of the cluster took place, tools and techniques used for data replication during migration and data/application validation post migration.