Apache Lens Demystified

Jothi Padmanabhan

Jothi Padmanabhan on May 19, 2015

1. Introduction

The Apache Lens Documentation describes Lens as a platform that provides an “Unified Analytics Interface” that provides a common view of data stored across multiple tiers and storage layers. In this article, we will try to break down that definition and try to explain, in layman terms, what Lens is all about -- the what, why of it. The how (design and architecture), we defer to a later post. Interested users can look up the Lens home page in the interim.

2. Example Problem

Let us consider a simplified, but a real world problem. InMobi, incidentally one of the more prolific users of Lens, is in the mobile advertising business. An important differentiator in this domain is the ability to develop deep understanding of the users so that relevant advertisements are presented for a given user when the opportunity arises. To this effect, InMobi tries to collect a lot of user information, subject to PII and privacy constraints, and stores them for analytics and inferences. In this article, let us focus on the analytics of user data.

InMobi receives about 5 Billion advertisement requests a day. Each request is associated with a variety of signals, some relevant for user based analytics and others not. InMobi responds to the majority of these requests with a suitable advertisement and this information is captured in a request-response log. In ad parlance, the parameters that accompany the originating ad requests are called “supply side” parameters and those related to the served advertisement are called the “demand side” parameters. In addition, there are some additional information that is associated with this request-response event that is important for analytics. A couple of examples could be the dollar impact of this event, whether the advertisement was clicked or not.

2.1 Example Schemas

Let us presume that we get the following (simplified) parameters from the request-response log on a daily basis.

struct request-response-info {
	user-id
	timestamp
	location-id (lat-long)
	device-id
	requested-app-id
	served-ad-id
	clicked-or-not
	dollar-value
} 

Some of the parameters like user and location have additional information associated with them. Let us try and capture them:

struct user-info {
	user-id
	age-bucket
	gender
	interests
}
struct location-info {
	location-id
	latitude
	longitude
	zip
	city
	country
}
struct app-info {
	app-id
	app-category //gaming, finance, maps
}
struct device-info {
        device-id
	manufacturer
	os
}

2.2 Sample Queries

Let us try and list down some of the queries that we are interested in. The queries generically fall into two categories:

  1. User counting across the different slices. How many users visited from a given city? Users within the age bracket? A combination of these.
  2. Measure reporting. How much was the dollar-impact across the different slices.

Each of the above category of queries would need to be supported over multiple time-windows (over the last day, last week, last month, between specific dates etc).

3. Data Modeling

In InMobi, hadoop is the de-facto choice for big data processing and storage. Most of the ETL jobs are based on Pig/MR and data stored in HDFS. Given this background, it makes sense to use Hive as the choice for analytics engine. Hive allows data residing in HDFS to be queried with a SQL-like interface and fits pretty well with the rest of the ecosystem.

Each of the structures above could be modelled as a Hive table and joined by the respective Ids for queries. This is no different from the 3NF or BCNF normalized forms in the traditional RDBMS world. Note that this is illustrative; in reality, people may opt to store data in a de-normalized form for performance reasons. To make queries a little faster, we could use the partitioning and bucketing features of Hive. The request-response info table could be time partitioned so that only data corresponding to the querying period is consulted. Similarly, the user and request-response tables can be bucketed by userid to allow for efficient querying on userid.

Even with these optimizations, queries could take significant processing and time for evaluation. This is primarily to do with the volume of data that we are talking about. With several billion request-response records in a given day and about billion users, these joins need significant data churning. For example, consider the following query. Calculate the total number of clicks from male users belonging to “Bangalore, India” in the previous week. The naive join for this query could take a pretty long time to complete, even after considering only the relevant time partitions.

These kind of queries are commonly referred to as multi dimensional analytic queries and are typically handled by OLAP systems. The OLAP wikipedia articles describes OLAP as comprising of three basic analytical operations: consolidation (roll-up), drill-down, and slicing and dicing. With these optimizations, analytical queries have much quicker response times.

Let us see how we could use the “roll-up” optimization to speed up the query that we have just considered. We would create a weekly process that just runs periodically, say on a Monday to aggregate the click count by different countries. Something like:

select sum(clicked-or-not), city  from request-response-info A, 
JOIN location B on locationID
where A.timestamp between ‘last-week-monday’ and ‘this-sunday’
group by B.city

Once this data is materialised and available, queries can hit this table for city counts and get instantaneous data. Note that data can also be made available for several weeks by running the queries on the individual weekly data and aggregating them. This is the essence of the ‘roll-up’ in OLAP world.

For some OLAP terminology, data is organized as numerical facts or measures (clicked-or-not, dollar-value in the above example) that are categorized by the different dimensions (user, location, app and device in the above example). It is possible that the roll-ups are done on one or several dimensions and/or one or several measures.

However, just having each of the roll up tables would not suffice; it would require the users to have intimate knowledge of all these tables and make use of them appropriately. The user would ideally like to make a query to a single logical table and the system should decide and choose the appropriate roll-up table to serve this from, according to availability and performance. The traditional OLAP systems do this, but in the Hive world, there is no direct equivalent. This is where Lens comes in. Lens provides the OLAP abstraction on top of Hive (and actually other systems as long as they are JDBC compliant). In the Lens world, users fire their query against one ‘logical table’ and Lens makes the appropriate choice of the actual physical table to query from. Lens entities are described in the next section.

4. Lens abstractions

Like traditional OLAP systems, Lens understands the primitives of facts and dimensions. Facts are immutable, time-series data of actual events and comprise of dim-attributes and measures (the request-response table in the above example). Dim-attributes are the different domain attributes by which data needs to be analysed (user age, device type, location city etc in the above example).

Lens separates the query interface for facts and dimensions from the actual storage. Queries are fired against the Lens ‘Cube’ and ‘Dimension’ objects. Actual data is stored in abstractions called ‘fact tables’ and ‘dim tables’. Several ‘fact tables’ can be associated with the Cube. Similarly, several ‘dim tables’ can be associated with the Dimension. When the user queries for data against a cube, Lens provides the intelligence for picking up the right fact table.

In our example, all the weekly roll-ups as well as the daily data would be stored as different fact tables that are associated with a request-response cube. Users are oblivious to the fact that several fact tables are associated with the cube, they just query the cube and that is the only abstraction that they are interested in and exposed to.

A word of caution though. While Lens makes the intelligent choice of fact table for optimal performance, it is beyond the scope of lens to actually generate these roll-ups automatically. These tables need to be generated outside of the Lens and the data sets registered with Lens so that it can use them appropriately. Typically, these roll-ups are done using Apache Falcon processes that periodically run pig/hive scripts and register them with Lens.

5. MOLAP/ROLAP/HOLAP

Typically OLAP systems are classified into one of three categories -- MOLAP, ROLAP and HOLAP.

Multidimensional online analytical processing (MOLAP) systems precompute the data in optimized storage before they answer queries. The advantage of these systems are that the query latencies are small, but come at the cost of pre-processing. Also, since all data is prepared beforehand, MOLAP systems only handle a small subset of queries.

ROLAP systems on the other hand do not precompute data, rather access the data from the underlying relational systems by querying on user requests. As a consequence, ROLAP systems are relatively slow, but do not need any preprocessing and also can service a much wider range of queries.

HOLAP systems are a combination of both MOLAP and ROLAP systems.

Lens is more of a ROLAP system. It accesses data from the underlying relation storages like Hive (or a JDBC compliant system) through queries when a user queries its cube or dimensions. However, if there is sufficient interest in the community, it is possible for Lens to evolve as a HOLAP system where it supports pre-materialized views for a subset of frequently used queries for faster response times.

This is also the place to call out the difference between Apache Kylin and Lens. Apache Kylin is much closer to being a MOLAP system as opposed to Lens which is more like a ROLAP system.

6. Beyond Rollups and Aggregations

While we had focused on the OLAP features in the sections above, Lens has features beyond these. The following are some:

  • As mentioned above, Lens is a natural fit for the Hadoop ecosystem with its great integration with Hive.
  • Lens can use different storage platforms underneath, beyond Hive, as long as they are JDBC compliant. This would mean support for most of the commonly used RDBMS as well as columnar stores like Redshift and Infobright. In InMobi, we use Infobright for storing data that need very fast response times and Lens prioritizes access from that store over Hive automatically.
  • Lens has a metastore that allows for easy exploration and discovery of all the information that it manages. Through a REST interface or a Lens shell, users can have easy access to the metadata of all the cubes, dimensions and the underlying fact tables and dimension tables.
  • Lens also supports query life-cycle management. Since Lens serves most requests by actually firing queries to its underlying storage and aggregating them as required, there is a fair amount of delay to when the request is made to when the results are available. Lens help manage this by accepting and tracking queries, persisting, and formatting the results. Lens also provides mechanism to share results via email.
  • In addition to query life-cycle management, Lens also allows for analytics over past queries. This is extremely useful to not only identify any anti-patterns in user queries (and subsequently optimizing them), but also in identifying most frequently used queries that can be optimized further by building in roll-ups and aggregates as necessary.
  • Lens DSL also has machine learning support over its cubes.

7. Road Ahead

Lens is under active development and is currently an incubating project in the Apache Software Foundation. While the project is strive with rich features, it is still in its early days and a lot of interesting new features are planned for the days ahead. Here is a small subset

  • Richer DSL Support. Lens supports the Cube DSL natively. Work is underway to support extensions for users to plugin their DSLs into it. This facilitates a shared deployment model with user extension. This is required to provide a singular unified catalog to the users and also share the execution engine and data across multiple querying mechanisms.
  • Unified Catalog. As mentioned earlier, Lens exposes all metadata through a relational interface (currently using HCatalog). If there are use cases where Lens needs to support a non-relational schema, this mechanism might be insufficient and a newer means to expose the catalog might be warranted.
  • Query Management optimizations. This is a rich area of possibilities, including better and more richer scheduling primitives and added features like throttling.
  • Query workflow optimizations. The main query engine itself can be optimized for different use cases to deliver much higher performance.

8. Summary

In this article, we gave a brief introduction to the analytics domain using a specific example and went on to establish the need for an OLAP system to solve such problems. We then introduced Lens as an OLAP system and how the above example can be modelled and solved using Lens. We went on to highlight some of the Lens features and concluded the article with some of the thoughts on the way ahead.