InMobi is an ad network, an entity that matches users to advertisers in a manner that maximizes efficiency. Doing this is a complex, multi-objective optimization problem. One has to cater to the preferences of several entities in the ecosystem - the publishers, the advertisers, and most importantly, the users.
Advertisers are interested in reaching a favourable set of users at a low cost. They identify these favourable users through the right filter settings. Users visit the network from all over the world, and each user has a multitude of properties or characteristics - geography, phone types, demographics, etc. When an advertiser knows their target audience, (for example, all weekend tablet users in New York), it is easy to reach such an audience. However, what if the advertiser doesn’t know what his users look like?
The User Segments product at InMobi helps manage this complexity for advertisers. We create user segments that are tailored for specific interest categories, like Fashion Lovers, Strategy Gamers, Frequent Travellers and so on. These segments are built through an analysis of users’ past behaviour and look-alike modeling and encompass all visible attributes of a user. This means that the advertiser doesn’t have to bother about whether the user has a smart-phone or tablet; what matters is that the user is interested in what the advertiser has to offer.
Now, we will proceed to look in detail at some of the nuances involved in building these segments, and the learnings we have had in the process.
Given a category, the task now is to identify the ‘segment’, defined as a set of users who are interested in that category. Here, two challenges crop up. The first is that our understanding of the user is limited. We have only limited interaction with each user, and hence the depth of knowledge of the user is low. The other challenge is that the user’s interests may change over time. These two challenges mean that we are predicting a changing signal under noisy conditions.
InMobi has data for around a billion users. This implies that any system we build has to scale to handle repetitive processing of such a large data set. The process of building a segment starts by collecting the required data points for the users, which are then divided into two classes that refer to two different time periods. The first is the observation period, where the required user attributes are collected - this is the view of the user. The second is the action period, where the user’s actions are observed - primarily, whether or not he performed a desired event. Subsequently, a model to predict the probability of a desired event, given the attributes of the user, will be built.
The data points about the user have to be aggregated across several days, and require heavy processing. Typically, Pig is the default technology used to carry out such large-scale SQL-like data transformation operations (Hive is another), and this is what we started off with. However, a machine learning pipeline has several custom data transformation operations (for example, normalizing feature scores) that cannot be expressed in Pig’s native syntax, and require the use of user-defined functions (UDFs). However, this slows down development and increases friction in making modifications to the pipeline. More recently, we have begun to adopt Spark for these types of tasks, due in large part to the flexibility that it provides in working with either map-reduce style data transformations, or custom computations on a single node. As we will see later, some suitable transformations allow us to bring the training data down to a manageable size. Hence, we opt for a single node model building system in R, since R is very stable, and has a rich set of models and algorithms built-in.
The segment identification task is cast into a framework of classification, where we want to identify whether a user will perform a desired action, given all the data points about him. We chose the Logistic Regression model (with a combination of L1 and L2 regularization) for this task, given a number of its advantages:
- Scalability: The model scales well to handle large amounts of records and features. Model building and prediction time is minimal. This is essential in a scenario where several such models will be built on a regular basis for multiple segments, and predictions have to be performed on the entire set of users.
- Handling sparse features: We find that in an ad network domain, most of the user data collected is sparse. Each feature alone will cover for only a small fraction of the population, and hence feature selection will not work very well. In such a scenario, regularization enables the use of a large feature set while preventing the model from overfitting.
- Feature importance: The feature weight is a very intuitive indicator of the importance of the feature. Given the weight, one can immediately reason about the contribution of the feature in the prediction. This is found to be very useful for exploratory modeling.
The desired events that contribute to segment building - click or download - are of very low probability; the positives are much fewer than the negatives. In such a scenario, building a model with the entire dataset straight away may not give good results. This model will be heavily biased towards negative events, whereas the positive events actually contain more information to build a useful model. With this in mind, we subsampled negative records before building the model. The amount of subsampling is another parameter of interest. We found that equalizing the number of positives and negatives, that is, having a fully balanced dataset, gave the best results. This also has the added benefit that the training data is now small enough that the model training algorithm can be run on a single powerful machine, greatly simplifying the model build process.
The volume of data available to train the model, in particular the number of positive events, is an important contributing factor in model quality. We find that some segments are inherently low-volume, and building models for these segments is more challenging. In such a scenario, one option is to combine datasets going back in time. That is, to augment training data by using older examples. But, could we keep increasing data size this way? Not really. As observed earlier, we are building a model of an ever-changing system; the older the data is, the more it will deviate from the current behaviour of the users. After a while, the gains in performance by adding older records will be offset by the staleness of the data. We found that around 4 weeks of data is optimal for model building of low-volume segments.
These are some of the nuances, practical issues, and learnings that came from building a large-scale user interest prediction system. We found that the models did not work straight away out of the box, and required a lot of fine-tuning and domain specific optimizations before good performance was achieved. In addition, several engineering and architectural enhancements were necessary in order to ensure effective productionizing of the segments.
Thanks to the members of the Data Sciences team involved in this project - Satya Chilukuri, Michael McCarthy, and Varun Modi.