Understanding data patterns is the key to growth and sustenance for any business. Any outliers in data – also known as data anomaly – could be a real pain to deal with. Whether you are dealing with system logs, business metrics, stock market updates, or auctions in advertising, identifying deviations in data and detecting anomalies are critical. This will help you uncover the underlying root causes and rectify them before they become unmanageable and pose serious threats to your business.
At every auction there are segments that participate and end up winning or losing. Every win has a revenue associated with itself. The volume of auction measures (i.e., bids, wins, and revenues) are recorded at hourly intervals. In our case, the auction data of 15 million segments - represented by six different dimensions - is gathered across two days. Any issue concerning the dimensions (for example, a security threat in a datacenter) could influence the measures. Given the dimensions and measures, we need to identify possible anomalies in the data and then pin them to the dimensions that are responsible for the outliers. Once the stakeholders are notified, they can investigate further and take appropriate actions.
Pre-Processing Time Series Data
For the sake of simplicity, segments bearing low revenue advertisers are clubbed together, and the ones with low observations (less than 24) are filtered out. The time series data is smoothened using moving averages.
Non-Stationary Time Series Data
Time series data could either be stationary or non-stationary. In simple terms, if the mean and variance of a time series remains the same throughout the timeline the data is stationary, whereas if the mean and variance changes with time then the data is non-stationary. The components of time-series data that make it non-stationary are categorized as trend, seasonal, cyclical, and random.
Anomaly detection gets fairly challenging with non-stationary data. One way of dealing with the problem is to convert non-stationary to stationary data and identify any abnormal values. While differencing can remove components like drift, it may not be able to eliminate time dependence from your data.
More traditional ways to deal with non-stationary data is to statistically decompose the time series data into trend, seasonal, and residual components by applying Seasonal-Trend decomposition using Loess (STL). The anomaly is detected on the residual component of the time series.
Spectral Residual (SR) method is an anomaly detection algorithm devised in an unsupervised setting to identify saliency. Saliency is something that stands out in an image or scene. This has been applied in the time series domain to identify anomalies in data.
Convolutional Neural Networks
Convolutional Neural Networks (CNN) is another saliency detection method in supervised setting when sufficient labelled data is available.
Microsoft devised a novel approach in their Azure Anomaly Detector by combining the SR and CNN models. The auto-generated saliency score from the SR model is used to train the CNN model to determine the threshold for anomaly.
With only two days of data available, it gets fairly difficult to use it for learning the thresholds. So, we resorted to use a more traditional way of identifying the outliers.
The Three-Sigma Limits
As per the three-sigma rule, if a data point is three standard deviations away from the mean then it can be termed as anomalous.
Ideally, the SR algorithm should be able to identify anomalous points even when used on non-stationary data, but the results seem to show that using SR algorithm on the residual component helps identify anomalous points in a much better way.
Root Cause Analysis
The Root Cause Analysis (RCA) is the second component of the problem. We need to identify the dimensions that are responsible for the anomalous data. Since we have six dimensions there could be at most 2^6 = 64 combinations of dimensions that could be responsible. Evaluating those many combinations is quite tedious and time consuming. One can go ahead with clustering the anomalous data points, but most clustering algorithms lack human understandable description of the clusters.
Formal Concept Analysis
The Formal Concept Analysis (FCA) is a mathematical model that helps us identify concepts in data. A domain comprises objects and attributes. A concept is a cluster of objects in the domain sharing common attributes and is represented by an object set and an attribute set.
Every concept must satisfy the closure property. A concept is closed if there is no other object in the domain that shares the same set of attributes in the attribute set of the concept and there is no other attribute that is shared by all the objects in the object set of the concept.
The anomalous segments at a given timestamp can be treated as objects and their corresponding dimensions can be treated as attributes.
Each dimension takes multiple categorical values and is transformed into multiple binary variables. For example, the dimension Country can take several values. Each country can be converted into a dimension with 1 indicating its presence and 0 its absence.
The concepts from the FCA lattice indicates the clusters with the attributes that describes the clusters.
Clustering of anomalies in other third-party systems are based on timestamps but there is no clear description about each cluster. Usually, they only allow filtering anomalies based on dimensions where the number of combinations to try is too high.