This article discusses how we can use autoencoder anomaly detection for automated data quality monitoring.
Companies strive to hoard and analyse massive amounts of data. They do this to derive business value from data. Furthermore, regulations which enforce “data hygiene”, e.g. BCBS 239, GDPR require this. However, we are sometimes mistaken about data that we think we have. A big database with many tables, each with a lot of insightful attributes – great. But the reality can be different. There can be missing values, changing levels of categorical variables, or simply incorrect data which goes unnoticed. These are just some of the data quality issues that arise in big databases. They can lead to bad business decisions and a serious malfunction of business processes.
Classic data quality monitoring
Almost every company that deals with data has some data quality (DQ) monitoring in place. Some even have whole teams to deal with this. This is often very costly. Furthermore, most of the DQ checks are hard-coded and rule-based. They raise red flags when broken. Such rules are often business-critical. For example, we cannot have a missing client ID, or a variable “risk profile” with the wrong value. As data volume grows, it is impossible to write a rule for every attribute. Not to mention the complexity of hard-coded multivariate checks.
Wouldn’t it be great to have automated DQ checks that we don’t even need to explicitly program? Autoencoder anomaly detection enters the stage.
Autoencoder anomaly detection – solve data quality issues without specifying them
Autoencoders are neural networks that model their own input. In other words, autoencoders try to learn the identity function. They consist of two parts. Firstly, the encoder reduces the dimensionality of the input data. After that, the decoder tries to reconstruct the input from these reduced dimensions. Autoencoders are relevant for DQ because they can model whole data tables. They map all relevant fields of a data table to the input and the output layer of the network. Hidden layers between input and output layers learn the regular behaviour of the data. Of course, we need to have “good” data to train the model.
Imagine something unusual happens to the data that we have not anticipated. Even if it passed all classic hard-coded rules, the autoencoder trained on regular data will be way off target and predict wrong outputs. We will observe a large reconstruction error – a significant difference between predicted and actual values – and detect a data anomaly. The data doesn’t behave in the way that the autoencoder learned.
This is very helpful whenever something novel happens in the data set. For instance, there are suddenly many more missing values (NAs) than usual, or the levels of categorical variables changed or shifted. An autoencoder allows us to skip coding these rules. Furthermore, it helps us identify the source of the issues. These properties of autoencoders allow us to detect anomalies early and reduce the cost of manual investigations.
Detect anomalies in multiple dimensions
The most exciting aspect of the application of autoencoders is finding anomalies which rule-based DQ cannot identify. This includes issues that a human would not find without purposeful investigation. For example, there may be errors in data collection, or a problem in the system producing the data.
Autoencoders are sensitive to changes in the joint distribution of the data. Standard DQ checks only investigate univariate distributions. For instance, they check whether a variable is within its usual range or if the mean has shifted. Autoencoders, in contrast, also learn the nuances of the multivariate distribution. They can detect changes even when all univariate, marginal distributions are unchanged.
This means that autoencoders find changes in how variables behave in relationship to each other. For example, the weight and height of people are usually positively correlated. Imagine that for a new load of data they have the same univariate distribution, but are suddenly negatively correlated. An autoencoder is a dense neural network. Each variable influences the output of the other variables. That means the autoencoder will produce a large reconstruction error thus capturing this irregularity.
Error detection and error isolation
We can apply autoencoders to various use cases. The goal is always to reconstruct the input as accurately as possible. Data quality presents us with a similar task. We want to learn the “normal” behaviour and then detect deviations from it. However, not all deviations are errors. Some are simply the result of the natural changes in the data. Indeed, this can also be something to capitalize on. Business value results from the recognition of changes and new trends. We will look into this topic in a further blog post. In any case, we are on alert for big reconstruction errors, to further investigate what causes them.
A DQ investigation can focus on different levels of granularity. We can look at individual data points or the data as a whole. Here we apply autoencoders to a whole batch of data. Slightly different methods apply to individual data points.
We use autoencoders and one-class classification on top of that to model the distribution of “good” data. With this we can detect whether a new load of data belongs to the good distribution (Error Detection). See the left chart in Figure 2 or read this article.
With a batch of data points we also have a batch of error measurements. We can examine the distribution of these errors. We can aggregate reconstruction errors of all variables (L2-norm, mean) or consider each separately. The latter is usually more precise as it provides information about where the anomaly occurred (Error Isolation). See the right chart in Figure 2.
Univariate vs. multivariate data issues: an illustration
Autoencoder anomaly detection is most useful when monitoring joint data distributions. Figure 3 summarizes the results of permuting one of the variables. This leaves the univariate distribution unchanged. However, the multivariate distribution is affected. This is akin to the height and weight example above. If we randomly permute the weights the relationship between height and weight is corrupted.
In our example, we trained the autoencoder on 4/5 of the data (training set). The graphs in Figure 3 show the remaining 1/5 (test set). We split the test data into two sets of equal size. The first set remained untouched, and we permuted the second set. The left part of Figure 3 shows the univariate distributions of the “good” and “bad” halves. The right part shows their reconstruction errors. It is clear that the univariate distributions provide no insight. However, the autoencoder reconstruction shows a marked discrepancy between the two sets.
Autoencoders do an impressive job of detecting data anomalies. However, not everything about using autoencoder anomaly detection for DQ inspection is straightforward. Following questions need to be addressed:
1) How should we treat missing values?
There are several ways of dealing with missing values (NAs) in neural networks. The most common approach is to impute missing values with the mean or median. This is often not helpful. It may be that the missing data is not random but displays a pattern. For example, NA in a “car age” variable might simply mean that a person does not have a car. In this case imputing the mean value is misleading.
2) How to model complete databases?
Modelling each table separately sounds reasonable, especially if there are many tables with many fields. Merging them together would create one monster-table. But in practice fields in different tables are related. Modelling the tables separately overlooks this information. Here, some expert knowledge helps to improve the process. Also, frameworks like tensorflow include an option for several inputs (tables) in a neural network. This can be used to model multi-table relationships. Alternatively, we can use transfer learning to apply what was learned in the model of one table to the model of another table.
3) What is ground truth?
Sometimes we don’t know if the data currently in the database is “healthy”. As a consequence, we are unsure if we can take it as a benchmark for new data.
4) What metrics should be monitored to detect anomalous data?
Several examples were mentioned above. There are also other evaluation methods that can work well for specific use cases.
Summary and outlook
Autoencoder anomaly detection successfully detects many kinds of DQ issues. Furthermore, it greatly complements existing DQ checks. Above all, autoencoders significantly speed up the identification of errors in a database.
We at InCube are working on how to:
- Scale up autoencoder anomaly detection to whole database(s)
- Recognize multivariate trends and draw business value from autoencoders
- Compare DQ systems using autoencoder anomaly detection with the more traditional DQ systems using rule-based methods – A/B testing
- Develop hybrid models that utilize rule-based systems and autoencoders, e.g. by using autoencoders to automatically set the rules
- Exploit the cost-saving potential of autoencoder-based DQ systems.
If you want to learn more about autoencoder anomaly detection for data quality, contact us at firstname.lastname@example.org.
Note: The experiments described in this article took random samples from one database table.