In the eternal pursuit for optimizing business processes, companies are discovering they can leverage machine learning (ML) and AI in many areas of their business. One of these areas still in conception phase in the financial industry is automated data quality (DQ) monitoring and error detection. Banks and insurance firms deal with piles of data that often come from unreliable and inconsistent sources. Despite the need to address this issue, they are reluctant to dive into something new or unconventional in their industry, such as using ML to detect data errors. However, ML models have been used for this purpose in different industries for some time now. They have a good track record of detecting unusual data points (anomalies). We leverage this knowledge to build ML-based Data Quality Monitoring Systems for detecting anomalous points in banking data.
Battle of ages: ML models vs hard-coded rules
A skeptical person might ask: Why should we use ML at all, when we can simply hard-code the rules for how the data must behave and have a guaranteed 100% accuracy? On the one hand, even simple hard coded rules can contain errors. These might fail at detecting incorrect data and cast doubts on the trustworthiness of hand-crafted rules. On the other hand, it can be shown that ML can compete with even correctly implemented rules.
Hard-coded rules also suffer from the following disadvantage: it takes a lot of time for Subject Matter Experts (SMEs) to define the rules, and then for developers to code them. This effort scales linearly with the number of data fields which need to be DQ-controlled: for every field, the process starts from zero. With ML, we found that models applied for similar fields or fields of the same type can be reused: the same architecture and setup are used, and the model just needs to be retrained. This means that ML has the advantage of diminishing costs, i.e. the process of finding errors gets cheaper with each new data field.
Specialty of ML
Hard-coded rules are useful if they are simple, although we have seen in our client projects that these can be successfully reproduced by ML models as well. There are some DQ issues, however, that are very non-obvious, so that nobody remembers them. Some are nearly impossible to formulate, so that even when people do know about them, they cannot be hard-coded. This is where ML models excel: they can discover anomalies in the feature space no matter how complex these anomalies are. That works especially well with more complex data types such as fields with free text, for which it is very tough to formulate clear and all-encompassing rules that cover all possible cases. In our projects, ML models discovered some errors that have not previously been defined by the SQL rules. This was even more exciting than the fact that ML models could reproduce the SQL rules themselves.
Learnings from Unsupervised Learning
From data to features
In unsupervised learning everything is completely data driven, which means one must take special care to make the data suitable to work with. This includes tweaking the data, up-sampling or re-sampling, different kinds of pre-processing and preparation and feature engineering in general. As is common in data science, we found that most of the time has to be spent on preparing the data. With the right features, modelling becomes the easy part of the process. Without them, the best model in the world would be useless.
Pre-processing is very different for different data types. One challenge that always comes up is handling missing values: are they replaced with zeros, imputed, left out? Dealing with missing values depends, same as the other pre-processing steps, on the type of feature. Another question is about encoding: how to represent text data, categorical data, dates? These data types have to be pre-processed in a correct way for the model to be able to learn useful information from them. Only after feature engineering comes the next step: Choosing the appropriate model family and model architecture/parameters and classifying data into “good” and “bad”.
From features to information
The main reason DQ with ML is challenging at first sight is that there is no ground truth: we don’t know in advance which points are errors, or which kind of errors we are looking for. This is the essence of unsupervised learning – one tries to group data based on some common characteristics. It implies that points we discover as anomalies are only candidates for errors – whether they are errors or not, needs to be confirmed by an subject matter expert (SME). If they are errors, they are corrected or taken out of the data set, and the model is retrained. The cycle of incorporating SME feedback brings the modelling into the realm of semi-supervised models, where only some of the points are given labels, and we attempt to infer labels on the rest of them.
To make the process of incorporating SME feedback as efficient as possible, we rely on the assumption that most of the data is correct, so that the “good” data points would be majority. This way the “bad” points would be isolated as outliers or anomalies. It might sound like a reasonable assumption, but it is not always fulfilled. In any case, since we don’t have the labels in advance, training happens on both good and bad data, the proportion of the bad data being unknown at the time of training. We use reconstruction error of autoencoders or the score of isolation forests to decide whether a data point is good or bad. Both metrics isolate points that are in some way very different from the rest. If “the rest” is bad, i.e. if we have many bad data points, then good points will be isolated as outliers. This might happen – in these extreme cases the effort of the SMEs is multiplied to the point where it becomes inefficient to apply ML models on such data.
In order to make the process as independent from human input as possible, decision about the threshold (where to put the boundary between good and bad points) can be automated. One possibility is to estimate the density of the scores to find a minimum in the density, which can be understood as a “gap” between two classes of data points separated with the unsupervised model.
The cycle of training models, calculating scores for each point, dividing “majority” points and anomalies, getting feedback from SMEs and then retraining – needs to end at some point. When do we say we have found all the errors, or that we have had enough iterations? Normally we can say that we are done either when there is no clear distinction between groups of points detected by the model (no minimum in the density of the scores), or when the two groups are of similar size. This would signal that we have gone a step too far, and it is time to stop the iterations.
Error automatically detected – now what?
As we are dealing with unlabeled data, we are never sure we have covered every type of error. One aspect in which there is room for improvement is towards modelling multivariate relationships and making models more complex. Instead of relatively simple models that only take several features as input, we can construct composite network architecture that could handle more input variables and thus model more complicated relationships.
Given discovered errors, the question is posed: How to fix them? It has been shown that ML models can help automate and improve detection of errors, but that is only the first step. At present, errors that are detected still have to be processed manually. Depending on the amount of errors, this is a laborious and costly process, with potential for yet more optimization and automation. Data remediation, or correcting errors, consists of many repetitive sub-processes, which can be automated with Robotic Process Automation – RPA. This would be an ideal next step to further optimize error detection and correction. We at InCube are developing solutions that would go the whole way from detecting errors to correcting them by employing ML and RPA. We are sure that these two combined can greatly enhance current processes and make businesses much more efficient and their data cleaner.