Photo by NASA on Unsplash

Why we want to save the world from bad data

(…and how we try to do it)

With several decades of combined experience with data science and data engineering in practice, our founders have spent endless hours with the Sisyphean task of checking, cleaning, fixing, and validating data before being able to do anything remotely interesting with it. And they’re not alone: it’s hard to find a company anywhere these days that doesn’t strive towards being more data-driven — with the implicit assumption that they want to be driven by good quality data.

They decided to do something about it. And now, with years of R&D and successful customer implementations under our belt, we are ready to introduce to the world how we’ll blast the data quality bottleneck!

For context, only in the US, cleaning bad data cost an estimated annual USD 3+ tr already in 2016 — a huge number that doesn’t even reflect the bad decision-making, delays, and opportunity cost of not putting good data to use. As an example, the value of AI/ML applications in 2030 is estimated to be north of USD 15 tr. Those algorithms are, by definition, based on a “garbage-in, garbage-out” logic, so a key assumption for realisation of such numbers is good quality data. And AI/ ML is far from the only useful application of good quality data.

In short: good data is fundamental to a lot of current and future value creation.

So you might say it’s justified that data professionals — like data scientists, data engineers, and data analysts — spend on average 80% of their time fixing data quality issues. But these are highly qualified, sought-after, scarce individuals that are tasked with performing mundane cleaning jobs when they could be doing much more productive and value-adding work! (The vast majority of data professionals also think this is the most boring part of their job.) Data engineers are charged with ensuring that data can be trusted: that it’s delivered consistently, on-time, with expected quality, etc. etc. Although their role is inarguably critical, they are tasked with building a figurative house using only a hammer, while they are in urgent need of a full toolbox.

That is fundamentally wrong. So we are going to flip those numbers and put the data engineer in the front seat of data quality management.

Here is how:

We automate real time validation, enabling proactive avoidance (or immediate identification and rectification) of data quality issues

As data professionals, we have countless experiences of finding ourselves spending crazy amounts of time and energy on manually sifting through datasets, looking for inconsistencies, incompleteness, inaccuracies, biases, skewness, shifts in the data, or invalid entries and amending them — often on a case by case basis. We’d much rather spend that time on more interesting parts of the machine learning pipeline, such as tweaking algorithms, training and deploying models, ensuring that data streams and models are robust, as well as getting the business side on board. By automating monitoring and validation, we relieve data engineers from time consuming tasks generally perceived as dull — while speeding up the validation process by orders of magnitude.

Another issue we often faced when building and deploying machine learning models is that of finding optimal retraining intervals. Basically, most of the time we’d retrain our models either on a more or less arbitrarily defined frequency (e.g. daily, weekly, monthly, etc.) or when we get some indication that our model is underperforming. The former is costly, as you’d normally set a frequency based on the average data relevance lifetime, including a buffer, and in case of the latter we are basically addressing the problem too late by definition. Real time validation, on the other hand, enables us to recognise and act immediately to shifts in data streams or batches, e.g. as they start to diverge from training data.

Not only do shifts in data call for retraining: issues in data collection call for corrective action. At present, we are confined to reactively correcting data to ensure high quality. The need for correction is often discovered only when decisions start to look strange — but then the damage is already done. Proactive data quality management, on the other hand, involves monitoring data in real-time, identifying issues in data collection as they occur. But why stop there? Real-time monitoring and validation enables real-time action. And the best way to minimise the need to fix bad data is to minimise the collection of bad data. We have built in functionality to enable real-time filtering, to exclude invalid or erroneous data points, and corrective operations to fix issues in the data before it enters the main data pipeline.

When we or our executives rely on data streams from e.g. sensors or online customer behaviours to inform decision making, it is beyond stressful to learn that something calling for urgent action happened hours, days, weeks, or even longer ago. So we think you should be able to get alerts for when irregularities occur or new trends emerge. Which you now can, with Validio.

One of the most obvious use cases of in which automated data validation can make a huge different is throughout the machine learning lifecycle

We built in flexibility so that you can customise your validation tasks (if needed), visualising their output, and enabling you fiddle with them effortlessly

In all fairness, many validation tasks are fairly standard (which is why it’s so abhorrently boring to do them over and over again with little variation…) But at the same time, we know we haven’t captured everything (yet) and we also know from experience how sometimes a specific domain, dataset, or use case might call for some strange corner case of a test. We built a library of the most commonly used tasks — as well as functionality for you to build your own.

When you’re performing a number of tasks in sequence, keeping track of their relative results and potential interrelations can be tricky. If you’re doing them in parallel, it’s positively messy and unwieldy. To simplify keeping track of validation test results, we visualise test results continuously through customisable dashboards.

In addition to enabling time and cost savings, as well as higher quality decisions, we enable compliance through good data quality and good data transparency

As data professionals, we know very well how the devil is often hiding in the details, so not only does the dashboard illustrate metadata, patterns in your dataset, and validation test results for data points. You can also filter the dataset to show the same results only for a particular part of the set — like all values in a class, or all values during a time period, or all values in another class that exhibit a certain feature value.

Even when you know you’ve done a good job pre-processing and validating your data, there are plenty of other stakeholders who might ask for evidence of what has been done and how the data looked before and after. This can be for compliance reasons, e.g. to enable examination of the structure of data that underlies a given prediction or decision, or because internal stakeholders want to review the quality of data (before and/ or after pre-processing). In addition to already mentioned advantages to automation, all digitally executed operations can be made to leave a trace, which we of course want to leverage: On the Validio platform, you can keep track of metadata and data composition over time, and generate automatic reports that presents key statistical characteristics of the dataset(s) for a given time period.

Lastly, we made sure to lower the threshold to jump onboard by making our platform adaptable and unintrusive

Given the importance and sensitivity of data assets and infrastructure, we built the platform as a service that is deployed at your selected cloud provider (or on-prem if you prefer), so that data never leaves your environment. Furthermore, it integrates to your infrastructure, so you won’t need to compromise on your preferences in terms of general data environment.

Conclusions?

As you might have guessed, Validio is built for data professionals, by data professionals. We have struggled with data quality for too long — and we have seen our friends and colleagues struggle all the same. With Validio, we want to save the world from bad data. And we hope to share our journey with you — as readers, customers, or potentially even future colleagues!

So follow us here and here — and don’t hesitate to contact us if you’re interested in our platform, our company, or plainly in discussing data quality. After all, not only do we want to save the world from bad data: we want good data to save the world!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ingrid af Sandeberg

Ingrid af Sandeberg

Incessantly curious polynerd. Head of Marketing and Commercialisation at Validio: good quality data will save the world!