When it comes to creating insights from data, a common saying is "garbage in, garbage out." Rightfully so, because the consequences of poor data quality are dire:
* management makes wrong decisions due to "polluted" KPIs on the displayed Power BI dashboard
* shareholders lose confidence in the company due to flattered profit figures
* customers file a claim because they pay more (or get less) than previous sent communication had portrayed
Regularly we see these kinds of errors only a the end of the process (jargon:
"downstream") come to light. At that point, the harm has already been done: the decision has already been made, trust has already been damaged and/or the claim has already been made. Errors in the data should therefore be caught as much as possible at the beginning of the process (jargon: "upstream").
Data validation involves testing each data field against a number of criteria. In theory, it is possible to make this quite complex by checking the plausibility of various dependencies between the data. We see that there are many tools on the market for this, often equipped with artificial intelligence. These tools can certainly add value. In practice, however, we find that already a very large proportion of errors can be caught by running three simple tests on each data column:
- Is the data field filled?
- is the data field of the correct data type
- Does the data field fall within the expected bandwidth?
Take a customer's date of birth as an example. To calculate a premium, this field must always be populated. In addition, the field must always be of type "date" in a consistent format, for example, the ISO format year, month, day. Finally, we want an error message to be given when the date of birth is in the future or just more than 120 years in the past.
A validator that consistently performs these three tests for each data field turns out to be relatively easy to implement while the payoff is surprisingly large by already eliminating a lot of rework, frustration and financial risk.