How to measure data quality across multiple dimensions?

Yannick Saillet
11 min readJun 17, 2024

--

How to measure all the facets of your data quality with IBM Knowledge Catalog

Photo by Dawid Małecki on Unsplash

In a previous article, I suggested a theoretical foundation for computing data quality scores that are consistent and comparable, no matter what technique is used to find the data quality issues. In a later article, I presented how this computation method has been used in combination with a set of out-of-the-box data quality checks implemented in a product that was called IBM Watson Catalog at that time.

Since I wrote these articles, many things have changed in this product. Besides the fact that it was renamed some time ago into IBM Knowledge Catalog, IBM has spent a lot of time and effort to redesign the architecture, the user experience, and how you can manage data quality at scale with it.

On the occasion of the release of a new major version of IBM Knowledge Catalog 5.0, I’d like to give an update on how the concepts introduced in my early articles have evolved into a more mature platform to manage centrally and at scale your data quality.

It will take more than one article to go through the novelties, so expect more articles in the new weeks. In this article, I will focus first on an update on the data quality scores and how it is made explainable to the user.

A bit of terminology: scores, dimensions, checks, and issues

To better understand how some of the concepts I introduced in my previous article evolved, let’s have a first look at the main screen how recent versions of IBM Knowledge Catalog present the data quality results for a given data asset:

Data Quality results of a data asset in IBM Knowledge Catalog

The first thing to notice is that the overall normalized data quality score as I introduced in “How to quantify Data Quality?” is still there, but it is not the only score that you see on this screen. You will notice that next to it, there are so-called Dimension scores which represent the drill down of this overall score among different aspects of data quality.

As mentioned in my previous articles, there are different facets of data quality, and not all of them are always equally important for a given user, depending on what is the intentional usage of the data. A data quality score is the result of a series of data quality checks that verify the data against a set of expectations. However, there may be many data quality checks specialized in different aspects of data quality. Looking only at the results of the individual checks or at the overall score may not be enough to recognize at first glance if the data fits the purpose or not.

For these reasons IBM Knowledge Catalog doesn’t only compute a single data quality score per data asset, but also a score per data quality dimension.

A data quality dimension is simply a logical grouping of different types of data quality checks all specialized in the same aspect of data quality. The industry has come over time with a kind of unofficial standard of data quality dimensions, such as Completeness, Accuracy, Consistency, Validity, Uniqueness, Integrity, etc… but you don’t have to stick to any convention and you can define your list of data quality dimensions.

During the data quality analysis — and this doesn’t have to be a single analysis using a single technology — experiments are run on the data to evaluate data quality checks enforcing data quality expectations or data quality constraints. Each of these data quality checks will search for one or multiple types of data quality issues and report them to the system. Each data quality check can be mapped to a data quality dimension, so that the issues it identifies will only affect the score of its dimension and the overall score, but not the other dimensions.

The lower part of the screen shows the actual data quality checks that have been applied to the data set and their results. You can see that each check is mapped to a dimension. When the data quality analysis is run on the data, all these checks are evaluated and their results are aggregated into a single data quality score for their respective dimension. The computation of the dimensions scores from the results of the checks follows the same mathematical formulas that I introduced in “How to quantify Data Quality?”. The different dimension scores are aggregated into a single overall score, as shown in the following illustration.

DQ checks contribute to dimension scores which contribute to the overall score

Overall data quality score revisited

But the question now that we have an intermediate layer of dimension scores is how to compute the overall score.

The solution here is simple:

The overall score is the average score of the different dimension scores.

This formula has the advantage that it is easy to understand but it also brings other benefits compared to a solution where the DQ check results would directly impact the score:

First, it avoids the overweighting of dimensions having more DQ checks: The number of checks to measure the score in a certain dimension shouldn’t depend on the number of checks or complexity of the checks necessary to measure those dimensions. Some dimensions may be easy to cover with a single rule, while other dimensions may require multiple rules to cover all their requirement.

For instance, the completeness dimension may only require a simple check for missing values, while a domain validity dimension score may have to be measured by many rules, each of them capturing a different type of data. But at the end of the day, the data should have a certain percentage of values being complete or not, and also a certain percentage of values being domain compliant. And it should not matter how many checks were necessary to find these issues.

That’s why the overall score should only depend on the dimension scores but not on the details of how these dimension scores were computed.

Another advantage of this method is that it could allow to weight the impact of each dimension on the overall score without having to know the details of which data quality check contributed to each dimension. Note that at the time of this writing, IBM Knowledge Catalog doesn’t support yet the specification of dimension weights, but the formula used to compute the overall score would make it easy to add.

Column scores and data quality coverage

We have seen how the overall data quality score of an asset can be decomposed into sub-scores for the different dimensions. But there is another way you may want to drill down into the details of the overall score: by column.

IBM knowledge catalog lets you see the dimension scores for each field or column of the data asset at a glance as shown in the following screen:

Data Quality dimension scorers by column

This representation has two advantages:

  • It shows in the single screen which column of the data set may have the most problems for the different dimensions.
  • It gives you an overview of the coverage of your data quality checks.

Let me elaborate on the second point as it is a new idea compared to my previous articles and has an impact on how we want to compute the data quality scores:

In my older article, I wrote that a data asset without any quality check should have per definition a score of 100% because it has no issue. While it is still true for the overall score, the dimension scores don’t follow that principle. A dimension score only exists for an asset or a field of an asset if there is at least one data quality check belonging to this dimension searching for issues on it.

The reason for that is that you may have a large number of different dimensions and those dimensions may not be relevant for all assets or all types of columns. If all custom dimensions were reporting a score of 100% for all assets even when there is no data quality check against these dimensions, it would mean that the addition of a dimension would always increase all overall data quality scores for all existing assets.

To avoid this undesirable effect, we introduced this new rule:

A dimension score is only computed where issue types belonging to this dimension are actively searched by data quality checks.

The absence of score for certain dimensions on certain fields is easy to see on the previous screen and this representation gives a rudimentary idea of the data quality check coverage per dimensions.

The data quality check coverage is inspired by the test coverage concept used in software testing, where ensuring that the tests cover all possible code paths of a software is as important as the results of the tests.

Obviously we don’t have code paths when testing data, but we have data quality dimensions. If a data quality dimension has been identified as being relevant for a certain data set, it is important to know if the data quality checks used to measure the data quality for this dimension are sufficient or not. This is the equivalent of test coverage for data quality.

This also means that if a dimension is only measured on a few fields, these fields will have a higher impact on the overall score than other fields. This effect appears to be useful for certain dimensions like uniqueness:

If a table is supposed to contain unique entities, you will usually want to measure the uniqueness of a few key columns and not test that all columns in the data asset have unique values. In this case, the uniqueness score should be only measured based on the key fields used to identify records. All the other fields playing no role in the identity of records should also not play a role in the score of the uniqueness dimension.

Fields drill down:

Coming back to the UI, the user can drill down from the previous screen into a single field. The data quality results for field contains the exact same kind of information as the one for the overall data set, but it only focuses on the scores, dimensions, and data quality checks relevant to the focussed data field:

Data quality scores and checks for a specific column

As you can see, all previous concepts seen at the level of a data asset are still valid at the level of a single field of this data asset.

A field has an overall data quality score computed as the average of all dimension scores for this field. The dimension scores are calculated as a result of the data quality checks applied to this field for each dimension.

A field may have a lower number of dimension scores than the overall asset if some dimensions are only measured on different fields but not this one.

The user can choose to ignore individual data quality checks for certain fields, if those checks are irrelevant in the context of these fields. In the same way, individual fields can be ignored when computing the overall data quality score. This flexibility is useful when data sets have hundreds of fields, but only a small part of them are relevant for the data set. In this case, you want to focus the overall score on the important fields only and not let the irrelevant fields make the score appear better than it should be.

Adding additional data quality checks

Once you have identified with the help of the previous screens that there is a gap in the coverage of the data quality checks for certain dimensions on certain fields, you can create additional data quality checks.

The technology used to implement the checks is not important. IBM Knowledge Accelerator provides different ways of implementing them. One of them is creating data quality rules, which can be expressed either in SQL or in a proprietary language. These rules can implement any data quality expectation, no matter how complex its logic is.

The basic requirements for those checks are:

  • They need to be able to find issues that can be mapped to a dimension
  • They can measure the percentage of the data that have the searched issues.

If a technology fulfills these two requirements, it can be used to implement a data quality check.

The following screen shows an example of a custom rule created to verify that the combination of name and address is unique for each row of the analyzed table.

As you can see, when such a rule is created, the user can map it to one of the existing data quality dimensions. There is a set of existing predefined dimensions installed by default with the product, but they can be customized.

Create a new data quality rule and assign it to a dimension
Define the rule logic

By creating a new rule and binding it to a particular data asset, we added a new data quality check to the list of data quality experiments running against this data asset. The additional check increased the coverage for the specified data quality dimension for the two fields that play an important role in the uniqueness criteria of this table.

When running the rule, it will identify as an issue whatever record doesn’t match its condition. These identified issues will only impact the data quality score for the data quality dimension assigned to the rule. They will also only affect the scores for the two fields bound to the rule.

This will also impact the data asset score for the same dimension and of course, to a lesser extent, the overall score of the data asset. However, the scores of the other dimensions and fields will remain unchanged.

A quick look at the consolidated data quality results confirms that the test coverage for the uniqueness dimension was extended to the two fields where the rule was applied

Improved test coverage for the Uniqueness dimension after the addition of a new rule

You can also observe that this rule identified 50% duplicates, which led to a drop in the score for the Uniqueness dimension while the other dimensions remained unchanged. The overall score was also degraded because of that but to a lesser extent because of the unchanged scores on the rest of the dimensions.

The uniqueness score dropped after the execution of the new rule.

This differentiation of the scores would allow us to have better alerting and to react earlier on a quality drop on a single dimension even if the overall score is still high. This would also allow us to have different types of alerts and remediation processes for the different dimensions.

Summary

We have seen how the creation of dimension scores as an intermediate layer between the scores of the individual data quality checks and the overall data quality score for the asset can help to have a more differentiated data quality measurement. We have also seen how the different scores are computed from each other and how this can be used to have a basic notion of coverage of the data quality check. We have seen how IBM Knowledge Catalog computes and displays these scores to give a 360° view of the data quality for each cataloged data asset, no matter what technology was used to identify the issues.

In a future article, I will provide a deeper dive into the flexibility of this framework and how it can be used to automate the data quality process.

--

--

Yannick Saillet

Software Architect, Master Inventor @IBM — Architect for Data Profiling and Data Quality in IBM Knowledge Catalog on IBM Cloud Pak for Data, IBM Cloud.