Right, your conclusion is correct.

Nov 16, 2020

Right, your conclusion is correct. Actually there should not be one universal data quality score, but one score per usage intent. The formula would be the same, but the constraints and weight that you would put on the different type of problems for the different columns should be different. A poor quality on a particular column is not an issue if you use the data set to build a model which doesn't use that column at all. For another user it may be a show stopper.

One way we solve it in the IBM products is that you can add the same data set to different "project", each project having different data rules and data quality in place to better reflect the analysis intent. We also have some other features that automatically apply data rules (constraints) based on the identified data classes or terms.

Ideally the constraints should come together with an industry model corresponding to your specific industry. Industry models contain the business terms and policies that are relevant for an industry, and the data classes which allow to detect those terms in the data (see one of my other articles showing how these play together). Ideally the model should also contain the data quality constraints, so that all most important constraints are in place before the data set are analyzed.

Written by Yannick Saillet

No responses yet