Data Quality Overview

Before one gets onto working on data quality, one has to appreciate \n on what is data quality?, why is it so important? and what are the \n reasons for data quality failures?

Data Quality Definition- What is Data Quality?

Data quality is not linear and has many dimensions like Accuracy, Completeness, Consistency, Timeliness and Auditability. Having data quality on one dimension is as good as 'no quality'.

None of the Data Quality dimensions is complete by itself, and many a times dimensions are overlapping.

Data Accuracy dimension of Data Quality:

Accuracy of data is the degree to which data correctly reflects the real world object OR an event being described.
Examples of Data Accuracy

  • The address of customer in the customer database is the real address.
  • The temperature recorded in the thermometer is the real temperature.
  • The bank balance in the customer's account is the real value customer deserves from the Bank.

Data Completeness dimension of Data Quality

Completeness of data is the extent to which the expected attributes of data are provided.

For example, a customer data is considered as complete if:

  • All customer addresses, contact details and other information are available.
  • Data of all customers is available.

Data Completeness definition is the 'expected completeness'. It is possible that data is not available, but it is still considered completed, as it meets the expectations of the user. Every data requirement has 'mandatory' and 'optional' aspects. For example customer's mailing address is mandatory and it is available and because customer’s office address is optional, it is OK if it is not available.

Data can be complete, but inaccurate:

  • All the customers' addresses are available, but many of them are not correct.
  • The health records of all patients have 'last visit' date, but some of it contains the future dates.

Data Consistency dimension of quality of data

Consistency of Data means that data across the enterprise should be in synch with each other.

Examples of data in-consistency are:

  • An agent is inactive, but he still has his disbursement account active.
  • A credit card is cancelled, and inactive, but the card billing status shows 'due'.

Data can be accurate (i.e., it will represent what happened in real world), but still inconsistent.

  • An Airline promotion campaign closure date is Jan 31, and there is a passenger ticket booked under the campaign on Feb. 2.

Data is inconsistent, when it is in synch in the narrow domain of an organization, but not in synch across the organization. For example:

  • Collection management system has the Cheque status as 'cleared', but in the accounting system, the money is not shown being credited to the bank account. Reason for this kind of inconsistency is that system interfaces are synchronized during the end-of-day batch runs.

Data can be complete, but inconsistent

  • Data for all the packets dispatched from New York to Chicago are available., but some of the packages are also shown as 'under bar-coding' status.

Data Timeliness

'Data delayed' is 'Data Denied'

The timeliness of data is extremely important. This is reflected in:

  • Companies are required to publish their quarterly results with in a given frame of time.
  • Customers service providing up-to date information to the customers.
  • Credit system checking on the credit card account activity.

The timeliness depends on user expectation. An online availability of data could be required for room allocation system in Hospitality, but an overnight data is fine for a billing system.

Example of Data not being timely:

  • The courier package status is delivered, but it will be updated in the system only in the night batch run. This means that online status will not be available.
  • The financial statements of a company are published one month after the year-end.
  • The census data is available two years after the census is done.

Data Auditability

Auditability means that any transaction, report, accounting entry, bank statement etc. can be tracked to its originating transaction. This would need a common identifier, which should stay with a transaction as it undergoes Transformation, aggregation and reporting.

Examples of non-auditable data:

  • A car chassis number cannot be linked to the part number supplied by an ancillary.
  • A surgery report cannot be linked to the Doctor ID of preliminary diagnosis OR the pathologist ID.