Beryl.
← All articles
Beryl Analytics Blog

Data Quality Fundamentals: The Six Dimensions and How to Test Them

Bad data is expensive in a way that hides from the budget. Nobody files an invoice for the marketing spend wasted on duplicate contacts, the forecast that missed because half the orders had null timestamps, or the executive trust that quietly evaporated after the third dashboard that disagreed with itself. Data quality is the discipline that prevents all of that, and it is far more concrete than its reputation suggests. There is a standard framework of six dimensions, and each one can be tested automatically. This article covers all six and shows how to turn them into tests that run on every pipeline load.

Why data quality is an engineering problem, not a vibe

Teams often treat data quality as something you sense rather than measure. A report looks off, someone investigates, a fix is applied, and everyone moves on until the next surprise. That reactive loop is the disease. The cure is to define what good means precisely enough to test it, then run those tests on every load so that broken data is caught at the source rather than discovered downstream by a stakeholder. The six dimensions give you that vocabulary.

The six dimensions and how to test each

1. Accuracy

Accuracy is whether the data matches reality. A customer record with the wrong address is inaccurate even if every field is filled in and well formatted. Accuracy is the hardest dimension to test automatically because it requires a source of truth to compare against. In practice you test it through reconciliation: does the total revenue in the warehouse match the total in the source system, does the row count match the source after a load. Where an authoritative reference exists, sample against it. Where it does not, reconciliation against the system of record is your best proxy.

2. Completeness

Completeness is whether the data that should be there is there. Two flavours matter: missing rows (did the load drop records) and missing values (are required fields null). Test row completeness with count comparisons against the source and against historical norms, so a load that brings in 60 percent of yesterday's volume raises a flag. Test value completeness with not-null assertions on the fields that downstream logic depends on. A null in an optional notes field is fine; a null in the order amount is a defect.

3. Consistency

Consistency is whether the data agrees with itself across systems and tables. If the customer table says a user is active and the subscription table says they cancelled, you have an inconsistency. Test it with cross-table assertions: every order references a customer that exists, the sum of line items equals the order total, the same entity carries the same status everywhere. Inconsistency is the root cause of the classic two dashboards disagree problem, and catching it at load time is far cheaper than debugging it in a board meeting.

4. Timeliness

Timeliness is whether the data is fresh enough for the decision. A perfectly accurate dataset that is two days stale is useless for a real-time operational decision and fine for a quarterly review. Test it with freshness checks: assert that the most recent record is no older than your agreed threshold, and alert when a pipeline has not loaded on schedule. Stale data that looks current is one of the most dangerous failures because nothing on the dashboard signals that the numbers stopped updating.

5. Validity

Validity is whether the data conforms to its rules and formats. An email field that contains text with no at sign is invalid. A status field containing a value outside the allowed set is invalid. A negative quantity where only positives make sense is invalid. Test validity with format checks (regular expressions for emails, phone numbers, identifiers), range checks (amounts within plausible bounds), and accepted-value checks (the status is one of the known states). Validity tests are cheap to write and catch a large share of upstream entry errors.

6. Uniqueness

Uniqueness is whether each real-world thing appears exactly once. Duplicate customer records inflate counts, waste marketing spend, and corrupt aggregates. Test it with uniqueness assertions on primary keys and with duplicate detection on natural keys like email plus name. Duplicates often creep in through repeated loads or upstream merges, so test on every load rather than once at setup.

Turning dimensions into automated tests

The point of the framework is operational. Each dimension maps to a class of assertion that should run automatically on every pipeline run: not-null and accepted-value checks for completeness and validity, uniqueness checks for keys, freshness checks for timeliness, reconciliation and row-count checks for accuracy, and cross-table referential checks for consistency. Modern transformation tooling lets you declare these tests alongside your models so they run as part of the build and fail loudly before bad data reaches a dashboard. The discipline is to treat a failed data test exactly like a failed unit test in software: a blocker, not a warning to be ignored. If you want a structured assessment of where your pipelines stand across all six dimensions, Beryl Analytics runs data quality audits that produce a prioritised remediation plan.

The downstream cost of bad data

Every dimension maps to a real cost. Inaccuracy and inconsistency erode trust, and once leadership stops trusting the dashboard they go back to gut feel, which makes the entire analytics investment worthless. Incompleteness and invalid values silently bias models and forecasts, so the prediction is wrong in ways nobody can see. Stale data drives decisions on a world that no longer exists. Duplicates waste spend and corrupt every count. The compounding cost is that bad data does not announce itself; it just makes every decision slightly worse until someone notices the cumulative damage. Testing at the source is dramatically cheaper than absorbing that drift.

Takeaways

Frequently asked questions

Which dimension should I test first? Start with the failures that hurt most for your business. For most teams that is completeness and timeliness (did the data arrive, and is it fresh), because those failures break reports outright.

How many tests is enough? Enough to cover the fields and relationships that real decisions depend on. Do not aim for total coverage of every column; aim for full coverage of the load-bearing ones. Reach out if you want help drawing that line.

data qualitydata quality dimensionsdata quality testingdata reliability

Want analytics that actually moves the number?

Beryl Analytics builds predictive models, data pipelines, and dashboards that drive decisions for businesses across New Zealand and Australia. We ship to production and prove the return.

Talk to Beryl Analytics
© Beryl Analytics 2026. AI-powered data analytics for New Zealand and Australia.
Home  •  Blog  •  Terms  •  Privacy