Skip to content

docs: Add content on validation and additional checks #65

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: live
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 29 additions & 4 deletions docs/components/validator.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,36 @@
# Validator and Quality Tool

## Summary
A data validator and quality tool checks that data conforms to a standard, providing both pass/fail validation against the standard's schema and codelists, and additional checks on data quality, coverage, and adherence to best practices.

Providing a report on technical validity of data against the schema. Providing feedback on the content of datasets, based on a set of data quality rules. Machine and human-readable rules used to check data quality.
Implementers can use a validator to get feedback on the quality of their draft and published data. They can also integrate validation into their data publication pipelines. Data users can use a validator to identify data quality issues that might impact their analysis. Similarly, data registries can incorporate validation results to provide a summary of quality issues in each dataset. Furthermore, support staff can use a validator to provide feedback and guidance to implementers.

## Description
To cater to different audiences, validators can offer various interfaces. For example, a user-friendly web application for implementers to upload data and receive immediate feedback, a command-line tool for developers to run local checks, and a software library that developers can embed within their data pipelines.

Part of a standard is often schema, and reporting on technical validity against the schema is a way of programmatically checking that the data conforms to the schema and can be used by other tools that expect data to conform to the schema. By providing validation as an online service, implementers can validate their data without
For more information about how schema validation relates to additional checks, see [author your schema, codelists and additional rules](../development/schema.md#author-your-schema-codelists-and-additional-rules).

## Prioritisation Factors

* Specific error reporting and user expreience: If implementers need context-specific error messages and guidance, target feedback or multiple output formats (e.g. human-readable reports and machine-readable JSON for integration with other tools).
* Complexity beyond the schema language: If the standard involves additional rules that cannot be expressed in its schema language, validation of codelists specified outside the schema, or semantic validation of the data beyond its structure and format.

## Deprioritisation Factors

* Simplicity: If the standard is purely structural, can be fully expressed in a schema language, and validated by existing tooling, an 'off-the-shelf- validator might be sufficient.
* Technical audience: If the standard's audience is developers with experience of standardising data, existing validation libraries or command-line tools might be sufficient.

## Examples

The Open Contracting Data Standard (OCDS) provides a web-based validator (the [OCDS Data Review Tool](https://review.standard.open-contracting.org/)) and a command-line tool and Python library ([Lib CoVE OCDS](https://github.com/open-contracting/lib-cove-ocds)).

360Giving provides a web-based [Data Quality Tool](https://dataquality.threesixtygiving.org/).

## Related components

* [Schema](schema)
* [Required fields](required_fields)
* [Codelists](codelists)
* [Registry of datasets](registry_of_datasets)

## Related patterns

* [Permissive schema](../patterns/schema.md#permissive-schema)
23 changes: 13 additions & 10 deletions docs/development/schema.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,14 +54,14 @@ The 360Giving Data Standard supports both spreadsheet and JSON formats, but most

```{seealso}

* 🧩 [Conversion tools](../components/conversion_tools)
* 💡 [Spreadsheet first schema design](../patterns/schema.md#spreadsheet-first)
🧩 [Conversion tools](../components/conversion_tools)
💡 [Spreadsheet first schema design](../patterns/schema.md#spreadsheet-first)

```

## Choose a schema language

A schema defines the meaning, structure and format of data.
A schema defines the meaning, structure and format of data. It can be used to validate that data is correctly structured and formatted.

Based on your chosen publication formats, you need to decide on a language in which to document the schema for a standard,

Expand All @@ -88,7 +88,7 @@ A codelist defines a set of permissable values for a field.
The recommended approach is to document codes, titles and descriptions in a CSV file, according to the [Open Data Services Codelist Schema](https://codelist-schema.readthedocs.io/).

```{seealso}
* 💡 [CSV codelists](../patterns/schema.md#csv-codelists)
💡 [CSV codelists](../patterns/schema.md#csv-codelists)
```

## Choose your packaging formats
Expand Down Expand Up @@ -117,20 +117,23 @@ GeoJSON | GeoJSON [feature collections](https://datatracker.ietf.org/doc/html/rf
```

```{seealso}
* 💡 [Packaging](../patterns/schema.md#packaging)
* 💬 [Packaging multiple networks · Issue #51 · Open-Telecoms-Data/open-fibre-data-standard](https://github.com/Open-Telecoms-Data/open-fibre-data-standard/issues/51)
* 💬 [Deprecate remaining package metadata and add bulk data format · Issue #1084 · open-contracting/standard](https://github.com/open-contracting/standard/issues/1084)
* 💬 [Add a metadata package schema · Issue #200 · GFDRR/rdl-standard](https://github.com/GFDRR/rdl-standard/issues/200)
💡 [Packaging](../patterns/schema.md#packaging)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to remove the bullets here? I think the structure is good for this longer list

💬 [Packaging multiple networks · Issue #51 · Open-Telecoms-Data/open-fibre-data-standard](https://github.com/Open-Telecoms-Data/open-fibre-data-standard/issues/51)
💬 [Deprecate remaining package metadata and add bulk data format · Issue #1084 · open-contracting/standard](https://github.com/open-contracting/standard/issues/1084)
💬 [Add a metadata package schema · Issue #200 · GFDRR/rdl-standard](https://github.com/GFDRR/rdl-standard/issues/200)
```

## Author your schema and codelists
## Author your schema, codelists and additional rules

Authoring the schema and codelists for a standard involves documenting the standard's data model in your chosen schema language and codelist formats.

JSON Schema specifies a number of keywords to describe and constrain JSON data. For example, the `type` keyword is used to restrict a field to a specific type, like "string" or "number", whilst the `title` keyword is used to provide a human-readable title for a field.

As well as the keywords specified in JSON Schema, the [Open Data Services JSON Schema Extension](https://json-schema-extension.readthedocs.io/) specifies additional keywords for linking fields to [CSV codelists](../patterns/schema.md#csv-codelists), and providing information about [deprecated fields](../patterns/schema.md#deprecated-fields).
As well as the keywords specified in JSON Schema, the [Open Data Services JSON Schema Extension](https://json-schema-extension.readthedocs.io/) specifies additional keywords for linking fields to [CSV codelists](../patterns/schema.md#csv-codelists), and providing information about [deprecated fields](../patterns/schema.md#deprecated-fields).

Constraints expressed in a schema are requirements that data must conform to in order to be considered valid. However, you might wish to impose less stringent rules related to data quality, coverage, or best practices. If your chosen schema language cannot express a rule that you need to impose, or if the rule is intentionally less strict than a requirement, consider providing structured documentation of these additional rules and implementing them as additional checks in a [validator and quality tool](../components/validator). For example, you might recommend and check that data includes geographic coordinates, even if it isn't required in the schema. Clearly specifying additional rules and implementing additional checks makes it easier for data publishers to identify data quality issues.

```{seealso}
💡 [Schema patterns](../patterns/schema)
🧩 [Validator and quality tools](../components/validator)
```