Skip to content

Commit 1706ce1

Browse files
Merge pull request #49 from OpenDataServices/schema
Rewrite docs/development/schema.md and related components and patterns
2 parents 3b28453 + ebec2f6 commit 1706ce1

File tree

10 files changed

+326
-53
lines changed

10 files changed

+326
-53
lines changed

codelist-schema.json

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
{
2+
"$defs": {
3+
"Row": {
4+
"properties": {
5+
"Deprecated": {
6+
"$ref": "https://codelist-schema.readthedocs.io/1__0__0/codelist-schema.json#/$defs/Row/properties/Deprecated"
7+
},
8+
"Deprecation note": {
9+
"$ref": "https://codelist-schema.readthedocs.io/1__0__0/codelist-schema.json#/$defs/Row/properties/Deprecation note"
10+
}
11+
}
12+
}
13+
}
14+
}

docs/components/index.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -515,6 +515,12 @@ Allowing conversion between serialization formats (e.g. CSV -> XML; JSON -> XLS)
515515

516516
Data standards often use structured data formats such as JSON or XML to give more flexbility in modelling and to allow validation against schema. Typically, developers prefer to work with structured data formats as they are easier to work with in programs. However, JSON and XML aren't very human-friendly, and people working with data in many domains prefer to use flat representations of data such as CSV and XLSX spreadsheets, both for publishing and manipulating data. Conversion tools allow conversion between the formats, to allow the standard and developers to retain the benfits of a structured data format and users to continue to be able to engage with the data in a way that they're comfortable with.
517517

518+
#### Examples
519+
520+
##### Flatten Tool
521+
522+
We maintain [Flatten Tool](http://flatten-tool.readthedocs.io), a Python library and command-line interface for converting data between structured formats like JSON and XML and tabular formats like CSV and XLSX. It can use a standard's schema to handle data types correctly, to produce human-readable column headings, and to structure tabular data helpfully.
523+
518524
#### Prioritisation Factors
519525

520526
- If the standard uses a structured data format, while data publishers and/or users prefer flat representations.

docs/conf.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@
3030
# Add any Sphinx extension module names here, as strings. They can be
3131
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
3232
# ones.
33-
extensions = ['sphinxcontrib.mermaid', 'myst_parser', 'sphinxcontrib.mermaid']
33+
extensions = ['sphinxcontrib.mermaid', 'myst_parser', 'sphinxcontrib.mermaid', 'sphinx_togglebutton', 'sphinx_design', 'sphinxcontrib.jsonschema']
3434

3535
# Myst parser configuration
3636

@@ -359,6 +359,6 @@
359359

360360

361361
linkcheck_ignore = [
362-
# The ODI is now behind a Clouflare challenge that we can't check.
362+
# The ODI is now behind a Cloudflare challenge that we can't check.
363363
'http://www.theodi.org'
364364
]

docs/development/schema.md

Lines changed: 103 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -1,75 +1,136 @@
1-
# Schema development
1+
# Data modelling and schema development
22

3-
This section outlines a number of the pattens we commonly use to develop the schema for an open data standard.
3+
This page provides an overview of the steps involved in data modelling and schema development.
44

5-
```{admonition} Learning / reflection
6-
---
7-
class: note
8-
---
9-
We currently jump straight from the conceptual framework document, to working up the data model and schema for a standard in a schema language.
5+
## Document a data model
106

11-
This differs [from the approach proposed here](https://github.com/open-contracting-archive/technical-approach#data-model) of maintaining the **data model** as a narrative document, only then given form by a schema as a reference implementation.
7+
A data model is an abstract model that organizes elements of data and standardises how they relate to one another and to the properties of real-world entities. A data model focuses on what data represents rather than how it is stored or exchanged.
8+
9+
Before authoring the schema for a standard and committing to specific implementation details, it is recommended to document a data model to help stakeholders align on definitions and relationships.
10+
11+
The data model for a standard should be based on [research](research.md) into the related policy area and a thorough understanding of the concepts which underpin it (a conceptual model). Documenting the data model for a standard involves identifying and defining the entities (classes), attributes (properties), relationships and permissable values (codelists) needed to satisfy the requirements, user stories and use cases for the standard.
12+
13+
Developing a good data model is an art as much as a science. It requires sensitivity to the needs of both data producers and data users, and an understanding of the incentive structures that will drive adoption of a standard.
14+
15+
The recommended approach is to document the data model using the [standard development template](../tools.md#standard-development-template-airtable), which ensures that the data model is grounded through explicit links to the requirements, user stories and use cases.
16+
17+
```{admonition} History
18+
:class: dropdown
19+
20+
Previously, we moved straight from documenting a conceptual framework to documenting a schema. The reasons for documenting a data model are explored in the [technical scoping](https://github.com/open-contracting-archive/technical-approach?tab=readme-ov-file#data-model) for the Open Contracting Data Standard.
1221
1322
```
1423

15-
## Schema language
24+
## Choose your publication formats
25+
26+
A publication format is a format in which data can be published by implementers of a standard. Common publication formats include:
27+
28+
* [JSON](https://www.json.org/json-en.html)
29+
* [GeoJSON](https://datatracker.ietf.org/doc/html/rfc7946)
30+
* [CSV](https://en.wikipedia.org/wiki/Comma-separated_values) and other tabular formats, such as XLSX and ODS.
31+
* [XML](https://www.w3.org/TR/xml/)
32+
33+
Based on your [research](research.md), you need to decide which publication formats to support.
1634

17-
Our preferred schema language is [JSON Schema v0.4](https://tools.ietf.org/html/draft-zyp-json-schema-04).
35+
It is [best practice](https://www.w3.org/TR/dwbp/#MultipleFormats) for data publishers to provide data in multiple formats, so that as many users as possible can use the data without first having to transform it to their preferred format. Therefore, you should consider how to support publication in multiple formats.
1836

19-
This allows us to provide field structures and definitions. Although less expressive than other schema languages, the constraints of JSON Schema enable us to focus on keeping data simple enough for a wide range of users.
37+
On a technical level, the recommended approach is to use JSON as the primary format around which a standard's tools are built, and to provide support for other formats through conversion tooling. Depending on the user needs identified in your research, a standard's documentation site and tooling might present an alternative format, such as CSV, as the primary format.
2038

21-
We generally use simple CSV files to represent codelists.
39+
Open Data Services' reusable tools for documenting, converting and validating data are built around JSON. If your research surfaces demand for a different primary format that cannot be supported through conversion to JSON, you should consider the potential costs associated with authoring new tooling.
2240

23-
We have a number of extensions to JSON Schema 0.4 we use (documented below).
41+
```{admonition} Example: The Open Contracting Data Standard
42+
:class: note
2443
25-
## Serializations
44+
The primary publication format of the Open Contracting Data Standard is JSON, but CSV and spreadsheet formats are also supported via conversion tooling. For more information, see [Serialization (Open Contracting Data Standard Documentation)](https://standard.open-contracting.org/latest/en/guidance/build/serialization/#serialization).
2645
27-
We design with a range of serializations in mind, and, where possible, to enable round-tripping of data between different serializations.
46+
```
2847

29-
In particular, through [flatten-tool](http://flatten-tool.readthedocs.io) we design with support for:
48+
```{admonition} Example: 360Giving
49+
:class: note
3050
31-
- Structured JSON serialization;
32-
- Excel serialization;
33-
- CSV serialization.
51+
The 360Giving Data Standard supports both spreadsheet and JSON formats, but most 360Giving data is published in spreadsheet format. Therefore, the documentation for the standard is primarily focussed on the spreadsheet format. For more information, see [Choosing your file format (360Giving Data Standard Documentation)](https://standard.threesixtygiving.org/en/latest/guidance/prepare-data/#choosing-your-file-format).
3452
35-
[Flatten-tool](http://flatten-tool.readthedocs.io/en/latest/unflatten/#human-friendly-headings-using-a-json-schema-with-titles) can use the titles in a schema to provide 'friendly' column headings, and with use of a [metatab](http://flatten-tool.readthedocs.io/en/latest/unflatten/#metadata-tab) also supports packaging meta-data and options to control how spreadsheets are parsed.
53+
```
3654

37-
## Extended JSON schema
55+
```{seealso}
3856
39-
We use a number of custom properties in our JSON Schema implementation. A [patch against JSON Schema 0.4 to include these is found here](https://github.com/open-contracting/standard/blob/6e538252dd08344222b5cd16b864ed0a2a866197/standard/schema/metaschema/meta-schema-patch.json).
57+
* 🧩 [Conversion tools](../components/index.md#conversion-tools)
58+
* 💡 [Spreadsheet first schema design](../patterns/schema.md#spreadsheet-first)
4059
41-
### Codelist properties
60+
```
4261

43-
- `codelist` - the filename of a .csv file that contains at least a `Code` column. Used by the CoVE validator to check for acceptable values.
44-
- `openCodelist` - a boolean value to indicate whether values can **only** come from the codelist, or whether additional values not on the codelist are permitted. When `openCodelist` = 'true' then encountering a value not on the codelist should generate a warning. When `openCodelist` = 'false' then encountering a value not on the codelist should generate an error.
62+
## Choose a schema language
4563

46-
### Deprecation properties
64+
A schema defines the meaning, structure and format of data.
4765

48-
> "Deprecation is the discouragement of use of some terminology, feature, design, or practice; typically because it has been superseded or is no longer considered efficient or safe – but without completely removing it or prohibiting its use."
66+
Based on your chosen publication formats, you need to decide on a language in which to document the schema for a standard,
4967

50-
See: [Deprecation (Wikipedia)](https://en.wikipedia.org/wiki/Deprecation)
68+
For standards that support JSON as a publication format, the preferred approach is to use [JSON Schema](https://json-schema.org/) to document the canonical schema for the standard, specifically [JSON Schema Draft 2020-12](https://json-schema.org/draft/2020-12). Although less expressive than other schema languages, the constraints of JSON Schema enable a focus on keeping data simple enough for a wide range of users.
5169

52-
- `deprecated` - and object to indicate that the field is deprecated, consisting of fields for:
53-
- `description` - a message that explains the deprecation, and that should be presented by validators to any publisher using this field.
54-
- `deprecatedVersion` - a string indicating the version in which the field was first deprecated.
70+
If you choose to support other publication formats alongside JSON, you should consider whether to provide secondary, derived schema for those formats.
5571

56-
We also use the column title `Deprecated` with a version number as the cell value in codelist CSV files when a code has been deprecated.
72+
```{admonition} Example: Open Referral
73+
:class: note
5774
58-
### Merge strategies
75+
The canonical schema for the Open Referral Data Specifications is documented using JSON Schema. However, a secondary schema is provided for the Tabular Data Package format, which is derived from the canonical schema. For more information, see [Serialization and Publication Formats (Open Referral Data Specifications Documentation)](http://docs.openreferral.org/en/latest/hsds/serialization.html).
5976
60-
The Open Contracting Data Standard describes an approach to merge together releases of data from different point in time. We add a number of properties to indicate how merging should be approached.
77+
```
6178

62-
- `omitWhemMerged`
63-
- `wholeListMerge`
64-
- `versionId`
79+
```{admonition} History
80+
:class: dropdown
81+
Previously, the recommended approach was to use [JSON Schema Draft 4](https://json-schema.org/draft-04/draft-zyp-json-schema-04). However, Draft 2020-12 contains several useful features not available in Draft 4.
82+
```
6583

66-
Behaviour for these is [described in the OCDS documentation](http://standard.open-contracting.org/1.1/en/schema/merging/#merging-rules).
84+
## Choose a codelist format
6785

68-
## Design patterns
86+
A codelist defines a set of permissable values for a field.
6987

70-
Developing a good schema is an art as much as a science. It requires sensitivity to the needs of both data producers and data users, and an understanding of the incentive structures that will drive adoption of a standard.
88+
The recommended approach is to document codes, titles and descriptions in a CSV file, according to the [Open Data Services Codelist Schema](https://codelist-schema.readthedocs.io/).
7189

7290
```{seealso}
73-
[Schema patterns](../patterns/schema.md)
74-
The following section provides links to a non-exhaustive set of design patterns that can be drawn upon when developing a schema.
91+
* 💡 [CSV codelists](../patterns/schema.md#csv-codelists)
7592
```
93+
94+
## Choose your packaging formats
95+
96+
A packaging format is structued way of bundling together data and, sometimes, metadata. You can think of a packaging format as a container for multiple records, texts or documents.
97+
98+
Packaging formats aid interoperability and reuse by providing tool developers and analysts with predicatable and consistent approaches to grouping, streaming and pagination.
99+
100+
Based on your chosen publication formats and the requirements identified in your research, you need to decide on a packaging format or formats for each publication format.
101+
102+
The recommended approach is to consider providing:
103+
104+
* A small file and API response format for files that are small enough to fit into memory or are published via API.
105+
* A bulk download format for files that are too large to fit into memory.
106+
107+
```{admonition} Example: The Open Fibre Data Standard
108+
:class: note
109+
110+
The Open Fibre Data Standard supports publication in JSON, GeoJSON and CSV formats. For the JSON and GeoJSON formats, it provides containers for publishing one or more networks and options to support pagination and streaming:
111+
112+
Format | Small files and API responses | Streaming
113+
--- | --- | ---
114+
JSON | A JSON object with an embedded array of `Network` objects, with an optional `.links` object for pagination | A [JSON Lines](https://jsonlines.org/) file in which each line is an `Network` object.
115+
GeoJSON | GeoJSON [feature collections](https://datatracker.ietf.org/doc/html/rfc7946#section-3.3), with an optional `.links` object for pagination | [Newline-delimited GeoJSON](https://stevage.github.io/ndgeojson/) files
116+
117+
```
118+
119+
```{seealso}
120+
* 💡 [Packaging](../patterns/schema.md#packaging)
121+
* 💬 [Packaging multiple networks · Issue #51 · Open-Telecoms-Data/open-fibre-data-standard](https://github.com/Open-Telecoms-Data/open-fibre-data-standard/issues/51)
122+
* 💬 [Deprecate remaining package metadata and add bulk data format · Issue #1084 · open-contracting/standard](https://github.com/open-contracting/standard/issues/1084)
123+
* 💬 [Add a metadata package schema · Issue #200 · GFDRR/rdl-standard](https://github.com/GFDRR/rdl-standard/issues/200)
124+
```
125+
126+
## Author your schema and codelists
127+
128+
Authoring the schema and codelists for a standard involves documenting the standard's data model in your chosen schema language and codelist formats.
129+
130+
JSON Schema specifies a number of keywords to describe and constrain JSON data. For example, the `type` keyword is used to restrict a field to a specific type, like "string" or "number", whilst the `title` keyword is used to provide a human-readable title for a field.
131+
132+
As well as the keywords specified in JSON Schema, the [Open Data Services JSON Schema Extension](https://json-schema-extension.readthedocs.io/) specifies additional keywords for linking fields to [CSV codelists](../patterns/schema.md#csv-codelists), and providing information about [deprecated fields](../patterns/schema.md#deprecated-fields).
133+
134+
```{seealso}
135+
💡 [Schema patterns](../patterns/schema.md)
136+
```

docs/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@ adoption/index
4141
learning/index
4242
components/index
4343
patterns/index
44+
tools
4445
about/index
4546
meta/index
4647

0 commit comments

Comments
 (0)