|
1 |
| -# Schema development |
| 1 | +# Data modelling and schema development |
2 | 2 |
|
3 |
| -This section outlines a number of the pattens we commonly use to develop the schema for an open data standard. |
| 3 | +This page provides an overview of the steps involved in data modelling and schema development. |
4 | 4 |
|
5 |
| -```{admonition} Learning / reflection |
6 |
| ---- |
7 |
| -class: note |
8 |
| ---- |
9 |
| -We currently jump straight from the conceptual framework document, to working up the data model and schema for a standard in a schema language. |
| 5 | +## Document a data model |
10 | 6 |
|
11 |
| -This differs [from the approach proposed here](https://github.com/open-contracting-archive/technical-approach#data-model) of maintaining the **data model** as a narrative document, only then given form by a schema as a reference implementation. |
| 7 | +A data model is an abstract model that organizes elements of data and standardises how they relate to one another and to the properties of real-world entities. A data model focuses on what data represents rather than how it is stored or exchanged. |
| 8 | + |
| 9 | +Before authoring the schema for a standard and committing to specific implementation details, it is recommended to document a data model to help stakeholders align on definitions and relationships. |
| 10 | + |
| 11 | +The data model for a standard should be based on [research](research.md) into the related policy area and a thorough understanding of the concepts which underpin it (a conceptual model). Documenting the data model for a standard involves identifying and defining the entities (classes), attributes (properties), relationships and permissable values (codelists) needed to satisfy the requirements, user stories and use cases for the standard. |
| 12 | + |
| 13 | +Developing a good data model is an art as much as a science. It requires sensitivity to the needs of both data producers and data users, and an understanding of the incentive structures that will drive adoption of a standard. |
| 14 | + |
| 15 | +The recommended approach is to document the data model using the [standard development template](../tools.md#standard-development-template-airtable), which ensures that the data model is grounded through explicit links to the requirements, user stories and use cases. |
| 16 | + |
| 17 | +```{admonition} History |
| 18 | +:class: dropdown |
| 19 | +
|
| 20 | +Previously, we moved straight from documenting a conceptual framework to documenting a schema. The reasons for documenting a data model are explored in the [technical scoping](https://github.com/open-contracting-archive/technical-approach?tab=readme-ov-file#data-model) for the Open Contracting Data Standard. |
12 | 21 |
|
13 | 22 | ```
|
14 | 23 |
|
15 |
| -## Schema language |
| 24 | +## Choose your publication formats |
| 25 | + |
| 26 | +A publication format is a format in which data can be published by implementers of a standard. Common publication formats include: |
| 27 | + |
| 28 | +* [JSON](https://www.json.org/json-en.html) |
| 29 | +* [GeoJSON](https://datatracker.ietf.org/doc/html/rfc7946) |
| 30 | +* [CSV](https://en.wikipedia.org/wiki/Comma-separated_values) and other tabular formats, such as XLSX and ODS. |
| 31 | +* [XML](https://www.w3.org/TR/xml/) |
| 32 | + |
| 33 | +Based on your [research](research.md), you need to decide which publication formats to support. |
16 | 34 |
|
17 |
| -Our preferred schema language is [JSON Schema v0.4](https://tools.ietf.org/html/draft-zyp-json-schema-04). |
| 35 | +It is [best practice](https://www.w3.org/TR/dwbp/#MultipleFormats) for data publishers to provide data in multiple formats, so that as many users as possible can use the data without first having to transform it to their preferred format. Therefore, you should consider how to support publication in multiple formats. |
18 | 36 |
|
19 |
| -This allows us to provide field structures and definitions. Although less expressive than other schema languages, the constraints of JSON Schema enable us to focus on keeping data simple enough for a wide range of users. |
| 37 | +On a technical level, the recommended approach is to use JSON as the primary format around which a standard's tools are built, and to provide support for other formats through conversion tooling. Depending on the user needs identified in your research, a standard's documentation site and tooling might present an alternative format, such as CSV, as the primary format. |
20 | 38 |
|
21 |
| -We generally use simple CSV files to represent codelists. |
| 39 | +Open Data Services' reusable tools for documenting, converting and validating data are built around JSON. If your research surfaces demand for a different primary format that cannot be supported through conversion to JSON, you should consider the potential costs associated with authoring new tooling. |
22 | 40 |
|
23 |
| -We have a number of extensions to JSON Schema 0.4 we use (documented below). |
| 41 | +```{admonition} Example: The Open Contracting Data Standard |
| 42 | +:class: note |
24 | 43 |
|
25 |
| -## Serializations |
| 44 | +The primary publication format of the Open Contracting Data Standard is JSON, but CSV and spreadsheet formats are also supported via conversion tooling. For more information, see [Serialization (Open Contracting Data Standard Documentation)](https://standard.open-contracting.org/latest/en/guidance/build/serialization/#serialization). |
26 | 45 |
|
27 |
| -We design with a range of serializations in mind, and, where possible, to enable round-tripping of data between different serializations. |
| 46 | +``` |
28 | 47 |
|
29 |
| -In particular, through [flatten-tool](http://flatten-tool.readthedocs.io) we design with support for: |
| 48 | +```{admonition} Example: 360Giving |
| 49 | +:class: note |
30 | 50 |
|
31 |
| -- Structured JSON serialization; |
32 |
| -- Excel serialization; |
33 |
| -- CSV serialization. |
| 51 | +The 360Giving Data Standard supports both spreadsheet and JSON formats, but most 360Giving data is published in spreadsheet format. Therefore, the documentation for the standard is primarily focussed on the spreadsheet format. For more information, see [Choosing your file format (360Giving Data Standard Documentation)](https://standard.threesixtygiving.org/en/latest/guidance/prepare-data/#choosing-your-file-format). |
34 | 52 |
|
35 |
| -[Flatten-tool](http://flatten-tool.readthedocs.io/en/latest/unflatten/#human-friendly-headings-using-a-json-schema-with-titles) can use the titles in a schema to provide 'friendly' column headings, and with use of a [metatab](http://flatten-tool.readthedocs.io/en/latest/unflatten/#metadata-tab) also supports packaging meta-data and options to control how spreadsheets are parsed. |
| 53 | +``` |
36 | 54 |
|
37 |
| -## Extended JSON schema |
| 55 | +```{seealso} |
38 | 56 |
|
39 |
| -We use a number of custom properties in our JSON Schema implementation. A [patch against JSON Schema 0.4 to include these is found here](https://github.com/open-contracting/standard/blob/6e538252dd08344222b5cd16b864ed0a2a866197/standard/schema/metaschema/meta-schema-patch.json). |
| 57 | +* 🧩 [Conversion tools](../components/index.md#conversion-tools) |
| 58 | +* 💡 [Spreadsheet first schema design](../patterns/schema.md#spreadsheet-first) |
40 | 59 |
|
41 |
| -### Codelist properties |
| 60 | +``` |
42 | 61 |
|
43 |
| -- `codelist` - the filename of a .csv file that contains at least a `Code` column. Used by the CoVE validator to check for acceptable values. |
44 |
| -- `openCodelist` - a boolean value to indicate whether values can **only** come from the codelist, or whether additional values not on the codelist are permitted. When `openCodelist` = 'true' then encountering a value not on the codelist should generate a warning. When `openCodelist` = 'false' then encountering a value not on the codelist should generate an error. |
| 62 | +## Choose a schema language |
45 | 63 |
|
46 |
| -### Deprecation properties |
| 64 | +A schema defines the meaning, structure and format of data. |
47 | 65 |
|
48 |
| -> "Deprecation is the discouragement of use of some terminology, feature, design, or practice; typically because it has been superseded or is no longer considered efficient or safe – but without completely removing it or prohibiting its use." |
| 66 | +Based on your chosen publication formats, you need to decide on a language in which to document the schema for a standard, |
49 | 67 |
|
50 |
| -See: [Deprecation (Wikipedia)](https://en.wikipedia.org/wiki/Deprecation) |
| 68 | +For standards that support JSON as a publication format, the preferred approach is to use [JSON Schema](https://json-schema.org/) to document the canonical schema for the standard, specifically [JSON Schema Draft 2020-12](https://json-schema.org/draft/2020-12). Although less expressive than other schema languages, the constraints of JSON Schema enable a focus on keeping data simple enough for a wide range of users. |
51 | 69 |
|
52 |
| -- `deprecated` - and object to indicate that the field is deprecated, consisting of fields for: |
53 |
| - - `description` - a message that explains the deprecation, and that should be presented by validators to any publisher using this field. |
54 |
| - - `deprecatedVersion` - a string indicating the version in which the field was first deprecated. |
| 70 | +If you choose to support other publication formats alongside JSON, you should consider whether to provide secondary, derived schema for those formats. |
55 | 71 |
|
56 |
| -We also use the column title `Deprecated` with a version number as the cell value in codelist CSV files when a code has been deprecated. |
| 72 | +```{admonition} Example: Open Referral |
| 73 | +:class: note |
57 | 74 |
|
58 |
| -### Merge strategies |
| 75 | +The canonical schema for the Open Referral Data Specifications is documented using JSON Schema. However, a secondary schema is provided for the Tabular Data Package format, which is derived from the canonical schema. For more information, see [Serialization and Publication Formats (Open Referral Data Specifications Documentation)](http://docs.openreferral.org/en/latest/hsds/serialization.html). |
59 | 76 |
|
60 |
| -The Open Contracting Data Standard describes an approach to merge together releases of data from different point in time. We add a number of properties to indicate how merging should be approached. |
| 77 | +``` |
61 | 78 |
|
62 |
| -- `omitWhemMerged` |
63 |
| -- `wholeListMerge` |
64 |
| -- `versionId` |
| 79 | +```{admonition} History |
| 80 | +:class: dropdown |
| 81 | +Previously, the recommended approach was to use [JSON Schema Draft 4](https://json-schema.org/draft-04/draft-zyp-json-schema-04). However, Draft 2020-12 contains several useful features not available in Draft 4. |
| 82 | +``` |
65 | 83 |
|
66 |
| -Behaviour for these is [described in the OCDS documentation](http://standard.open-contracting.org/1.1/en/schema/merging/#merging-rules). |
| 84 | +## Choose a codelist format |
67 | 85 |
|
68 |
| -## Design patterns |
| 86 | +A codelist defines a set of permissable values for a field. |
69 | 87 |
|
70 |
| -Developing a good schema is an art as much as a science. It requires sensitivity to the needs of both data producers and data users, and an understanding of the incentive structures that will drive adoption of a standard. |
| 88 | +The recommended approach is to document codes, titles and descriptions in a CSV file, according to the [Open Data Services Codelist Schema](https://codelist-schema.readthedocs.io/). |
71 | 89 |
|
72 | 90 | ```{seealso}
|
73 |
| -[Schema patterns](../patterns/schema.md) |
74 |
| -The following section provides links to a non-exhaustive set of design patterns that can be drawn upon when developing a schema. |
| 91 | +* 💡 [CSV codelists](../patterns/schema.md#csv-codelists) |
75 | 92 | ```
|
| 93 | + |
| 94 | +## Choose your packaging formats |
| 95 | + |
| 96 | +A packaging format is structued way of bundling together data and, sometimes, metadata. You can think of a packaging format as a container for multiple records, texts or documents. |
| 97 | + |
| 98 | +Packaging formats aid interoperability and reuse by providing tool developers and analysts with predicatable and consistent approaches to grouping, streaming and pagination. |
| 99 | + |
| 100 | +Based on your chosen publication formats and the requirements identified in your research, you need to decide on a packaging format or formats for each publication format. |
| 101 | + |
| 102 | +The recommended approach is to consider providing: |
| 103 | + |
| 104 | +* A small file and API response format for files that are small enough to fit into memory or are published via API. |
| 105 | +* A bulk download format for files that are too large to fit into memory. |
| 106 | + |
| 107 | +```{admonition} Example: The Open Fibre Data Standard |
| 108 | +:class: note |
| 109 | +
|
| 110 | +The Open Fibre Data Standard supports publication in JSON, GeoJSON and CSV formats. For the JSON and GeoJSON formats, it provides containers for publishing one or more networks and options to support pagination and streaming: |
| 111 | +
|
| 112 | +Format | Small files and API responses | Streaming |
| 113 | +--- | --- | --- |
| 114 | +JSON | A JSON object with an embedded array of `Network` objects, with an optional `.links` object for pagination | A [JSON Lines](https://jsonlines.org/) file in which each line is an `Network` object. |
| 115 | +GeoJSON | GeoJSON [feature collections](https://datatracker.ietf.org/doc/html/rfc7946#section-3.3), with an optional `.links` object for pagination | [Newline-delimited GeoJSON](https://stevage.github.io/ndgeojson/) files |
| 116 | +
|
| 117 | +``` |
| 118 | + |
| 119 | +```{seealso} |
| 120 | +* 💡 [Packaging](../patterns/schema.md#packaging) |
| 121 | +* 💬 [Packaging multiple networks · Issue #51 · Open-Telecoms-Data/open-fibre-data-standard](https://github.com/Open-Telecoms-Data/open-fibre-data-standard/issues/51) |
| 122 | +* 💬 [Deprecate remaining package metadata and add bulk data format · Issue #1084 · open-contracting/standard](https://github.com/open-contracting/standard/issues/1084) |
| 123 | +* 💬 [Add a metadata package schema · Issue #200 · GFDRR/rdl-standard](https://github.com/GFDRR/rdl-standard/issues/200) |
| 124 | +``` |
| 125 | + |
| 126 | +## Author your schema and codelists |
| 127 | + |
| 128 | +Authoring the schema and codelists for a standard involves documenting the standard's data model in your chosen schema language and codelist formats. |
| 129 | + |
| 130 | +JSON Schema specifies a number of keywords to describe and constrain JSON data. For example, the `type` keyword is used to restrict a field to a specific type, like "string" or "number", whilst the `title` keyword is used to provide a human-readable title for a field. |
| 131 | + |
| 132 | +As well as the keywords specified in JSON Schema, the [Open Data Services JSON Schema Extension](https://json-schema-extension.readthedocs.io/) specifies additional keywords for linking fields to [CSV codelists](../patterns/schema.md#csv-codelists), and providing information about [deprecated fields](../patterns/schema.md#deprecated-fields). |
| 133 | + |
| 134 | +```{seealso} |
| 135 | +💡 [Schema patterns](../patterns/schema.md) |
| 136 | +``` |
0 commit comments