feat: Adopt UAX-31 compliant dataset names #702

mattijn · 2025-06-01T21:02:51Z

This PR updates dataset names in datapackage.json to be UAX-31 compliant, making them valid Python identifiers. The changes only affect the name fields while keeping the path fields unchanged.

I'm not sure if this is sufficient to close #695, but otherwise it can serve as a starting point for a discussion towards closing the related issue.

Changes

Converted hyphens to underscores in names
Removed file extensions (.json, .csv, .png) from all dataset names
Added icon_ prefix to image files
Ensured all names are valid Python identifiers

Examples of Changes

- "name": "7zip.png"
+ "name": "icon_7zip"

- "name": "annual-precip.json"
+ "name": "annual_precip"

Only oddities are related to extension arrow and parquet:

- "name": "flights_200k.arrow"
+ "name": "flights_200k_arrow"

- "name": "flights_200k.json"
+ "name": "flights_200k"

- "name": "flights_3m.parquet"
+ "name": "flights_3m_parquet"

Regarding flights_200k, there is a arrow and json variant. Therefor opt to add _arrow to the name for the arrow variant, since it is not a common file type, same rationale was applied for the parquet file of flights_3m.

Rationale

Makes dataset names valid Python identifiers
Improves consistency in naming convention
Keep file paths unchanged
Follows UAX-31 standard for general-purpose identifiers

Testing

Verified all new names are valid Python identifiers
Confirmed all path fields remain unchanged
Checked that all names are unique

Checklist

All dataset names are UAX-31 compliant
File paths remain unchanged
Image files have appropriate prefixes
No duplicate names

dsmedia · 2025-06-05T11:18:19Z

Hi @mattijn - thanks for tackling this. datapackage.json is a generated file, and the resource.name is derived from the filename. I think if we manually change datapackage.json it may get overwritten the next time the build_datapackage.py script is run. @dangotbanned's comprehensive script builds off the source file, _data/datapackage_additions.toml, and generates the resource name from the filename (path).

Would it make sense here to incorporate the logic you propose (when to prefix, etc.) into build_datapackage.py directly?

I'm wondering if we should avoid tying the dataset name to the existence of a collision in the filename root, because it could create temporal instability in the names (e.g. if weather.json is added tomorrow does the name of weather.csv need to be changed from weather to weather_csv?) and also because the name chosen may end up in a different state depending upon the order in which the script processes it when looking for conflicts.

Also, the prefix rule could be made clearer. if the prefix icon_ is being added because 7zip starts with a numeral, what prefix would 7zip.csv get?

Perhaps we could establish deterministic rules based solely on each file's properties (path, mediatype) rather than context-dependent collision detection?

@domoritz - As a side note, I noticed vega/vega-lite#4942 involves Unicode characters (μ, σ) in field names causing problems. I believe UAX-31 explicitly supports these as valid identifiers, so perhaps ecosystem-wide adoption would provide a consistent standard for identifier handling?

domoritz · 2025-06-16T14:22:58Z

Anything needed here?

dsmedia · 2025-06-16T22:54:37Z

Anything needed here?

@domoritz I think we're on a good track here but I think we need to address two issues first:

These changes will be overwritten
The datapackage.json file is auto-generated by build_datapackage.py, which derives resource names directly from filenames. Any manual edits to datapackage.json (like in this PR) will be lost the next time the build script runs. The solution needs to be implemented in build_datapackage.py itself.
The collision-based naming approach lacks consistency

The current approach creates temporal instability and order-dependency issues:
- Temporal instability: If weather.json is added tomorrow, would weather.csv suddenly need to change from weather to weather_csv? This would break existing code.
- Order dependency: The final names depend on the order files are processed when checking for collisions.
- Unclear prefix rules: If icon_ is added to 7zip (for starting with a numeral), what happens to 7zip.csv?

My suggestion above is

Perhaps we could establish deterministic rules based solely on each file's properties (path, mediatype) rather than context-dependent collision detection?

any thoughts here @mattijn ?

mattijn · 2025-06-17T13:56:40Z

The thought I have is.. I agree completely:), but I have not been able to prioritise time for this yet..

This commit introduces a new naming strategy to ensure every resource has a unique, UAX-31 compliant identifier. The new implementation works as follows: - A preliminary scan of the `/data` directory identifies dataset basenames that have multiple file extensions. - A new `make_uax31_name` function sanitizes the filename to create a valid Python identifier (replaces hyphens, prefixes numbers). - For datasets with multiple formats, the file format is appended as a suffix to the name to guarantee uniqueness. - Note: Adding a new format for an existing dataset will rename the original resource to include a suffix.

dsmedia · 2025-07-01T23:08:25Z

@dangotbanned, @mattijn: In 255690a I've tried to incorporate a slightly modified logic from above into the build_datapackage.py script itself. The new naming strategy ensures every resource has a unique, UAX-31 compliant identifier.

The new implementation works as follows:

A preliminary scan of the /data directory identifies dataset basenames that have multiple file extensions.
A new make_uax31_name function sanitizes the filename to create a valid Python identifier (replaces hyphens, prefixes numbers).
For datasets with multiple formats, the file format is appended as a suffix to the name to guarantee uniqueness.
Note: Adding a new format for an existing dataset will rename the original resource to include a suffix. So, this would be something to watch for. One way to prevent this would be to include a filetype suffix in every name, but this seemed costly to clarity. Given the infrequency of the dataset additions and the rarity of the edge case it seems an acceptable compromise.

Are there other edge cases we need to consider here? Is the implementation OK?

mattijn · 2025-07-02T20:51:16Z

Nice! One observation while reading the code diff: I see a few camelCase names (londonBoroughs and londonCentroids) where others are all using snake_case names.

dsmedia · 2025-07-03T11:59:19Z

Great catch, @mattijn. Given Altair is a key downstream partner, which case convention would be most consistent with the Altair style? My assumption is snake_case, but I wanted to get your recommendation.

mattijn · 2025-07-03T13:59:29Z

Python ecosystem applications would perform snake case indeed. If that is a possibility for the naming convention, that would be great.

domoritz · 2025-07-03T14:35:44Z

+1 to snake case. It's the superior convention.

Updates the `build_datapackage.py` script to ensure all generated dataset names are `snake_case`. The changes include: - A new `to_snake_case` helper function to convert camelCase strings. - The `make_uax31_name` function now uses this helper to sanitize all dataset names before they are written to `datapackage.json`. - This resolves issues where filenames like `londonBoroughs.json` would result in a non-standard `camelCase` identifier.

dsmedia · 2025-07-03T23:17:18Z

Should we add a CI check to ensure all dataset names are UAX-31 compliant, since downstream libraries like Altair will eventually depend on these names being valid identifiers? There may be several ways to get this done. One idea is for a small script to be run after the build to validate the generated datapackage.json. This would automatically catch any invalid names and prevent them from being merged in the future.

domoritz · 2025-07-04T06:21:24Z

Tests sound good.

dsmedia · 2025-07-04T22:13:22Z

Actually, I think we may already be fine here. I've already included a basic but effective check here. The script will fail loudly if it generates a non-compliant name thanks to this assertion:

vega-datasets/scripts/build_datapackage.py

Lines 266 to 267 in 73056e4

    
           # Validate the name is a valid identifier 
        
           assert name.isidentifier(), f"Generated name '{name}' is not a valid identifier"

I believe this handles the immediate concern for this PR, as it would cause the build to fail.

update dataset names

3cb241c

mattijn changed the title ~~Adopt UAX-31 compliant dataset names~~ feat: Adopt UAX-31 compliant dataset names Jun 1, 2025

maxsplit 1

edaba65

domoritz approved these changes Jun 1, 2025

View reviewed changes

dsmedia self-requested a review June 16, 2025 22:41

dsmedia requested a review from dangotbanned July 1, 2025 23:09

dsmedia approved these changes Jul 5, 2025

View reviewed changes

domoritz merged commit ada523a into main Jul 5, 2025
2 checks passed

domoritz deleted the uax-31 branch July 5, 2025 12:28

dsmedia mentioned this pull request Jul 10, 2025

feat(RFC): Adds altair.datasets vega/altair#3631

Closed

6 tasks

dangotbanned mentioned this pull request Jul 16, 2025

test: update test_datasets.py vega/altair#3857

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: Adopt UAX-31 compliant dataset names #702

feat: Adopt UAX-31 compliant dataset names #702

Uh oh!

mattijn commented Jun 1, 2025 •

edited

Loading

Uh oh!

dsmedia commented Jun 5, 2025 •

edited

Loading

Uh oh!

domoritz commented Jun 16, 2025

Uh oh!

dsmedia commented Jun 16, 2025

Uh oh!

mattijn commented Jun 17, 2025

Uh oh!

dsmedia commented Jul 1, 2025

Uh oh!

mattijn commented Jul 2, 2025

Uh oh!

dsmedia commented Jul 3, 2025

Uh oh!

mattijn commented Jul 3, 2025

Uh oh!

domoritz commented Jul 3, 2025

Uh oh!

dsmedia commented Jul 3, 2025 •

edited

Loading

Uh oh!

domoritz commented Jul 4, 2025

Uh oh!

dsmedia commented Jul 4, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

feat: Adopt UAX-31 compliant dataset names #702

feat: Adopt UAX-31 compliant dataset names #702

Uh oh!

Conversation

mattijn commented Jun 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Examples of Changes

Rationale

Testing

Checklist

Uh oh!

dsmedia commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

domoritz commented Jun 16, 2025

Uh oh!

dsmedia commented Jun 16, 2025

Uh oh!

mattijn commented Jun 17, 2025

Uh oh!

dsmedia commented Jul 1, 2025

Uh oh!

mattijn commented Jul 2, 2025

Uh oh!

dsmedia commented Jul 3, 2025

Uh oh!

mattijn commented Jul 3, 2025

Uh oh!

domoritz commented Jul 3, 2025

Uh oh!

dsmedia commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

domoritz commented Jul 4, 2025

Uh oh!

dsmedia commented Jul 4, 2025

Uh oh!

Uh oh!

Uh oh!

mattijn commented Jun 1, 2025 •

edited

Loading

dsmedia commented Jun 5, 2025 •

edited

Loading

dsmedia commented Jul 3, 2025 •

edited

Loading