Skip to content

feat: Adopt UAX-31 compliant dataset names #702

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 5, 2025
Merged

feat: Adopt UAX-31 compliant dataset names #702

merged 4 commits into from
Jul 5, 2025

Conversation

mattijn
Copy link
Contributor

@mattijn mattijn commented Jun 1, 2025

This PR updates dataset names in datapackage.json to be UAX-31 compliant, making them valid Python identifiers. The changes only affect the name fields while keeping the path fields unchanged.

I'm not sure if this is sufficient to close #695, but otherwise it can serve as a starting point for a discussion towards closing the related issue.

Changes

  • Converted hyphens to underscores in names
  • Removed file extensions (.json, .csv, .png) from all dataset names
  • Added icon_ prefix to image files
  • Ensured all names are valid Python identifiers

Examples of Changes

- "name": "7zip.png"
+ "name": "icon_7zip"

- "name": "annual-precip.json"
+ "name": "annual_precip"

Only oddities are related to extension arrow and parquet:

- "name": "flights_200k.arrow"
+ "name": "flights_200k_arrow"

- "name": "flights_200k.json"
+ "name": "flights_200k"

- "name": "flights_3m.parquet"
+ "name": "flights_3m_parquet"

Regarding flights_200k, there is a arrow and json variant. Therefor opt to add _arrow to the name for the arrow variant, since it is not a common file type, same rationale was applied for the parquet file of flights_3m.

Rationale

  • Makes dataset names valid Python identifiers
  • Improves consistency in naming convention
  • Keep file paths unchanged
  • Follows UAX-31 standard for general-purpose identifiers

Testing

  • Verified all new names are valid Python identifiers
  • Confirmed all path fields remain unchanged
  • Checked that all names are unique

Checklist

  • All dataset names are UAX-31 compliant
  • File paths remain unchanged
  • Image files have appropriate prefixes
  • No duplicate names

@mattijn mattijn changed the title Adopt UAX-31 compliant dataset names feat: Adopt UAX-31 compliant dataset names Jun 1, 2025
@dsmedia
Copy link
Collaborator

dsmedia commented Jun 5, 2025

Hi @mattijn - thanks for tackling this. datapackage.json is a generated file, and the resource.name is derived from the filename. I think if we manually change datapackage.json it may get overwritten the next time the build_datapackage.py script is run. @dangotbanned's comprehensive script builds off the source file, _data/datapackage_additions.toml, and generates the resource name from the filename (path).

Would it make sense here to incorporate the logic you propose (when to prefix, etc.) into build_datapackage.py directly?

I'm wondering if we should avoid tying the dataset name to the existence of a collision in the filename root, because it could create temporal instability in the names (e.g. if weather.json is added tomorrow does the name of weather.csv need to be changed from weather to weather_csv?) and also because the name chosen may end up in a different state depending upon the order in which the script processes it when looking for conflicts.

Also, the prefix rule could be made clearer. if the prefix icon_ is being added because 7zip starts with a numeral, what prefix would 7zip.csv get?

Perhaps we could establish deterministic rules based solely on each file's properties (path, mediatype) rather than context-dependent collision detection?

@domoritz - As a side note, I noticed vega/vega-lite#4942 involves Unicode characters (μ, σ) in field names causing problems. I believe UAX-31 explicitly supports these as valid identifiers, so perhaps ecosystem-wide adoption would provide a consistent standard for identifier handling?

@domoritz
Copy link
Member

Anything needed here?

@dsmedia dsmedia self-requested a review June 16, 2025 22:41
@dsmedia
Copy link
Collaborator

dsmedia commented Jun 16, 2025

Anything needed here?

@domoritz I think we're on a good track here but I think we need to address two issues first:

  1. These changes will be overwritten
    The datapackage.json file is auto-generated by build_datapackage.py, which derives resource names directly from filenames. Any manual edits to datapackage.json (like in this PR) will be lost the next time the build script runs. The solution needs to be implemented in build_datapackage.py itself.

  2. The collision-based naming approach lacks consistency

  • The current approach creates temporal instability and order-dependency issues:
    • Temporal instability: If weather.json is added tomorrow, would weather.csv suddenly need to change from weather to weather_csv? This would break existing code.
    • Order dependency: The final names depend on the order files are processed when checking for collisions.
    • Unclear prefix rules: If icon_ is added to 7zip (for starting with a numeral), what happens to 7zip.csv?

My suggestion above is

Perhaps we could establish deterministic rules based solely on each file's properties (path, mediatype) rather than context-dependent collision detection?

any thoughts here @mattijn ?

@mattijn
Copy link
Contributor Author

mattijn commented Jun 17, 2025

The thought I have is.. I agree completely:), but I have not been able to prioritise time for this yet..

This commit introduces a new naming strategy to ensure every resource has a unique, UAX-31 compliant identifier.

The new implementation works as follows:
- A preliminary scan of the `/data` directory identifies dataset basenames that have multiple file extensions.
- A new `make_uax31_name` function sanitizes the filename to create a valid Python identifier (replaces hyphens, prefixes numbers).
- For datasets with multiple formats, the file format is appended as a suffix to the name to guarantee uniqueness.
- Note: Adding a new format for an existing dataset will rename the original resource to include a suffix.
@dsmedia
Copy link
Collaborator

dsmedia commented Jul 1, 2025

@dangotbanned, @mattijn: In 255690a I've tried to incorporate a slightly modified logic from above into the build_datapackage.py script itself. The new naming strategy ensures every resource has a unique, UAX-31 compliant identifier.

The new implementation works as follows:

  • A preliminary scan of the /data directory identifies dataset basenames that have multiple file extensions.
  • A new make_uax31_name function sanitizes the filename to create a valid Python identifier (replaces hyphens, prefixes numbers).
  • For datasets with multiple formats, the file format is appended as a suffix to the name to guarantee uniqueness.
  • Note: Adding a new format for an existing dataset will rename the original resource to include a suffix. So, this would be something to watch for. One way to prevent this would be to include a filetype suffix in every name, but this seemed costly to clarity. Given the infrequency of the dataset additions and the rarity of the edge case it seems an acceptable compromise.

Are there other edge cases we need to consider here? Is the implementation OK?

@dsmedia dsmedia requested a review from dangotbanned July 1, 2025 23:09
@mattijn
Copy link
Contributor Author

mattijn commented Jul 2, 2025

Nice! One observation while reading the code diff: I see a few camelCase names (londonBoroughs and londonCentroids) where others are all using snake_case names.

@dsmedia
Copy link
Collaborator

dsmedia commented Jul 3, 2025

Great catch, @mattijn. Given Altair is a key downstream partner, which case convention would be most consistent with the Altair style? My assumption is snake_case, but I wanted to get your recommendation.

@mattijn
Copy link
Contributor Author

mattijn commented Jul 3, 2025

Python ecosystem applications would perform snake case indeed. If that is a possibility for the naming convention, that would be great.

@domoritz
Copy link
Member

domoritz commented Jul 3, 2025

+1 to snake case. It's the superior convention.

Updates the `build_datapackage.py` script to ensure all generated dataset names are `snake_case`.

The changes include:
- A new `to_snake_case` helper function to convert camelCase strings.
- The `make_uax31_name` function now uses this helper to sanitize all dataset names before they are written to `datapackage.json`.
- This resolves issues where filenames like `londonBoroughs.json` would result in a non-standard `camelCase` identifier.
@dsmedia
Copy link
Collaborator

dsmedia commented Jul 3, 2025

Should we add a CI check to ensure all dataset names are UAX-31 compliant, since downstream libraries like Altair will eventually depend on these names being valid identifiers? There may be several ways to get this done. One idea is for a small script to be run after the build to validate the generated datapackage.json. This would automatically catch any invalid names and prevent them from being merged in the future.

@domoritz
Copy link
Member

domoritz commented Jul 4, 2025

Tests sound good.

@dsmedia
Copy link
Collaborator

dsmedia commented Jul 4, 2025

Actually, I think we may already be fine here. I've already included a basic but effective check here. The script will fail loudly if it generates a non-compliant name thanks to this assertion:

# Validate the name is a valid identifier
assert name.isidentifier(), f"Generated name '{name}' is not a valid identifier"

I believe this handles the immediate concern for this PR, as it would cause the build to fail.

@domoritz domoritz merged commit ada523a into main Jul 5, 2025
2 checks passed
@domoritz domoritz deleted the uax-31 branch July 5, 2025 12:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adopt general-purpose-identifiers as dataset names
3 participants