-
-
Notifications
You must be signed in to change notification settings - Fork 214
feat: Adopt UAX-31 compliant dataset names #702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @mattijn - thanks for tackling this. Would it make sense here to incorporate the logic you propose (when to prefix, etc.) into I'm wondering if we should avoid tying the dataset name to the existence of a collision in the filename root, because it could create temporal instability in the names (e.g. if Also, the prefix rule could be made clearer. if the prefix Perhaps we could establish deterministic rules based solely on each file's properties (path, mediatype) rather than context-dependent collision detection? @domoritz - As a side note, I noticed vega/vega-lite#4942 involves Unicode characters (μ, σ) in field names causing problems. I believe UAX-31 explicitly supports these as valid identifiers, so perhaps ecosystem-wide adoption would provide a consistent standard for identifier handling? |
Anything needed here? |
@domoritz I think we're on a good track here but I think we need to address two issues first:
My suggestion above is
any thoughts here @mattijn ? |
The thought I have is.. I agree completely:), but I have not been able to prioritise time for this yet.. |
This commit introduces a new naming strategy to ensure every resource has a unique, UAX-31 compliant identifier. The new implementation works as follows: - A preliminary scan of the `/data` directory identifies dataset basenames that have multiple file extensions. - A new `make_uax31_name` function sanitizes the filename to create a valid Python identifier (replaces hyphens, prefixes numbers). - For datasets with multiple formats, the file format is appended as a suffix to the name to guarantee uniqueness. - Note: Adding a new format for an existing dataset will rename the original resource to include a suffix.
@dangotbanned, @mattijn: In 255690a I've tried to incorporate a slightly modified logic from above into the The new implementation works as follows:
Are there other edge cases we need to consider here? Is the implementation OK? |
Nice! One observation while reading the code diff: I see a few camelCase names (londonBoroughs and londonCentroids) where others are all using snake_case names. |
Great catch, @mattijn. Given Altair is a key downstream partner, which case convention would be most consistent with the Altair style? My assumption is snake_case, but I wanted to get your recommendation. |
Python ecosystem applications would perform snake case indeed. If that is a possibility for the naming convention, that would be great. |
+1 to snake case. It's the superior convention. |
Updates the `build_datapackage.py` script to ensure all generated dataset names are `snake_case`. The changes include: - A new `to_snake_case` helper function to convert camelCase strings. - The `make_uax31_name` function now uses this helper to sanitize all dataset names before they are written to `datapackage.json`. - This resolves issues where filenames like `londonBoroughs.json` would result in a non-standard `camelCase` identifier.
Should we add a CI check to ensure all dataset names are UAX-31 compliant, since downstream libraries like Altair will eventually depend on these names being valid identifiers? There may be several ways to get this done. One idea is for a small script to be run after the build to validate the generated datapackage.json. This would automatically catch any invalid names and prevent them from being merged in the future. |
Tests sound good. |
Actually, I think we may already be fine here. I've already included a basic but effective check here. The script will fail loudly if it generates a non-compliant name thanks to this assertion: vega-datasets/scripts/build_datapackage.py Lines 266 to 267 in 73056e4
I believe this handles the immediate concern for this PR, as it would cause the build to fail. |
This PR updates dataset names in
datapackage.json
to be UAX-31 compliant, making them valid Python identifiers. The changes only affect thename
fields while keeping thepath
fields unchanged.I'm not sure if this is sufficient to close #695, but otherwise it can serve as a starting point for a discussion towards closing the related issue.
Changes
.json
,.csv
,.png
) from all dataset namesicon_
prefix to image filesExamples of Changes
Only oddities are related to extension
arrow
andparquet
:Regarding
flights_200k
, there is aarrow
andjson
variant. Therefor opt to add_arrow
to thename
for thearrow
variant, since it is not a common file type, same rationale was applied for theparquet
file offlights_3m
.Rationale
Testing
path
fields remain unchangedChecklist