Skip to content

DRILL-8474: Add Daffodil Format Plugin to Drill: Phase 1 #2989

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

cgivre
Copy link
Contributor

@cgivre cgivre commented May 8, 2025

DRILL-8474: Add Daffodil Format Plugin to Drill

Description

This PR replaces: #2836 which is closed. That was to retain history/comments while squashing numerous debug-related commits together into this PR. This PR also replaces #2909.

Documentation

New format-daffodil module created

Still uses absolute paths for the schemaFileURI. (which is cheating. Wouldn't work in a true distributed drill environment.)

We have yet to work out how to enable Drill to provide access for DFDL schemas in XML form with include/import to be resolved.

The input data stream is, however, being accessed in the proper Drill manner. Gunzip happened automatically. Nice.

Test show this works for data as complex as having nested repeating sub-records.

These DFDL types are supported:

  • int
  • long
  • short
  • byte
  • boolean
  • double
  • float (does not work. Bug DAFFODIL-2367)
  • hexBinary
  • string

Testing

See tests under src/test in the new daffodil contrib module.

mbeckerle and others added 4 commits May 8, 2025 18:45
Requires Daffodil version 3.7.0 or higher.

New format-daffodil module created

Still uses absolute paths for the schemaFileURI.
(which is cheating. Wouldn't work in a true distributed
drill environment.)

We have yet to work out how to enable Drill to provide
access for DFDL schemas in XML form with include/import
to be resolved.

The input data stream is, however, being accessed in the
proper Drill manner. Gunzip happened automatically. Nice.

Note: Fix boxed Boolean vs. boolean problem. Don't use
boxed primitives in Format config objects.

Test show this works for data as complex as having
nested repeating sub-records.

These DFDL types are supported:

- int
- long
- short
- byte
- boolean
- double
- float (does not work. Bug DAFFODIL-2367)
- hexBinary
- string

apache#2835
@cgivre cgivre marked this pull request as draft May 8, 2025 23:11
@cgivre cgivre self-assigned this May 8, 2025
@cgivre cgivre added enhancement PRs that add a new functionality to Drill new-format New Format Plugin labels May 8, 2025
@cgivre cgivre requested a review from jnturton May 9, 2025 15:06
@cgivre cgivre marked this pull request as ready for review May 9, 2025 15:06
@cgivre cgivre changed the title DRILL-8474: Add Daffodil Format Plugin to Drill DRILL-8474: Add Daffodil Format Plugin to Drill: Phase 1 May 9, 2025
@cgivre
Copy link
Contributor Author

cgivre commented May 9, 2025

@mbeckerle I'm working on the logic to add queries similar to the Dynamic UDF capabilities which would allow a user to import the DFDL files. That will be a separate PR once this is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement PRs that add a new functionality to Drill new-format New Format Plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants