Skip to content

GH-731: Avro adapter, output dictionary-encoded fields as enums #779

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

martin-traverse
Copy link
Contributor

What's Changed

Updated ArrowToAvro to output dictionary-encoded string vectors as Avro enums, where possible. Apologies for the delay - busy as usual!

To output dict encoded vectors as enums, a dictionary provider must be supplied to the top level methods with all the required dictionaries. All dictionary values must be present when the schema is written, i.e. before the data blocks are produced. If data is being written as a schema followed by multiple blocks, values added to a dictionary in between blocks will not be included in the schema resulting in an invalid Avro file (in general supply an invalid dictionary mapping will result in invalid output).

Dictionary encoded fields are checked to ensure they are valid Avro enums. If the dictionary encoded field is not a string field, or the string values are not valid Avro enums, the field is decoded and output as literal values. This is done by calling DictionaryEncoder.decode(vector, dictionary), which will consume memory for the vector. An alternative approach would be to decode values one-by-one, however this would require a significant change to the producer pattern since the current producers expect concrete vectors of the output type. Another option would be to throw an error if there are dictionary-encoded vectors that are not string types, i.e. push the responsibility onto client code. I'm not sure which approach is best - happy to take any guidance and I will update the code accordingly.

To read enums back the current approach for decoding is unchanged (the AvroToArrow config has to be set up with a MapDictionaryProvider which is populated when data is read). The last part of the Avro work is to add the capability for reading / writing whole files block-by-block, so there is an opportunity to do something with the top level APIs there, for now the current API works and I've used it in the round trip tests.

Please let me know any feedback, happy to update as needed!

Closes #731.

This comment has been minimized.

@martin-traverse
Copy link
Contributor Author

On reflection, simply decoding a dict-encoded vector is not an option because it will leak. I have pushed an update that will reject dict-encoded fields that are not valid Avro enums (check is during schema production).

I do think automatically decoding dict-encoded fields would be a nicer behaviour. Should we perhaps think about updating the producer pattern to allow for this? E.g. every producer could take an optional index vector, of type BaseIntVector. If that is present, current index is mapped through the index vector which resolves the dictionary index. Would this be ok?

@lidavidm lidavidm added the enhancement PRs that add or improve features. label Jun 2, 2025
@github-actions github-actions bot added this to the 18.4.0 milestone Jun 2, 2025
@lidavidm
Copy link
Member

lidavidm commented Jun 3, 2025

On reflection, simply decoding a dict-encoded vector is not an option because it will leak. I have pushed an update that will reject dict-encoded fields that are not valid Avro enums (check is during schema production).

I do think automatically decoding dict-encoded fields would be a nicer behaviour. Should we perhaps think about updating the producer pattern to allow for this? E.g. every producer could take an optional index vector, of type BaseIntVector. If that is present, current index is mapped through the index vector which resolves the dictionary index. Would this be ok?

So essentially, the "actual" vector would be the dictionary values, and the index vector would contain the "original" encoded vector? I think that sounds OK to me.

@lidavidm
Copy link
Member

lidavidm commented Jun 3, 2025

(You could also perhaps create a non-owning vector implementation that does this unwrapping? Though likely that's more work.)

@martin-traverse
Copy link
Contributor Author

So essentially, the "actual" vector would be the dictionary values, and the index vector would contain the "original" encoded vector? I think that sounds OK to me.

Yes - essentially I want to avoid replicating the type-specific logic. I tried it out and looked ok, just not sure what happens with resetValueVector() though. Another option.... could be a DictionaryDecodingProducer, then the vector type for that producer would be the index type, with a concrete producer as a child. Dictionaries can't be updated in Avro anyway.... I think it works :) Will update the PR later for review.

@martin-traverse martin-traverse force-pushed the feature/avro-dict-encoding branch from c6afc1b to 8a69d12 Compare June 7, 2025 20:50
@martin-traverse
Copy link
Contributor Author

Hi @lidavidm - I have added the dictionary decoding producer, which turned out to be very simple. Now any dictionary encoded fields that are not valid Avro enums will be automatically decoded and output as their concrete type. This does require running a regex over the dictionary entries, but that only has to happen once when the producers are set up.

I do think we will need to change the enum read, so dictionaries are populated during the schema phase rather than the data phase in order to read whole files with multiple blocks. I'd like to keep that change back and do it as part of the next PR, which will be read / write for whole files.

Assuming you are happy with both these points then I think this PR is ready for review :-)

@lidavidm
Copy link
Member

Thanks - I will try to get to reviewing this

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (just one nit), thanks! Sorry for the long wait

import org.apache.avro.io.Encoder;

/**
* Producer that produces decoded values from a dictionary-encoded {@link BaseIntVector}, writes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wording nit, but this sounds like the encoded values are ints; instead the encoded values are arbitrary and the encoding is ints, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement PRs that add or improve features.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Avro Adapter - Dictionary Encoding
2 participants