Skip to content

feat: deterministic metadata encoding #7437

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 1, 2025

Conversation

timsaucer
Copy link
Contributor

@timsaucer timsaucer commented Apr 23, 2025

Which issue does this PR close?

Rationale for this change

The ordering of metadata is not consistent since it uses a HashMap. It can be useful in unit tests to verify an output from a known hash of it's serialized values. With metadata this is not consistent.

What changes are included in this PR?

Adds ordering to the hashmap keys when encoding.

Are there any user-facing changes?

No.

Example

If you run this example multiple times, you will see the encoding changes from run to run based on the non-deterministic ordering of the hashmap iterator.

use std::{hash::Hasher, sync::Arc};

use arrow::{array::RecordBatch, datatypes::Schema};

fn main() {
    let schema = Arc::new(
        Schema::empty().with_metadata(
            [
                ("a".to_owned(), "1".to_owned()), //
                ("b".to_owned(), "2".to_owned()), //
                ("c".to_owned(), "3".to_owned()), //
                ("d".to_owned(), "4".to_owned()), //
                ("e".to_owned(), "5".to_owned()), //
            ]
            .into_iter()
            .collect(),
        ),
    );
    let batch = RecordBatch::new_empty(schema.clone());

    dbg!(&batch.schema().metadata().keys());

    let mut bytes = Vec::new();
    let mut w = arrow::ipc::writer::StreamWriter::try_new(&mut bytes, &schema).unwrap();
    w.write(&batch).unwrap();
    w.finish().unwrap();

    let mut h = std::hash::DefaultHasher::new();
    h.write(&bytes);
    let h = h.finish();

    eprintln!("{} bytes -- h = {h:x}", bytes.len());
}

@github-actions github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels Apr 23, 2025
@timsaucer
Copy link
Contributor Author

I currently based this off 54.3.1 but I will updated it to main after we have completed internal testing.

@timsaucer timsaucer force-pushed the feat/deterministic-metadata-encoding branch from ee273f6 to 5027767 Compare April 24, 2025 11:21
@timsaucer timsaucer marked this pull request as ready for review April 24, 2025 11:21
@github-actions github-actions bot removed the parquet Changes to the parquet crate label Apr 24, 2025
@timsaucer
Copy link
Contributor Author

Rebased on main, ready for review.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @timsaucer -- I think this PR makes sense to me

Can you please file a ticket (it makes for better release notes / easier to understand what changed)

@timsaucer timsaucer changed the title Feat/deterministic metadata encoding feat: deterministic metadata encoding Apr 28, 2025
@timsaucer
Copy link
Contributor Author

@alamb Thank you for the review. I have added an issue as requested, but as I am not a committer on arrow-rs I will need someone else to merge this.

@alamb
Copy link
Contributor

alamb commented Apr 28, 2025

@alamb Thank you for the review. I have added an issue as requested, but as I am not a committer on arrow-rs I will need someone else to merge this.

I will merge it in a day or two to give others a chance to comment if they desire.

FYI @etseidl and @tustvold

@alamb alamb merged commit 880be2f into apache:main May 1, 2025
28 checks passed
@alamb
Copy link
Contributor

alamb commented May 1, 2025

Thanks again @timsaucer and @etseidl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deterministic metadata encoding
3 participants