Skip to content

support two more calib datasets and fix embedding layer bug #653

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jul 10, 2025
Merged

Conversation

wenhuach21
Copy link
Contributor

No description provided.

@wenhuach21 wenhuach21 changed the title support ultrachat_200k dataset support ultrachat_200k dataset and fix embedding layer bug Jul 9, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces support for the ultrachat_200k dataset, extends dataset registration to multiple aliases, and refines the embedding quantization logic.

  • Extend register_dataset decorator to accept multiple dataset names and integrate ultrachat_200k
  • Import and standardize load_dataset usage across existing dataset functions
  • Modify quantize_embedding_layer to return whether any layers were actually quantized

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
auto_round/utils.py Sort GGUF_CONFIG keys for deterministic ordering in _gguf_format
auto_round/calib_dataset.py Added load_dataset import, multi-name registration, ultrachat_200k support, and hardcoded dataset fixes
auto_round/autoround.py Introduce to_quantize flag and change return value of quantize_embedding_layer
Comments suppressed due to low confidence (1)

auto_round/autoround.py:783

  • Changing the return value to to_quantize alters the previous always-True behavior. Downstream callers expecting True on completion may now misinterpret False as failure. Either update callers or restore the original return semantics and expose to_quantize via a separate API.
        return to_quantize

@wenhuach21 wenhuach21 changed the title support ultrachat_200k dataset and fix embedding layer bug support two more calib datasets and fix embedding layer bug Jul 10, 2025
@wenhuach21
Copy link
Contributor Author

The lambada_openai evaluation appears to have some issues maybe due to the update of datasets library and couldn't be reproduced locally. Merging for now; will address the problem in a future fix.

@wenhuach21 wenhuach21 merged commit 7d72403 into main Jul 10, 2025
6 of 7 checks passed
@wenhuach21 wenhuach21 deleted the data_200k branch July 10, 2025 07:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants