Skip to content

Conversation

@NuojCheng
Copy link
Collaborator

@NuojCheng NuojCheng commented Oct 22, 2025

Description

This PR adds support for ramp-up batch size, a feature originally proposed in the GPT-3 paper and implemented in Megatron.

When enabled, the per device batch size starts at a smaller value (per_device_batch_size_start) and gradually increases (per_device_batch_size_increment) until it reaches the target per_device_batch_size over a specified number of rampup_samples. This can help improve training stability, especially during the initial training phases.

This feature introduces four new configuration parameters, which align with the Megatron implementation:

  • enable_rampup_batch_size: (default: False) Set to True to enable the ramp-up feature.
  • per_device_batch_size_start: The per-device batch size to use at the beginning of training.
  • per_device_batch_size_increment: The amount to increase the per-device batch size at each ramp-up step.
  • rampup_samples: The total number of samples to process before reaching the full target batch size.

The PR includes the following changes:

  • RampupDataLoader: Adds a new RampupDataLoader class that inherits from the base DataLoader. Its primary responsibility is to truncate the input data to match the correct ramp-up shape for the current training step.
  • Metric Logger: Updates the metric logger to prevent flops and token counts associated with metadata from being logged.
  • Config Updates: Modifies pyconfig.py to register and validate the new ramp-up configuration parameters.
  • Testing: Adds new tests to data_loader_tests.py to verify the RampupDataLoader's slicing and increment logic.

FIXES: b/452468482

Tests

New data_loader_test.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@NuojCheng NuojCheng added the draft Draft PR label Oct 22, 2025
@NuojCheng NuojCheng changed the title Add rampup batch size support in MaxText [WIP] Add rampup batch size support in MaxText Oct 22, 2025
@NuojCheng NuojCheng force-pushed the chengnuojin-rampup-batch branch 4 times, most recently from 57ff3e8 to 842193d Compare October 24, 2025 00:53
@NuojCheng NuojCheng changed the title [WIP] Add rampup batch size support in MaxText Add rampup batch size support in MaxText Oct 24, 2025
@NuojCheng NuojCheng added gemini-review and removed draft Draft PR labels Oct 24, 2025
@github-actions
Copy link

🤖 Hi @NuojCheng, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

@RissyRan
Copy link
Collaborator

🤖 Hi @NuojCheng, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

It seems out of quota for free tier. We are going to update the Tier 1, should be better soon.

Attempt 1 failed with status 429. Retrying with backoff... ApiError: {"error":{"message":"{\n  \"error\": {\n    \"code\": 429,\n    \"message\": \"You exceeded your current quota, please check your plan and billing details. For more information on this error, head to: https://ai.google.dev/gemini-api/docs/rate-limits. To monitor your current usage, head to: https://ai.dev/usage?tab=rate-limit.\\n* Quota exceeded for metric: generativelanguage.googleapis.com/generate_content_free_tier_requests, limit: 2\\nPlease retry in 20.529613201s.\",\n    \"status\": \"RESOURCE_EXHAUSTED\",\n    \"details\": [\n      {\n        \"@type\": \"type.googleapis.com/google.rpc.QuotaFailure\",\n        \"violations\": [\n          {\n            \"quotaMetric\": \"generativelanguage.googleapis.com/generate_content_free_tier_requests\",\n            \"quotaId\": \"GenerateRequestsPerMinutePerProjectPerModel-FreeTier\",\n            \"quotaDimensions\": {\n              \"location\": \"global\",\n              \"model\": \"gemini-2.5-pro\"\n            },\n            \"quotaValue\": \"2\"\n          }\n        ]\n      },\n      {\n        \"@type\": \"type.googleapis.com/google.rpc.Help\",\n        \"links\": [\n          {\n            \"description\": \"Learn more about Gemini API quotas\",\n            \"url\": \"[https://ai.google.dev/gemini-api/docs/rate-limits\](https://ai.google.dev/gemini-api/docs/rate-limits/)"\n          }\n        ]\n      },\n      {\n        \"@type\": \"type.googleapis.com/google.rpc.RetryInfo\",\n        \"retryDelay\": \"20s\"\n      }\n    ]\n  }\n}\n","code":429,"status":"Too Many Requests"}}

@github-actions
Copy link

🤖 Hi @RissyRan, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

@github-actions
Copy link

🤖 Hi @RissyRan, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants