-
Notifications
You must be signed in to change notification settings - Fork 453
[Draft] Comparing toolu with main #963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
shatu
wants to merge
268
commits into
toolu
Choose a base branch
from
main
base: toolu
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* Fix misnamed variables. * Ran linter.
Co-authored-by: Hamish Ivison <[email protected]>
Adds new olmo-core-compatible chat templates. Includes: * New olmo template with support for function-calling. Includes a basic hard-coded system prompt, and appends "You do not have access to any functions" to any SFT examples that do not include functions. * Thinker version of the above template, has <think> included in the generation prompt * R1-style thinker template These 3 templates mirror our current Tulu templates Also includes some necessary changes to the --add_bos logic, to handle the new chat template which does not have a bos token. Includes a few other QoL fixes: * Fixes a bug in the olmocore tokenization script re: label mask * Logs dataset-level statistics during data mixing and tokenization * Supports easy upsampling during data mixing
* fix up my (jacob's) slightly broken pr --------- Co-authored-by: jacob-morrison <[email protected]>
* remove moar things * create on pr * dont create on pr
* Moved init to the main thread, before the queues. * Undid changes to test.
Co-authored-by: Saurabh Shah <[email protected]>
* init branch * it works * nits * false positive regex * manual filtering * fix manual label issue * another fix * minor fixes * final tweaks * clean up * cleaning * cleanup and docs * try removing tests spec * reset pyproject
Co-authored-by: Hamish Ivison <[email protected]>
* look at changes * tweak * style * fix * fix * one more fix * fix? * verbose is way too verbose?? * update * fix * correct logging * fix small wandb logging bug --------- Co-authored-by: Hamish Ivison <[email protected]>
* A bunch of minor changes. * Undo changes to script. * Added some tests which now pass. * Fixing indexing issue. * Claude tried to fix the indexing error we were running into. * Fix indexing issue. * Another attempt to fix the index bug. * We now have a failing test. * Added failing tests. * Now, all tests pass. * Tests pass. Launching. * Added a bunch of logging. * Removed most of the logging code. * Ran linter. * Created stripped down version of the tests. * Ran linter. * Add comprehensive logging to diagnose hanging issue - Added logging to vLLM engine process_from_queue method - Added logging to split_and_insert_batch to track queue operations - Added logging to accumulate_inference_batches to track result collection - Added logging to sync_weights_and_prepare_prompts to track batch flow This will help identify where the system is getting stuck on training step 2. * Fix vLLM engine queue processing and batch duplication - Added comprehensive error handling and logging to vLLM process_from_queue - Fixed batch sending logic to match original implementation (only send when training_step \!= 1) - Added early failure detection for vLLM engines - Added queue existence checks to catch initialization issues The main issue was that the vLLM engine's process_from_queue method was not running, and the batch sending logic was incorrect causing duplicate batches. * Switched to use new WeightUpdater system. * Claude tried to fix everything. * Cleaned up PR significantly. * Cleaned up code. * Cleaned up vllm_utils3.py. * Ran linter. * Expected bug in accumulate. * Fixed args. * Modified run single gpu.sh * Cleaned up eval results. * Adds git commit hash to experiment. * Ran linter. * Cleaned up PR. * Added debug scripts. * Removed debugging statements. * All tests and linter passes. * Cleaned up loop. * Cleaned up code. * Removed whitespace. * A bunch of clean up. * Ran linter. * Changed reference tracking to remove bug. * Now, code runs again. * Change test config to ignore oe-eval-internal. * Reset to c6ac09d. * Remove flags from addopts. * Remove flags from addopts. * Add back flag * Fix spacing issue. * Temporarily comment out pytest addopts to debug GitHub checks * Removed pytest block from pyproject. * Fix GitHub checks by updating uv version in quality workflow - Update uv from 0.5.11 to 0.7.17 in quality.yml to match tests.yml - Re-enable pytest configuration in pyproject.toml - Old uv version was likely unable to parse [tool.pytest.ini_options] * What about an empty block? * Most minimal pytest config. * Removes pyproject config. * More verbose tests. * Now, we use a thread to manage generation. * Modified grpo_fast to use a thread to manage generation. * Updated eval processing. * Fixed assert statements. * Removed asserts, as they don't work with the new threading model. * Fixed error. * Switched logger.fatal to raise ValueError. * Cleaned up code. * Ran linter. * Fixed cleanup issues. * Ran linter. * removed old tests. * Added more cleanup. * Ran formatter. * Added leak detection. * maybe fixed issue? * Ran linter. * Tries to fix the tests. * Ran linter. * Removed differences to pyproject.toml. * Another attempt to fix speed. * Ran linter. * Updated tests. * Hopefully fixed synchronization issue. * Ran linter. * Updated script to have good clusters. * Added back accumulate_inference_batches. * Tests should pass now. * Added performance diagnostics via tqdm. * Linter passes. * Now, shellcheck passes build_image_and_launch.sh. * Single GPU script passes shellcheck. * Fixed library reference. * Removed unneeded test. * Fixed conversion from ref to futures. * Fixed variable naming issue. * Cleaned up PR by removing old files. * Moved runtime leak detection to utils.py. * Ran linter. * Changed training resource cleanup to be a function. * Added generation timer. * Ran linter.
* perf penalty * style * fix tests * style * remove scripts * delete more scritps * one more
…s `grpo_fast` on a single GPU. (#830) * Added a beaker experiment action. * Refactors conditional job execution. * Makes it possible to pass images through. * Updated experiment workflow. * Copies the trigger from push-image. * add trigger on PR for testing * trigger test * Add the script we need to run the test. * Updated the workflow. * Update script path. * Added script. * Removed change to script. * Updated to use newly build image. * Now, should have a private key. * Remove big files. * Fix spacing issue. * Cleaned up a bunch of space. * An attempt to fix the oom issue. * Added back ssg agent setup. * Added back ssg agent setup. * Added back ssg agent setup. * Uses a shallow clone. * More attempts to fix the no space left on device error by mimicking push-image.yml. * Fix ssh key. * Another attempt to fix building. * Another attempt at fixing it... * Another attempt at building the image. * Adding python dependencies. * Update status code matching. * Fixed tail usage. * Remove ceres as a cluster. * Now, only runs on every commit pushed to main. * Removed dockerfile changes. * Cleaned up the script. * Final cleanup. * Fixed whitespace issue. * Added back trigger for PRs. * Adding a comment to trigger run. * Fix tab characters. * Added uv back to single gpu script. * Commented out push trigger. --------- Co-authored-by: Hamish Ivison <[email protected]>
* Added comments explaining why we set NCCL_CUMEM_ENABLE. * Ran linter.
…or quality. (#844) * Adds pytest-xdist for parallel tests, and only install needed deps. * Make quality faster by removing uv run. * Fix quality check.
Co-authored-by: Saurabh Shah <[email protected]>
* add verify script * logger, move contstants to top of file
…#823) * Fixed TORCH_CUDA_ARCH_LIST warning. * Ran linter. * Change default name for single_gpu_on_beaker script.
Co-authored-by: Saurabh Shah <[email protected]>
* skip_oi_evals skips everything including print statements
previously --skip_oi_evals would still print Submitting {oi_eval} etc..
* revert
* only print if we're not skipping evals
---------
Co-authored-by: Michael Noukhovitch <[email protected]>
* Add system prompt + streamline tokenize * address comments * rename * remove old arg * whoops forgot file * debug system prompt * finbarr comments * cute system prompt * cute system prompt debug * move debug system prompt * no more max token length for RL scripts * fix
* fix attn and mlp flop calculation * take into account causal attention * fix weight_memory_bytes * fix head_dim * fix ModelDims creation from LLMRayActor * fix kv_proj: in prefill instead of decode * remove prefill/decode distinction for attn * remove note * fix things (again)
* Update script to call new oe-eval safety evals * Add num gpu constraints * Add handling for alternative safety beaker image * typos in script, add hf key to gantry args * move safety eval call into a task suite * update num_gpu calculation * typo * update to match oe-eval naming and engine requirements * typo * update with safety eval version
* Use epoch number instead of training step more * updated code * updated code * Removed merge conflict markers * Removed merge conflict markers * renamed test_vllm_utils3.py * Ran linter * Updated code to remove tracking argument
* Added launch benchmark scripts * Fixed priority * Updated scripts to remove max_token_length * Renamed filter from rlvr_filter_v1 to rlvr_max_length_filter_v1 * Updated code * Add comprehensive logging to debug benchmark hanging issue Added detailed logging throughout the vLLM generation pipeline to diagnose why the benchmark script hangs: - Log prompt submission and queue sizes in benchmark_generators.py - Log actor_manager should_stop status and engine ready status - Log _prefetch_worker iterations and request processing in vllm_utils.py - Log async task creation, completion accumulation, and result queue operations This will help identify where in the pipeline the hang occurs (prompt queue, async tasks, completion queue, or results queue). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Updated script to use stego. * Updated script * updated script to add tokenizer * Fixed tokenizer flag * added more clusters * Added titan * Updated code * Updated scripts * Gather whole model False * Now we use saturn * Now, we let the engines take 10 minutes to init. * Updated code * Updated script. * Updated code * Updated benchmark * updated script * Updated script * Updated code * Added logs * Add comprehensive diagnostic logging to identify benchmark hang location Added detailed logging at critical points in the vLLM engine initialization pipeline to diagnose where the script hangs: - benchmark_generators.py: Log ActorManager creation and vLLM engine setup - vllm_utils.py create_vllm_engines: Log engine creation loop progress - LLMRayActor.__init__: Log each initialization step - _setup_and_start_async_engine: Log thread creation and startup sequence This will help identify whether the hang occurs during Ray actor creation, actor initialization, or vLLM AsyncLLMEngine startup. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Fix placement group deadlock in benchmark_generators The script was hanging because two placement groups were being created: 1. One in setup_vllm_engines() that allocated the GPU 2. Another in create_vllm_engines() that waited forever for the same GPU The placement group created in setup_vllm_engines was only passed to create_vllm_engines when single_gpu_mode=True, causing create_vllm_engines to create a second placement group when single_gpu_mode=False. Fix: Always pass the placement group to create_vllm_engines, preventing the duplicate allocation attempt. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * way more logs * set enforce eager in benchmark script * Update code * Update code * Updated code * updated logging * more logging * updated epoch number * update * mor elogs * updated code * update code * Set min number of tokens * now removing stop string * trying again * Set batch invariant flag. * Set batch invariant flag. * Undid typo * now, we pass env variables along * remove batch invariant * Trying on augusta... * Reset to weka * Added gs_model_name * Updated scritp * Update code * More samples * Now, use 3 nodes for training * larger batch size * larger batch * Do 20 steps * Less sampels * Smaller patch * more checkpointing * More samples * debug flag * More activation checkpointing * Now, we clear memory after each step. * Now trying with 5 nodes * Using config frmo https://beaker.allen.ai/orgs/ai2/workspaces/open-instruct-dev/work/01K89DYQFDYTS50SYG89YNHM8G? * Added assert for batch invariant * Updated code * Fix benchmark * Updated code * Set batch invariant flag * Updated code * Remove eval queue * Update env var code * updated code * Now try 2 nodes * updated code * Updated code * Undid logging changes * updated descriptions for benchmarks * Undid changes to vllm_utils.py * Cleaned up PR * cleaned up code * removed flag * Fixed default config --------- Co-authored-by: Claude <[email protected]>
* fix scripts referencing old DPO script. * Updated to code to use config file var instead of positional parameter * Fixed shellcheck error.
* Simplify * Fixed type error * Removed strict=False * Update code to remove strict=False
* Moved tests into separate files * Added tests * Updates to code * update tests * updated * Fixed tests * Cleaned up tests * Fix error in tests.
* changes to mason and eval scripts to not override args * undo changes to mason * Update scripts/submit_eval_jobs.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update scripts/eval/oe-eval.sh Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* Simplify * Fixed type error * Removed strict=False * Update code to remove strict=False * change default LCB to fixed prompt (#1118) * linter passes * fix scripts referencing old DPO script. (#1121) * fix scripts referencing old DPO script. * Updated to code to use config file var instead of positional parameter * Fixed shellcheck error. --------- Co-authored-by: Saurabh Shah <[email protected]>
* Simplify * Removed strict=False * Update code to remove strict=False * Now, added another linter command. * Ran linter. * Removed strict=False
… Beaker description. (#1127) * Now, we get num_attention_heads from the hf config. * Update code * Added test that we match manual values * Updated calculations * Updated code with check_calculation * Updated code * Now, tests pass. * Updated code to normalize properly * Added some fixes * Updated code * Updated code * Another fix * Cleaned up tests. * Cleaned up PR * Update MFU/MBU code. * Now, mbu tests pass. * Moved to json file * Added test data * undid changes and simplified test function. * An attempt at a fix * Update code with patches * now, tests pass * Added MFU to DPO * updated script * uses uv for dpo * Added a chat template to the DPO script. * Added trackign * Updated code to handle tracking when none * Added description updates * undid changes * Check out dpo script * updated script * Update code to remove whitespace * fix finetune timing * Fixed bugs pointed out by cursor.
…epo. (#1128) * Set load_parallel * Cleaned up PR.
* Delete open_instruct/ppo2.py * deleted all mentions of ppo_fast.py
* Update filtering script to include new patterns and args * Apply suggestion from @gemini-code-assist[bot] Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Apply suggestion from @gemini-code-assist[bot] Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Apply suggestion from @gemini-code-assist[bot] Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Apply suggestion from @gemini-code-assist[bot] Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update scripts/data/filtering_and_updates/filter_dataset_by_keywords.py * Refine filtering patterns and dataset processing Updated regex patterns for filtering AI model mentions and improved dataset filtering logic. * Update filter_dataset_by_keywords.py * Optimize dataset filtering and loading processes * Refactor dataset loading and filtering processes --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* Added launch benchmark scripts * Fixed priority * Updated scripts to remove max_token_length * Renamed filter from rlvr_filter_v1 to rlvr_max_length_filter_v1 * Updated code * Add comprehensive logging to debug benchmark hanging issue Added detailed logging throughout the vLLM generation pipeline to diagnose why the benchmark script hangs: - Log prompt submission and queue sizes in benchmark_generators.py - Log actor_manager should_stop status and engine ready status - Log _prefetch_worker iterations and request processing in vllm_utils.py - Log async task creation, completion accumulation, and result queue operations This will help identify where in the pipeline the hang occurs (prompt queue, async tasks, completion queue, or results queue). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Updated script to use stego. * Updated script * updated script to add tokenizer * Fixed tokenizer flag * added more clusters * Added titan * Updated code * Updated scripts * Gather whole model False * Now we use saturn * Now, we let the engines take 10 minutes to init. * Updated code * Updated script. * Updated code * Updated benchmark * updated script * Updated script * Updated code * Added logs * Add comprehensive diagnostic logging to identify benchmark hang location Added detailed logging at critical points in the vLLM engine initialization pipeline to diagnose where the script hangs: - benchmark_generators.py: Log ActorManager creation and vLLM engine setup - vllm_utils.py create_vllm_engines: Log engine creation loop progress - LLMRayActor.__init__: Log each initialization step - _setup_and_start_async_engine: Log thread creation and startup sequence This will help identify whether the hang occurs during Ray actor creation, actor initialization, or vLLM AsyncLLMEngine startup. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Fix placement group deadlock in benchmark_generators The script was hanging because two placement groups were being created: 1. One in setup_vllm_engines() that allocated the GPU 2. Another in create_vllm_engines() that waited forever for the same GPU The placement group created in setup_vllm_engines was only passed to create_vllm_engines when single_gpu_mode=True, causing create_vllm_engines to create a second placement group when single_gpu_mode=False. Fix: Always pass the placement group to create_vllm_engines, preventing the duplicate allocation attempt. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * way more logs * set enforce eager in benchmark script * Update code * Update code * Updated code * updated logging * more logging * updated epoch number * update * mor elogs * updated code * update code * Set min number of tokens * now removing stop string * trying again * Set batch invariant flag. * Set batch invariant flag. * Undid typo * now, we pass env variables along * remove batch invariant * Trying on augusta... * Reset to weka * Added gs_model_name * Updated scritp * Update code * More samples * Now, use 3 nodes for training * larger batch size * larger batch * Do 20 steps * Less sampels * Smaller patch * more checkpointing * More samples * debug flag * More activation checkpointing * Now, we clear memory after each step. * Now trying with 5 nodes * Using config frmo https://beaker.allen.ai/orgs/ai2/workspaces/open-instruct-dev/work/01K89DYQFDYTS50SYG89YNHM8G? * Added assert for batch invariant * Updated code * Fix benchmark * Updated code * Set batch invariant flag * Updated code * Remove eval queue * Update env var code * updated code * Now try 2 nodes * updated code * Updated code * Undid logging changes * updated descriptions for benchmarks * Undid changes to vllm_utils.py * Cleaned up PR * cleaned up code * removed flag * Fixed default config * Updated tests * Added tests * Update gold dataset * tests pass * uses a smaller split when no cache * Now, linter passes * Tests pass locally. * Update code to clean up * Added image pruning * use larger runner * Added hf token as an env variable * Added asserts. * Update open_instruct/dataset_transformation.py Co-authored-by: Hamish Ivison <[email protected]> * Update open_instruct/dataset_transformation.py Co-authored-by: Hamish Ivison <[email protected]> --------- Co-authored-by: Claude <[email protected]> Co-authored-by: Hamish Ivison <[email protected]>
* Now, we get num_attention_heads from the hf config. * Update code * Added test that we match manual values * Updated calculations * Updated code with check_calculation * Updated code * Now, tests pass. * Updated code to normalize properly * Added some fixes * Updated code * Updated code * Another fix * Updated code to fix errors from cursor review * Cleaned up tests. * cleaned up code * Cleaned up PR * Restore docstrings and inline comments to ModelDims methods Added back all docstrings and inline comments that were removed during the sliding window implementation. These comments explain the assumptions, calculations, and design decisions in the FLOP and memory bandwidth estimation code. Changes: - Restored docstrings for all ModelDims methods (attn_flops, mlp_flops, prefill_flops, decode_flops, flops, weight_memory_bytes, kv_cache_write_bytes, kv_cache_read_bytes, prefill_memory_bytes, decode_memory_bytes, memory_bytes) - Restored inline comments explaining calculation details - Kept all functionality changes (sliding window support, A100 bandwidth fix) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Refactor attn_flops to use sliding_window parameter directly Changed attn_flops signature from using a boolean use_sliding_window flag to accepting the sliding_window value directly as an Optional[int]. This makes the API cleaner and more explicit. Changes: - attn_flops now takes sliding_window: Optional[int] = None instead of use_sliding_window: bool = False - Uses kv_len = min(kv_len, sliding_window or float("inf")) to handle None case elegantly - Updated all call sites in prefill_flops and decode_flops to pass sliding_window=None for full attention layers and sliding_window=self.sliding_window for sliding window layers 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * updated code * Fixed bug in tests * Updates code * Now, linter passes. * Update MFU/MBU code. * Now, mbu tests pass. * Moved to json file * Added test data * undid changes and simplified test function. * Updated code. * Updated code * test passes * An attempt at a fix * Update code with patches * now, tests pass * Cleaned up code. * Ran linter * Ran linter * linter passes --------- Co-authored-by: Claude <[email protected]>
* Now, we get num_attention_heads from the hf config. * Update code * Added test that we match manual values * Updated calculations * Updated code with check_calculation * Updated code * Now, tests pass. * Updated code to normalize properly * Added some fixes * Updated code * Updated code * Another fix * Updated code to fix errors from cursor review * Cleaned up tests. * cleaned up code * Cleaned up PR * Restore docstrings and inline comments to ModelDims methods Added back all docstrings and inline comments that were removed during the sliding window implementation. These comments explain the assumptions, calculations, and design decisions in the FLOP and memory bandwidth estimation code. Changes: - Restored docstrings for all ModelDims methods (attn_flops, mlp_flops, prefill_flops, decode_flops, flops, weight_memory_bytes, kv_cache_write_bytes, kv_cache_read_bytes, prefill_memory_bytes, decode_memory_bytes, memory_bytes) - Restored inline comments explaining calculation details - Kept all functionality changes (sliding window support, A100 bandwidth fix) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Refactor attn_flops to use sliding_window parameter directly Changed attn_flops signature from using a boolean use_sliding_window flag to accepting the sliding_window value directly as an Optional[int]. This makes the API cleaner and more explicit. Changes: - attn_flops now takes sliding_window: Optional[int] = None instead of use_sliding_window: bool = False - Uses kv_len = min(kv_len, sliding_window or float("inf")) to handle None case elegantly - Updated all call sites in prefill_flops and decode_flops to pass sliding_window=None for full attention layers and sliding_window=self.sliding_window for sliding window layers 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * updated code * Fixed bug in tests * Updates code * Now, linter passes. * Update MFU/MBU code. * Now, mbu tests pass. * Moved to json file * Added test data * undid changes and simplified test function. * Updated code. * Updated code * test passes * An attempt at a fix * Update code with patches * now, tests pass * Added MFU to DPO * updated script * uses uv for dpo * Added a chat template to the DPO script. * Added trackign * Updated code to handle tracking when none * Running without tracking * Cleaned up PR. * Undid changes. * Fixed bug * added logging * fixed softmax issue * Use wandb * removed logs --------- Co-authored-by: Claude <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Draft PR for a branch -- No review needed