[Draft] Comparing toolu with main #963

shatu · 2025-08-28T23:39:20Z

Draft PR for a branch -- No review needed

This reverts commit 541058c.

* Fix misnamed variables. * Ran linter.

Co-authored-by: Hamish Ivison <[email protected]>

Adds new olmo-core-compatible chat templates. Includes: * New olmo template with support for function-calling. Includes a basic hard-coded system prompt, and appends "You do not have access to any functions" to any SFT examples that do not include functions. * Thinker version of the above template, has <think> included in the generation prompt * R1-style thinker template These 3 templates mirror our current Tulu templates Also includes some necessary changes to the --add_bos logic, to handle the new chat template which does not have a bos token. Includes a few other QoL fixes: * Fixes a bug in the olmocore tokenization script re: label mask * Logs dataset-level statistics during data mixing and tokenization * Supports easy upsampling during data mixing

* fix up my (jacob's) slightly broken pr --------- Co-authored-by: jacob-morrison <[email protected]>

* remove moar things * create on pr * dont create on pr

* Moved init to the main thread, before the queues. * Undid changes to test.

) * init and run * minor updates * minor script additions * nit * add readme and basic special token filtering * minor additions * Delete ngram_test_results/removal_rates_summary.txt

Co-authored-by: Saurabh Shah <[email protected]>

* init branch * it works * nits * false positive regex * manual filtering * fix manual label issue * another fix * minor fixes * final tweaks * clean up * cleaning * cleanup and docs * try removing tests spec * reset pyproject

Co-authored-by: Hamish Ivison <[email protected]>

* look at changes * tweak * style * fix * fix * one more fix * fix? * verbose is way too verbose?? * update * fix * correct logging * fix small wandb logging bug --------- Co-authored-by: Hamish Ivison <[email protected]>

* A bunch of minor changes. * Undo changes to script. * Added some tests which now pass. * Fixing indexing issue. * Claude tried to fix the indexing error we were running into. * Fix indexing issue. * Another attempt to fix the index bug. * We now have a failing test. * Added failing tests. * Now, all tests pass. * Tests pass. Launching. * Added a bunch of logging. * Removed most of the logging code. * Ran linter. * Created stripped down version of the tests. * Ran linter. * Add comprehensive logging to diagnose hanging issue - Added logging to vLLM engine process_from_queue method - Added logging to split_and_insert_batch to track queue operations - Added logging to accumulate_inference_batches to track result collection - Added logging to sync_weights_and_prepare_prompts to track batch flow This will help identify where the system is getting stuck on training step 2. * Fix vLLM engine queue processing and batch duplication - Added comprehensive error handling and logging to vLLM process_from_queue - Fixed batch sending logic to match original implementation (only send when training_step \!= 1) - Added early failure detection for vLLM engines - Added queue existence checks to catch initialization issues The main issue was that the vLLM engine's process_from_queue method was not running, and the batch sending logic was incorrect causing duplicate batches. * Switched to use new WeightUpdater system. * Claude tried to fix everything. * Cleaned up PR significantly. * Cleaned up code. * Cleaned up vllm_utils3.py. * Ran linter. * Expected bug in accumulate. * Fixed args. * Modified run single gpu.sh * Cleaned up eval results. * Adds git commit hash to experiment. * Ran linter. * Cleaned up PR. * Added debug scripts. * Removed debugging statements. * All tests and linter passes. * Cleaned up loop. * Cleaned up code. * Removed whitespace. * A bunch of clean up. * Ran linter. * Changed reference tracking to remove bug. * Now, code runs again. * Change test config to ignore oe-eval-internal. * Reset to c6ac09d. * Remove flags from addopts. * Remove flags from addopts. * Add back flag * Fix spacing issue. * Temporarily comment out pytest addopts to debug GitHub checks * Removed pytest block from pyproject. * Fix GitHub checks by updating uv version in quality workflow - Update uv from 0.5.11 to 0.7.17 in quality.yml to match tests.yml - Re-enable pytest configuration in pyproject.toml - Old uv version was likely unable to parse [tool.pytest.ini_options] * What about an empty block? * Most minimal pytest config. * Removes pyproject config. * More verbose tests. * Now, we use a thread to manage generation. * Modified grpo_fast to use a thread to manage generation. * Updated eval processing. * Fixed assert statements. * Removed asserts, as they don't work with the new threading model. * Fixed error. * Switched logger.fatal to raise ValueError. * Cleaned up code. * Ran linter. * Fixed cleanup issues. * Ran linter. * removed old tests. * Added more cleanup. * Ran formatter. * Added leak detection. * maybe fixed issue? * Ran linter. * Tries to fix the tests. * Ran linter. * Removed differences to pyproject.toml. * Another attempt to fix speed. * Ran linter. * Updated tests. * Hopefully fixed synchronization issue. * Ran linter. * Updated script to have good clusters. * Added back accumulate_inference_batches. * Tests should pass now. * Added performance diagnostics via tqdm. * Linter passes. * Now, shellcheck passes build_image_and_launch.sh. * Single GPU script passes shellcheck. * Fixed library reference. * Removed unneeded test. * Fixed conversion from ref to futures. * Fixed variable naming issue. * Cleaned up PR by removing old files. * Moved runtime leak detection to utils.py. * Ran linter. * Changed training resource cleanup to be a function. * Added generation timer. * Ran linter.

* perf penalty * style * fix tests * style * remove scripts * delete more scritps * one more

…s `grpo_fast` on a single GPU. (#830) * Added a beaker experiment action. * Refactors conditional job execution. * Makes it possible to pass images through. * Updated experiment workflow. * Copies the trigger from push-image. * add trigger on PR for testing * trigger test * Add the script we need to run the test. * Updated the workflow. * Update script path. * Added script. * Removed change to script. * Updated to use newly build image. * Now, should have a private key. * Remove big files. * Fix spacing issue. * Cleaned up a bunch of space. * An attempt to fix the oom issue. * Added back ssg agent setup. * Added back ssg agent setup. * Added back ssg agent setup. * Uses a shallow clone. * More attempts to fix the no space left on device error by mimicking push-image.yml. * Fix ssh key. * Another attempt to fix building. * Another attempt at fixing it... * Another attempt at building the image. * Adding python dependencies. * Update status code matching. * Fixed tail usage. * Remove ceres as a cluster. * Now, only runs on every commit pushed to main. * Removed dockerfile changes. * Cleaned up the script. * Final cleanup. * Fixed whitespace issue. * Added back trigger for PRs. * Adding a comment to trigger run. * Fix tab characters. * Added uv back to single gpu script. * Commented out push trigger. --------- Co-authored-by: Hamish Ivison <[email protected]>

* Added comments explaining why we set NCCL_CUMEM_ENABLE. * Ran linter.

… script. (#833)

…or quality. (#844) * Adds pytest-xdist for parallel tests, and only install needed deps. * Make quality faster by removing uv run. * Fix quality check.

Co-authored-by: Saurabh Shah <[email protected]>

* add verify script * logger, move contstants to top of file

…#823) * Fixed TORCH_CUDA_ARCH_LIST warning. * Ran linter. * Change default name for single_gpu_on_beaker script.

Co-authored-by: Saurabh Shah <[email protected]>

* skip_oi_evals skips everything including print statements previously --skip_oi_evals would still print Submitting {oi_eval} etc.. * revert * only print if we're not skipping evals --------- Co-authored-by: Michael Noukhovitch <[email protected]>

* Add system prompt + streamline tokenize * address comments * rename * remove old arg * whoops forgot file * debug system prompt * finbarr comments * cute system prompt * cute system prompt debug * move debug system prompt * no more max token length for RL scripts * fix

* fix attn and mlp flop calculation * take into account causal attention * fix weight_memory_bytes * fix head_dim * fix ModelDims creation from LLMRayActor * fix kv_proj: in prefill instead of decode * remove prefill/decode distinction for attn * remove note * fix things (again)

* Update script to call new oe-eval safety evals * Add num gpu constraints * Add handling for alternative safety beaker image * typos in script, add hf key to gantry args * move safety eval call into a task suite * update num_gpu calculation * typo * update to match oe-eval naming and engine requirements * typo * update with safety eval version

* Use epoch number instead of training step more * updated code * updated code * Removed merge conflict markers * Removed merge conflict markers * renamed test_vllm_utils3.py * Ran linter * Updated code to remove tracking argument

* Added launch benchmark scripts * Fixed priority * Updated scripts to remove max_token_length * Renamed filter from rlvr_filter_v1 to rlvr_max_length_filter_v1 * Updated code * Add comprehensive logging to debug benchmark hanging issue Added detailed logging throughout the vLLM generation pipeline to diagnose why the benchmark script hangs: - Log prompt submission and queue sizes in benchmark_generators.py - Log actor_manager should_stop status and engine ready status - Log _prefetch_worker iterations and request processing in vllm_utils.py - Log async task creation, completion accumulation, and result queue operations This will help identify where in the pipeline the hang occurs (prompt queue, async tasks, completion queue, or results queue). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Updated script to use stego. * Updated script * updated script to add tokenizer * Fixed tokenizer flag * added more clusters * Added titan * Updated code * Updated scripts * Gather whole model False * Now we use saturn * Now, we let the engines take 10 minutes to init. * Updated code * Updated script. * Updated code * Updated benchmark * updated script * Updated script * Updated code * Added logs * Add comprehensive diagnostic logging to identify benchmark hang location Added detailed logging at critical points in the vLLM engine initialization pipeline to diagnose where the script hangs: - benchmark_generators.py: Log ActorManager creation and vLLM engine setup - vllm_utils.py create_vllm_engines: Log engine creation loop progress - LLMRayActor.__init__: Log each initialization step - _setup_and_start_async_engine: Log thread creation and startup sequence This will help identify whether the hang occurs during Ray actor creation, actor initialization, or vLLM AsyncLLMEngine startup. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Fix placement group deadlock in benchmark_generators The script was hanging because two placement groups were being created: 1. One in setup_vllm_engines() that allocated the GPU 2. Another in create_vllm_engines() that waited forever for the same GPU The placement group created in setup_vllm_engines was only passed to create_vllm_engines when single_gpu_mode=True, causing create_vllm_engines to create a second placement group when single_gpu_mode=False. Fix: Always pass the placement group to create_vllm_engines, preventing the duplicate allocation attempt. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * way more logs * set enforce eager in benchmark script * Update code * Update code * Updated code * updated logging * more logging * updated epoch number * update * mor elogs * updated code * update code * Set min number of tokens * now removing stop string * trying again * Set batch invariant flag. * Set batch invariant flag. * Undid typo * now, we pass env variables along * remove batch invariant * Trying on augusta... * Reset to weka * Added gs_model_name * Updated scritp * Update code * More samples * Now, use 3 nodes for training * larger batch size * larger batch * Do 20 steps * Less sampels * Smaller patch * more checkpointing * More samples * debug flag * More activation checkpointing * Now, we clear memory after each step. * Now trying with 5 nodes * Using config frmo https://beaker.allen.ai/orgs/ai2/workspaces/open-instruct-dev/work/01K89DYQFDYTS50SYG89YNHM8G? * Added assert for batch invariant * Updated code * Fix benchmark * Updated code * Set batch invariant flag * Updated code * Remove eval queue * Update env var code * updated code * Now try 2 nodes * updated code * Updated code * Undid logging changes * updated descriptions for benchmarks * Undid changes to vllm_utils.py * Cleaned up PR * cleaned up code * removed flag * Fixed default config --------- Co-authored-by: Claude <[email protected]>

* fix scripts referencing old DPO script. * Updated to code to use config file var instead of positional parameter * Fixed shellcheck error.

* Simplify * Fixed type error * Removed strict=False * Update code to remove strict=False

* Moved tests into separate files * Added tests * Updates to code * update tests * updated * Fixed tests * Cleaned up tests * Fix error in tests.

* changes to mason and eval scripts to not override args * undo changes to mason * Update scripts/submit_eval_jobs.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update scripts/eval/oe-eval.sh Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Simplify * Fixed type error * Removed strict=False * Update code to remove strict=False * change default LCB to fixed prompt (#1118) * linter passes * fix scripts referencing old DPO script. (#1121) * fix scripts referencing old DPO script. * Updated to code to use config file var instead of positional parameter * Fixed shellcheck error. --------- Co-authored-by: Saurabh Shah <[email protected]>

* Simplify * Removed strict=False * Update code to remove strict=False * Now, added another linter command. * Ran linter. * Removed strict=False

… Beaker description. (#1127) * Now, we get num_attention_heads from the hf config. * Update code * Added test that we match manual values * Updated calculations * Updated code with check_calculation * Updated code * Now, tests pass. * Updated code to normalize properly * Added some fixes * Updated code * Updated code * Another fix * Cleaned up tests. * Cleaned up PR * Update MFU/MBU code. * Now, mbu tests pass. * Moved to json file * Added test data * undid changes and simplified test function. * An attempt at a fix * Update code with patches * now, tests pass * Added MFU to DPO * updated script * uses uv for dpo * Added a chat template to the DPO script. * Added trackign * Updated code to handle tracking when none * Added description updates * undid changes * Check out dpo script * updated script * Update code to remove whitespace * fix finetune timing * Fixed bugs pointed out by cursor.

…epo. (#1128) * Set load_parallel * Cleaned up PR.

* Delete open_instruct/ppo2.py * deleted all mentions of ppo_fast.py

@gemini-code-assist

* Update filtering script to include new patterns and args * Apply suggestion from @gemini-code-assist[bot] Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Apply suggestion from @gemini-code-assist[bot] Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Apply suggestion from @gemini-code-assist[bot] Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Apply suggestion from @gemini-code-assist[bot] Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update scripts/data/filtering_and_updates/filter_dataset_by_keywords.py * Refine filtering patterns and dataset processing Updated regex patterns for filtering AI model mentions and improved dataset filtering logic. * Update filter_dataset_by_keywords.py * Optimize dataset filtering and loading processes * Refactor dataset loading and filtering processes --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Added launch benchmark scripts * Fixed priority * Updated scripts to remove max_token_length * Renamed filter from rlvr_filter_v1 to rlvr_max_length_filter_v1 * Updated code * Add comprehensive logging to debug benchmark hanging issue Added detailed logging throughout the vLLM generation pipeline to diagnose why the benchmark script hangs: - Log prompt submission and queue sizes in benchmark_generators.py - Log actor_manager should_stop status and engine ready status - Log _prefetch_worker iterations and request processing in vllm_utils.py - Log async task creation, completion accumulation, and result queue operations This will help identify where in the pipeline the hang occurs (prompt queue, async tasks, completion queue, or results queue). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Updated script to use stego. * Updated script * updated script to add tokenizer * Fixed tokenizer flag * added more clusters * Added titan * Updated code * Updated scripts * Gather whole model False * Now we use saturn * Now, we let the engines take 10 minutes to init. * Updated code * Updated script. * Updated code * Updated benchmark * updated script * Updated script * Updated code * Added logs * Add comprehensive diagnostic logging to identify benchmark hang location Added detailed logging at critical points in the vLLM engine initialization pipeline to diagnose where the script hangs: - benchmark_generators.py: Log ActorManager creation and vLLM engine setup - vllm_utils.py create_vllm_engines: Log engine creation loop progress - LLMRayActor.__init__: Log each initialization step - _setup_and_start_async_engine: Log thread creation and startup sequence This will help identify whether the hang occurs during Ray actor creation, actor initialization, or vLLM AsyncLLMEngine startup. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Fix placement group deadlock in benchmark_generators The script was hanging because two placement groups were being created: 1. One in setup_vllm_engines() that allocated the GPU 2. Another in create_vllm_engines() that waited forever for the same GPU The placement group created in setup_vllm_engines was only passed to create_vllm_engines when single_gpu_mode=True, causing create_vllm_engines to create a second placement group when single_gpu_mode=False. Fix: Always pass the placement group to create_vllm_engines, preventing the duplicate allocation attempt. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * way more logs * set enforce eager in benchmark script * Update code * Update code * Updated code * updated logging * more logging * updated epoch number * update * mor elogs * updated code * update code * Set min number of tokens * now removing stop string * trying again * Set batch invariant flag. * Set batch invariant flag. * Undid typo * now, we pass env variables along * remove batch invariant * Trying on augusta... * Reset to weka * Added gs_model_name * Updated scritp * Update code * More samples * Now, use 3 nodes for training * larger batch size * larger batch * Do 20 steps * Less sampels * Smaller patch * more checkpointing * More samples * debug flag * More activation checkpointing * Now, we clear memory after each step. * Now trying with 5 nodes * Using config frmo https://beaker.allen.ai/orgs/ai2/workspaces/open-instruct-dev/work/01K89DYQFDYTS50SYG89YNHM8G? * Added assert for batch invariant * Updated code * Fix benchmark * Updated code * Set batch invariant flag * Updated code * Remove eval queue * Update env var code * updated code * Now try 2 nodes * updated code * Updated code * Undid logging changes * updated descriptions for benchmarks * Undid changes to vllm_utils.py * Cleaned up PR * cleaned up code * removed flag * Fixed default config * Updated tests * Added tests * Update gold dataset * tests pass * uses a smaller split when no cache * Now, linter passes * Tests pass locally. * Update code to clean up * Added image pruning * use larger runner * Added hf token as an env variable * Added asserts. * Update open_instruct/dataset_transformation.py Co-authored-by: Hamish Ivison <[email protected]> * Update open_instruct/dataset_transformation.py Co-authored-by: Hamish Ivison <[email protected]> --------- Co-authored-by: Claude <[email protected]> Co-authored-by: Hamish Ivison <[email protected]>

* Now, we get num_attention_heads from the hf config. * Update code * Added test that we match manual values * Updated calculations * Updated code with check_calculation * Updated code * Now, tests pass. * Updated code to normalize properly * Added some fixes * Updated code * Updated code * Another fix * Updated code to fix errors from cursor review * Cleaned up tests. * cleaned up code * Cleaned up PR * Restore docstrings and inline comments to ModelDims methods Added back all docstrings and inline comments that were removed during the sliding window implementation. These comments explain the assumptions, calculations, and design decisions in the FLOP and memory bandwidth estimation code. Changes: - Restored docstrings for all ModelDims methods (attn_flops, mlp_flops, prefill_flops, decode_flops, flops, weight_memory_bytes, kv_cache_write_bytes, kv_cache_read_bytes, prefill_memory_bytes, decode_memory_bytes, memory_bytes) - Restored inline comments explaining calculation details - Kept all functionality changes (sliding window support, A100 bandwidth fix) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Refactor attn_flops to use sliding_window parameter directly Changed attn_flops signature from using a boolean use_sliding_window flag to accepting the sliding_window value directly as an Optional[int]. This makes the API cleaner and more explicit. Changes: - attn_flops now takes sliding_window: Optional[int] = None instead of use_sliding_window: bool = False - Uses kv_len = min(kv_len, sliding_window or float("inf")) to handle None case elegantly - Updated all call sites in prefill_flops and decode_flops to pass sliding_window=None for full attention layers and sliding_window=self.sliding_window for sliding window layers 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * updated code * Fixed bug in tests * Updates code * Now, linter passes. * Update MFU/MBU code. * Now, mbu tests pass. * Moved to json file * Added test data * undid changes and simplified test function. * Updated code. * Updated code * test passes * An attempt at a fix * Update code with patches * now, tests pass * Cleaned up code. * Ran linter * Ran linter * linter passes --------- Co-authored-by: Claude <[email protected]>

* Now, we get num_attention_heads from the hf config. * Update code * Added test that we match manual values * Updated calculations * Updated code with check_calculation * Updated code * Now, tests pass. * Updated code to normalize properly * Added some fixes * Updated code * Updated code * Another fix * Updated code to fix errors from cursor review * Cleaned up tests. * cleaned up code * Cleaned up PR * Restore docstrings and inline comments to ModelDims methods Added back all docstrings and inline comments that were removed during the sliding window implementation. These comments explain the assumptions, calculations, and design decisions in the FLOP and memory bandwidth estimation code. Changes: - Restored docstrings for all ModelDims methods (attn_flops, mlp_flops, prefill_flops, decode_flops, flops, weight_memory_bytes, kv_cache_write_bytes, kv_cache_read_bytes, prefill_memory_bytes, decode_memory_bytes, memory_bytes) - Restored inline comments explaining calculation details - Kept all functionality changes (sliding window support, A100 bandwidth fix) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Refactor attn_flops to use sliding_window parameter directly Changed attn_flops signature from using a boolean use_sliding_window flag to accepting the sliding_window value directly as an Optional[int]. This makes the API cleaner and more explicit. Changes: - attn_flops now takes sliding_window: Optional[int] = None instead of use_sliding_window: bool = False - Uses kv_len = min(kv_len, sliding_window or float("inf")) to handle None case elegantly - Updated all call sites in prefill_flops and decode_flops to pass sliding_window=None for full attention layers and sliding_window=self.sliding_window for sliding window layers 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * updated code * Fixed bug in tests * Updates code * Now, linter passes. * Update MFU/MBU code. * Now, mbu tests pass. * Moved to json file * Added test data * undid changes and simplified test function. * Updated code. * Updated code * test passes * An attempt at a fix * Update code with patches * now, tests pass * Added MFU to DPO * updated script * uses uv for dpo * Added a chat template to the DPO script. * Added trackign * Updated code to handle tracking when none * Running without tracking * Cleaned up PR. * Undid changes. * Fixed bug * added logging * fixed softmax issue * Use wandb * removed logs --------- Co-authored-by: Claude <[email protected]>

saurabh111233212 and others added 30 commits July 21, 2025 18:02

Revert "Now, we run individual prompts through the queue. (#796)" (#804)

4659dca

This reverts commit 541058c.

Fix misnamed variables. (#808)

45ae474

* Fix misnamed variables. * Ran linter.

Fix broken syntax. (#809)

9d1620d

Co-authored-by: Hamish Ivison <[email protected]>

Fixes from last PR (#810)

d944d42

* fix up my (jacob's) slightly broken pr --------- Co-authored-by: jacob-morrison <[email protected]>

Delete run_repro.sh (#813)

207268a

Fix disk space error on image creation (#814)

cc33540

* remove moar things * create on pr * dont create on pr

fix: chat template kwarg in save_with_accelerate (#824)

8c45fd8

Fixes the ray double init error. (#822)

6f30008

* Moved init to the main thread, before the queues. * Undid changes to test.

[WIP] Filtering: Remove special tokens, Chinese characters, readme (#815

b28ac65

) * init and run * minor updates * minor script additions * nit * add readme and basic special token filtering * minor additions * Delete ngram_test_results/removal_rates_summary.txt

Makes pytest ignore oe-eval-internal. (#829)

7b92068

Co-authored-by: Saurabh Shah <[email protected]>

repetition filtering for reasoning traces (#801)

4477615

* init branch * it works * nits * false positive regex * manual filtering * fix manual label issue * another fix * minor fixes * final tweaks * clean up * cleaning * cleanup and docs * try removing tests spec * reset pyproject

fix avg_loss rename (#828)

535ff07

Co-authored-by: Hamish Ivison <[email protected]>

look at changes (#825)

ec3e61b

* look at changes * tweak * style * fix * fix * one more fix * fix? * verbose is way too verbose?? * update * fix * correct logging * fix small wandb logging bug --------- Co-authored-by: Hamish Ivison <[email protected]>

perf penalty (#805)

a388f3e

* perf penalty * style * fix tests * style * remove scripts * delete more scritps * one more

delete ref to async moe (#834)

e8d4cf0

fix some format in code_stdio (#835)

bf22d22

oops! (#840)

bcd65a1

Added comments explaining why we set NCCL_CUMEM_ENABLE. (#838)

cecdea6

* Added comments explaining why we set NCCL_CUMEM_ENABLE. * Ran linter.

Adds an optional timeout flag to mason, and sets it on the single GPU…

d407b09

… script. (#833)

Adds pytest-xdist for parallel tests, and only installs needed deps f…

2497ce0

…or quality. (#844) * Adds pytest-xdist for parallel tests, and only install needed deps. * Make quality faster by removing uv run. * Fix quality check.

Minor fix to scripts. (#842)

f25ae11

Co-authored-by: Saurabh Shah <[email protected]>

add verify script (#845)

b245fb7

* add verify script * logger, move contstants to top of file

Silences TORCH_CUDA_ARCH_LIST warning by passing the env var through. (…

44358c2

…#823) * Fixed TORCH_CUDA_ARCH_LIST warning. * Ran linter. * Change default name for single_gpu_on_beaker script.

Added merge_group trigger.

7a3011c

dataset statistics fix (#846)

815a73a

Added Claude.md. (#847)

63baaf4

Co-authored-by: Saurabh Shah <[email protected]>

finbarrtimbers and others added 30 commits October 20, 2025 18:38

Initial commit rename vllm_utils3.py. (#1078)

fc22674

Added script for benchmarking olmo3 infra. (#1101)

9ef54e3

Gives a cleaner name to rl_utils2.py. (#1100)

5229e05

Fix typos in README.md (#1104)

504a66e

update default for bbh eval (#1107)

7ba4cd0

Fix issues with epoch number addition (#1088)

d32724b

* Use epoch number instead of training step more * updated code * updated code * Removed merge conflict markers * Removed merge conflict markers * renamed test_vllm_utils3.py * Ran linter * Updated code to remove tracking argument

Sets default env vars, copying mason.py (#1114)

c2f672f

change default LCB to fixed prompt (#1118)

fdfc72a

fix scripts referencing old DPO script. (#1121)

515cd4e

* fix scripts referencing old DPO script. * Updated to code to use config file var instead of positional parameter * Fixed shellcheck error.

Adds a new linter rule (#1117)

9530e31

* Simplify * Fixed type error * Removed strict=False * Update code to remove strict=False

Moved tests into separate files (#1115)

9b15a60

* Moved tests into separate files * Added tests * Updates to code * update tests * updated * Fixed tests * Cleaned up tests * Fix error in tests.

Added another new linter rule: flake8-bugbear (#1123)

9502f16

* Simplify * Removed strict=False * Update code to remove strict=False * Now, added another linter command. * Ran linter. * Removed strict=False

default tulu-3-results (#1125)

6b18946

Simplify log_softmax_and_gather operation (#1124)

a76cc4b

Sets the num_proc argument for all calls to load_dataset in the r…

1a2b79a

…epo. (#1128) * Set load_parallel * Cleaned up PR.

Delete open_instruct/ppo2.py (#1130)

745bf58

* Delete open_instruct/ppo2.py * deleted all mentions of ppo_fast.py

Delete old code (#1132)

bac9393

olmo thinker template final (#1134)

3422790

downgrade deepspeed (#1136)

2a0ff46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Draft] Comparing toolu with main #963

[Draft] Comparing toolu with main #963

Uh oh!

shatu commented Aug 28, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants

[Draft] Comparing toolu with main #963

Are you sure you want to change the base?

[Draft] Comparing toolu with main #963

Uh oh!

Conversation

shatu commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants

shatu commented Aug 28, 2025 •

edited

Loading