Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
148 commits
Select commit Hold shift + click to select a range
2740d49
add round1 anneal configs
aetting Jun 26, 2025
39da9f2
Pin to swafix and add smoketest config
undfined Jun 26, 2025
e765967
Trainer config updates
undfined Jun 26, 2025
14233fa
More fixes
undfined Jun 26, 2025
d753873
oops
undfined Jun 26, 2025
528bb19
More config tweaks
undfined Jun 26, 2025
610a9de
Imports
undfined Jun 26, 2025
2a3e8cb
Fix for WSD class bug
undfined Jun 26, 2025
068b776
Match ac_config from swafix
undfined Jun 26, 2025
613ab1a
Match sliding window changes
undfined Jun 27, 2025
4bbb982
More shenans
undfined Jun 27, 2025
65eb9e4
Typo
undfined Jun 27, 2025
a1ac0da
comment
undfined Jun 27, 2025
1353c5a
Can't load state with new dataset
undfined Jun 27, 2025
7c523ee
OOM
undfined Jun 27, 2025
b9df024
olmo3 settings and new paths
aetting Jun 30, 2025
09f3a38
resources and web name
aetting Jun 30, 2025
13ef591
new web paths
aetting Jun 30, 2025
7b2afe8
Use improved scheduler branch
undfined Jul 1, 2025
1b35a4d
Merge branch 'undfined/swafix-core' into olmo3-anneals
aetting Jul 1, 2025
335283b
update round1 anneal paths (missing two)
aetting Jul 1, 2025
8d834bb
update example configs
aetting Jul 1, 2025
657084e
add web paths
aetting Jul 1, 2025
c167043
Merge branch 'undfined/swafix-core' into olmo3-anneals
aetting Jul 1, 2025
de07e7a
consistency updates
aetting Jul 1, 2025
71cf247
Use new dolmino math and update weights
undfined Jul 1, 2025
6104825
Allow repetitions in hqweb
undfined Jul 1, 2025
953ae4a
Not enough tokens for dolmino
undfined Jul 1, 2025
e614cae
Adjust reddit target
undfined Jul 2, 2025
70684e2
Try double rbz
undfined Jul 2, 2025
95859f9
oops
undfined Jul 2, 2025
ec6a6bc
Back to 8192 rbz
undfined Jul 2, 2025
106f175
try with float8
undfined Jul 2, 2025
035d19a
Newer torch
undfined Jul 2, 2025
36cc983
dp tweaks
undfined Jul 2, 2025
2520bdb
match pretrain
undfined Jul 2, 2025
aee96c2
More tweaks for large job
undfined Jul 2, 2025
8b085bd
baseline dolmino anneal config
aetting Jul 2, 2025
d897712
paths bucket and format fix
aetting Jul 2, 2025
95ef206
mj anneals rd1
Jul 3, 2025
4339fdd
Tweaks for mj anneals
undfined Jul 3, 2025
a0544c2
Rejiggered ratios for OMR rewrites
Jul 3, 2025
e3b8261
merge
Jul 3, 2025
a562e4c
restore trainer state from save folder
epwalsh Jul 3, 2025
8520a13
Merge pull request #127 from allenai/epwalsh/olmo3-anneals
aetting Jul 3, 2025
dba258a
update example priority
aetting Jul 3, 2025
2979e40
restore model_and_optim
aetting Jul 3, 2025
a209758
add lr-test-config
aetting Jul 3, 2025
ba57e1f
Added a bunch of nanoanneals
Jul 3, 2025
e595476
Merge branch 'olmo3-anneals' of github.com:allenai/olmo-cookbook into…
Jul 3, 2025
7983868
fixed typo
Jul 3, 2025
9fe0df4
added submodular dolmino math curves
Jul 3, 2025
70a1a94
path format consistency
aetting Jul 4, 2025
a11ea98
fix name
aetting Jul 4, 2025
47211e3
added gs->weka tool
Jul 7, 2025
14d46c9
Merge branch 'olmo3-anneals' of github.com:allenai/olmo-cookbook into…
Jul 7, 2025
148b539
idk this is some luca thing maybe\?
Jul 7, 2025
6cbe734
Added convert from config
Jul 7, 2025
c78dc1e
diff convert
Jul 7, 2025
acbd449
tyler wanted me to do this, idk
Jul 7, 2025
fd233ad
Merge branch 'main' into olmo3-anneals
Jul 7, 2025
de49cd6
Adds v2 hq fim stackedu microanneal
undfined Jul 7, 2025
cd707d9
convert with custom branching
Jul 7, 2025
14b95d3
Merge branch 'olmo3-anneals' of github.com:allenai/olmo-cookbook into…
Jul 7, 2025
90ac849
Adds v2++ hq fim stackedu microanneal
undfined Jul 7, 2025
1670251
Merge branch 'olmo3-anneals' of github.com:allenai/olmo-cookbook into…
undfined Jul 7, 2025
ba62835
add 7T anneals
aetting Jul 8, 2025
b41ca95
step update
aetting Jul 8, 2025
d06c704
fix run names
aetting Jul 8, 2025
12befcd
add ae microanneals
aetting Jul 9, 2025
6d42d1d
bump up rank microbatch size
aetting Jul 9, 2025
e7e55ac
Added eval script for midtraining
Jul 9, 2025
40408cf
uncomment
Jul 9, 2025
5ed2c7b
merge
Jul 9, 2025
d849517
Update README.md
revbucket Jul 9, 2025
9dc008a
add testrun
aetting Jul 9, 2025
d5f4d88
Rename olmo2 anneals and add olmo3-fim-code configs
undfined Jul 9, 2025
b1dd917
Added 'missing eval' stuff
Jul 9, 2025
f0eeaa6
Too many workers counting tokens
undfined Jul 9, 2025
8eb33fc
Merge branch 'olmo3-anneals' of github.com:allenai/olmo-cookbook into…
Jul 9, 2025
c471589
merged davidhs backfill stuff
Jul 9, 2025
8645829
increment eval version
Jul 9, 2025
0657fb0
Wrong weight for hqweb
undfined Jul 10, 2025
29a3653
Merge branch 'olmo3-anneals' of github.com:allenai/olmo-cookbook into…
undfined Jul 10, 2025
38aeeea
add highthresh diverse qa config
aetting Jul 10, 2025
6fd7250
Added mjnewmath-bestof
Jul 10, 2025
aba11c2
add wip anneal round 2 config
aetting Jul 11, 2025
2144071
added kodkode mjicroanneals
Jul 11, 2025
3f32ad1
Merge branch 'olmo3-anneals' of github.com:allenai/olmo-cookbook into…
Jul 11, 2025
079cdf5
update reasoning and math ratios
aetting Jul 11, 2025
8ae28c7
add round2 8T
aetting Jul 11, 2025
371e551
8 nodes
aetting Jul 11, 2025
ed104ab
update reasoning paths and run names
aetting Jul 12, 2025
687bce5
updated code path
aetting Jul 12, 2025
fc81e62
updated paths
soldni Jul 12, 2025
830c5c1
adjusting ratios
soldni Jul 12, 2025
35d7cde
merged main
Jul 14, 2025
d2cd5a9
add follow-up reasoning microanneals
aetting Jul 14, 2025
e8dc411
Adds 10b anneal with 35/30/35 web/code/etc ratios
undfined Jul 14, 2025
984e751
add reddit lowthresh663 microanneal
aetting Jul 14, 2025
5773e19
added megamath-web-pro-max anneals
Jul 15, 2025
7182650
more reddit lowthresh microanneals
aetting Jul 15, 2025
647294d
fix nonmc name
aetting Jul 16, 2025
bee78f9
cleaned up
Jul 16, 2025
0b90f6b
Merge branch 'olmo3-anneals' of github.com:allenai/olmo-cookbook into…
Jul 16, 2025
b108729
remove model_and_optim in load_path
aetting Jul 16, 2025
c18c541
add round 3 macroanneal configs
aetting Jul 18, 2025
bc2c081
add 12T configs
aetting Jul 21, 2025
31e480b
lowthresh mcplusfull
aetting Jul 21, 2025
bcf3459
added rewrite checks
Jul 21, 2025
aa985bf
convert from config hashes updated
Jul 21, 2025
d187de5
Added swallow anneals
Jul 21, 2025
e66c626
lowthresh add context v1
aetting Jul 22, 2025
0825df8
fix path
aetting Jul 22, 2025
68c6fee
add 200B round3 config
aetting Jul 25, 2025
0401ec8
more web paths
aetting Jul 25, 2025
6ef5068
adjust math ratios and code path
aetting Jul 26, 2025
a9de822
adjust reasoning ratios
aetting Jul 26, 2025
795e26d
adjust reasoning ratios
aetting Jul 26, 2025
2cf4c9c
update name
aetting Jul 28, 2025
b12faae
16 nodes
aetting Jul 28, 2025
59d42fc
add omr fullthoughts baseline
aetting Jul 28, 2025
fafdcc5
psgqa microanneal
aetting Jul 30, 2025
fa8d330
psgqa microanneal name
aetting Jul 30, 2025
da8433d
psgqa microanneal name
aetting Jul 30, 2025
30d25b3
add no reasoning no instruct
aetting Jul 31, 2025
fbf6ff9
add no reasoning no instruct
aetting Jul 31, 2025
b50797f
add nodes
aetting Jul 31, 2025
a5d0d3b
fix dolmino ratio
aetting Jul 31, 2025
65f7ebf
add sub8k llamanemotron
aetting Jul 31, 2025
cc1f690
Added check of swallowmatt stuff
Jul 31, 2025
d79785a
bumped nodes on fm4p
Jul 31, 2025
c71cd1a
Added megamatt test anneals
Jul 31, 2025
05249d8
changed names
Jul 31, 2025
78bb0e4
some more swallowmath diversity experiments
Aug 1, 2025
5193269
correct token counts
Aug 1, 2025
ba1f52d
Adds changes to support continual pretrain of olmo3
undfined Aug 1, 2025
c6f53f2
Fix config
undfined Aug 1, 2025
33a52e2
Try with diff path
undfined Aug 1, 2025
e24a6d6
Can't load state with new dataset
undfined Aug 1, 2025
8703207
OOM fix maybe
undfined Aug 1, 2025
620e3ef
Div by 0 not gud
undfined Aug 1, 2025
98352fc
Must set warmup
undfined Aug 1, 2025
a814101
Tweaks
undfined Aug 1, 2025
336347e
fix model validation
undfined Aug 1, 2025
b5e84a7
Try decay for a single token
undfined Aug 1, 2025
14a8dd4
set lr floor for decay
undfined Aug 1, 2025
b048115
duh
undfined Aug 1, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -315,3 +315,10 @@ All PMR CLI commands support the following options:
| `--script` | `-s` | None | Path to script file or directory to execute |

Note that you can provide either `--command` or `--script`, but not both. When using `--script` with a directory path, all executable files in that directory will be distributed across the instances.


# Midtraining utilities
I (mj) built out some utilities to minimize manual labor for common tasks for midtraining.
- [`scripts/gs2weka.py`](scripts/gs2weka.py): This script finds the latest checkpoint for a given model configuration and copies it from Google Cloud Storage to Weka storage using olmo-cookbook. Run with `python scripts/gs2weka.py <yaml_file>` to automatically detect your Beaker account and process the latest checkpoint, or use --beaker-name to specify a different account name.
- [`scripts/convert_from_config.py`](scripts/convert_from_config.py): This script finds the latest checkpoint in Weka storage for a given model configuration and converts it to HuggingFace format using olmo-cookbook-eval. Run with `python scripts/convert_from_config.py <yaml_file>` to automatically detect your Beaker account and convert the latest checkpoint, with optional --overwrite flag to reconvert existing checkpoints.
- [`scripts/olmo3_midtrain_eval.sh`](scripts/olmo3_midtrain_eval.sh): This script runs OLMo3 midtraining evaluations on a given checkpoint path using two different task suites (midtrain and main). Run with `bash scripts/olmo3_midtrain_eval.sh <checkpoint_path>` where the checkpoint path should point to a converted HuggingFace format checkpoint (e.g., ending in -hf).
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ checkpoints = [
"boto3"
]
all = [
"ai2-olmo-core @ git+https://github.com/allenai/OLMo-core.git@c779ca546cc3194e73e7491aaefcdffbed042c65",
"ai2-olmo-core @ git+https://github.com/allenai/OLMo-core.git@tylerr/olmo3-scripts-swafix-foreachopt",
"beaker-py>=1,<2",
"GitPython>=3.0,<4.0",
"wandb",
Expand Down
266 changes: 266 additions & 0 deletions scripts/convert_from_config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,266 @@
#!/usr/bin/env python3
"""
Script to process YAML file and run olmo-cookbook command with latest checkpoint
"""
import argparse
import re
import subprocess
import sys
from pathlib import Path

import yaml


def run_command(cmd, shell=False, errs_okay=False):
"""Run a shell command and return stdout"""
try:
if shell:
result = subprocess.run(cmd, shell=True, capture_output=True, text=True, check=True)
else:
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
return result.stdout.strip()
except subprocess.CalledProcessError as e:
print(f"Error running command: {' '.join(cmd) if isinstance(cmd, list) else cmd}")
print(f"Error: {e.stderr}")
if not errs_okay:
sys.exit(1)
raise e


def get_yaml_name(yaml_file):
"""Extract the 'name' attribute from YAML file"""
try:
with open(yaml_file, "r") as f:
data = yaml.safe_load(f)

if "name" not in data:
print(f"Error: 'name' attribute not found in {yaml_file}")
sys.exit(1)

return data["name"]
except Exception as e:
print(f"Error reading YAML file {yaml_file}: {e}")
sys.exit(1)


def get_beaker_name():
"""Get the NAME from 'beaker account whoami' output"""
output = run_command(["beaker", "account", "whoami"])

# Parse the table output to extract NAME
lines = output.strip().split("\n")
if len(lines) < 2:
print("Error: Unexpected output from 'beaker account whoami'")
sys.exit(1)

# Look for the data row (skip header)
for line in lines[1:]:
parts = line.split()
if len(parts) >= 2:
return parts[1] # NAME is the second column

print("Error: Could not extract NAME from beaker account whoami output")
sys.exit(1)


def find_latest_checkpoint(beaker_name, yaml_name):
"""Find the latest checkpoint directory in weka"""

weka_path = f"weka://oe-training-default/ai2-llm/checkpoints/{beaker_name}/{yaml_name}-*"

# Convert weka:// path to s3:// path for s5cmd
s3_path = weka_path.replace("weka://oe-training-default/", "s3://oe-training-default/")

# Add wildcard to check for any files in the directory
s3_path_wildcard = f"{s3_path}/*"

print(f"Checking if weka path exists: {weka_path}")
print(f"Using s5cmd to check: {s3_path_wildcard}")

cmd = [
"s5cmd",
"--profile",
"WEKA",
"--endpoint-url",
"https://weka-aus.beaker.org:9000",
"ls",
s3_path_wildcard,
]

try:
output = run_command(cmd, errs_okay=True)
if not output:
print(f"No checkpoints found with prefix: {prefix}")
sys.exit(1)

# Get all matching paths
paths = output.strip().split("\n")

# Sort paths to get the latest one (lexicographically)
paths = [_.split(" ")[-1].strip() for _ in paths]
ckpts = set()
for p in paths:
re_string = yaml_name + r"-[0-9a-f]{8}/step\d+/"
if re.match(re_string, p):
ckpts.add(re.match(re_string, p).group())
assert (
len(ckpts) > 0
), "No valid checkpoints found??? [this should assert should never fail if we got here to begin with]"
max_ckpt = max(ckpts)
print(max_ckpt)
return "weka://oe-training-default/ai2-llm/checkpoints/%s/%s" % (beaker_name, max_ckpt)

except subprocess.CalledProcessError as e:
print("No weka paths found!")
print(
f"Make sure you have access to weka://oe-training-deafult/ai2-llm/checkpoints/{beaker_name}/{yaml_name}-* directories"
)
raise e
# sys.exit(1)
except Exception as e:
print("ERR CODE ", e)
raise e


def check_hf_path_exists(latest_ckpt):
"""Check if the corresponding weka path already exists"""
# Convert gs:// path to weka:// path
hf_path = latest_ckpt.rstrip("/") + "-hf/*"

print(f"Checking if weka path exists: {hf_path}")
cmd = [
"s5cmd",
"--profile",
"WEKA",
"--endpoint-url",
"https://weka-aus.beaker.org:9000",
"ls",
hf_path,
]

# Convert weka:// path to s3:// path for s5cmd
hf_path = hf_path.replace("weka://oe-training-default/", "s3://oe-training-default/")

print(f"Checking if weka path exists: {hf_path}")
cmd = [
"s5cmd",
"--profile",
"WEKA",
"--endpoint-url",
"https://weka-aus.beaker.org:9000",
"ls",
hf_path,
]

try:
# Run the command - if it succeeds, the path exists
output = run_command(cmd, errs_okay=True)
print(f"✅ Weka path exists - found %s files:" % len(output.split("\n")))
return True
except subprocess.CalledProcessError as e:
# If the command fails, the path doesn't exist
print(f"❌ Weka path does not exist (s5cmd failed as expected)")
return False


def run_olmo_cookbook(weka_path):
"""Run the olmo-cookbook command with the GCS path"""
print("Converting %s" % weka_path)
weka_path = weka_path.replace("weka://", "/").rstrip("/")
cmd = [
"olmo-cookbook-eval",
"convert",
weka_path,
"-t",
"olmo-core-v2",
"--use-beaker",
"--huggingface-transformers-git-url",
"https://github.com/2015aroras/transformers.git",
"--huggingface-transformers-commit-hash",
"ae3889ced6ed7362e5883671fc6dc4cb4fece5fa",
"--olmo-core-v2-commit-hash",
"57a04d0b69047d797c96eede056a211e75b5914a",
]
print(f"Running: {' '.join(cmd)}")

try:
# Run the command and stream output in real-time
process = subprocess.Popen(
cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True, bufsize=1, universal_newlines=True
)

beaker_url = None
beaker_url_pattern = re.compile(r"https://beaker\.org/ex/[A-Z0-9]+")

for line in process.stdout:
print(line, end="")

# Look for the beaker URL in the output
match = beaker_url_pattern.search(line)
if match:
beaker_url = match.group(0)

process.wait()

if process.returncode != 0:
print(f"Error: olmo-cookbook command failed with return code {process.returncode}")
sys.exit(1)

# Print the extracted Beaker URL
if beaker_url:
print(f"\n" + "=" * 60)
print(f"🔗 Beaker Experiment URL: {beaker_url}")
print(f"=" * 60)
return beaker_url
else:
print("\nWarning: Could not extract Beaker experiment URL from output")

except Exception as e:
print(f"Error running olmo-cookbook: {e}")
sys.exit(1)


def main():
parser = argparse.ArgumentParser(description="Process YAML file and run olmo-cookbook with latest checkpoint")
parser.add_argument("yaml_file", help="Path to the YAML file")
parser.add_argument("--beaker-name", required=False, default=None)
parser.add_argument("--overwrite", required=False, type=bool, default=False)
args = parser.parse_args()

# Validate input file exists
if not Path(args.yaml_file).exists():
print(f"Error: YAML file {args.yaml_file} does not exist")
sys.exit(1)

print(f"Processing YAML file: {args.yaml_file}")

# Step 1: Get name from YAML
yaml_name = get_yaml_name(args.yaml_file)
print(f"YAML name: {yaml_name}")

# Step 2: Get beaker name
if args.beaker_name == None:
beaker_name = get_beaker_name()
else:
beaker_name = args.beaker_name
print(f"Beaker name: {beaker_name}")

# Step 3: Find latest checkpoint
print(
f"Searching for checkpoints with prefix: weka://oe-training-default/ai2-llm/checkpoints/{beaker_name}/{yaml_name}-"
)
latest_checkpoint = find_latest_checkpoint(beaker_name, yaml_name)
print(f"Latest checkpoint: {latest_checkpoint}")

# Step 4: Check if weka path already exists
if check_hf_path_exists(latest_checkpoint) and not args.overwrite:
print(f"\n🚫 Converted checkpoint already exists in weka storage. Skipping cookbook command.")
print(f"The checkpoint has already been copied to weka://oe-training-default/")
return

# Step 5: Run olmo-cookbook command
run_olmo_cookbook(latest_checkpoint)


if __name__ == "__main__":
main()
Loading