Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
d65171c
debug: sampling if we have 100 millions of examples
Essoz Apr 7, 2025
da05c80
add: improve infer engine logs overall
Essoz Apr 7, 2025
9abc315
Merge branch 'main' into eval_fp
Essoz Apr 7, 2025
77b756c
add pid to infer_engine log file name to break ties
Essoz Apr 8, 2025
50c4fa3
add script to run infer at relation-parallelism
Essoz Apr 8, 2025
c2db82a
only parallelize to 3 groups (FunctionCover, FunctionLead and others)…
Essoz Apr 8, 2025
2394aa3
add: cache stage traces so we don't recreate them for every hypothesis
Essoz Apr 8, 2025
9677e87
remove redundant error log
Essoz Apr 8, 2025
b698004
sampling API calls
Essoz Apr 8, 2025
5fca394
YOLO: rename mldaikon, ml_daikon, ml-daikon to traincheck
Essoz Apr 17, 2025
36141ac
Use pyproject.toml to install traincheck
Essoz Apr 17, 2025
db0bf3d
doc: summary of TrainCheck
Essoz Apr 18, 2025
cf410af
add status section and Discord server links
Essoz Apr 18, 2025
167a5ce
doc: skeleton for Try TrainCheck
Essoz Apr 18, 2025
867464b
add: default non-cuda tensors to CPU-based hashing method
Essoz Apr 18, 2025
52ff91c
doc: installation guide
Essoz Apr 18, 2025
56a56fb
doc: update link to sub docs in README
Essoz Apr 18, 2025
bd2df13
doc: add CPU installation instructions
Essoz Apr 18, 2025
1a655d1
fix installation guide directory name
Essoz Apr 18, 2025
10274aa
fix: handle ModuleNotFoundError in safe_getattr
Essoz Apr 20, 2025
54b25b1
WIP: quick start tutorial
Essoz Apr 20, 2025
6494f89
fix: refine handling for missing property fields in ConsistentOutputR…
Essoz Apr 20, 2025
e38b608
WIP: 5-min tutorial
Essoz Apr 21, 2025
b613c51
doc: 5 min tutorial
Essoz Apr 21, 2025
d0157dc
doc: fix wording in 5-min tutorial
Essoz Apr 22, 2025
c127408
doc: inference and invariant related concepts
Essoz Apr 22, 2025
25e2b77
AE GUIDE: perf benchmark
Essoz Apr 22, 2025
6330572
AE GUIDE: add links to guide in main README
Essoz Apr 22, 2025
2b72777
doc: traincheck-collect (basic usage instructions)
Essoz Apr 28, 2025
c2daaef
doc: traincheck-check (basic usage instructions)
Essoz Apr 28, 2025
615c8c1
fix: handle scalar and vectors in hashing
Essoz Apr 28, 2025
97ea4b5
add: fp evaluate doc
Essoz Apr 30, 2025
82e04b4
add Apache 2.0 License
Essoz Apr 30, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions .github/workflows/eval-overhead-e2e.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,16 +6,16 @@ on:
paths:
- '.github/workflows/**'
- 'eval_scripts/perf_benchmark/**'
- 'mldaikon/instrumentor/**'
- 'mldaikon/proxy_wrapper/**'
- 'mldaikon/collect_trace.py'
- 'traincheck/instrumentor/**'
- 'traincheck/proxy_wrapper/**'
- 'traincheck/collect_trace.py'
pull_request:
paths:
- '.github/workflows/**'
- 'eval_scripts/perf_benchmark/**'
- 'mldaikon/instrumentor/**'
- 'mldaikon/proxy_wrapper/**'
- 'mldaikon/collect_trace.py'
- 'traincheck/instrumentor/**'
- 'traincheck/proxy_wrapper/**'
- 'traincheck/collect_trace.py'


permissions:
Expand Down
12 changes: 6 additions & 6 deletions .github/workflows/pre-commit-checks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,12 @@ on:
- main
paths:
- '.github/workflows/**'
- 'mldaikon/**'
- 'traincheck/**'
- 'tests/**'
pull_request:
paths:
- '.github/workflows/**'
- 'mldaikon/**'
- 'traincheck/**'
- 'tests/**'

jobs:
Expand Down Expand Up @@ -42,19 +42,19 @@ jobs:

- name: Run black
id: black
run: black --check mldaikon --exclude tests
run: black --check traincheck --exclude tests

- name: Run mypy on main source code folder
id: mypy
run: mypy mldaikon --install-types --non-interactive --ignore-missing-imports
run: mypy traincheck --install-types --non-interactive --ignore-missing-imports

- name: Run isort
id: isort
run: isort --check --profile=black mldaikon --skip tests
run: isort --check --profile=black traincheck --skip tests

- name: Run ruff
id: ruff
run: ruff check mldaikon
run: ruff check traincheck

- name: Check if any checks failed
if: failure()
Expand Down
7 changes: 3 additions & 4 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ experiments/*

*.pt
*.pstats
_ml_daikon_*
_traincheck_*
test_meta_hypothesis_combination.py
!call_graph.json

Expand All @@ -44,7 +44,7 @@ torch_wrapper.py
trace-analyzer.ipynb
instrumented_84911_watch.py
results/1_0.01/case_1_confusion_matrix.csv
!/mldaikon/static_analyzer/func_level/*.log
!/traincheck/static_analyzer/func_level/*.log

*.json
*.prof
Expand All @@ -67,6 +67,5 @@ eval_scripts/perf_benchmark/overhead-e2e/*/traincheck*
*.pth*
*ubyte*
eval_scripts/**/*.png
traincheck
mldaikon_run*
traincheck_run*
trace_*
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ repos:
rev: v1.9.0
hooks:
- id: mypy
files: ^mldaikon/
files: ^traincheck/
types: [python]
args: [--install-types, --non-interactive, --ignore-missing-imports]
# args: [--strict]
Expand Down
13 changes: 13 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
Copyright 2025 OrderLab and University of Michigan

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
86 changes: 39 additions & 47 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,48 +1,40 @@

# ML-DAIKON
[![Pre-commit checks](https://github.com/OrderLab/ml-daikon/actions/workflows/pre-commit-checks.yml/badge.svg)](https://github.com/OrderLab/ml-daikon/actions/workflows/pre-commit-checks.yml)

Instrumentor Performance Benchmark Results: http://orderlab.io/ml-daikon/dev/bench/

## Instrumentator Usage
ML-Daikon performs automatic instrumentation of programs and supports out-of-tree execution. To use the instrumentor, please install mldaikon as a pip package in the desired python environment where the example pipeline should be run in.

To install the instrumentor:
```shell
git clone [email protected]:OrderLab/ml-daikon.git
cd ml-daikon
pip3 install -e .
conda install cudatoolkit
```

A typical instrumentor invocation looks like
```bash
python3 -m mldaikon.collect_trace \
-p <path to your python script> \
-s <optional path to sh script that invokes the python script> \
-t [names of the module to be instrumented, e.g. torch, megatron] \ # `torch` is the default value here so you probably don't need to set it
--scan_proxy_in_args \ # dynamic analysis for APIContainRelation in 84911, keep it on
--allow_disable_dump \ # skip instrumentation for functions in modules specified in config.WRAP_WITHOUT_DUMP, keep it on for instrumentor overhead, inform @Essoz if you need those functions for invariant inference
-d # enabling debug logging, if you are not debugging the trace collector, you probably don't need it
```

The instrumentor will dump the collected trace to the folder where you invoked the command. There should be one trace per thread and the names of trace files follow the pattern:
```bash
_ml_daikon_<pyscript-file-name>_mldaikon_trace_API_<time-of-instrumentor-invocation>_<process-id>_<thread-id>.log
```
After execution completion, you can also look at `program_output.txt` for the stdout and stderr of the pipeline being executed.

## Infer Engine Usage

```bash
python3 -m mldaikon.infer_engine \
-t <path to your trace files> \
-d \ # enable debug logging
-o invariant.json \ # name of the file to dump the inferred invariants to
```

There are two other arguments that you might need.
```bash
--disable_precond_sampling \ # by default we enable sampling of examples to be used in precondition inference when the number of examples exceeds 10000. Sampling might cause us to lose information and you can disable this behavior by setting this flag.
--precond_sampling_threshold \ # the default threshold to sample examples is 10000, change this if you need to
```
[![format and types](https://github.com/OrderLab/traincheck/actions/workflows/pre-commit-checks.yml/badge.svg)](https://github.com/OrderLab/traincheck/actions/workflows/pre-commit-checks.yml)
[![Chat on Discord](https://img.shields.io/discord/1362661016760090736?label=Discord&logo=discord&style=flat)](https://discord.gg/DPEd7Xeg)

# TrainCheck

TrainCheck is a lightweight, extensible tool for runtime monitoring of “silent” bugs in deep‑learning training pipelines. Instead of waiting for a crash or a bad model, TrainCheck:
1. **Automatically instruments** your existing training scripts (e.g., from [pytorch/examples](https://github.com/pytorch/examples) or [huggingface/transformers/examples](https://github.com/huggingface/transformers/tree/main/examples)), inserting tracing hooks with minimal code changes.
2. **Learns precise invariants**–precise properties that should hold during training across API calls and model updates-by analyzing executions of known-good runs.
3. **Catches silent issues early**–by checking invariants on new or modified training jobs, alerting you immediately if something didn't happen as expected (e.g., model weight inconsistency, mixed precision not applied successfully, unexpected tensor shapes). On violation, TrainCheck flags the point of divergence—so users can diagnose silent issues before they derail your model.

![Workflow](docs/assets/images/workflow.png)

Under the hood, TrainCheck decomposes into three CLI tools:
- **Instrumentor** (`traincheck-collect`)
Wraps target training programs with lightweight tracing logic. It produces an instrumented version of the target program that logs API calls and model states without altering training semantics.
- **Inference Engine** (`traincheck-infer`)
Consumes one or more trace logs from successful runs to infer low‑level invariants.
- **Checker** (`traincheck-check`)
Runs alongside or after new training jobs to verify that each recorded event satisfies the inferred invariants.

## Status

TrainCheck is under active development. Features may be incomplete and the documentation is evolving—if you give it a try, please join our 💬 [Discord server](https://discord.gg/DPEd7Xeg) or file a GitHub issue for support. Currently, the **Checker** operates in a semi‑online mode: you invoke it against the live, growing trace output to catch silent bugs as they appear. Fully automatic monitoring is on the roadmap, and we welcome feedback and contributions from early adopters.

## Try TrainCheck

1. **Install**
Follow the [Installation Guide](./docs/installation-guide.md) to get TrainCheck set up on your machine.

2. **Explore**
Work through our "[5‑Minute Experience with TrainCheck](./docs/5-min-tutorial.md)" tutorial. You’ll learn how to:
- Instrument a training script and collect a trace
- Automatically infer low‑level invariants
- Run the Checker in semi‑online mode to uncover silent bugs

## Documentation
Please visit [TrainCheck Technical Doc](./docs/technical-doc.md).

🕵️‍♀️ OSDI AE members, please see [TrainCheck AE Guide](./docs/ae.md).
Loading
Loading