OrderLab · Essoz · Apr 30, 2025 · Apr 7, 2025 · Apr 7, 2025 · Apr 7, 2025
diff --git a/.github/workflows/eval-overhead-e2e.yml b/.github/workflows/eval-overhead-e2e.yml
@@ -6,16 +6,16 @@ on:
     paths:
       - '.github/workflows/**'
       - 'eval_scripts/perf_benchmark/**'
-      - 'mldaikon/instrumentor/**'
-      - 'mldaikon/proxy_wrapper/**'
-      - 'mldaikon/collect_trace.py'
+      - 'traincheck/instrumentor/**'
+      - 'traincheck/proxy_wrapper/**'
+      - 'traincheck/collect_trace.py'
   pull_request:
     paths:
       - '.github/workflows/**'
       - 'eval_scripts/perf_benchmark/**'
-      - 'mldaikon/instrumentor/**'
-      - 'mldaikon/proxy_wrapper/**'
-      - 'mldaikon/collect_trace.py'
+      - 'traincheck/instrumentor/**'
+      - 'traincheck/proxy_wrapper/**'
+      - 'traincheck/collect_trace.py'
 
 
 permissions:

diff --git a/.github/workflows/pre-commit-checks.yml b/.github/workflows/pre-commit-checks.yml
@@ -6,12 +6,12 @@ on:
       - main
     paths:
       - '.github/workflows/**'
-      - 'mldaikon/**'
+      - 'traincheck/**'
       - 'tests/**'
   pull_request:
     paths:
       - '.github/workflows/**'
-      - 'mldaikon/**'
+      - 'traincheck/**'
       - 'tests/**'
 
 jobs:
@@ -42,19 +42,19 @@ jobs:
 
       - name: Run black
         id: black
-        run: black --check mldaikon --exclude tests
+        run: black --check traincheck --exclude tests
 
       - name: Run mypy on main source code folder
         id: mypy
-        run: mypy mldaikon --install-types --non-interactive --ignore-missing-imports
+        run: mypy traincheck --install-types --non-interactive --ignore-missing-imports
 
       - name: Run isort
         id: isort
-        run: isort --check --profile=black mldaikon --skip tests
+        run: isort --check --profile=black traincheck --skip tests
 
       - name: Run ruff
         id: ruff
-        run: ruff check mldaikon
+        run: ruff check traincheck
 
       - name: Check if any checks failed
         if: failure()

diff --git a/.gitignore b/.gitignore
@@ -26,7 +26,7 @@ experiments/*
 
 *.pt
 *.pstats
-_ml_daikon_*
+_traincheck_*
 test_meta_hypothesis_combination.py
 !call_graph.json
 
@@ -44,7 +44,7 @@ torch_wrapper.py
 trace-analyzer.ipynb
 instrumented_84911_watch.py
 results/1_0.01/case_1_confusion_matrix.csv
-!/mldaikon/static_analyzer/func_level/*.log
+!/traincheck/static_analyzer/func_level/*.log
 
 *.json
 *.prof
@@ -67,6 +67,5 @@ eval_scripts/perf_benchmark/overhead-e2e/*/traincheck*
 *.pth*
 *ubyte*
 eval_scripts/**/*.png
-traincheck
-mldaikon_run*
+traincheck_run*
 trace_*
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -10,7 +10,7 @@ repos:
     rev: v1.9.0
     hooks:
       - id: mypy
-        files: ^mldaikon/
+        files: ^traincheck/
         types: [python]
         args: [--install-types, --non-interactive, --ignore-missing-imports] 
         # args: [--strict]

diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,13 @@
+Copyright 2025 OrderLab and University of Michigan
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
diff --git a/README.md b/README.md
@@ -1,48 +1,40 @@
 
-# ML-DAIKON
-[![Pre-commit checks](https://github.com/OrderLab/ml-daikon/actions/workflows/pre-commit-checks.yml/badge.svg)](https://github.com/OrderLab/ml-daikon/actions/workflows/pre-commit-checks.yml)
-
-Instrumentor Performance Benchmark Results: http://orderlab.io/ml-daikon/dev/bench/
-
-## Instrumentator Usage
-ML-Daikon performs automatic instrumentation of programs and supports out-of-tree execution. To use the instrumentor, please install mldaikon as a pip package in the desired python environment where the example pipeline should be run in.
-
-To install the instrumentor:
-```shell
-git clone [email protected]:OrderLab/ml-daikon.git
-cd ml-daikon
-pip3 install -e .
-conda install cudatoolkit
-```
-
-A typical instrumentor invocation looks like
-```bash
-python3 -m mldaikon.collect_trace \
-  -p <path to your python script> \
-  -s <optional path to sh script that invokes the python script> \
-  -t [names of the module to be instrumented, e.g. torch, megatron] \ # `torch` is the default value here so you probably don't need to set it
-  --scan_proxy_in_args \ # dynamic analysis for APIContainRelation in 84911, keep it on
-  --allow_disable_dump \ # skip instrumentation for functions in modules specified in config.WRAP_WITHOUT_DUMP, keep it on for instrumentor overhead, inform @Essoz if you need those functions for invariant inference
-  -d # enabling debug logging, if you are not debugging the trace collector, you probably don't need it
-```
-
-The instrumentor will dump the collected trace to the folder where you invoked the command. There should be one trace per thread and the names of trace files follow the pattern:
-```bash
-_ml_daikon_<pyscript-file-name>_mldaikon_trace_API_<time-of-instrumentor-invocation>_<process-id>_<thread-id>.log
-```
-After execution completion, you can also look at `program_output.txt` for the stdout and stderr of the pipeline being executed.
-
-## Infer Engine Usage
-
-```bash
-python3 -m mldaikon.infer_engine \
-  -t <path to your trace files> \
-  -d \ # enable debug logging 
-  -o invariant.json \ # name of the file to dump the inferred invariants to
-```
-
-There are two other arguments that you might need.
-```bash
---disable_precond_sampling \ # by default we enable sampling of examples to be used in precondition inference when the number of examples exceeds 10000. Sampling might cause us to lose information and you can disable this behavior by setting this flag.
---precond_sampling_threshold \ # the default threshold to sample examples is 10000, change this if you need to
-```
+[![format and types](https://github.com/OrderLab/traincheck/actions/workflows/pre-commit-checks.yml/badge.svg)](https://github.com/OrderLab/traincheck/actions/workflows/pre-commit-checks.yml)
+[![Chat on Discord](https://img.shields.io/discord/1362661016760090736?label=Discord&logo=discord&style=flat)](https://discord.gg/DPEd7Xeg)
+
+# TrainCheck
+
+TrainCheck is a lightweight, extensible tool for runtime monitoring of “silent” bugs in deep‑learning training pipelines. Instead of waiting for a crash or a bad model, TrainCheck:
+1. **Automatically instruments** your existing training scripts (e.g., from [pytorch/examples](https://github.com/pytorch/examples) or [huggingface/transformers/examples](https://github.com/huggingface/transformers/tree/main/examples)), inserting tracing hooks with minimal code changes.
+2. **Learns precise invariants**–precise properties that should hold during training across API calls and model updates-by analyzing executions of known-good runs.
+3. **Catches silent issues early**–by checking invariants on new or modified training jobs, alerting you immediately if something didn't happen as expected (e.g., model weight inconsistency, mixed precision not applied successfully, unexpected tensor shapes). On violation, TrainCheck flags the point of divergence—so users can diagnose silent issues before they derail your model.
+
+![Workflow](docs/assets/images/workflow.png)
+
+Under the hood, TrainCheck decomposes into three CLI tools:
+- **Instrumentor** (`traincheck-collect`)
+  Wraps target training programs with lightweight tracing logic. It produces an instrumented version of the target program that logs API calls and model states without altering training semantics.
+- **Inference Engine** (`traincheck-infer`)
+  Consumes one or more trace logs from successful runs to infer low‑level invariants.
+- **Checker** (`traincheck-check`)
+  Runs alongside or after new training jobs to verify that each recorded event satisfies the inferred invariants.
+
+## Status
+
+TrainCheck is under active development. Features may be incomplete and the documentation is evolving—if you give it a try, please join our 💬 [Discord server](https://discord.gg/DPEd7Xeg) or file a GitHub issue for support. Currently, the **Checker** operates in a semi‑online mode: you invoke it against the live, growing trace output to catch silent bugs as they appear. Fully automatic monitoring is on the roadmap, and we welcome feedback and contributions from early adopters.
+
+## Try TrainCheck
+
+1. **Install**  
+   Follow the [Installation Guide](./docs/installation-guide.md) to get TrainCheck set up on your machine.
+
+2. **Explore**  
+   Work through our "[5‑Minute Experience with TrainCheck](./docs/5-min-tutorial.md)" tutorial. You’ll learn how to:
+   - Instrument a training script and collect a trace  
+   - Automatically infer low‑level invariants  
+   - Run the Checker in semi‑online mode to uncover silent bugs
+
+## Documentation
+Please visit [TrainCheck Technical Doc](./docs/technical-doc.md).
+
+🕵️‍♀️ OSDI AE members, please see [TrainCheck AE Guide](./docs/ae.md).