Skip to content

Commit

Permalink
add svgs for docs
Browse files Browse the repository at this point in the history
  • Loading branch information
amakelov committed Jul 2, 2024
1 parent 28fa53b commit 7d80f5e
Show file tree
Hide file tree
Showing 23 changed files with 2,432 additions and 974 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
*.png
*.tif
*.gif
*.svg
# *.svg
*.dot
*.gv
# *.mp4
Expand Down
13 changes: 8 additions & 5 deletions docs/docs/01_storage_and_ops.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ from mandala._next.imports import Storage
storage = Storage(
# omit for an in-memory storage
db_path='my_persistent_storage.db',
# omit to disable automatic dependency tracking
# omit to disable automatic dependency tracking & versioning
# use "__main__" to only track functions defined in the current session
deps_path='__main__',
)
Expand Down Expand Up @@ -63,8 +63,10 @@ The objects (e.g. `s`) returned by `@op`s are always instances of a subclass of
composition of `@op`s that created this ref.

Two `Ref`s with the same `cid` may have different `hid`s, and `hid` is the
unique identifier of `Ref`s in the storage.
unique identifier of `Ref`s in the storage. However, only 1 copy per unique
`cid` is stored to avoid duplication in the storage.

### `Ref`s can be in memory or not
Additionally, `Ref`s have the `in_memory` property, which indicates if the
underlying object is present in the `Ref` or if this is a "lazy" `Ref` which
only contains metadata. **`Ref`s are only loaded in memory when needed for a new
Expand Down Expand Up @@ -94,7 +96,7 @@ storage.unwrap(s) # loads from storage only if necessary



Other useful methods of the `Storage` include:
### Other useful `Storage` methods

- `Storage.attach(inplace: bool)`: like `unwrap`, but puts the objects in the
`Ref`s if they are not in-memory.
Expand All @@ -121,8 +123,9 @@ version at the time of the call, and the `cid`s of the inputs
- `Call.hid`: a history ID for the call, the same as `Call.cid`, but using the
`hid`s of the inputs.

**Every `Ref` history ID has at most one `Call` that it is an output of**, and
if it exists, this call can be found by calling `storage.get_ref_creator`:
**For every `Ref` history ID, there's at most one `Call` that has an output with
this history ID**, and if it exists, this call can be found by calling
`storage.get_ref_creator()`:


```python
Expand Down
6 changes: 3 additions & 3 deletions docs/docs/02_retracing.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ with storage:
```

AtomRef(hid='d0f...', cid='908...', in_memory=False) AtomRef(hid='f1a...', cid='69f...', in_memory=False)
AtomRef(hid='caf...', cid='d80...', in_memory=False)
AtomRef(hid='caf...', cid='f35...', in_memory=False)
AtomRef(hid='d16...', cid='12a...', in_memory=False)


Expand Down Expand Up @@ -134,7 +134,7 @@ with storage:
Loading data
Training model
Getting accuracy
AtomRef(0.84, hid='158...', cid='6c4...')
AtomRef(0.82, hid='158...', cid='238...')
Training model
Getting accuracy
AtomRef(0.9, hid='214...', cid='24c...')
Expand Down Expand Up @@ -185,5 +185,5 @@ with storage:
print(storage.unwrap(acc), storage.unwrap(model))
```

0.84 RandomForestClassifier(max_depth=2, n_estimators=5)
0.82 RandomForestClassifier(max_depth=2, n_estimators=5)

42 changes: 21 additions & 21 deletions docs/docs/03_cf.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
# Query the Storage with `ComputationFrame`s
## Why `ComputationFrame`s?
The `ComputationFrame` data structure **formalizes the natural/intuitive way you
think of the "web" of saved `@op` calls**.
think of the "web" of saved `@op` calls**. It gives you a "grammar" in which
operations over persisted computation graphs that are easy to think of are also
easy to implement.

In computational projects, all queries boil down to how some variables depend on
other variables: e.g., in ML you often care about what input parameters lead to
Expand All @@ -12,9 +14,9 @@ represents the "web" of saved `@op` calls, linked by how the outputs of one

The `ComputationFrame` (CF) is the data structure used to explore and query this
web of calls. It's a high-level view of a collection of `@op` calls, so that
calls that serve the same role are grouped together. The groups of calls form a
calls that serve the same role are grouped together. **The groups of calls form a
computational graph of variables and functions, which enables effective &
natural high-level operations over storage.
natural high-level operations over storage**.

This section covers basic tools to get up to speed with CFs. For more advanced
usage, see [Advanced `ComputationFrame` tools](06_advanced_cf.md)
Expand All @@ -29,9 +31,9 @@ limited view of storage because it will involve few (0 or 1) `@op`s
context to the CF by adding new function nodes containing the calls that
produced/used some variable(s). The goal of this stage is to incorporate in the
CF all variables whose relationships you're interested in.
- **selection**: restrict the values of the variables in the CF
variables by some predicates. This lets you focus on specific parameters before
making expensive calls to the storage.
- **combination & restriction**: merge multiple CFs, restrict to subgraphs or
specific values of the variables along some predicates. This lets you focus on
the computations you want before making expensive calls to the storage.
- **[conversion to a `pandas.DataFrame`](#extracting-dataframes-from-computationframes)**: finally,
extract a table representing the relationships between the variables in the CF
for downstream analysis.
Expand Down Expand Up @@ -244,13 +246,13 @@ print(cf.df(values='refs').to_markdown())
```

Extracting tuples from the computation graph:
output_0@output_0, output_1@output_1 = train_model(n_estimators=n_estimators, X_train=X_train, y_train=y_train)
| | n_estimators | y_train | X_train | train_model | output_0 | output_1 |
output_0@output_0, output_1@output_1 = train_model(n_estimators=n_estimators, y_train=y_train, X_train=X_train)
| | y_train | X_train | n_estimators | train_model | output_1 | output_0 |
|---:|:-----------------------------------------------------|:-----------------------------------------------------|:-----------------------------------------------------|:----------------------------------------------|:-----------------------------------------------------|:-----------------------------------------------------|
| 0 | AtomRef(hid='98c...', cid='29d...', in_memory=False) | AtomRef(hid='faf...', cid='83f...', in_memory=False) | AtomRef(hid='efa...', cid='a6d...', in_memory=False) | Call(train_model, cid='c4f...', hid='5f7...') | AtomRef(hid='b25...', cid='462...', in_memory=False) | AtomRef(hid='760...', cid='46b...', in_memory=False) |
| 1 | AtomRef(hid='120...', cid='9bc...', in_memory=False) | AtomRef(hid='faf...', cid='83f...', in_memory=False) | AtomRef(hid='efa...', cid='a6d...', in_memory=False) | Call(train_model, cid='3be...', hid='e60...') | AtomRef(hid='522...', cid='d5a...', in_memory=False) | AtomRef(hid='646...', cid='acb...', in_memory=False) |
| 2 | AtomRef(hid='235...', cid='c04...', in_memory=False) | AtomRef(hid='faf...', cid='83f...', in_memory=False) | AtomRef(hid='efa...', cid='a6d...', in_memory=False) | Call(train_model, cid='204...', hid='c55...') | AtomRef(hid='208...', cid='c75...', in_memory=False) | AtomRef(hid='5b7...', cid='f0a...', in_memory=False) |
| 3 | AtomRef(hid='9fd...', cid='4ac...', in_memory=False) | AtomRef(hid='faf...', cid='83f...', in_memory=False) | AtomRef(hid='efa...', cid='a6d...', in_memory=False) | Call(train_model, cid='5af...', hid='514...') | AtomRef(hid='331...', cid='e64...', in_memory=False) | AtomRef(hid='784...', cid='238...', in_memory=False) |
| 0 | AtomRef(hid='faf...', cid='83f...', in_memory=False) | AtomRef(hid='efa...', cid='a6d...', in_memory=False) | AtomRef(hid='9fd...', cid='4ac...', in_memory=False) | Call(train_model, cid='5af...', hid='514...') | AtomRef(hid='784...', cid='238...', in_memory=False) | AtomRef(hid='331...', cid='e64...', in_memory=False) |
| 1 | AtomRef(hid='faf...', cid='83f...', in_memory=False) | AtomRef(hid='efa...', cid='a6d...', in_memory=False) | AtomRef(hid='235...', cid='c04...', in_memory=False) | Call(train_model, cid='204...', hid='c55...') | AtomRef(hid='5b7...', cid='f0a...', in_memory=False) | AtomRef(hid='208...', cid='c75...', in_memory=False) |
| 2 | AtomRef(hid='faf...', cid='83f...', in_memory=False) | AtomRef(hid='efa...', cid='a6d...', in_memory=False) | AtomRef(hid='120...', cid='9bc...', in_memory=False) | Call(train_model, cid='3be...', hid='e60...') | AtomRef(hid='646...', cid='acb...', in_memory=False) | AtomRef(hid='522...', cid='d5a...', in_memory=False) |
| 3 | AtomRef(hid='faf...', cid='83f...', in_memory=False) | AtomRef(hid='efa...', cid='a6d...', in_memory=False) | AtomRef(hid='98c...', cid='29d...', in_memory=False) | Call(train_model, cid='c4f...', hid='5f7...') | AtomRef(hid='760...', cid='46b...', in_memory=False) | AtomRef(hid='b25...', cid='462...', in_memory=False) |


##
Expand Down Expand Up @@ -494,16 +496,14 @@ print(cf.df().drop(columns=['X_train', 'y_train']).to_markdown())

Extracting tuples from the computation graph:
X_train@output_0, y_train@output_2 = generate_dataset(random_seed=random_seed)
output_0@output_0, output_1@output_1 = train_model(n_estimators=n_estimators, X_train=X_train, y_train=y_train)
output_0@output_0, output_1@output_1 = train_model(n_estimators=n_estimators, y_train=y_train, X_train=X_train)
output_0_0@output_0 = eval_model(model=output_0)
| | n_estimators | random_seed | generate_dataset | train_model | output_0 | eval_model | output_0_0 | output_1 |
|---:|---------------:|--------------:|:---------------------------------------------------|:----------------------------------------------|:-----------------------------------------------------|:---------------------------------------------|-------------:|-----------:|
| 0 | 10 | 42 | Call(generate_dataset, cid='19a...', hid='c3f...') | Call(train_model, cid='c4f...', hid='5f7...') | | | nan | 0.74 |
| 1 | 80 | 42 | Call(generate_dataset, cid='19a...', hid='c3f...') | Call(train_model, cid='3be...', hid='e60...') | RandomForestClassifier(max_depth=2, n_estimators=80) | Call(eval_model, cid='137...', hid='d32...') | 0.82 | 0.83 |
| 2 | 20 | 42 | Call(generate_dataset, cid='19a...', hid='c3f...') | Call(train_model, cid='204...', hid='c55...') | | | nan | 0.8 |
| 3 | 40 | 42 | Call(generate_dataset, cid='19a...', hid='c3f...') | Call(train_model, cid='5af...', hid='514...') | RandomForestClassifier(max_depth=2, n_estimators=40) | Call(eval_model, cid='38f...', hid='5d3...') | 0.81 | 0.82 |
| 4 | 20 | 42 | Call(generate_dataset, cid='19a...', hid='c3f...') | Call(train_model, cid='204...', hid='c55...') | RandomForestClassifier(max_depth=2, n_estimators=20) | | nan | nan |
| 5 | 10 | 42 | Call(generate_dataset, cid='19a...', hid='c3f...') | Call(train_model, cid='c4f...', hid='5f7...') | RandomForestClassifier(max_depth=2, n_estimators=10) | | nan | nan |
| | random_seed | generate_dataset | n_estimators | train_model | output_1 | output_0 | eval_model | output_0_0 |
|---:|--------------:|:---------------------------------------------------|---------------:|:----------------------------------------------|-----------:|:-----------------------------------------------------|:---------------------------------------------|-------------:|
| 0 | 42 | Call(generate_dataset, cid='19a...', hid='c3f...') | 80 | Call(train_model, cid='3be...', hid='e60...') | 0.83 | RandomForestClassifier(max_depth=2, n_estimators=80) | Call(eval_model, cid='137...', hid='d32...') | 0.82 |
| 1 | 42 | Call(generate_dataset, cid='19a...', hid='c3f...') | 20 | Call(train_model, cid='204...', hid='c55...') | 0.8 | RandomForestClassifier(max_depth=2, n_estimators=20) | | nan |
| 2 | 42 | Call(generate_dataset, cid='19a...', hid='c3f...') | 40 | Call(train_model, cid='5af...', hid='514...') | 0.82 | RandomForestClassifier(max_depth=2, n_estimators=40) | Call(eval_model, cid='38f...', hid='5d3...') | 0.81 |
| 3 | 42 | Call(generate_dataset, cid='19a...', hid='c3f...') | 10 | Call(train_model, cid='c4f...', hid='5f7...') | 0.74 | RandomForestClassifier(max_depth=2, n_estimators=10) | | nan |


Importantly, we see that some computations only partially follow the full
Expand Down
Loading

0 comments on commit 7d80f5e

Please sign in to comment.