ZnTrack (zɪŋk træk
) is a lightweight and easy-to-use Python package for
converting your existing Python code into reproducible workflows. By structuring
your code as a directed graph with well-defined inputs and outputs, ZnTrack
ensures reproducibility, scalability, and ease of collaboration.
- Reproducible Workflows: Convert Python scripts into reproducible workflows with minimal effort.
- Parameter, Output, and Metric Tracking: Easily track parameters, outputs, and metrics in your Python code.
- Shareable and Collaborative: Collaborate with your team by working together through GIT. Share your workflows and use parts in other projects or package them as Python packages.
- DVC Integration: ZnTrack is built on top of DVC for version control and experiment management and seamlessly integrates into the DVC ecosystem.
Let’s take a workflow that constructs a periodic, atomistic system of Ethanol and runs a geometry optimization using MACE-MP-0.
from ase.optimize import LBFGS
from mace.calculators import mace_mp
from rdkit2ase import pack, smiles2conformers
model = mace_mp()
frames = smiles2conformers(smiles="CCO", numConfs=32)
box = pack(data=[frames], counts=[32], density=789)
box.calc = model
dyn = LBFGS(box, trajectory="optim.traj")
dyn.run(fmax=0.5)
Dependencies
For this example to work, you will need:To make this workflow reproducible, we convert it into a directed graph structure where each step is represented as a Node. Nodes define their inputs, outputs, and the computational logic to execute. Here's the graph structure for our example:
flowchart LR
Smiles2Conformers --> Pack --> StructureOptimization
MACE_MP --> StructureOptimization
In ZnTrack, each Node is defined as a Python class. The class attributes
define the inputs (parameters and dependencies) and outputs, while the
run
method contains the computational logic to be executed.
Note
ZnTrack uses Python dataclasses under the hood, providing an automatic
__init__
method. Starting from Python 3.11, most IDEs should reliably
provide type hints for ZnTrack Nodes.
Tip
For files produced during the run
method, ZnTrack provides a unique
Node Working Directory (zntrack.nwd
). Always use this directory to store
files to ensure reproducibility and avoid conflicts.
from dataclasses import dataclass
from pathlib import Path
import ase.io
from ase.optimize import LBFGS
from mace.calculators import mace_mp
from rdkit2ase import pack, smiles2conformers
import zntrack
class Smiles2Conformers(zntrack.Node):
smiles: str = zntrack.params() # A required parameter
numConfs: int = zntrack.params(32) # A parameter with a default value
frames_path: Path = zntrack.outs_path(zntrack.nwd / "frames.xyz") # Output file path
def run(self) -> None:
frames = smiles2conformers(smiles=self.smiles, numConfs=self.numConfs)
ase.io.write(self.frames_path, frames)
@property
def frames(self) -> list[ase.Atoms]:
# Load the frames from the output file using the node's filesystem
with self.state.fs.open(self.frames_path, "r") as f:
return list(ase.io.iread(f, ":", format="extxyz"))
class Pack(zntrack.Node):
data: list[list[ase.Atoms]] = zntrack.deps() # Input dependency (list of ASE Atoms)
counts: list[int] = zntrack.params() # Parameter (list of counts)
density: float = zntrack.params() # Parameter (density value)
frames_path: Path = zntrack.outs_path(zntrack.nwd / "frames.xyz") # Output file path
def run(self) -> None:
box = pack(data=self.data, counts=self.counts, density=self.density)
ase.io.write(self.frames_path, box)
@property
def frames(self) -> list[ase.Atoms]:
# Load the packed structure from the output file
with self.state.fs.open(self.frames_path, "r") as f:
return list(ase.io.iread(f, ":", format="extxyz"))
# We could hardcode the MACE_MP model into the StructureOptimization Node, but we
# can also define it as a dependency. Since the model doesn't require a `run` method,
# we define it as a `@dataclass`.
@dataclass
class MACE_MP:
model: str = "medium" # Default model type
def get_calculator(self, **kwargs):
return mace_mp(model=self.model)
class StructureOptimization(zntrack.Node):
model: MACE_MP = zntrack.deps() # Dependency (MACE_MP model)
data: list[ase.Atoms] = zntrack.deps() # Dependency (list of ASE Atoms)
data_id: int = zntrack.params() # Parameter (index of the structure to optimize)
fmax: float = zntrack.params(0.05) # Parameter (force convergence threshold)
frames_path: Path = zntrack.outs_path(zntrack.nwd / "frames.traj") # Output file path
def run(self):
atoms = self.data[self.data_id]
atoms.calc = self.model.get_calculator()
dyn = LBFGS(atoms, trajectory=self.frames_path.as_posix())
dyn.run(fmax=0.5)
@property
def frames(self) -> list[ase.Atoms]:
# Load the optimization trajectory from the output file
with self.state.fs.open(self.frames_path, "rb") as f:
return list(ase.io.iread(f, ":", format="traj"))
Now that we’ve defined all the necessary Nodes, we can build and execute the workflow. Follow these steps:
-
Initialize a new directory for your project:
git init dvc init
-
Create a Python module for the Node definitions:
- Create a file
src/__init__.py
and place the Node definitions inside it.
- Create a file
-
Define and execute the workflow in a
main.py
file:from src import MACE_MP, Pack, Smiles2Conformers, StructureOptimization import zntrack # Initialize the ZnTrack project project = zntrack.Project() # Define the MACE-MP model model = MACE_MP() # Build the workflow graph with project: etoh = Smiles2Conformers(smiles="CCO", numConfs=32) box = Pack(data=[etoh.frames], counts=[32], density=789) optm = StructureOptimization(model=model, data=box.frames, data_id=-1, fmax=0.5) # Execute the workflow project.repro()
Tip
If you don’t want to execute the graph immediately, use
project.build()
instead. You can run the graph later using dvc repro
or
the paraffin package.
Once the workflow has been executed, the results are stored in the respective
files. For example, the optimized trajectory is saved in
nodes/StructureOptimization/frames.traj
.
You can load the results directly using ZnTrack, without worrying about file paths or formats:
import zntrack
# Load the StructureOptimization Node
optm = zntrack.from_rev(name="StructureOptimization")
# you can pass `remote: str` and `rev: str` to access data from
# a different commit or a remote repository.
# Access the optimization trajectory
print(optm.frames)
For additional examples and advanced use cases, check out these packages built on top of ZnTrack:
- mlipx - Machine Learned Interatomic Potential eXploration.
- IPSuite - Machine Learned Interatomic Potential Tools.
If you use ZnTrack in your research, please cite us:
@misc{zillsZnTrackDataCode2024,
title = {{{ZnTrack}} -- {{Data}} as {{Code}}},
author = {Zills, Fabian and Sch{\"a}fer, Moritz and Tovey, Samuel and K{\"a}stner, Johannes and Holm, Christian},
year = {2024},
eprint={2401.10603},
archivePrefix={arXiv},
}
This project is distributed under the Apache License Version 2.0.
Here’s a list of other projects that either work together with ZnTrack or achieve similar results with slightly different goals or programming languages:
- DVC - Main dependency of ZnTrack for Data Version Control.
- dvthis - Introduce DVC to R.
- DAGsHub Client - Logging parameters from within Python.
- MLFlow - A Machine Learning Lifecycle Platform.
- Metaflow - A framework for real-life data science.
- Hydra - A framework for elegantly configuring complex applications.
- Snakemake - Workflow management system for reproducible and scalable data analyses.