Skip to content

Commit 874ef22

Browse files
committed
introduce document for instructlab-sdk
Signed-off-by: Charlie Doern <[email protected]>
1 parent e12aeea commit 874ef22

File tree

2 files changed

+175
-0
lines changed

2 files changed

+175
-0
lines changed

.spellcheck-en-custom.txt

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@ agentic
66
Akash
77
AMDGPU
88
Anil
9+
API
10+
api
911
arge
1012
args
1113
arXiv
@@ -215,6 +217,8 @@ Salawu
215217
scalable
216218
SDG
217219
sdg
220+
SDK
221+
sdk
218222
semvar
219223
sexualized
220224
SHA

docs/sdk/instructlab-sdk.md

Lines changed: 171 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
# InstructLab Python SDK
2+
3+
## Motivation
4+
5+
Today, the only way to "drive" the InstructLab opinionated workflow is via the `ilab` CLI. While this process provides a succinct way for everyday users to initialize a config, generate synthetic data, train a model, and evaluate it: the guardrails are quite limiting both in what a user can do and what the development team can add as features exposed directly to the user over time.
6+
7+
Additionally, current consumers of InstructLab are finding themselves importing different library private and public APIs in combination with CLI functionality to achieve the workflows they want. While these more advanced usage patterns are not for everyone, providing ways to run bespoke and piecemeal workflows in a standardized and safe way for the community is a necessity.
8+
9+
Unifying these various ranges of advanced workflows under an overarching `InstructLab SDK API` will allow for new usage patterns and a clearer story on what InstructLab can and should provide as user accessible endpoints.
10+
11+
While each library can and _should_ have their own publicly accessible SDK, not all functionality being added to SDG, Training, and Eval needs to be correlated directly to the "InstructLab workflow". This Python SDK should, as the CLI does, expose an opinionated flow that uses functionality from the various libraries. The InstructLab SDK should be derived from the library APIs, not the other way around. SDG, for example, currently has a `generate_data` method, meant to only be accessed by InstructLab. This method simply calls other publicly available SDG functionality. Orchestration of the InstructLab flow like this should not be of concern to the individual libraries and instead be handled by the overarching InstructLab SDK which will maintain the user contracts. The InstructLab SDK will need to work within the bounds of what the Libraries expose as public APIs.
12+
13+
The benefit of the above is that the opinionated flow can be accessed in a more nuanced and piecemeal way while also gaining the potential for more advanced features. Say a consumer wants to:
14+
15+
1. Setup a custom config file for ilab (optional)
16+
2. Initialize a taxonomy
17+
3. Ensure their taxonomy is valid
18+
4. Ingest some data for RAG and SDG (SDG coming soon)
19+
5. Generate synthetic data using an InstructLab pipeline
20+
6. Do some custom handling per their use case
21+
7. Fine-tune a model using the custom config they initialized for their hardware
22+
8. Evaluate their model after training using various benchmarks
23+
24+
A user could do the following to accomplish this if they had an SDK:
25+
26+
```python
27+
ilab_client = InstructLab(config_path="", taxonomy="", auto_detect=True)
28+
if ilab_client.taxonomy.diff():
29+
rag_ingestor = RAG.Ingestor(...)
30+
rag_ingestor.ingest()
31+
32+
data_ingestor = Data.Ingestor()
33+
data_ingestor.ingest()
34+
35+
data_generator = Data.Generator(...)
36+
data_paths = data_generator.generate_data(...)
37+
38+
some_custom_handling(data_paths)
39+
40+
model_trainer = Model.Trainer(...)
41+
# potentially could expose other things besides training
42+
training_data = model_trainer.process_data(...)
43+
model_path = model_trainer.train_model(training_data=training_data)
44+
45+
eval_client = Model.Evaluator(model_path=model_path)
46+
eval_client.mt_bench(...)
47+
eval_client.dk_bench(...)
48+
```
49+
50+
(the structure of the SDK and actual arguments is discussed below)
51+
52+
However today, users are forced to run a sequence of commands tailored to only work with the proper directory structure on the system.
53+
54+
## Major Goals
55+
56+
1. Modularize the InstructLab workflow such that any part can be run independently
57+
2. Allow users to choose whether or not to take advantage of the config/system-profile method of running InstructLab. Meaning they do not need any pre-existing configuration to run the SDK.
58+
3. Standardize user contracts for the existing functionality of the InstructLab workflow. Existing CLI commands should be using the SDK once past click parsing, not separate code.
59+
4. Define Contracts loose enough that functionality can be expanded as more advanced features are released.
60+
5. Document SDK usage in upcoming InstructLab releases.
61+
62+
## Non-Goals
63+
64+
1. Exposing all library functionality immediately
65+
2. Replacing CLI
66+
3. Shipping an SDK that is generally available as opposed to v1alpha1 or v1beta1.
67+
68+
## Design
69+
70+
### Versioning
71+
72+
The SDK would start at version v1alpha1 such that it can change/break at any time for the first few iterations as libraries adjust their API surface.
73+
74+
### Structure
75+
76+
This SDK should live in a net new package inside of `instructlab/instructlab` preferably to limit unnecessary imports in a new repository. The SDK could be imported as `instructlab.core...`
77+
78+
The user surface initially should look like this:
79+
80+
`instructlab.core` contains all SDK definitions. Users can `from instructlab.core import...` to use specific SDK classes
81+
82+
For most of the existing InstructLab command groups, there should be a class:
83+
84+
`from instructlab.core import, InstructLab, Eval, Data, Model`
85+
86+
However, things like configuration initialization, taxonomy management, system info are utilities. So they can live in a centralized `InstructLab` client class.
87+
88+
```python
89+
90+
ilab_client = InstructLab(config_path="", taxonomy="", auto_detect=True)
91+
92+
diff = ilab_client.taxonomy.diff()
93+
94+
if diff:
95+
# not in v1alpha1
96+
data_ingestor = Data.Ingestor(data_path="",...)
97+
data_path = data_ingestor.ingest()
98+
# not in v1alpha1
99+
100+
sdg_client = Data.Generator(client=openai_compat_client, data=data_path, teacher_model=path_to_model)
101+
102+
sdg_client.generate_data()
103+
104+
# you can either use a config obj or pass trainer args
105+
trainer = Model.Trainer(student_model=path_to_student_model, configuration=configuration_object, training_args=...)
106+
107+
model_path = trainer.train_model()
108+
109+
# you can either use a config obj or each arg manually
110+
eval_client = Eval.Evaluator(model_path=model_path, configuration=configuration_object...)
111+
112+
eval_client.mt_bench()
113+
```
114+
115+
So the full list:
116+
117+
```console
118+
instructlab.core.InstructLab
119+
instructlab.core.InstructLab.taxonomy.diff
120+
instructlab.core.InstructLab.system.info
121+
instructlab.core.InstructLab.config.show
122+
instructlab.core.InstructLab.config.get
123+
instructlab.core.Data.Ingestor
124+
instructlab.core.Data.Ingestor.ingest
125+
instructlab.core.Data.Generator
126+
instructlab.core.Data.Generator.generate_data
127+
instructlab.core.Model.Server.serve
128+
instructlab.core.Model.Trainer
129+
instructlab.core.Model.Trainer.train_model
130+
instructlab.core.Model.Trainer.process_data
131+
instructlab.core.Model.Evaluator
132+
instructlab.core.Model.Evaluator.mt_bench
133+
instructlab.core.Model.Evaluator.dk_bench
134+
instructlab.core.Model.Evaluator.mmlu_bench
135+
instructlab.core.RAG.Ingestor.ingest
136+
instructlab.core.RAG.Converter.convert
137+
```
138+
139+
Presumably, the distinct methods under each class will grow, which is why I am opting to make very distinct classes per operation.
140+
141+
These initial exposed functions can expand to include any new functionality that is more SDK oriented from the various libraries. For example, if SDG adds something like subset selection, teacher as annotator, data mixing, etc we could expose an `instructlab.core.Data.Annotator.annotate` or `instructlab.core.Data.Mixer.mix` that could be invoked in sequence in a user's script with other parts of the ilab workflow. Some things make _less_ sense to be exposed via a CLI, but still are critical to ensuring users get a good model and properly generated data.
142+
143+
There are certain things that only exist in `ilab` currently and functionality that is going to be moving these such as data ingestion, RAG, etc. Forming an SDK for `instructlab` allows us to capture all of these concerns under one API.
144+
145+
These endpoints in combination with the curated InstructLab Config File will open up these workflows to users and allow InstructLab to be easily incorporated into other projects. Allowing people to run things like data generation, and full fine-tuning via an SDK that pulls in their pre-existing config.yaml but also can be run independently will open new avenues for InstructLab adoption and extensibility.
146+
147+
## Changes to the CLI
148+
149+
The `ilab` CLI will need to adapt to this new structure. Commands like `ilab data generate` should, in terms of code, follow this flow:
150+
151+
1. `src/instructlab/cli/data/generate.py`
152+
2. `src/instructlab/data/generate.py`
153+
3. `src/instructlab/process.py`
154+
4. `src/instructlab/core/data/generator.py`
155+
156+
So generally: cli -> internal handling package -> process management package -> core SDK -> library code, is the flow.
157+
158+
The flow of the CLI today is such that the cli package for a command (`src/instructlab/cli/data/generate.py`) parses the command line options, and passes control to the core code (`/src/instructlab/data/generate.py`). This then (if applicable), yields control to the `process` package which kicks off necessary processes for the given command.
159+
160+
The difference with an SDK is that we would eventually want to end up executing `core/data/generator.py`, the actual publicly consumable python SDK. This will ensure that the CLI can do whatever custom handling it needs to do on top, but eventually it must boil down to the `core` package which uses publicly available methods from the various libraries.
161+
162+
## Scope of work
163+
164+
In upcoming releases the InstructLab team should aim to:
165+
166+
1. Design the SDK given the structure above
167+
2. Converse with Library maintainers to negotiate user contracts
168+
3. Begin work to re-architect how the CLI works using the SDK
169+
4. Publish an alpha SDK for public consumption
170+
171+
After this initial work, the team can scope adding net new functionality that is not in the CLI to the SDK.

0 commit comments

Comments
 (0)