-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
introduce document for instructlab-sdk #184
base: main
Are you sure you want to change the base?
Conversation
95d2b47
to
7bcad36
Compare
|
||
Additionally, current consumers of InstructLab are finding themselves importing different library private and public APIs in combination with CLI functionality to achieve the workflows they want. While these more advanced usage patterns are not for everyone, providing ways to run bespoke and piecemeal workflows in a standardized and safe way for the community is starting to seem like a necessity. | ||
|
||
Unifying these various ranges of advanced workflows under an overarching `InstructLab SDK API` will allow for new usage patterns and a clearer story on what InstructLab can and should provide as user accessible endpoints. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SDKs and APIs are two different things. I'm assuming here you meant SDK.
Unifying these various ranges of advanced workflows under an overarching `InstructLab SDK API` will allow for new usage patterns and a clearer story on what InstructLab can and should provide as user accessible endpoints. | |
Unifying these various ranges of advanced workflows under an overarching `InstructLab SDK` will allow for new usage patterns and a clearer story on what InstructLab can and should provide as user accessible endpoints. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, folks have been colloquially calling this "The SDK API for InstructLab", but I will use just SDK to keep this simple
|
||
1. Modularize the InstructLab workflow such that any part can be run independently | ||
2. Allow users to choose whether or not to take advantage of the config/system-profile method of running InstructLab. Meaning they do not need any pre-existing configuration to run the SDK. | ||
3. Standardize user contracts for the existing functionality of the InstructLab workflow. Existing CLI commands should be using the SDK once past click parsing, not separate code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we consider core
be the actual main library instead of having a subcomponent of core
? Like from instructlab import InstructLab
and then have things like config init as part of that initialization.
4. Define Contracts loose enough that functionality can be expanded as more advanced features are released. | ||
5. Document SDK usage in upcoming InstructLab releases. | ||
|
||
## Non-Goals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love that you've included this as part of the dev-doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
docs/sdk/instructlab-sdk.md
Outdated
instructlab.api.data.list | ||
instructlab.api.model.train | ||
instructlab.api.model.evaluate | ||
instructlab.api.model.chat (one shot request probably) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like ilab chat
today provides an unmatched experience for having a local chat with models. But we shouldn't expose this through the SDK. The reason being here is that we'd simply be wrapping around LlamaCPP or vLLM, all for a single message request. At that point, it would be faster for them to just load transformers directly and manually generate from the model.
In particular, whenever I'm testing models programatically, my workflow looks like:
1. Spin up a vLLM server with the model I want
2. Run a script with whatever pre-existing messages I've written
The script itself is super simple too, like:
import openai
client = openai.Client(
api_base="http://localhost:8000/v1",
api_key="empty"
)
messages = [
{
"role": "system",
"content": "You are an AI assistant"
},
{
"role": "user",
"content": "Why is Linux better than Windows?"
}
]
response = client.chat.completions.create(messages, model="my-model")
assistant_response = response.choices[0].content
So if you're already at the level of consuming an SDK, I feel like it's a very simple task, so exposing it might not give much benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good document overall and a step in the right direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one question on my end :) And it may be too far out of scope; feel free to ignore it!
docs/sdk/instructlab-sdk.md
Outdated
|
||
```console | ||
instructlab.api.configuration.init | ||
instructlab.api.taxonomy.diff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this structure is what the current CLI has, I may be asking a larger question and won't be offended if you shut this as resolved to handle it somewhere else. Would it make more sense for the .taxonomy.diff
tree to be under .data
, to eventually leave room for alternative forms of data sorting and ingestion into the pipeline (i.e., is taxonomy a subfield of data management overall, with SDG as another subfield)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think grouping taxonomy utilities under the general InstructLab
class makes the most sense. I have updated the doc with a new design, lmk what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a good idea. A lot of work though, of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looking good to me, a few comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really an SDK? Is this not just systematization of interfaces in Python modules, that is, Python APIs? For example, there is no "SDK" artifact to download.
This is actually about cleanly modularizing the codebase, isn't it?
docs/sdk/instructlab-sdk.md
Outdated
instructlab.api.data.ingest(...) | ||
data_paths = instructlab.api.data.generate(...) | ||
some_custom_handling(data_paths) | ||
instructlab.api.model.train(data_path=..., strategy=lab_multiphase) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having *.api.*
submodules makes this seems a bit like defining interfaces, which isn't really a pattern in Python. What does having .api.
subpackages provide? Would they provide some sort of stability guarantee?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am actually thinking of renaming this .core....
since this isn't an API.
Grouping the code that we want folks to publicly consume is a common trend
The goal of this is to have folks who are currently haphazardly consuming different parts of InstructLab to have a centralized "core" of code that is versioned and static. What we call it, I don't feel strongly about, but it is pretty close to an SDK I believe. If there is some sort of standardization we need to fall into, we can. but other projects like llama stack do something similar but they do package it separately. We could scope packaging it separately but having it live within the |
|
||
1. Modularize the InstructLab workflow such that any part can be run independently | ||
2. Allow users to choose whether or not to take advantage of the config/system-profile method of running InstructLab. Meaning they do not need any pre-existing configuration to run the SDK. | ||
3. Standardize user contracts for the existing functionality of the InstructLab workflow. Existing CLI commands should be using the SDK once past click parsing, not separate code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we consider core
be the actual main library instead of having a subcomponent of core
? Like from instructlab import InstructLab
and then have things like config init as part of that initialization.
instructlab.core.system.info | ||
instructlab.core.rag.ingest | ||
instructlab.core.rag.convert | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent with the Python SDK should be to serve the Data Scientists and AI engineers doing experiments as the primary use. Then the constructs required for embedding or extending InstructLab from other products and platforms.
For a data science experience I propose to be inspired to what sklearn or keras or pytorch high level structure provide for this persona.
For example, what would it take to have an experience similar to :
from instructlab import InstructLab, SDG, Train, Eval
ilab = InstructLab(taxonomy=<path_to_taxonomy>) # and on initizalization, the taxonomy is diff & loaded
# like in Pandas DataFrame.describe()
ilab.describe() # prints a summary of the taxonomy diff and attributes
ilab_df = ilab.data.ingest() # ingest the data from the taxonomy and return a dataset as a Pandas DataFrame
oa_client = {
"url": "https://api.openactive.io/v1",
"api_key": "YOUR_API_KEY"
}
##
# SDG interactions
##
ilab_sdg=SDG(client=oa_client,
data=ilab_df,
teacher_model={<model_and_attributes>},
) # ilab SDG class
ilab_sdg.load_pipeline(<path_to_pipeline>) # load a pipeline from a file
for block in ilab_sdg.pipeline.get_blocks():
# do block customization
ilab_sdg.pipeline[block].something()
ilab_sdg.pipeline[block].block.foo = "bar"
ilab_sdg.run(
callback=<callback_function>, # a function to call after executing each block (e.g. to report progress, or save intermediate results)
) # generate the SDG data
##
# Training interactions
##
ilab_train = Train(dataset=ilab_sdg.data,
student_model={<model_and_attributes>},
) # ilab Train class
ilab_train.run(
callback=<callback_function>, # a function to call after each cycle/teration/epoch loop is run (e.g. to report progress, or save intermediate results, or to stop the training if certain stopping criteria are met)
) # execute the Training loop
ilab_train.model_save(<path_to_save>, resolution={}, model_format={gguf|safe_tensors|onnx|etc}) # save the model to a file in the specified format
##
# Evaluation
###
ilab_eval = Eval(dataset=[<path_to_eval_dataset>],
endpoints=[<path_to_endpoints_for_eval>],
) # ilab Eval class
ilab_eval.evals(
[list_of_evaluations], # a list of evaluations to run (e.g. accuracy, precision, dk-bench, etc.)
)
ilab_eval.run(
callback=<callback_function>, # a function to call after each evaluation is run (e.g. to report progress, or save intermediate results)
)
ilab_eval.summary() # print a summary of the evaluation results
ilab_eval.save(<path_to_save>, format={jsonl|parquet|csv|xls}) # save the evaluation results to a file
ilab_eval.export_report(<path_to_save>, format={html|pdf}) # report with the evaluation results
I would propose to focus on accelerating the goal of the persona (the data scientists), and not on mapping or exposing the InstructLab internal architecture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is possibly a good end goal, but would require significant re-architecture of InstructLab, and the libraries (what they expose).
For the purposes of this dev-doc, I can incorporate some of this into what I am proposing, but for an alpha SDK, keeping the interactions as simple yet expandable as possible is my goal.
So we should aim to not require library adjustments/changes at first, and then we can add functionality once the structure is in place.
Let me incorporate some of this and I will update the PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @williamcaban has a point here about the persona we're building this SDK for. Who are we making this SDK for? Datascientist? People trying to build REST-API based instructlab services? The design of the SDK might vary based on who our target user of the SDK would be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's definitely important to understand the persona and goal of the SDK. William is proposing a fairly large re-architecting of the APIs, while this dev doc seems mostly focused on exposing existing CLI flows and functionality via a Python SDK. Is our goal to expose the existing end-to-end flow via a Python SDK? Or to provide a new way to interact with various InstructLab components that's more granular and flexible in how you compose your own end-to-end workflow from the pieces we're exposing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am making some updates to the PR (some are already up), aiming to meld the two approaches a bit, lmk what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the updates look like a reasonable enough place for work to start.
The actual Python code and list of APIs is only illustrative, right? From reading this, it's my understanding that the actual parameters and methods on individual classes shown here are just an example since you call out future work is to design the actual SDK based on the structure above and negotiate user contracts with library maintainers. For example, if SDG points out that the Data.Generator
constructor needs different params or that we have to actually use the taxonomy diff results as input to data generation as opposed to just checking if there is a diff, this kind of thing can be ironed out later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One additional point I want to raise - what we definitely don't want is to be exposing pure library functionalities through a class in the Core repo - users that want that should just be importing the libraries directly.
As Charlie notes above what we want here is an SDK for the opinionated ilab
workflow that this package is centered around - for example, our Data class wouldn't expose every public function in the SDG library, but it would expose functions we have within src/instructlab/data
such as generate_data
and list_data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great idea. We should incorporate real experience from current consumers into the design of the Python public API.
docs/sdk/instructlab-sdk.md
Outdated
@@ -0,0 +1,109 @@ | |||
# InstructLab SDK |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably obvious to you but consumers would be better served if we refer and publish the SDK as "Python" SDK (to contrast with whatever other bindings we could have in the future).
|
||
Today, the only way to "drive" the InstructLab opinionated workflow is via the `ilab` CLI. While this process provides a succinct way for everyday users to initialize a config, generate synthetic data, train a model, and evaluate it: the guardrails are quite limiting both in what a user can do and what the development team can add as features exposed directly to the user over time. | ||
|
||
Additionally, current consumers of InstructLab are finding themselves importing different library private and public APIs in combination with CLI functionality to achieve the workflows they want. While these more advanced usage patterns are not for everyone, providing ways to run bespoke and piecemeal workflows in a standardized and safe way for the community is a necessity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the mechanism to collect these real use patterns from external consumers, to make sure their needs are properly served?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question. Most of our usage information is coming from internal use cases combined with reports of what people are asking for colloquially. We have no request gathering system as of yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. I'd advise then to at least have several interview sessions with potential consumers, collecting requests / feedback / pain points.
docs/sdk/instructlab-sdk.md
Outdated
|
||
Unifying these various ranges of advanced workflows under an overarching `InstructLab SDK API` will allow for new usage patterns and a clearer story on what InstructLab can and should provide as user accessible endpoints. | ||
|
||
While each library can and _should_ have their own publicly accessible API, not all functionality being added to SDG, Training, and Eval needs to be correlated directly to the "InstructLab workflow". This Python SDK should, as the CLI does, expose an opinionated flow that uses functionality from the various libraries. The InstructLab SDK should be derived from the library APIs, not the other way around. SDG, for example, currently has a `generate_data` method, meant to only be accessed by InstructLab. This method simply calls other publicly available SDG functionality. Orchestration of the InstructLab flow like this should not be of concern to the individual libraries and instead be handled by the overarching InstructLab SDK which will maintain the user contracts. The InstructLab SDK will need to work within the bounds of what the Libraries expose as public APIs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Thinking while reading) "an opinionated" (but perhaps LESS opinionated?) flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
within the bounds of what the Libraries expose as public APIs
What's the current mechanism in which libraries communicate which APIs are public and which are not? Or is the definition of a "public API" here, loosely, "whatever ilab already pulls from the library"? Or, alternatively, that will be the result of "contract negotiation" that you mention below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most libraries have made APIs over time and are just now standardizing what they want public vs private. Part of the process of building this SDK will be making final decisions on that front.
In regard to opinionated, yes this will be less opinionated and more user driven
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the CLI will always be more "opinionated" than the SDK - in terms of functionality, we're exposing the same functionality to both "consumers" but the nature of the different interfaces will likely always allow for more flexibility with the SDK (since it's easier to orchestrate Python classes than CLI commands)
docs/sdk/instructlab-sdk.md
Outdated
data_paths = instructlab.core.data.generate(...) | ||
some_custom_handling(data_paths) | ||
instructlab.core.model.train(data_path=..., strategy=lab_multiphase) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not an issue with the proposal itself since the snippet is only illustrative, but I'd consider whether ilab SDK should promote using file paths for any artifacts. I'd rather have ilab encapsulate the knowledge about storage options (for models and datasets) and then allow SDK users to refer to the artifacts by their names. (This would be in line with the spirit of instructlab/instructlab#1832 and some other issues under instructlab/instructlab#1871)
docs/sdk/instructlab-sdk.md
Outdated
3. `src/instructlab/process.py` | ||
4. `src/instructlab/api/data/generate.py` | ||
|
||
So generally: cli -> internal handling package -> process management package -> api -> library code, is the flow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
api -> core? (though perhaps "api" is a better name for a module that is meant to be consumed externally. That said, API may also refer to upcoming REST API initiative, so perhaps it's better to avoid this word either. public
or sdk
would be another option with clear intent.) Whatever the final name, these should be unified to use just one across the doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
core
is my preference, personally
docs/sdk/instructlab-sdk.md
Outdated
|
||
The flow of the CLI today is such that the cli package for a command (`src/instructlab/cli/data/generate.py`) parses the command line options, and passes control to the core code (`/src/instructlab/data/generate.py`). This then (if applicable), yields control to the `process` package which kicks off necessary processes for the given command. | ||
|
||
The difference with an SDK is that we would eventually want to end up executing `api/data/generate.py`, the actual publicly consumable python SDK. This will ensure that the CLI can do whatever custom handling it needs to do on top, but eventually it must boil down to the `api` package which uses publicly available methods from the various libraries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this right src/instructlab/data/generate.py
would just kick off processes using the code in api/data/generate.py
? Is there a reason that couldn't be done in cli/data/generate.py
? Is there a reason we need to go from 3 files -> 4 to do the same thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cdoern talked about this in-person - basically this decision boils down to whether or not we need process management or not for the SDK
ae451c3
to
5cc519f
Compare
@cdoern I want to revisit and re-emphasize the point I raised earlier. I really do not see how this is an SDK. This is about Python API design and modularization. The act of logically grouping operations, giving them good names, and understandable function signatures - of defining the "nouns" and "verbs" in the application - is an act of... wait for it... domain modeling! The work being proposed here is fundamentally a modeling and modularization effort - it is not about creating an SDK. |
docs/sdk/instructlab-sdk.md
Outdated
instructlab.core.InstructLab | ||
instructlab.core.InstructLab.taxonomy.diff | ||
instructlab.core.InstructLab.system.info | ||
instructlab.core.InstructLab.config.show | ||
instructlab.core.InstructLab.config.get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To illustrate my point above, I would prefer
instructlab.core.InstructLab | |
instructlab.core.InstructLab.taxonomy.diff | |
instructlab.core.InstructLab.system.info | |
instructlab.core.InstructLab.config.show | |
instructlab.core.InstructLab.config.get | |
instructlab.core.Taxonomy.diff | |
instructlab.core.System.info | |
instructlab.core.Config.show | |
instructlab.core.Config.get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be accomplished w/ __init__.py
module visibility too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just as a note, these classes are departing from the structure of the other ones where its Model.Trainer.train_model
for example, is this something we are ok with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a departure from that? I don't see how
instructlab.core.system.info | ||
instructlab.core.rag.ingest | ||
instructlab.core.rag.convert | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One additional point I want to raise - what we definitely don't want is to be exposing pure library functionalities through a class in the Core repo - users that want that should just be importing the libraries directly.
As Charlie notes above what we want here is an SDK for the opinionated ilab
workflow that this package is centered around - for example, our Data class wouldn't expose every public function in the SDG library, but it would expose functions we have within src/instructlab/data
such as generate_data
and list_data
|
||
There are certain things that only exist in `ilab` currently and functionality that is going to be moving these such as data ingestion, RAG, etc. Forming an SDK for `instructlab` allows us to capture all of these concerns under one API. | ||
|
||
These endpoints in combination with the curated InstructLab Config File will open up these workflows to users and allow InstructLab to be easily incorporated into other projects. Allowing people to run things like data generation, and full fine-tuning via an SDK that pulls in their pre-existing config.yaml but also can be run independently will open new avenues for InstructLab adoption and extensibility. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These endpoints in combination with the curated InstructLab Config File will open up these workflows to users and allow InstructLab to be easily incorporated into other projects. Allowing people to run things like data generation, and full fine-tuning via an SDK that pulls in their pre-existing config.yaml but also can be run independently will open new avenues for InstructLab adoption and extensibility. | |
These endpoints in combination with the curated InstructLab Config File will open up these workflows to users and allow InstructLab to be easily incorporated into other projects. Allowing people to run things like data generation, and full fine-tuning via an SDK that pulls in their pre-existing `config.yaml` but also can be run independently will open new avenues for InstructLab adoption and extensibility. |
docs/sdk/instructlab-sdk.md
Outdated
## Changes to the CLI | ||
|
||
The `ilab` CLI will need to adapt to this new structure. Commands like `ilab data generate` should, in terms of code, follow this flow: | ||
|
||
1. `src/instructlab/cli/data/generate.py` | ||
2. `src/instructlab/data/generate.py` | ||
3. `src/instructlab/process.py` | ||
4. `src/instructlab/core/data/generator.py` | ||
|
||
So generally: cli -> internal handling package -> process management package -> core SDK -> library code, is the flow. | ||
|
||
The flow of the CLI today is such that the cli package for a command (`src/instructlab/cli/data/generate.py`) parses the command line options, and passes control to the core code (`/src/instructlab/data/generate.py`). This then (if applicable), yields control to the `process` package which kicks off necessary processes for the given command. | ||
|
||
The difference with an SDK is that we would eventually want to end up executing `core/data/generator.py`, the actual publicly consumable python SDK. This will ensure that the CLI can do whatever custom handling it needs to do on top, but eventually it must boil down to the `core` package which uses publicly available methods from the various libraries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it not make more sense for both the CLI and "Core" to be consuming from the generation and process logic separately? Why couple the CLI with the SDK at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible we'd want this in some situations, but I think it's a good design decision to plop the CLI on top of the SDK as much as possible. For one thing, it minimizes duplicate code doing similar things. Another good thing is that we'd be testing our own SDK by wrapping it- testing the CLI would also be (only slightly indirectly) testing the SDK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think core should consume process management at all. Vice versa, CLI - through process management layer, if needed - will consume SDK. The user of the SDK should not be required to use our particular process management model. (Nor I think process management module is interesting to general consumers per se as a separate library. Prefer keeping it private to ilab CLI flow.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm it's an interesting point @JamesKunstle @booxter - but isn't the difference here that the CLI and the SDK are consuming the same logic but exposing it in different ways, i.e. via the command line versus importable classes? I suppose the CLI could utilize the SDK classes, but at that point it seems to me that the logic itself could simply expose importable classes and that would cover the usecase?
Indeed the process management piece shapes a lot of this conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understood so far, process management is applicable where we anticipate long running processes (generate, train etc.) Some commands are like that and will go through the chain of calls as:
click -> (cli handler) - ...schedules process... -> process management -> ...spawns process... -> (process handler) -> sdk -> libraries
while for processes that don't need background execution, it would be:
click -> (cli handler) -> sdk -> libraries
In the scheme above, sdk never consumes process management (because it's not expressed in terms of separate async processes).
docs/sdk/instructlab-sdk.md
Outdated
data_path = data_ingestor.ingest() | ||
# not in v1alpha1 | ||
|
||
sdg_client = Data.Generator(client=openai_compat_client, data=data_path, teacher_model=path_to_model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be really cool if we could use the vLLM offline inference features in an SDK scenario like this. My understanding is that that's the highest throughput calling pattern if requests are well-organized to begin with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the design changes represented in this document.
One consideration I'd raise as we proceed is that we may want to avoid the object-oriented pattern that the HF Trainer object is representative of, and prefer a functional programming convention.
The ravioli-code "Trainer" / "Generator" classes are a problematic design pattern elsewhere in this problem space. instructlab/training
has avoided duplicating this convention because it's harder to reason about and encourages large procedural instance methods.
Instead, it's more readable, hackable, and composable to have an SDK expose functional building blocks w/ attention to the interfaces informing the call.
For illustration, why ever use:
trainer = Trainer(
data=data,
model=model,
epochs=5,
lr=1e-4,
...
)
trainer.train()
when one could just run:
train(
data=data,
model=model,
epochs=5,
lr=1e-4,
...
)
Agreed with @JamesKunstle on NOT adopting object oriented primitives to describe process flow (ilab methodology being a pipeline of processes with - hopefully pure - inputs and outputs). (As an anecdote: We have Java-style RAG implementation in tree with factories and all, which is very hard to follow and unravel due to all the layers of indirection, and I don't think this is a good pattern to continue in this code base, esp. for "SDK".) @anastasds I wonder what makes something an SDK (SDK being Software Development Kit) if not a "pack of compute primitives meant to be consumed externally", which this proposal is (?) (Obviously, proper modeling is part of any good API and SDK design; but this seems wider.) AFAIU a primary problem being solved here "current consumers of InstructLab are finding themselves importing different library private and public APIs", which is being solved by giving these consumers something to consume that is meant to be consumed. This is SDK. |
@booxter in that general of a sense, every source codebase is an SDK. An Having |
Signed-off-by: Charlie Doern <[email protected]>
@booxter @JamesKunstle given the consumption patterns of who is going to use this while also trying to the best of our ability to align with InstructLab's core values here is what I propose: rather than a "spaghetti" structure like: from instructlab.core import Data
sdg_client = Data.Generator(<args>)
sdg_client.generate() I propose instead to at least do from instructlab.core import Data, Config
config_object = Config.init(auto_detect=True)
data_client = Data(taxonomy_path="", cfg=config_object)
data_client.generate_data() I think this structure at a minimum is what we need for the following reasons:
In general usability, searchability (is that a word?), and our inheritance structure would suffer from not having at least top level cmd classes. @nathan-weinberg do you agree with the above? |
Indeed - I think from a usability perspective it makes sense that the SDK classes would align with the CLI command groups - ultimately we're exposing the same (if not more) functionality through a different interface, but keeping the naming conventions of our original interface (the CLI) is a good thing IMO. I also echo what @JamesKunstle says above about having sub-classes like |
+1. Keeping the existing click command group hierarchy reflected in "SDK" while avoiding popping up classes is a good balance. Represent entities with objects and processes with methods / functions (Not with a "factory" to execute processes or to produce objects). On the question of whether something is SDK or not - the main point is the intent. If the intent is for external consumers to take it and integrate with it, then it's SDK. Whether it's shipped in the same artifact is an implementation detail. That said, as long as the name for this "thing" is not just "API" (which will be ambiguous for upcoming REST API), but e.g. "Python API" then I am ok with renaming too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correctly directed. We can flush out details (verbiage etc.) during development phase. What I'd encourage is to always refer to this as Python API/SDK going forward.
|
||
`instructlab.core` contains all SDK definitions. Users can `from instructlab.core import...` to use specific SDK classes | ||
|
||
For most of the existing InstructLab command groups, there should be a class: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual groups are
Commands:
config Command group for interacting with the config of InstructLab.
data Command group for interacting with the data generated by...
model Command group for interacting with the models in InstructLab.
process Command group for interacting with the processes run by...
rag Command group for interacting with the RAG for InstructLab.
system Command group for all system-related command calls.
taxonomy Command group for interacting with the taxonomy of InstructLab.
I don't think Eval (process) makes sense a class. What makes sense is Model.eval() though.
Note each click command group is tied to a "thing" (object) (except RAG? Which perhaps should've moved under data
?), not a verb.
This is probably just an oversight though since I don't see Eval listed in the examples below, only in the line 60.
|
||
openai_compat_client = some_server() | ||
|
||
data_jsonls = data_client.generate_data(client=openai_compat_client, data=data_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how you distinguish between parameters passed into verb methods (client
, data
here) and parameters passed to class.__init__
(line 100).
|
||
The internal handling package is necessary as it allows us to split off a sub-process when it makes the most sense for us before calling the library code directly. This is how the CLI works today. | ||
|
||
The difference with an SDK is that we would eventually want to end up executing `core/data/generator.py`, the actual publicly consumable python SDK. This will ensure that the CLI can do whatever custom handling it needs to do on top, but eventually it must boil down to the `core` package which uses publicly available methods from the various libraries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good goal. Eventually we can programmatically enforce that no imports for libraries are happening except under the sdk package.
No description provided.