Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLMBlock concurrency #157

Merged
merged 7 commits into from
Jul 19, 2024

Conversation

gabe-l-hart
Copy link
Contributor

Description

Supports #135

This PR ports the parallel generation PR from the research fork and adds unit tests plus arg plumbing up to generate_data.

@mergify mergify bot added the testing Relates to testing label Jul 16, 2024
@markmc
Copy link
Contributor

markmc commented Jul 17, 2024

I'd like to see info like below (from aakankshaduggal#8) included in one of the commit messages

Overview

The current implementation of LLMBlock sends the entire dataset as a single large batch of requests to the OpenAI server. This may lead to some requests waiting too long for the response, resulting in timeout errors and potentially overloading the backend server with extremely large batches.

Proposed Changes

Use concurrent processing using Python’s concurrent.futures package in LLMBlock. The key changes are:
Utilizes concurrent.futures for managing parallel tasks with threading for launching parallel tasks.
Allows users to specify the number of requests to send in each batch.
Allows users to specify the number of concurrent worker threads to handle batches.

Example Usage

If the user sets the concurrency to 8 and the batch size to 32, the system will run 8 concurrent threads, each sending a batch of 32 prompts, resulting in a total of 256 requests processed simultaneously by the backend server.

Copy link
Contributor

@markmc markmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Gabe!

src/instructlab/sdg/generate_data.py Outdated Show resolved Hide resolved
@@ -290,6 +311,8 @@ def generate_data(
model_family,
model_name,
num_instructions_to_generate,
num_workers=num_cpus,
batch_size=batch_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since no concurrency is used unless batch size is set ... I feel like we need a sensible default here. We certainly shouldn't require users to set it to get sensible behavior

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, good point. This could get tricky to estimate. I think the primary elements of a heuristic would be expected memory overhead per batch element (likely low for IO-bound work), and expected concurrency limits on the client-side for IO-bound work. I'm not sure we would have any reasonable way of knowing these though. I have much less familiarity with the "real" workload, though. Is there some kind of fixed number that we think would make sense?

The other concern I have about enabling concurrency by default is the possibility for the sharding of the dataset to actually change the results. I think most blocks will treat rows of the Dataset as independent entities and process them as such, but I could certainly imagine a case where the logic of a block does not treat the rows as independent. In this case, enabling concurrency might degrade the overall results if the individual block executions have less full context to work with.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our users certainly aren't going to be able to do a better job of judging these questions (whether sharding is safe, what batch size to us)

If we can't come up with a sane default, it should be in the pipeline config somehow

@shivchander @xukai92 @aakankshaduggal @npalaska we need your help making a call on an appropriate default batch size or whether it needs to be configured per-pipeline or per-block

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From @shivchander

we did some tests with the vllm endpoint and found that low batch size and increasing the number of workers worked best from a throughput perspective. So I’d recommend a batch size of 8 and setting num workers to num cpus.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further from @shivchander - it can be tricky to debug with this concurrency enabled, so would be nice to have an easy way to disable this for debugging (as a runtime parameter, not a pipeline config). It might make sense for --num-cpus=1 to disable this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so it sounds like we want:

  • batch_size=8
  • num_cpus=None (<- The python default with num_workers=None is here)
  • If the user sets num_cpus=1, we fully disable using ThreadPoolExecutor
    • This one can get annoying since there is no out-of-the-box SynchronousExecutor that offers the same API. I wrote one in a different project that we can probably borrow (here). The alternative is to maintain separate code paths for the synchronous version which can be pretty error prone.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - another reason would be for smaller datasets and small knowledge sources (say 1 page document), batching is an overkill and would probably end up taking more time due to I/O and regrouping. So I would incline towards --num-cpus=1 as default behavior and triggering the batching when needed

My sense is that this will inevitably lead to users experiencing sub-par performance, complaining (or just walking way), and us telling them "oh, you need this go-faster flag"

I don't think we should be taking that route out of a concern that it might be a little slower with small datasets

Unless I'm misunderstanding something, I think we should enable the batching by default

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point - im with you on making the distributed setting as a default

batch_size = 8
num_cpus =  mp.cpu_count()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, just a note: The combination of batch size and concurrency also depends on the underlying accelerator being used and how the accelerators can handle the batches. (The total number of batches the server can handle depends on the compute profile and the memory available on the accelerator).

The general consensus is to have smaller batches per thread and maximize the number of threads to a point where we are not adding a lot of requests in the backend vLLM queue. As the requests start to queue up on the backend vLLM server the time to the first token will shoot up and some requests will result in timeout errors. However, having a small batch means only a few requests get penalized in case of an error.

For example, using a batch of 8 and num_workers 32 (256 total concurrent requests) on 4H100 yielded the most optimal performance for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the input! This makes sense and absolutely lines up with my assumptions about why a heuristic for this would be tough to nail down. Since we do have some strong ideas about the default environment where this would run, I think batch_size = 8 and num_cpus = None (thereby using the python default of min(32, os.cpu_count() + 4)) is probably a safe place to start.

src/instructlab/sdg/sdg.py Outdated Show resolved Hide resolved
src/instructlab/sdg/sdg.py Outdated Show resolved Hide resolved
src/instructlab/sdg/sdg.py Outdated Show resolved Hide resolved
tests/test_sdg.py Outdated Show resolved Hide resolved
input_splits = self._split_dataset(dataset)
output_splits = [None] * len(input_splits)

with ThreadPoolExecutor(max_workers=self.num_workers) as executor:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has me wondering:

All threads enqueued to ThreadPoolExecutor will be joined before the interpreter can exit. Note that the exit handler which does this is executed before any exit handlers added using atexit. This means exceptions in the main thread must be caught and handled in order to signal threads to exit gracefully. For this reason, it is recommended that ThreadPoolExecutor not be used for long-running tasks.

Could you experiment with killing the parent while the pipeline is running, to check the behavior is graceful?

On the other hand, if we used ProcessPoolExecutor, I suspect we might run into pickling issues like I fixed in commit 7cfbaa9

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, ProcessPoolExecutor would very likely cause pickling problems. I thought about making it configurable, but decided against it to keep it simple. Probably worth a simple experiment to see if a Dataset can easily pass between processes (my money is on "no").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, looks like Dataset itself is actually pretty happy to pass between processes:

tmp.py

from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
from datasets import Dataset
import os

ds = Dataset.from_list([{"foo": 1}, {"foo": 2}])

pool_type = ThreadPoolExecutor if int(os.getenv("THREADS", "0")) else ProcessPoolExecutor

def doit(dataset):
    return dataset.map(lambda r: {"foo": r["foo"] * 2})

if __name__ == "__main__":
    print(f"Pool Type: {pool_type}")
    with pool_type() as pool:
        fut = pool.submit(doit, ds)
        print(fut.result().to_list())
?> python tmp.py
Pool Type: <class 'concurrent.futures.process.ProcessPoolExecutor'>
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2649.59 examples/s]
[{'foo': 2}, {'foo': 4}]

?> THREADS=1 python tmp.py
Pool Type: <class 'concurrent.futures.thread.ThreadPoolExecutor'>
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 1261.26 examples/s]
[{'foo': 2}, {'foo': 4}]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do suspect that if we put any non-primitives into the Dataset (not sure if this is even allowed based on using pyarrow behind the scenes), we'd hit pickling problems like you did with SSLContext.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you experiment with killing the parent while the pipeline is running, to check the behavior is graceful?

I was thinking about this a little. I'll see what I can do to drum up a full test of this. I suspect the behavior would be to block termination (short of SIGKILL) until all non-error threads have fully terminated.

In caikit, we built a wrapper (DestroyableThread) to handle this so that a given piece of work running in a thread could be cancelled without destroying the thread itself from the pool. This is probably overkill since caikit is designed to be a long-running server and a single dead task should not cause harm to the central thread pool. In the case of SDG, it could still be useful to signal that other future shards should stop early if one of the shards has failed, though I suspect that a simple SIGKILL to the parent process would do the trick since this will all be driven by the CLI.

Copy link
Contributor Author

@gabe-l-hart gabe-l-hart Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, as suspected, it looks like SIGTERM will wait for a "graceful" exit which means finishing all running futures, while SIGKILL will unceremoniously kill everything:

tmp.py
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed
from datasets import Dataset
import os
import time
import atexit
from functools import partial

pool_type = ThreadPoolExecutor if int(os.getenv("THREADS", "0")) else ProcessPoolExecutor

def doit(dataset, sleep_time=0, should_raise=False):
    if sleep_time:
        time.sleep(sleep_time)
    if should_raise:
        raise RuntimeError("Yikes")
    return dataset.map(lambda r: {"foo": r["foo"] * 2})

def end_report(futures):
    print("Done: " + str([f"{i}: {fut.done()}" for i, fut in enumerate(futures)]))
    print("Cancelled: " + str([f"{i}: {fut.cancelled()}" for i, fut in enumerate(futures)]))
    print("Exception: " + str([f"{i}: {fut.exception(timeout=0.01)}" for i, fut in enumerate(futures)]))

if __name__ == "__main__":
    print(f"Pool Type: {pool_type}")
    ds1 = Dataset.from_list([{"foo": 1}, {"foo": 2}])
    ds2 = Dataset.from_list([{"foo": 3}, {"foo": 4}])
    ds3 = Dataset.from_list([{"foo": 5}, {"foo": 6}])
    with pool_type() as pool:
        futures = [
            pool.submit(doit, ds, sleep_time=i * 3, should_raise=not i)
            for i, ds in enumerate([ds1, ds2, ds3])
        ]
        atexit.register(partial(end_report, futures))
        for future in as_completed(futures):
            print(future.result().to_list())

With SIGTERM

?> time THREADS=1 python tmp.py & proc=$! && kill $proc
[1] 24619
Pool Type: <class 'concurrent.futures.thread.ThreadPoolExecutor'>
Map: 100%|████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 857.29 examples/s]
Map: 100%|████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 818.96 examples/s]
Traceback (most recent call last):
  File "/Users/ghart/Projects/github/instructlab/sdg/tmp.py", line 34, in <module>
    print(future.result().to_list())
          ^^^^^^^^^^^^^^^
  File "/Users/ghart/mambaforge/envs/instructlab-sdg/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/ghart/mambaforge/envs/instructlab-sdg/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/Users/ghart/mambaforge/envs/instructlab-sdg/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ghart/Projects/github/instructlab/sdg/tmp.py", line 14, in doit
    raise RuntimeError("Yikes")
RuntimeError: Yikes
Done: ['0: True', '1: True', '2: True']
Cancelled: ['0: False', '1: False', '2: False']
Exception: ['0: Yikes', '1: None', '2: None']

real	0m6.612s
user	0m0.741s
sys	0m3.233s

[1]+  Exit 1                  time THREADS=1 python tmp.py

With SIGKILL
(NOTE: This also kills time, but it all happens "instantly")

?> time THREADS=1 python tmp.py & proc=$! && kill -9 $proc
[1] 24510
[1]+  Killed: 9               time THREADS=1 python tmp.py

Copy link
Contributor Author

@gabe-l-hart gabe-l-hart Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, letting it all run, including the exception raised in the first future, waits for the others to complete (as it should):

?> time THREADS=1 python tmp.py 
Pool Type: <class 'concurrent.futures.thread.ThreadPoolExecutor'>
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 1006.07 examples/s]
Map: 100%|████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 842.74 examples/s]
Traceback (most recent call last):
  File "/Users/ghart/Projects/github/instructlab/sdg/tmp.py", line 34, in <module>
    print(future.result().to_list())
          ^^^^^^^^^^^^^^^
  File "/Users/ghart/mambaforge/envs/instructlab-sdg/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/ghart/mambaforge/envs/instructlab-sdg/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/Users/ghart/mambaforge/envs/instructlab-sdg/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ghart/Projects/github/instructlab/sdg/tmp.py", line 14, in doit
    raise RuntimeError("Yikes")
RuntimeError: Yikes
Done: ['0: True', '1: True', '2: True']
Cancelled: ['0: False', '1: False', '2: False']
Exception: ['0: Yikes', '1: None', '2: None']

real	0m6.584s
user	0m0.672s
sys	0m3.332s

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, as suspected, it looks like SIGTERM will wait for a "graceful" exit which means finishing all running futures

Ok, thanks. I guess that works as a reasonable UX for the CLI

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gabe-l-hart I am seeing inconsistent behaviour in the test above. Can be graceful both for term and kill. It can also be ungraceful for both.

@markmc
Copy link
Contributor

markmc commented Jul 17, 2024

I'd like to see info like below (from aakankshaduggal#8) included in one of the commit messages

Also, one commit should be fine - please squash, and add:

Co-authored-by: Nikhil Palaskar <[email protected]>
Co-authored-by: shiv <[email protected]>
Co-authored-by: Kai Xu <[email protected]>
Co-authored-by: Aakanksha Duggal <[email protected]>

@gabe-l-hart
Copy link
Contributor Author

Ah, very good call on the Co-Authored-By. Will definitely add that and do the squash rebase.

@gabe-l-hart gabe-l-hart force-pushed the LLMBlockConcurrency-135 branch from bd3cf54 to 7243cb5 Compare July 17, 2024 14:22
@derekhiggins
Copy link
Contributor

@gabe-l-hart is it expected that I would be able to run this with a locally running ilab server?

when I attempt to I get "openai.InternalServerError: Service Unavailable"

I had set batchsize to 2 in the simple pipeline with 3 samples in the taxonomy

@gabe-l-hart
Copy link
Contributor Author

gabe-l-hart commented Jul 17, 2024

@markmc @russellb @shivchander Question for you: I see that there is already a num_procs in PipelineContext (here). It looks like this is used in two places, both as an argument to Dataset.map. This goes down the rabbit hole of process-based parallelism in Dataset which I ends up in arrow_dataset.py where the value is actually used to spawn processes with multiprocess.Pool in several places. Having both num_procs and max_workers in that object being used for different concurrency methods will probably be pretty confusing to anyone looking at it, so I want to get your thoughts between the following options:

  1. Rename them to more explicit names that say what they're used for (e.g. max_batch_workers, dataset_map_num_procs) and keep them separate
  2. Have a single value that is used in both places potentially resulting in 2X system threads being scheduled if all treads are occupied with batching AND a map operation is called with num_proc
  3. Try to figure out a way to centralize the resource pool (possibly as a ProcessPoolExecutor and get map to lease those processes so we control the upper bound on concurrency explicitly, but I don't think that's possible with Dataset)
  4. Don't do anything and just write comments explaining why they're both there

My inclination is to go with (1) for now, but I'm curious your opinions.

@gabe-l-hart gabe-l-hart force-pushed the LLMBlockConcurrency-135 branch from 7243cb5 to d6eefca Compare July 17, 2024 17:30
@shivchander
Copy link
Member

shivchander commented Jul 17, 2024

@markmc @russellb @shivchander Question for you: I see that there is already a num_procs in PipelineContext (here). It looks like this is used in two places, both as an argument to Dataset.map. This goes down the rabbit hole of process-based parallelism in Dataset which I ends up in arrow_dataset.py where the value is actually used to spawn processes with multiprocess.Pool in several places. Having both num_procs and max_workers in that object being used for different concurrency methods will probably be pretty confusing to anyone looking at it, so I want to get your thoughts between the following options:

  1. Rename them to more explicit names that say what they're used for (e.g. max_batch_workers, dataset_map_num_procs) and keep them separate
  2. Have a single value that is used in both places potentially resulting in 2X system threads being scheduled if all treads are occupied with batching AND a map operation is called with num_proc
  3. Try to figure out a way to centralize the resource pool (possibly as a ProcessPoolExecutor and get map to lease those processes so we control the upper bound on concurrency explicitly, but I don't think that's possible with Dataset)
  4. Don't do anything and just write comments explaining why they're both there

My inclination is to go with (1) for now, but I'm curious your opinions.

I would also incline towards 1 - seems like the path of least resistance

@gabe-l-hart gabe-l-hart force-pushed the LLMBlockConcurrency-135 branch 2 times, most recently from 782c284 to dab7222 Compare July 17, 2024 20:43
@gabe-l-hart
Copy link
Contributor Author

@gabe-l-hart is it expected that I would be able to run this with a locally running ilab server?

when I attempt to I get "openai.InternalServerError: Service Unavailable"

I had set batchsize to 2 in the simple pipeline with 3 samples in the taxonomy

@derekhiggins I don't think the matching should have any impact on this, but it's always possible there's a bug! Can you try the updated code and set the batch_size to 1? This should be equivalent to running with no batching. If that passes and running with batching fails, there may be an issue with the openai client and thread safety which would be troubling.

@markmc
Copy link
Contributor

markmc commented Jul 18, 2024

I kept it in multiple atomic commits to show the history of the port plus refactor. I'm personally a bit of a "always preserve history" zealot, but I'm happy to conform to the community standards here

Community standards developing here: instructlab/dev-docs#110

e.g. for the error handling changes, I'm not a fan of keeping a long-term history of the 4 different iterations of this ... especially since the commit messages don't capture any of the "why" for the different approaches. But I do hugely value splitting logically separate changes into their own commits

That said, it's not such a big deal here ... it's not like this is a string of fix typo, update thing, etc. commits 👍

Copy link
Contributor

@markmc markmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, but I'd like to see what Derek finds with his testing before we merge

model_family: The family identifier for the model being updated
num_instructions_to_generate: The total number of instructions the user
wants to generate during this run
batch_size: The size of the dataset batches for parallel generation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's definitely going to be confusion here about this batching versus the batching in LLMBlock - see server_supports_batching, batch_kwargs, etc.

(I don't have an obvious and quick solution, so I'm fine with it for now)

batch_num_workers: Optional[int] = None

@property
def batching_enabled(self) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another example of that confusion ... I want to move server_supports_batched from the OpenAI client to PipelineContext ... and that sounds an awful lot like "batching enabled"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is definitely confusing and requires realizing the client/server relationship between this SDG code and the server.

src/instructlab/sdg/pipeline.py Outdated Show resolved Hide resolved
@derekhiggins
Copy link
Contributor

@gabe-l-hart is it expected that I would be able to run this with a locally running ilab server?
when I attempt to I get "openai.InternalServerError: Service Unavailable"
I had set batchsize to 2 in the simple pipeline with 3 samples in the taxonomy

@derekhiggins I don't think the matching should have any impact on this, but it's always possible there's a bug! Can you try the updated code and set the batch_size to 1? This should be equivalent to running with no batching. If that passes and running with batching fails, there may be an issue with the openai client and thread safety which would be troubling.

Back several months ago the local server (started with "ilab serve" ) used to crash if multiple clients
connected at the same time. Our solution was to restrict the server to only a single worker (rather then have it crash after somebody was running a generate for multiple hours)
https://github.com/instructlab/instructlab/blob/3b2e7916c38179c9749b8b293e1448aa380b3868/src/instructlab/model/backends/backends.py#L399

This restriction is still in place (and I believe its the reason I'm seeing service unavailable errors) i.e. the local server
as started by ilab can only handle a single request at a time.

instructlab.sdg.pipeline.PipelineBlockError: PipelineBlockError(<class 'instructlab.sdg.llmblock.LLMBlock'>/gen_skill_freeform): Service Unavailable │

@markmc
Copy link
Contributor

markmc commented Jul 18, 2024

Back several months ago the local server (started with "ilab serve" ) used to crash if multiple clients connected at the same time. Our solution was to restrict the server to only a single worker (rather then have it crash after somebody was running a generate for multiple hours) https://github.com/instructlab/instructlab/blob/3b2e7916c38179c9749b8b293e1448aa380b3868/src/instructlab/model/backends/backends.py#L399

This restriction is still in place (and I believe its the reason I'm seeing service unavailable errors) i.e. the local server as started by ilab can only handle a single request at a time.

Ok, and that is specific to the llama-cpp backend (get_uvicorn_config() is only used in llama_cpp.py) whereas this concurrency code was written with vllm in mind

@gabe-l-hart I think we need to reconcile this with server_supports_batched in LLMBlock - basically that code is detecting llama-cpp vs vllm

Copy link
Contributor

@markmc markmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to be disabled for llama-cpp. Great catch @derekhiggins !

@gabe-l-hart
Copy link
Contributor Author

Very good catch. I'll look into how to make the defaults conditional on the server.

@gabe-l-hart
Copy link
Contributor Author

@markmc Thanks for the pointer on the standards. I really like the articulation of commits as revertable atomic units (though my personal preference is "legible trail of what happened within reason"). I'll refactor the history to reflect this.

Also, as a side note, I think you're spot on that this really has to do with whether the repo is git centric versus GitHub centric. The lack of a good history view in GH has always been a huge sore spot for me, particularly one which supports --first-parent!

@gabe-l-hart gabe-l-hart force-pushed the LLMBlockConcurrency-135 branch from dab7222 to 7cdfade Compare July 18, 2024 17:06
Copy link
Contributor

mergify bot commented Jul 18, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @gabe-l-hart please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 18, 2024
gabe-l-hart and others added 2 commits July 18, 2024 12:48
Problem statement from [email protected]:

Overview

The current implementation of LLMBlock sends the entire dataset as a single
large batch of requests to the OpenAI server. This may lead to some
requests waiting too long for the response, resulting in timeout errors and
potentially overloading the backend server with extremely large batches.

Proposed Changes

Use concurrent processing using Python’s concurrent.futures package in
LLMBlock. The key changes are:

* Utilizes concurrent.futures for managing parallel tasks with threading
for launching parallel tasks.
* Allows users to specify the number of requests to send in each batch.
* Allows users to specify the number of concurrent worker threads to
handle batches.

Example Usage

If the user sets the concurrency to 8 and the batch size to 32, the system
will run 8 concurrent threads, each sending a batch of 32 prompts,
resulting in a total of 256 requests processed simultaneously by the
backend server.


Ref: aakankshaduggal#6

instructlab#135

Signed-off-by: Gabe Goodhart <[email protected]>

Co-authored-by: Nikhil Palaskar <[email protected]>
Co-authored-by: shiv <[email protected]>
Co-authored-by: Kai Xu <[email protected]>
Co-authored-by: Aakanksha Duggal <[email protected]>
@gabe-l-hart gabe-l-hart force-pushed the LLMBlockConcurrency-135 branch from 7cdfade to afe12a6 Compare July 18, 2024 18:49
@mergify mergify bot added ci-failure and removed needs-rebase labels Jul 18, 2024
@gabe-l-hart
Copy link
Contributor Author

@derekhiggins @markmc It looks like the decision of which backend to use is made here in instructlab and by the time generate_data is called (here) the information about which backend is being used is lost (as it should be since SDG doesn't care). I think the cleanest way to support disabling concurrency for llama.cpp would be to put conditional logic in instructlab/data/generate.py to set the batch_num_workers argument correctly. We're going to need a PR there anyway to expose these as CLI args, so I could make the change there if that makes sense to you.

@gabe-l-hart
Copy link
Contributor Author

Draft PR for changes in instructlab once this is merged: instructlab/instructlab#1797

@gabe-l-hart gabe-l-hart force-pushed the LLMBlockConcurrency-135 branch 2 times, most recently from 65005ff to 432a842 Compare July 18, 2024 23:09
@mergify mergify bot added the ci-failure label Jul 19, 2024
@markmc
Copy link
Contributor

markmc commented Jul 19, 2024

I've filed #174 to capture some of the stuff we've discussed here, I'm fine with capturing it as a TODO for future

@derekhiggins
Copy link
Contributor

Tested with both lamma_cpp and vllm,
lamma_cpp consistently hits the "Service Unavailable" issue described above which vllm seems to work fine.

Do we want to address disabling concurrency with lamma_cpp here or address it later ?

@markmc
Copy link
Contributor

markmc commented Jul 19, 2024

Derek also pointed out that CI isn't passing, my bad for not seeing that!

Do we want to address disabling concurrency with lamma_cpp here or address it later ?

Something like, in generate_data():

    # FIXME: remove this when ilab knows to pass batch_size=0 with llama
    if batch_size is None:
        batch_size = 0

could be a good way to get all this staged, ready to enable safely when the necessary ilab change merges by deleting those lines

@derekhiggins
Copy link
Contributor

derekhiggins commented Jul 19, 2024

Ya, the e2e job is reporting success but there is a traceback in the logs

TypeError: %d format: a real number is required, not NoneType

https://github.com/instructlab/sdg/actions/runs/9999879504/job/27641546924?pr=157#step:9:120

@derekhiggins
Copy link
Contributor

ok, there is nothing wrong with the CI job, the generate command spits out the trackback about the log string formatting and then continues and succeeds

@gabe-l-hart
Copy link
Contributor Author

🤦 bad logging. Will fix now

@gabe-l-hart gabe-l-hart force-pushed the LLMBlockConcurrency-135 branch from 432a842 to 68303f2 Compare July 19, 2024 16:08
@mergify mergify bot removed the ci-failure label Jul 19, 2024
@gabe-l-hart
Copy link
Contributor Author

Something like, in generate_data():

# FIXME: remove this when ilab knows to pass batch_size=0 with llama
if batch_size is None:
    batch_size = 0

could be a good way to get all this staged, ready to enable safely when the necessary ilab change merges by deleting those lines

That makes sense because of the sequence of the updates. Will add that now.

This is a mitigation to allow the `instructlab-sdg` library to merge and
release before `instructlab` has updated the CLI invocation of
generate_data to properly distinguish between backend types. It should be
reverted once that change is made in the CLI.

instructlab#135

Signed-off-by: Gabe Goodhart <[email protected]>
@markmc
Copy link
Contributor

markmc commented Jul 19, 2024

lgtm when CI passes

Copy link

E2E (NVIDIA A10G x4 - full pipeline) workflow launched on this PR: View run

Copy link

e2e workflow succeeded on this PR: View run, congrats!

Copy link
Contributor

@derekhiggins derekhiggins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks

@gabe-l-hart
Copy link
Contributor Author

Thanks for the thorough reviews @derekhiggins @markmc!

@derekhiggins derekhiggins merged commit 30fb578 into instructlab:main Jul 19, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants