Issues loading model shards #140

barsuna · 2024-08-11T11:19:24Z

folks, thank you for a very interesting project! i'm trying some basic scenarios and hit a few snags. Happy to split all to separate issues if needed.

Setup:

node/host1, ubuntu 24.04, 96gb ram, nvidia 4090 (24gb)
node/host2, ubuntu 24.04, 64gb ram, 2x nvidia titan-v (12gb each)

I can run successfully llama3.1 8B when host1 works alone (get about 8 tokens / sec)
If i start 2nd host, there is a number of issues:

Node 1, not only finds the GPU, it also provides Flops for various weight sizes, while on node 2 it doesnt. And i'm not sure it even supports 2 GPUs in 1 node?

i.e

Collected topology: Topology(Nodes: {f6a7175c-f785-40fb-9bc5-dd24f2fc05a8: Model: Linux Box (NVIDIA GEFORCE RTX 4090). Chip: NVIDIA GEFORCE RTX 4090. Memory: 24564MB. Flops: fp32: 82.58 TFLOPS, fp16: 165.16 TFLOPS, int8: 330.32 TFLOPS, 11551e73-088b-4022-b116-4f952ca03891: Model: Linux Box (NVIDIA TITAN V). Chip: NVIDIA TITAN V. Memory: 12288MB. Flops: fp32: 0.00 TFLOPS, fp16: 0.00 TFLOPS, int8: 0.00 TFLOPS}, Edges: {f6a7175c-f785-40fb-9bc5-dd24f2fc05a8: {'11551e73-088b-4022-b116-4f952ca03891'}, 11551e73-088b-4022-b116-4f952ca03891: {'f6a7175c-f785-40fb-9bc5-dd24f2fc05a8'}})

When starting node2 - it fails to load llama3.1-8b (16GB) - i see the following error:

Error processing tensor for shard Shard(model_id='mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated', start_layer=21, end_layer=31, n_layers=32): attempted to cast disk buffer (bitcast only)
Traceback (most recent call last):
  File "/home/x/exo/exo/orchestration/standard_node.py", line 221, in _process_tensor
    result, inference_state, is_finished = await self.inference_engine.infer_tensor(request_id, shard, tensor, inference_state=inference_state)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/exo/exo/inference/tinygrad/inference.py", line 77, in infer_tensor
    await self.ensure_shard(shard)
  File "/home/x/exo/exo/inference/tinygrad/inference.py", line 95, in ensure_shard
    self.model = build_transformer(model_path, shard, model_size="8B" if "8b" in shard.model_id.lower() else "70B")
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/exo/exo/inference/tinygrad/inference.py", line 48, in build_transformer
    weights = fix_bf16(weights)
              ^^^^^^^^^^^^^^^^^
  File "/home/x/exo/exo/inference/tinygrad/models/llama.py", line 221, in fix_bf16
    return {k:v.cast(dtypes.float16) if v.dtype == dtypes.bfloat16 else v for k,v in weights.items()}
              ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/penv/lib/python3.12/site-packages/tinygrad/tensor.py", line 3166, in _wrapper
    ret = fn(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^
  File "/home/x/penv/lib/python3.12/site-packages/tinygrad/tensor.py", line 2966, in cast
    return self if self.dtype == dtype else F.Cast.apply(self, dtype=dtype)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/penv/lib/python3.12/site-packages/tinygrad/tensor.py", line 38, in apply
    ret.lazydata, ret.requires_grad, ret.grad = ctx.forward(*[t.lazydata for t in x], **kwargs), ctx.requires_grad, None
                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/penv/lib/python3.12/site-packages/tinygrad/function.py", line 22, in forward
    return x.cast(dtype, bitcast)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/penv/lib/python3.12/site-packages/tinygrad/lazy.py", line 96, in cast
    if self.device.startswith("DISK") and not bitcast: raise RuntimeError("attempted to cast disk buffer (bitcast only)")
                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: attempted to cast disk buffer (bitcast only)
[]

The NVIDIA compute capabilities of GPUs in host1 and host2 are different: 4090 has cc 8.9, while titan-v has cc 7.0 - this seems to prompt tinygrad to start in NV mode for 4090 and in CUDA mode for titan-V does this matter to exo, or one needs to force both sides to CUDA?

if i do force both sides to CUDA mode i get error above (point 2), if i do not force them and node2 works in CUDA, then i get error below:

Excluded model param keys for shard=Shard(model_id='mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated', start_layer=0, end_layer=31, n_layers=32): set()
ram used: 11.98 GB, layers.27.feed_forward.w2.weight                  :  85%|██████████████████████████████████████████████████████████████████▌           | 249/292 [00:05<00:00, 45.44it/s]loaded weights in 5553.64 ms, 12.21 GB loaded at 2.20 GB/s
Error processing tensor for shard Shard(model_id='mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated', start_layer=0, end_layer=31, n_layers=32): CUDA Error 2, out of memory
Traceback (most recent call last):
  File "/home/x/penv/lib/python3.12/site-packages/tinygrad/device.py", line 152, in alloc
    try: return super().alloc(size, options)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/penv/lib/python3.12/site-packages/tinygrad/device.py", line 136, in alloc
    return self._alloc(size, options if options is not None else BufferOptions())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/penv/lib/python3.12/site-packages/tinygrad/runtime/ops_cuda.py", line 68, in _alloc
    return init_c_var(cuda.CUdeviceptr(), lambda x: check(cuda.cuMemAlloc_v2(ctypes.byref(x), size)))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/penv/lib/python3.12/site-packages/tinygrad/helpers.py", line 291, in init_c_var
    def init_c_var(ctypes_var, creat_cb): return (creat_cb(ctypes_var), ctypes_var)[1]
                                                  ^^^^^^^^^^^^^^^^^^^^
  File "/home/x/penv/lib/python3.12/site-packages/tinygrad/runtime/ops_cuda.py", line 68, in <lambda>
    return init_c_var(cuda.CUdeviceptr(), lambda x: check(cuda.cuMemAlloc_v2(ctypes.byref(x), size)))
                                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/penv/lib/python3.12/site-packages/tinygrad/runtime/ops_cuda.py", line 13, in check
    if status != 0: raise RuntimeError(f"CUDA Error {status}, {ctypes.string_at(init_c_var(ctypes.POINTER(ctypes.c_char)(), lambda x: cuda.cuGetErrorString(status, ctypes.byref(x)))).decode()}")  # noqa: E501
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA Error 2, out of memory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/x/exo/exo/orchestration/standard_node.py", line 221, in _process_tensor
    result, inference_state, is_finished = await self.inference_engine.infer_tensor(request_id, shard, tensor, inference_state=inference_state)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/exo/exo/inference/tinygrad/inference.py", line 77, in infer_tensor
    await self.ensure_shard(shard)
  File "/home/x/exo/exo/inference/tinygrad/inference.py", line 95, in ensure_shard
    self.model = build_transformer(model_path, shard, model_size="8B" if "8b" in shard.model_id.lower() else "70B")
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/exo/exo/inference/tinygrad/inference.py", line 52, in build_transformer
    load_state_dict(model, weights, strict=False, consume=False) # consume=True
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/penv/lib/python3.12/site-packages/tinygrad/nn/state.py", line 129, in load_state_dict
    else: v.replace(state_dict[k].to(v.device)).realize()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/penv/lib/python3.12/site-packages/tinygrad/tensor.py", line 3166, in _wrapper
    ret = fn(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^
  File "/home/x/penv/lib/python3.12/site-packages/tinygrad/tensor.py", line 203, in realize
    run_schedule(*self.schedule_with_vars(*lst), do_update_stats=do_update_stats)
  File "/home/x/penv/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 225, in run_schedule
    ei.run(var_vals, do_update_stats=do_update_stats)
  File "/home/x/penv/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 175, in run
    bufs = [cast(Buffer, x) for x in self.bufs] if jit else [cast(Buffer, x).ensure_allocated() for x in self.bufs]
                                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/penv/lib/python3.12/site-packages/tinygrad/device.py", line 78, in ensure_allocated
    def ensure_allocated(self) -> Buffer: return self.allocate() if not hasattr(self, '_buf') else self
                                                 ^^^^^^^^^^^^^^^
  File "/home/x/penv/lib/python3.12/site-packages/tinygrad/device.py", line 87, in allocate
    self._buf = opaque if opaque is not None else self.allocator.alloc(self.nbytes, self.options)
                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/penv/lib/python3.12/site-packages/tinygrad/device.py", line 155, in alloc
    return super().alloc(size, options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/penv/lib/python3.12/site-packages/tinygrad/device.py", line 136, in alloc
    return self._alloc(size, options if options is not None else BufferOptions())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/penv/lib/python3.12/site-packages/tinygrad/runtime/ops_cuda.py", line 68, in _alloc
    return init_c_var(cuda.CUdeviceptr(), lambda x: check(cuda.cuMemAlloc_v2(ctypes.byref(x), size)))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/penv/lib/python3.12/site-packages/tinygrad/helpers.py", line 291, in init_c_var
    def init_c_var(ctypes_var, creat_cb): return (creat_cb(ctypes_var), ctypes_var)[1]
                                                  ^^^^^^^^^^^^^^^^^^^^
  File "/home/x/penv/lib/python3.12/site-packages/tinygrad/runtime/ops_cuda.py", line 68, in <lambda>
    return init_c_var(cuda.CUdeviceptr(), lambda x: check(cuda.cuMemAlloc_v2(ctypes.byref(x), size)))
                                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/penv/lib/python3.12/site-packages/tinygrad/runtime/ops_cuda.py", line 13, in check
    if status != 0: raise RuntimeError(f"CUDA Error {status}, {ctypes.string_at(init_c_var(ctypes.POINTER(ctypes.c_char)(), lambda x: cuda.cuGetErrorString(status, ctypes.byref(x)))).decode()}")  # noqa: E501
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA Error 2, out of memory

In tinygrad repo the example for llama3 has some quantization supported - int8 and nf4 - does/can exo also support this? Loading models at 16 bits is a luxury few of us can afford :) While i can confirm that i can run llama3.1 70B quantized to 4 bits fully on these 3 GPUs if i put them into a signle node - it is my hope to get this running also on exo (i understand that llama.cpp support is in the works for now). On the same topic - it seems the MLX engine mostly refers to quantized models while tinygrad seems to pull 16bit weights, any reasons for this?
This is more of a question, is there a fallback to CPU supported? Or exo expect everything to load into GPUs. It seems to be latter, but wanted to confirm, so i understand the logic better.

Thank you again!

The text was updated successfully, but these errors were encountered:

AlexCheema · 2024-08-13T11:07:08Z

Not showing FLOPs is a visual bug. I made an issue to fix this here: automatically determine estimate of device FLOPs #149
Try running with SUPPORT_BF16=0 e.g. SUPPORT_BF16=0 python3 main.py
Possible same as 2? Or it could actually be that you're running out of memory, in which case perhaps it's worth waiting for [BOUNTY - $500] Add support for quantized models with tinygrad #148 where we will introduce quantized models.
Made an issue for this here: [BOUNTY - $500] Add support for quantized models with tinygrad #148
Right now we use tinygrad default device. We can have some more robust logic in there to failover. For now I'd like to iron out all the issues with the default device before introducing fallbacks that are relied on when they shouldn't be.

Thanks for the detailed issue - this really helps. Please continue to make issues / comments so we can improve exo to fulfil your use-case.

barsuna · 2024-08-17T15:58:36Z

Thank you @AlexCheema!

Ack on 1. I realize now these are static numbers. If determining these dynamically, it seems sensible to also establish bus bandwidth and GPU memory bandwidth - i imagine the overall perf would influence how big a shard each device will get?

On 2. The problem seems to have actually went away (i'm at c4b261d) with original command line
CUDA=1 python main.py --max-parallel-downloads 1 --disable-tui --wait-for-peers 1

and now i am able to use 2 nodes together, but if i actually try SUPPORT_BF16=0 - it causes python3.12 to segfault. But at i'm further than a was

Is not the same as 2 - that seems to do with exo only using single GPU per instance - indeed a single titan-V has only 12GB of memory, but the model is 16GB, so we run out if we only use a single card. But there are 2 cards with 12GB each, so i was hoping both will be used. It seems not the case. I think i saw an issue that suggested that we need to run an exo instance per GPU card.

I've managed to get 2 instances on the same host to more or less work in the following way:

(instance 1)

export CUDA_VISIBLE_DEVICES=0
CUDA=1 VISIBLE_DEVICES=0 python main.py --max-parallel-downloads 1 --disable-tui --wait-for-peers 1 --node-id 1111 --broadcast-port 55555

(instance 2)

export CUDA_VISIBLE_DEVICES=1
CUDA=1 VISIBLE_DEVICES=1 python main.py --max-parallel-downloads 1 --disable-tui --wait-for-peers 1 --node-id 2222 --listen-port 55555

the listen-port is needed because else there is a port conflict (this port is already used by instance 1) and broadcast port is needed because otherwise node 2 doesnt hear from node 1 (though opposite is not true - seems grpc allows such assymmetric/unidirectional communication).

Unless i'm off in some completelly wrong direction, we may want to have more clean way to run multiple instances per host (there are still issues like both instances try to bind to the same port for API etc).

Looking forward to quantization on tinygrad, thank you!
Got it, thank you!

AlexCheema · 2024-08-26T16:14:15Z

Yep we should do this dynamically. Made an issue: Dynamic device capabilities #177
This SUPPORT_BF16 shenanigans needs to be fixed properly. It should be as simple as running one command to run exo -- no configuration. I'm working on improving this, could use help on this!

This looks almost correct. You might want to set --broadcast-port too. There's an example here where we run 2 nodes on the same host:

exo/.circleci/config.yml

Lines 19 to 25 in 3949357

    
                       # Start first instance 
        
                       HF_HOME="$(pwd)/.hf_cache_node1" DEBUG_DISCOVERY=7 DEBUG=7 python3 main.py --inference-engine <<parameters.inference_engine>> --node-id "node1" --listen-port 5678 --broadcast-port 5679 --chatgpt-api-port 8000 --chatgpt-api-response-timeout-secs 900 > output1.log 2>&1 & 
        
                       PID1=$! 
        
                       # Start second instance 
        
                       HF_HOME="$(pwd)/.hf_cache_node2" DEBUG_DISCOVERY=7 DEBUG=7 python3 main.py --inference-engine <<parameters.inference_engine>> --node-id "node2" --listen-port 5679 --broadcast-port 5678 --chatgpt-api-port 8001 --chatgpt-api-response-timeout-secs 900 > output2.log 2>&1 & 
        
                       PID2=$!

barsuna · 2024-08-30T06:37:23Z

thank you @AlexCheema !

On 3 - this approch seems limited to 2 processes, we still need something different for when there is >2 instances. I tried to put each instance in a docker container, but wasnt able to get everything working quickly - it is little limiting anyways.

Do you plan to stay with 1 instance per GPU and do sharding between the local GPUs or perhaps considering to update the discovery to handle same host instances?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues loading model shards #140

Issues loading model shards #140

barsuna commented Aug 11, 2024 •

edited

Loading

AlexCheema commented Aug 13, 2024

barsuna commented Aug 17, 2024 •

edited

Loading

AlexCheema commented Aug 26, 2024 •

edited

Loading

barsuna commented Aug 30, 2024

Issues loading model shards #140

Issues loading model shards #140

Comments

barsuna commented Aug 11, 2024 • edited Loading

AlexCheema commented Aug 13, 2024

barsuna commented Aug 17, 2024 • edited Loading

AlexCheema commented Aug 26, 2024 • edited Loading

barsuna commented Aug 30, 2024

barsuna commented Aug 11, 2024 •

edited

Loading

barsuna commented Aug 17, 2024 •

edited

Loading

AlexCheema commented Aug 26, 2024 •

edited

Loading