Model Config settings for Llama-based architectures #22

jamesoneill12 · 2024-04-05T12:49:03Z

Hi there,

Thanks for creating this repo.
I wanted to know what should be for Llama-2-7b-chat-hf if its the below for gpt and opt arhitectures ?

    "gpt": {
        "path_to_blocks": ["transformer", "h"],
        "child_ref_in_parent_forward": ["transformer", "block"],
    },
    "opt": {
        "path_to_blocks": ["model", "decoder", "layers"],
        "child_ref_in_parent_forward": ["model.decoder", "decoder", "decoder_layer"],
    }

I think it's something close to

    "llama": {
        "path_to_blocks": ["model", "layers"],
        "child_ref_in_parent_forward": ["model", "decoder_layer"], 
    }

but running into the following error

File "/GPTFast/Helpers/Class/add_str_as_func.py", line 9, in add_str_as_func
func_code = compile(complete_func_str, "", "exec")
File "", line 19
input_pos: Optional[torch.Tensor] = None

So the parsing of the code string is somehow getting miscorrectly matched at "decoder_layer".
Any help would be appreciated for this to be able to work on the LLama architectures using this code.

The text was updated successfully, but these errors were encountered:

MDK8888 · 2024-04-05T17:46:23Z

Hey James, Llama actually already supports static key-value caching natively within transformers. Will put up a fix in the next few days so that models with static key-value caching natively enabled can also integrate into GPTFast.

jamesoneill12 · 2024-04-05T21:51:28Z

Oh that's awesome! Not completely related, but I've noticed meta-llama/LlamaGuard-7b is super fast out of the box for guardrailing (0.09-0.13 second inference for 100 max new tokens with input token length 400 for a single sample on A100 80GB GPU w/ bfloat16 dtype) but I'm not seeing the same on other Llama architectures such as Llama-2-7b-chat-hf. Do you know if some of the Llama arcs have some inference optimization behind the scenes apart from kv caching ?

MDK8888 · 2024-04-06T22:27:09Z

Hey, apologies for the late response-that is very interesting indeed! I would have to investigate how LlamaGuard-7b works under the hood to answer :)

jamesoneill12 · 2024-04-07T21:26:57Z

No problem! That would be great actually, even if it supported in Transformers

MDK8888 · 2024-04-10T23:18:18Z

Hey James, this week is incredibly busy for me. I will do my best to have a new branch with the fixes up this weekend, if not, early next week.

jamesoneill12 · 2024-04-12T09:45:29Z

No problem at all, can't wait for the release!

MDK8888 · 2024-04-15T00:13:05Z

Hey James, I just pushed up my changes on the branch LlamaIntegration. The example for how it works with TinyLlama is under Examples.llama, but I don't have the GPU bandwidth to test on larger models. Let me know if my changes work with the specific Llama model that you had in mind, and I'll fix it asap if not. Thanks once again for pointing this out to me :)

jamesoneill12 · 2024-04-15T14:42:02Z

Fantastic @MDK8888 !! Can't wait to try this out, I'll let you know if there's anything to report on the larger Llama-based architectures.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Config settings for Llama-based architectures #22

Model Config settings for Llama-based architectures #22

jamesoneill12 commented Apr 5, 2024 •

edited

Loading

MDK8888 commented Apr 5, 2024

jamesoneill12 commented Apr 5, 2024

MDK8888 commented Apr 6, 2024

jamesoneill12 commented Apr 7, 2024

MDK8888 commented Apr 10, 2024

jamesoneill12 commented Apr 12, 2024

MDK8888 commented Apr 15, 2024

jamesoneill12 commented Apr 15, 2024

Model Config settings for Llama-based architectures #22

Model Config settings for Llama-based architectures #22

Comments

jamesoneill12 commented Apr 5, 2024 • edited Loading

MDK8888 commented Apr 5, 2024

jamesoneill12 commented Apr 5, 2024

MDK8888 commented Apr 6, 2024

jamesoneill12 commented Apr 7, 2024

MDK8888 commented Apr 10, 2024

jamesoneill12 commented Apr 12, 2024

MDK8888 commented Apr 15, 2024

jamesoneill12 commented Apr 15, 2024

jamesoneill12 commented Apr 5, 2024 •

edited

Loading