Skip to content

Model Config settings for Llama-based architectures #22

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jamesoneill12 opened this issue Apr 5, 2024 · 8 comments
Open

Model Config settings for Llama-based architectures #22

jamesoneill12 opened this issue Apr 5, 2024 · 8 comments

Comments

@jamesoneill12
Copy link

jamesoneill12 commented Apr 5, 2024

Hi there,

Thanks for creating this repo.
I wanted to know what should be for Llama-2-7b-chat-hf if its the below for gpt and opt arhitectures ?

    "gpt": {
        "path_to_blocks": ["transformer", "h"],
        "child_ref_in_parent_forward": ["transformer", "block"],
    },
    "opt": {
        "path_to_blocks": ["model", "decoder", "layers"],
        "child_ref_in_parent_forward": ["model.decoder", "decoder", "decoder_layer"],
    }

I think it's something close to

    "llama": {
        "path_to_blocks": ["model", "layers"],
        "child_ref_in_parent_forward": ["model", "decoder_layer"], 
    }

but running into the following error

File "/GPTFast/Helpers/Class/add_str_as_func.py", line 9, in add_str_as_func
func_code = compile(complete_func_str, "", "exec")
File "", line 19
input_pos: Optional[torch.Tensor] = None

So the parsing of the code string is somehow getting miscorrectly matched at "decoder_layer".
Any help would be appreciated for this to be able to work on the LLama architectures using this code.

@MDK8888
Copy link
Owner

MDK8888 commented Apr 5, 2024

Hey James, Llama actually already supports static key-value caching natively within transformers. Will put up a fix in the next few days so that models with static key-value caching natively enabled can also integrate into GPTFast.

@jamesoneill12
Copy link
Author

Oh that's awesome! Not completely related, but I've noticed meta-llama/LlamaGuard-7b is super fast out of the box for guardrailing (0.09-0.13 second inference for 100 max new tokens with input token length 400 for a single sample on A100 80GB GPU w/ bfloat16 dtype) but I'm not seeing the same on other Llama architectures such as Llama-2-7b-chat-hf. Do you know if some of the Llama arcs have some inference optimization behind the scenes apart from kv caching ?

@MDK8888
Copy link
Owner

MDK8888 commented Apr 6, 2024

Hey, apologies for the late response-that is very interesting indeed! I would have to investigate how LlamaGuard-7b works under the hood to answer :)

@jamesoneill12
Copy link
Author

No problem! That would be great actually, even if it supported in Transformers

@MDK8888
Copy link
Owner

MDK8888 commented Apr 10, 2024

Hey James, this week is incredibly busy for me. I will do my best to have a new branch with the fixes up this weekend, if not, early next week.

@jamesoneill12
Copy link
Author

No problem at all, can't wait for the release!

@MDK8888
Copy link
Owner

MDK8888 commented Apr 15, 2024

Hey James, I just pushed up my changes on the branch LlamaIntegration. The example for how it works with TinyLlama is under Examples.llama, but I don't have the GPU bandwidth to test on larger models. Let me know if my changes work with the specific Llama model that you had in mind, and I'll fix it asap if not. Thanks once again for pointing this out to me :)

@jamesoneill12
Copy link
Author

Fantastic @MDK8888 !! Can't wait to try this out, I'll let you know if there's anything to report on the larger Llama-based architectures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants