wxxwxxw/Llama-3-Taiwan-8B-Instruct-128k-4bit-AWQ

This model is quantized using AWQ in 4 bits; the original model is yentinglin/Llama-3-Taiwan-8B-Instruct-128k

quantize

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'yentinglin/Llama-3-Taiwan-8B-Instruct-128k'
quant_path = 'Llama-3-Taiwan-8B-Instruct-128k-AWQ'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM", "modules_to_not_convert": []}

model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

inference with vllm

from vllm import LLM, SamplingParams

llm = LLM(model='wxxwxxw/Llama-3-Taiwan-8B-Instruct-128k-4bit-AWQ', 
          quantization="AWQ",
          tensor_parallel_size=2, # number of gpus
          gpu_memory_utilization=0.9,
          dtype='half'
         )

tokenizer = llm.get_tokenizer()
conversations = tokenizer.apply_chat_template(
    [{'role': 'user', 'content': "how tall is taipei 101"}],
    tokenize=False,
)

outputs = llm.generate(
    [conversations],
    SamplingParams(
        temperature=0.5,
        top_p=0.9,
        min_tokens=20,
        max_tokens=1024,
    )
)
 
for output in outputs:
    generated_ids = output.outputs[0].token_ids
    generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)
    print(generated_text)
Downloads last month
29
Safetensors
Model size
1.98B params
Tensor type
I32
·
FP16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for wxxwxxw/Llama-3-Taiwan-8B-Instruct-128k-4bit-AWQ

Quantized
(3)
this model