wxxwxxw/Llama-3-Taiwan-8B-Instruct-128k-4bit-AWQ
This model is quantized using AWQ in 4 bits; the original model is yentinglin/Llama-3-Taiwan-8B-Instruct-128k
quantize
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = 'yentinglin/Llama-3-Taiwan-8B-Instruct-128k'
quant_path = 'Llama-3-Taiwan-8B-Instruct-128k-AWQ'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM", "modules_to_not_convert": []}
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model.quantize(tokenizer, quant_config=quant_config)
# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
inference with vllm
from vllm import LLM, SamplingParams
llm = LLM(model='wxxwxxw/Llama-3-Taiwan-8B-Instruct-128k-4bit-AWQ',
quantization="AWQ",
tensor_parallel_size=2, # number of gpus
gpu_memory_utilization=0.9,
dtype='half'
)
tokenizer = llm.get_tokenizer()
conversations = tokenizer.apply_chat_template(
[{'role': 'user', 'content': "how tall is taipei 101"}],
tokenize=False,
)
outputs = llm.generate(
[conversations],
SamplingParams(
temperature=0.5,
top_p=0.9,
min_tokens=20,
max_tokens=1024,
)
)
for output in outputs:
generated_ids = output.outputs[0].token_ids
generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)
print(generated_text)
- Downloads last month
- 29
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for wxxwxxw/Llama-3-Taiwan-8B-Instruct-128k-4bit-AWQ
Base model
meta-llama/Meta-Llama-3-70B