Papers
arxiv:2402.17764

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Published on Feb 27, 2024
Β· Submitted by akhaliq on Feb 28, 2024
#1 Paper of the day

Abstract

Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective. Furthermore, it enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.

Community

Very nice paper that introduces a new paradigm for LLM quantization (ternary weights for linear layers {-1, 0, 1} resulting in removing the need of having multiplications in matmul + int8 activations)
It seems that method cannot be used as a post-training quantization method, but rather train a 1.5-bit model from scratch. I believe the code will be shared here: https://github.com/microsoft/unilm/tree/master/bitnet - would be curious to see if the authors will share the quantized models on the Hub!
I also wonder if the lm_head is also quantized, as not quantizing the lm head helps for preserving good generation quality for quantized language models

Β·
Paper author

We would definitely be happy to open-source the models for future research. Please stay tuned!

The lm_head is not quantized because the language models have to use high-precision
probabilities to perform sampling, and it only takes a very small proportion of the cost especially when the model is large.

This is incredible! Like the other commenter here one of my first thoughts goes immediately to existing LLMs and whether they can be converted to 1.58bit LLMs somehow. @shumingma Did you conduct any experiments in this area? Either via some finetuning method or even distillation?

Β·
Paper author

Unfortunately, the conversion or post-training quantization from existing LLMs doesn't help. This is why we train the models from scratch.

Amazing work!
This method is likely compatible with powerinfer (as long as the activation function is replaced by ReLU or squared ReLU) which would make it ever faster on a mixed setup with, for example, 64GB RAM + 24GB VRAM (which would then support a 400B model with decent speeds)

It would also be interesting to see this combined with some of these papers: (I think all of them are compatible with each other)

switchhead
fast feed forward
pause tokens
EAGLE
KIVI

Β·

Fast feed forward doesn’t replicate. Worked on that for a few weeks.

Hi, very exciting work!
I have a few questions on the zero-shot performance on the language tasks.
Did you also compute the evaluation with "BitNet b1.58 70B" ? I'm very curious about these results. I'm referring to something like Table 3.

Β·
Paper author

We haven't finished the training of the models beyond 3B as it requires much much more resources. However, we're optimistic about the results because we have verified that BitNet follows a similar performance-parameter scaling law as the full-precision LLMs. We'll update the results on larger models once they're ready.

Really interesting work! Are there any major drawbacks or are we all just starting over using this?

Hi. Great work wanted to ask how long did the 1b or 700m parameter variants take to train? I couldn't see in the paper.

Β·

I would also be interested if you have a sense of if it is more efficient to train the models using this method vs a more traditional model?

The trend of perplexity becoming better with a larger parameter count compared to the 700m and 1.3b is... perplexing.
Did you guys study how it impacted very small parameter count models (i.e, 100m?)
Is it reasonable to conclude that "under-parameterized" Transformers tend to use the full precision to better represent individual neurons, but that this property seems to fade with scaling, which makes the technique more effective w.r.t large models?

Β·

From what I understand, as models become larger, sparsity emerges, e.g. https://openreview.net/forum?id=TJ2nxciYCk-

This is great news! Could you share the training code so we can experiment with pre-training smaller models?

I think the name "Ternary LLM" makes more sense than "BitNet b1.58"

Β·

Or "TritNet" if they prefer to keep with their existing naming scheme.

I wonder if you could quantize layers one by one, with calibration. To 1bit. I know thats not the point of this as the models were all from scratch. Would be pretty interesting. Something similar to LASER.

I know purely quantizing existing models does not work, but are there plans to try some distillation procedure or possibly slow-walk existing model parameters into this quantized state?

Β·

Well,It's not quantized.
It's build tensors with 1.56bit instead FP16 in mind.

Very interesting approach! One question I still have is what’s the integer layout to store the third state (0). Since it is not 1-bit, I am guessing that’s where the 1.58 comes from, but I am unclear on what’s the representation in the binary form. Do you use one bit for the sign and another one for the value?

This research direction is starting to remind me of Hyperdimensional Computing / Vector Symbolic Architectures, which also typically use 1-bit or ternary representations, but take the approach of building explicit knowledge structures by combining concept vectors using a set of basic operations.
I wonder if both HDC/VSA and LLMs end up doing ultimately the same things at their core. It would be really cool if they turned out to be special cases of a single unified framework that combined the former's interpretability with the latter's trainability/scalability :-)

Missed an opportunity to name the title "ternary weights is all you need.".

This paper is very surprising to me. I would have thought that you could have a model with {-1, 0, 1} match the capability of an FP model by being significantly larger than it. You would be making up for the loss of β€œdescriptiveness” of FP by increasing the number of less descriptive weights. However, if I am following correctly, you’ve found that you actually don’t need to scale up the number of weights at all. Do you have any ideas as to why that might be? It kind've shatters my understanding of weights were even doing in the first place.

Β·

I agree with this. I'd be interested if someone has an idea of intuition to offer here. Is it perhaps at these high dimensions the addition of the weights isn't so valuable (ostensibly another dimension)

The memory savings and throughput results in the paper are inference right? Are you seeing the same or similar gains during training or are training gains different?

Β·

I believe during training the model is trained with full-precision master weights, and low-bit weighs are used for forward and back calculation.

This kind of feels too good to be true. Please prove me wrong, I'd be happy if you do so and prove the results are true.

My main concerns:

  1. Why don't you at least train the 7B version of BitNet on 2T tokens so it can be easily comparable on OpenLLM benchmark? It's easy to show that a 7B model performs well in a setting where it's trained only on 100B tokens, as there is a potential maximum information capacity, which is far below an fp16 alternative.

  2. What is the StableLM 3B trained on 2T tokens you are talking about? I could not find such a model. Stability has StableLM 3B trained on 1T tokens and a StableLM 2 1.6B trained on 2T tokens. The benchmarks of either of these models don't correspond to the benchmark you provide, and are better.

Β·

My main concerns:

  1. Why don't you at least train the 7B version of BitNet on 2T tokens so it can be easily comparable on OpenLLM benchmark?

They said in a previous thread that they hadn't finished training the models larger than 3.9b yet, because of the compute involved. I think the numbers for those in the paper like 70b are inferred from the current trends, but sounds like they do plan to train them.

  1. What is the StableLM 3B trained on 2T tokens you are talking about? I could not find such a model. Stability has StableLM 3B trained on 1T tokens and a StableLM 2 1.6B trained on 2T tokens.

It might be a mistake, assuming that the 3b had the same amount of tokens as the 1.6b. There is also the Zephyr versions of each, though not sure how many more tokens were used for those fine-tunings.

This comment has been hidden

Pls opensource the model weight for further research @shumingma

Cool Work and I wanna know why the model after ternary QAT optimization is sorely less than 4x smaller? Shouldn't it be 8x smaller at least as compared to FP16?

If it is sorely 4x less small, it looks more like a 4-bit quantized model, and as we all know 4-bit is almost lossless for current LLM. @shumingma

Β·

I think the following explains. At smaller sizes, the full precision embedding takes up more of the model. They estimate at 70b, that it will take 1/7 the vram of a normal 70b model.

"We further scaled up the model size to 7B, 13B, and 70B and evaluated the
cost. Figure 2 illustrates the trends of latency and memory, showing that the speed-up increases as the
model size scales. In particular, BitNet b1.58 70B is 4.1 times faster than the LLaMA LLM baseline.
This is because the time cost for nn.Linear grows with the model size. The memory consumption
follows a similar trend, as the embedding remains full precision and its memory proportion is smaller
for larger models. Both latency and memory were measured with a 2-bit kernel, so there is still room
for optimization to further reduce the cost."

Though keep in mind those are extrapolations since they haven't actually trained above 3.9b yet.

Interesting work, but doesn't the improvement in PPL of quantized models vs their fp16 counterparts signal that they(the fp16 models) were not properly trained to begin with? (Intuitively, it should be impossible for the 1-bit model to find a point in the weight-dimension that has lower loss than the point found by fp16, right?)

Β·

Exactly that, these models are under-trained for the number of parameters they have.

How are you able to represent 3 states using 1.58 bits? Don't you need at least 2 bits to represent more then 2 states?

Β·

technically they are BCT encoding the ternary anyways so it's actually 2 bits averaging out to 1.58)

Great work! Can you expand on the quantization details?

  1. what's the granularity for weights and activations? (per-tensor, per-channel, per-token, etc.)
  2. Are activation scales calculated statically or dynamically?
Β·

It's not a quantization method.

Would not 2 bits (and quaternary instead of ternary) be more efficient when implemented on a binary processor?

Β·

The performance optimization is in the math. When you are doing matrix multiplication with ternary, it turns into non-multiplication. I.E. -1 x anything = sign flip, 0 x anything = 0, and 1 x anything = anything.
In all cases, the answer is almost instantaneous, even without any specialized hardware. It will be great to see how well this runs on regular CPUs.

How does this ternary representation look like? Is it in int2 where the first bit is the sign?

Β·

Good point - interested in this

If you do choose to continue training the larger models could you use the data used to train Phi-2? I imagine it would scale significantly better than standard data. And potentially 5gb of deduped star coder dataset and 5gb of slim pajama πŸ™πŸ™ just some hopeful request!

Also is there really currently no way to quantize the models down to 1.58 bits and use a recovery lora kinda like in y’alls β€œtransformer compression” paper.

I'm particularly curious about how the model size is kept consistent in the table. So, how is the model size of the b1.58 model calculated? From my understanding, if the model size remains consistent, does it imply more parameters, especially compared to quantization? Especially, I noticed that in the paper, the 1-bit BitNet compares models with different numbers of bits, while keeping the model size consistent. Personally, I believe this approach is less promising than quantization because it does not reduce the model size.

Β·

By β€œmodel size” they just mean the number of parameters in the model, not the physical size on disk. Generally the memory limitation is from loading all the data of the model into memory, so that is more representative of the size in the sense you mean.

And for that, they didn’t make the embeddings smaller, so it makes a bigger diff the larger the model. You can see that by the time it gets up to 70b params, they estimate 1/7 the ram, so the file size would be around that much smaller (depends on how the trits are actually encoded into bits)

Will the training code be made public? That would actually be awesome and then we "gpu poor" will be able to have true mixture of experts with 10s of models trained on trillions of token and hence agi. Also, have you guys thought about doing this for pictures and videos to train models in similar fashion?

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

I was expecting to see the original 1bit BitNet in the perplexity table. Was curious just how much adding that zero weight improved the model.

Β·

yes, that would be helpful. I think this paper needs work. Great initial result, but lots of loose ends.

Hey I'm just a curious newb. But I'm wondering could we have a 1 byte mamba? Also spiking neural networks are binary-like and capable of real time learning (that's why they are sometimes called liquid neural nets right?) and ternary is just binary with negatives... so... might there be a way to record the activation of neurons in response to a prompt and do that 3 times with a different seed each, and use a graph pruning algorithm to help it learn? And likewise use some kind of associative reinforcement algorithm to make new graph connections between concepts that get brought up together in context?

Could we also use this system in a just-bytes/encoderless multimodal model?

import tensorflow as tf
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import layers
import numpy as np

class BitNet(tf.keras.Model):

def __init__(self, num_layers, hidden_size, num_heads, vocab_size):
    super().__init__()

    self.embeddings = tf.keras.layers.Embedding(vocab_size, hidden_size)

    self.layers = [
        BitLinearBlock(hidden_size, num_heads)
        for _ in range(num_layers)
    ]

    self.ln = LayerNormalization(hidden_size)
    self.lm_head = tf.keras.layers.Dense(vocab_size, dtype=tf.float32)  # Use higher precision for lm_head

def call(self, inputs, training=True):
    x = self.embeddings(inputs)

    for layer in self.layers:
        x = layer(x, training=training)

    x = self.ln(x)
    return self.lm_head(x)

class BitLinearBlock(tf.keras.layers.Layer):

def __init__(self, hidden_size, num_heads):
    super().__init__()
    self.atten = BitAttention(hidden_size, num_heads)
    # Assuming the implementation of FeedForward is complete
    self.mlp = FeedForward(hidden_size)

def call(self, inputs, training):
    att = self.atten(inputs, training)
    return self.mlp(att)

class BitAttention(tf.keras.layers.Layer):

def __init__(self, hidden_size, num_heads):
    super().__init__()
    self.num_heads = num_heads
    self.hidden_size = hidden_size

def build(self, input_shape):
    # Initialize 1-bit weights etc
    self.q_weight = self.add_weight(
        shape=(input_shape[-1], self.hidden_size),
        initializer=tf.keras.initializers.GlorotNormal,
        dtype=tf.float32
    )

    self.kv_weight = self.add_weight(
        shape=(input_shape[-1], 2 * self.hidden_size),
        initializer=tf.keras.initializers.GlorotNormal,
        dtype=tf.float32
    )

    # Convert weights to ternary representation
    self.q_weight = tf.sign(self.q_weight)
    self.kv_weight = tf.sign(self.kv_weight)

    # Centralize weights
    self.q_weight_mean = tf.reduce_mean(self.q_weight)
    self.q_weight -= self.q_weight_mean

    self.kv_weight_mean = tf.reduce_mean(self.kv_weight)
    self.kv_weight -= self.kv_weight_mean

    # Scale factor
    self.q_scale = 1 / tf.reduce_sum(
        tf.cast(tf.abs(self.q_weight), tf.float32))

    self.kv_scale = 1 / tf.reduce_sum(
        tf.cast(tf.abs(self.kv_weight), tf.float32))

def call(self, inputs, training):

    # Absmax quantize activations
    inputs = quantize(inputs)

    # Multi-head attention
    queries = tf.matmul(inputs, self.q_weight * self.q_scale)
    keys = tf.matmul(inputs, self.kv_weight[:, :self.hidden_size] * self.kv_scale)
    values = tf.matmul(inputs, self.kv_weight[:, self.hidden_size:] * self.kv_scale)

    qk_aproduct = tf.matmul(queries, keys, transpose_b=True) / np.sqrt(self.hidden_size)
    attn_weights = tf.nn.softmax(qk_aproduct)

    attn_out = tf.matmul(attn_weights, values)

    # Residual connection
    output = inputs + attn_out

    # Layer normalization
    output = self.layer_norm(output)

    return output

def backward(self, grad):
    # Sign grad
    grad_queries = tf.matmul(grad, attn_weights, transpose_a=True)

    # Backprop queries
    grad_queries = quantize(grad_queries)
    grad_q_weight = tf.matmul(inputs, grad_queries, transpose_b=True) * self.q_scale

    # Backprop keys
    grad_keys = tf.matmul(attn_weights, grad, transpose_a=True)
    grad_kv_weight = tf.matmul(inputs, grad_keys, transpose_b=True)[:, :self.hidden_size] * self.kv_scale

    # Backprop values
    grad_values = tf.matmul(attn_weights, grad, transpose_b=True)
    grad_kv_weight = tf.concat([grad_kv_weight, tf.matmul(inputs, grad_values, transpose_b=True)],
                               axis=1) * self.kv_scale

    return grad

class FeedForward(tf.keras.layers.Layer):

def __init__(self, hidden_size):
    super().__init__()

def call(self, inputs):
    x = tf.keras.layers.Dense(units=hidden_size, activation=tf.nn.relu)(inputs)
    x = tf.keras.layers.Dense(units=hidden_size)(x)
    return x

class LayerNormalization(layers.Layer):

def __init__(self, hidden_size, epsilon=1e-6):
    super().__init__()
    self.gamma = self.add_weight(shape=(hidden_size,), initializer='ones', trainable=True)
    self.beta = self.add_weight(shape=(hidden_size,), initializer='zeros', trainable=True)
    self.epsilon = epsilon

def call(self, x):
    mean = tf.reduce_mean(x, axis=-1, keepdims=True)
    variance = tf.reduce_mean(tf.square(x - mean), axis=-1, keepdims=True)
    normalized = (x - mean) * tf.math.rsqrt(variance + self.epsilon)
    return self.gamma * normalized + self.beta

Placeholder for the quantize function

def quantize(x):
abs_max = tf.math.reduce_max(tf.math.abs(x))
quantized = x / abs_max
return tf.clip_by_value(quantized, -1, 1)

Assuming ce (cross-entropy) and lr (learning rate) are defined elsewhere

ce = tf.keras.losses.CategoricalCrossentropy()
lr = 0.001

Instantiate the model and compile

model = BitNet(
num_layers=12,
hidden_size=768,
num_heads=12,
vocab_size=30000
)
model.compile(optimizer=Adam(lr), loss=ce)

@tf .function
def train_step(inputs, labels):
with tf.GradientTape() as tape:
outs = model

Β·

In your Bitattention function you use fp32 for the weights. When do these weights converted into the ternary representation of (-1, 0, 1)? I might be blind, but I just can't see it.

Updated BitAttention
Maintain both high-precision master weights and quantized low-bit weights.
For the forward pass, use the low-bit weights for efficiency.
For the backward pass, calculate gradients with respect to the low-bit weights.
Then apply the straight-through estimator - directly accumulate those gradients onto the high-precision master weights, bypassing the non-diff quantization

class BitAttention(tf.keras.layers.Layer):
def init(self, hidden_size, num_heads, quantization_bits=1):
super().init()
self.num_heads = num_heads
self.hidden_size = hidden_size
self.quantization_bits = quantization_bits

def build(self, input_shape):
    # Initialize high-precision master weights
    self.q_weight_master = self.add_weight(
        shape=(input_shape[-1], self.hidden_size),
        initializer=tf.keras.initializers.GlorotNormal,
        dtype=tf.float32,
        name='q_weight_master'
    )

    self.kv_weight_master = self.add_weight(
        shape=(input_shape[-1], 2 * self.hidden_size),
        initializer=tf.keras.initializers.GlorotNormal,
        dtype=tf.float32,
        name='kv_weight_master'
    )

    # Initialize low-bit quantized weights
    self.q_weight = self.add_weight(
        shape=(input_shape[-1], self.hidden_size),
        initializer=tf.keras.initializers.GlorotNormal,
        dtype=tf.float32,
        trainable=False,
        name='q_weight'
    )

    self.kv_weight = self.add_weight(
        shape=(input_shape[-1], 2 * self.hidden_size),
        initializer=tf.keras.initializers.GlorotNormal,
        dtype=tf.float32,
        trainable=False,
        name='kv_weight'
    )

def call(self, inputs, training):
    # Use low-bit weights for forward pass
    queries = tf.matmul(inputs, self.q_weight)
    keys = tf.matmul(inputs, self.kv_weight[:, :self.hidden_size])
    values = tf.matmul(inputs, self.kv_weight[:, self.hidden_size:])

    qk_aproduct = tf.matmul(queries, keys, transpose_b=True) / np.sqrt(self.hidden_size)
    attn_weights = tf.nn.softmax(qk_aproduct)

    attn_out = tf.matmul(attn_weights, values)

    # Residual connection
    output = inputs + attn_out

    # Layer normalization
    output = self.layer_norm(output)

    return output

def backward(self, grad):
    # Sign grad
    grad_queries = tf.matmul(grad, self.attn_weights, transpose_a=True)

    # Backprop queries
    grad_queries = quantize(grad_queries)
    grad_q_weight = tf.matmul(inputs, grad_queries, transpose_b=True)

    # Backprop keys
    grad_keys = tf.matmul(self.attn_weights, grad, transpose_a=True)
    grad_kv_weight = tf.matmul(inputs, grad_keys, transpose_b=True)[:, :self.hidden_size]

    # Backprop values
    grad_values = tf.matmul(self.attn_weights, grad, transpose_b=True)
    grad_kv_weight = tf.concat([grad_kv_weight, tf.matmul(inputs, grad_values, transpose_b=True)],
                               axis=1)

    # Use straight-through estimator
    self.q_weight_master.assign_add(grad_q_weight)
    self.kv_weight_master.assign_add(grad_kv_weight)

    # Sync quantized weights from masters periodically
    if training and self.quantization_bits < 32:
        if tf.equal(tf.math.mod(tf.train.get_global_step(), SYNC_INTERVAL), 0):
            self.q_weight.assign(quantize(self.q_weight_master))
            self.kv_weight.assign(quantize(self.kv_weight_master))

    return grad

Is this really 1.58bits or is this 2bits with some waste?

Unless the future hardware has ternary memory, it's still going to be stored in binary. The simplest encoding would be 2bits (maybe sign -1,1 & mag 0, 1), but that's pretty far from 1.58 bits. You could encode 5 ternary bits with 8 binary bits for storage (1.6bits/weight) but then you need some decoder (like a lookup table), and I'm not sure if that was factored into the efficiency/power graphs.

So if we assume it's actually 2bit storage, it raises the question of why not quantize to all 4 values instead of just 3? At first glance it may seem that using only 3 is required to avoid the multiplication, but if I understood the activations were int8, so the 4th weight value could have been 0.5 and the hw can simply right shift instead of multiply, which is just as "free" in as the other 3 values (-1, 0, 1).

Am I missing something here @shumingma ?

Β·

Noticed that a post-training quantization work seemed similar to this. https://huggingface.co/papers/2402.11960

@brandf It doesn't address the packing question, but now that you say, with practically free bit shifts, one could avoid multiplication up to (-2, -1, 0, 1, 2) with evenly spaced weights, and even (-4, -2, -1, 0, 1, 2, 4) doesn't look too bad.

Β·

any weight that are 1/2^x can also be done with a shifts. it doesn't even have to by symmetric so for example with 3bit quantization you get 8 values and you could map them to (-1, -0.5, 0, 0.25, 0.5, 1, 2, 4).

when signed integers are represented in the standard two's complement way the right shifts need to preserve the high order bit, but again that's free in hardware.

this shift trick doesn't work unless the activations are integers though, however there are similar bit-level tricks that can be done to avoid a full multiply.

I'm a beginner college student. When I first saw ReLU, I was like, "Was there such a simple way?" but this time I feel similar. This weighting seemed like a W with ReLU.

I hope there's a code that I can experiment with or recreate.

Does this mean that ternary computers are making a comeback?

Β·

More like ternary accelerator :)

Paper author

Thank you so much for your interest in our work! I'm delighted to see such insightful discussions taking place around our 1-bit LLMs. We truly appreciate the engagement from the community.

I'm excited to share that we will be releasing a detailed note paper this week, which will provide in-depth coverage of the implementation details and experiments discussed in the initial paper. Additionally, we plan to address the questions and comments raised here within the note paper itself.

The note paper is expected to be published this week, hopefully as early as tomorrow. We can't wait to continue the discussions and receive further feedback from all of you once the paper is out.

Stay tuned for the upcoming release, and please feel free to keep the insightful questions and comments coming!

Β·

I hope there will be good results!!

Paper author

A new paper providing training details, code, and FAQ is available at https://github.com/microsoft/unilm/blob/master/bitnet/The-Era-of-1-bit-LLMs__Training_Tips_Code_FAQ.pdf
(It's not on arXiv for some inexplicable reason.)

We welcome any questions or comments you may have regarding this paper and the information it covers. Feel free to share your thoughts and inquiries!

Β·

Will the toy models we see trained in the paper (the 3b variants especially) be released on HuggingFace so that llama.cpp and other software can add support for the modified arch? It would be interesting to see how the community optimizes / takes advantage of this on current hardware too.

Someone wrote a critical blog post (saw on HN), but I'm not experience enough to know if the criticisms have merit or not: https://huggingface.co/blog/joey00072/experiments-with-bitnet-1-5

Β·

The paper says that the discrepancy with FP16 gets reduced when the models are larger.

In the blog, the models are only 15M parameters, so I don't think it proves anything.

But that said, we still don't know what happens when a 70B ternary model is trained on a very large dataset with 4-8T tokens. Perhaps the ternary model's loss will saturate a lot earlier than the FP16 model.

We have successfully reproduced the results shown in the paper! All models are trained with 100B tokens on RedPajama. The weight can be quantized to ternary values offline. We release the 700M, 1.3B, 3B models and the evaluation results in the https://huggingface.co/1bitLLM

Β·

That’s awesome, can you share some info on the training compute requirements?

Hi all, first of all, what an exciting result @shumingma ! Very excited to see your followup work, plus of course model weights and code. I wrote a blog post about the paper(s) here: https://learning-exhaust.hashnode.dev/are-all-large-language-models-really-in-158-bits

I hope this helps people pick apart the details and underestand what may be going on under the hood. @shumingma I would love to hear your feedback on the blog

Β·

Thanks, this was a very nice writeup!

Hello there!

I am excited about the work you have done, congratulations!

I just have a small question. For the PyTorch implementation that you have provided here, you mention that it is necessary to remove the RMSNorm layers that precede the Attention and MLP calculations. This is because the new BitLinear layer is responsible for performing this operation.

Considering that the RMSNorm contains parameters that are learned during training:

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
print(model.model.layers[0].input_layernorm.weight) # RMSNorm example
Parameter containing:
tensor([0.0535, 0.2080, 0.4473, ..., 0.0854, 0.0435, 0.0289],
       requires_grad=True)

In the case of the BitLinear layers: do these RMSNorm layers contain such parameters, or are they parameter-free RMSNorm layers? We must change the original forward operation to something along these lines?

Revolutionize LLMs: BitNet b1.58 Brings 1.58-bit Efficiency!

Links πŸ”—:

πŸ‘‰ Subscribe: https://www.youtube.com/@Arxflix
πŸ‘‰ Twitter: https://x.com/arxflix
πŸ‘‰ LMNT (Partner): https://lmnt.com/

By Arxflix
9t4iCUHx_400x400-1.jpg

@shumingma

llama.cpp supports running the models reproduced by @1bitLLM !

Any plans to release the 3B model trained with 2T tokens? It would be a step up in model quality!

Hi, has the code to train the model from scratch for 1.5-bit been made public yet? If so, I would appreciate it if anyone could share the link.

Paper author

Hi all,

We have released the inference code for BitNet b1.58 models. The current release is optimized for CPU devices (both x86 and ARM), and will support GPU and NPU in the coming releases.

πŸ‘‰ https://github.com/microsoft/BitNet

Features:

  • πŸ”₯Seamlessly support the 1-bit models on Hugging Face
  • πŸš€ Running a 100B BitNet b1.58 model on a single CPU with speeds comparable to human reading
  • πŸ€– Deploying on various platforms (Windows, Linux, Mac, Android, etc) and different architectures (x86 and ARM)

Have fun!

Sign up or log in to comment

Models citing this paper 32

Browse 32 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2402.17764 in a dataset README.md to link it from this page.

Spaces citing this paper 22

Collections including this paper 201