The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

[10]

creesch

March 1 (edited March 1)

Link

The abstract was still a bit jargony for me. But, as far as my understanding goes, this introduces a new type of LLM where every parameter is stored as either a 0, 1 or -1 (hence the 1.58 bit name...

The abstract was still a bit jargony for me.
But, as far as my understanding goes, this introduces a new type of LLM where every parameter is stored as either a 0, 1 or -1 (hence the 1.58 bit name I suppose).
Current LLMs are based on a principle of storing this information in a 16bit floating point. This does allow them to store much more complex information in there, but also means it takes up more space even in cases where the information might not be that complex.

So the exciting bit here, at least if I understand correctly, is that with the much more limited parameter storage in 1.58bit format they have achieved the same or similar reasoning performance?

Which in turn seems to mean that less memory and energy is needed, which in turn influences scalability and hardware that potentially can be used.

Edit:

Reading the linked issue it looks like a lot is still unknown, and I am seeing a healthy amount of skepticism as well. Which is good given the craze around LLMs I suppose.

So I guess we have to wait to see if the claimed results can actually be achieved. It would be extremely nice if they can be. For various projects I have been experimenting with the use of the openAI api (and more recently openrouter to use other LLMs as well). These projects have been fun and promising, but on any scale get expensive fast.
I also don't like the fact that I am tied to third party services.

Being able to scale down the resources needed for models will hopefully make it much more practical to self-host them in a more affordable manner.

But at this point that is still wishful thinking. Let's see if any of the promises in the paper actually pan out.

12 votes

[3]
Greg
March 1
Link Parent
I've only skimmed the paper so far, but that sounds like the core of it - although my own area is visual models rather than language, so happy to be corrected by anyone who knows more here! BitNet...

I've only skimmed the paper so far, but that sounds like the core of it - although my own area is visual models rather than language, so happy to be corrected by anyone who knows more here!

BitNet kicked off the idea of training LLMs from scratch using extremely low precision parameters, and this work extends the basis slightly (ternary rather than binary) and shows what the technique can do at scale when compared to equivalent higher precision architectures.

Often models are trained in 32 and/or 16 bit precision and then quantised afterwards, whereas these are training in quantised form right off the bat - which seems like it makes intuitive sense, it means that parameters on the margin of being rounded one way or the other will be determined and refined in the context of the training dataset as a whole rather than in the much less stable context of the final model weights alone. I couldn't immediately see any info on the training cost, but I'd assume that fully quantising from the start would significantly reduce the overhead there as well.

There's been some previous work on training/fine tuning in a way that's aware a model has been or will be quantised, but from these newer papers it's looking like there could be some real advantages in treating the quantisation as a core part of the architecture rather than a post hoc compression of the full model.

7 votes
1. [2]
  balooga
  March 1
  Link Parent
  I wonder if the same technique could be used for image models?
  
  I wonder if the same technique could be used for image models?
  
  2 votes
  1. Greg
    March 1
    Link Parent
    Just what I was thinking! I doubt it’d do much good in a unet, those get unstable even training in bf16, but I’m very tempted to do some tests with a diffusion transformer…
    
    Just what I was thinking! I doubt it’d do much good in a unet, those get unstable even training in bf16, but I’m very tempted to do some tests with a diffusion transformer…
    
    2 votes
vord
March 1
Link Parent
So, I'm by no means a domain expert, and this is purely speculation. I wager its something along the lines of significant figures. Everything is ultimately getting reduced to a yes/no decision...

So, I'm by no means a domain expert, and this is purely speculation.

I wager its something along the lines of significant figures. Everything is ultimately getting reduced to a yes/no decision eventually. It doesn't really matter if the weight is 0.000487654 if its getting rounded to 1. So if the training data can be more condensed, it will allow for a far greater token count.

1 vote
[5]
pete_the_paper_boat (OP)
March 1 (edited March 1)
Link Parent
I'm no domain expert here, but I share the same concerns, until I can try it out in llama.cpp I am skeptical. Still pretty excited to see if I could run a 30B model in under 12GBs of VRAM! It...

I'm no domain expert here, but I share the same concerns, until I can try it out in llama.cpp I am skeptical. Still pretty excited to see if I could run a 30B model in under 12GBs of VRAM!

It seems like the models conform to it's limitations during training. That would make sense (to me). I would guess models with floating points rely on those decimals. Which is why quantization post training takes such a hit on performance.

Sounds like previous BitNet used { -1, 1}. So maybe the 0 was very important.

1 vote
1. [4]
  vord
  March 1
  Link Parent
  Makes sense, No/Maybe/Yes gives a lot more flexibility than just yes/no.
  
  Makes sense, No/Maybe/Yes gives a lot more flexibility than just yes/no.
  
  3 votes
  1. [3]
    pete_the_paper_boat (OP)
    March 1 (edited March 1)
    Link Parent
    I don't think it's "maybe", I think it represents sparcity. By not propagating some value through a connection (multiplying by 0), it's essentially disconnected. But the paper mentions they only...
    
    I don't think it's "maybe", I think it represents sparcity. By not propagating some value through a connection (multiplying by 0), it's essentially disconnected.
    
    But the paper mentions they only use addition when applying weights, but I think with that they mean with 1-bit precision, they can perform multiplication using only addition.
    
    3 votes
    
    [2]
    vord
    March 1 (edited March 1)
    Link Parent
    So in that case it would be more like 'decreases weight, does not increase, increases'. In cases like this it well could be that logic checks against nulls may be straight up worse than addition...
    
    So in that case it would be more like 'decreases weight, does not increase, increases'.
    
    In cases like this it well could be that logic checks against nulls may be straight up worse than addition across the board. Integer addition is simpler than integer multiplication by a decent factor IIRC.
    
    Add everything. Cut lowest values. Repeat.
    
    2 votes
    
    pete_the_paper_boat (OP)
    March 1
    Link Parent
    Yes! It's like branchless programming. They probably use SIMD instructions to calculate the dot products efficiently.
    
    In cases like this it well could be that logic checks against nulls may be straight up worse
    
    Yes! It's like branchless programming. They probably use SIMD instructions to calculate the dot products efficiently.
    
    2 votes

[3]

pete_the_paper_boat (OP)

March 2

Link

The authors answered some questions on their paper's huggingface page.

5 votes

[2]
pete_the_paper_boat (OP)
March 23
Link Parent
UPDATE: They've since released a pdf on GitHub with training tips and a PyTorch implementation. link

UPDATE:

They've since released a pdf on GitHub with training tips and a PyTorch implementation.

link

4 votes
1. Wes
  March 23
  Link Parent
  Thanks for sharing the update. I've been waiting earnestly for more information on this one. Of course being a layman, I found the FAQ the most interesting part. Very exciting!
  
  Thanks for sharing the update. I've been waiting earnestly for more information on this one. Of course being a layman, I found the FAQ the most interesting part.
  
  Will BitNet work for larger models?
  
  As shown in Tables 1 and 2 in the "The Era of 1-bit LLMs" paper, there is a clear trend indicating that the gap between full-precision Transformer LLMs and BitNet b1.58 (and also b1) becomes smaller as the model size increases. This suggests that BitNet and 1-bit language models are even more effective for larger models. Scaling is one of the primary goals of our research on 1-bit LLMs, as we eventually need to scale up the model size (and training tokens) to train practical LLMs.
  
  Very exciting!
  
  1 vote

Greg

March 1 (edited March 2)

Link

Looks like someone’s already implemented a version of this yesterday: https://github.com/kyegomez/BitNet/blob/main/bitnet/bitbnet_b158.py Only 80 lines of code, which makes sense given it’s just a...

Looks like someone’s already implemented a version of this yesterday: https://github.com/kyegomez/BitNet/blob/main/bitnet/bitbnet_b158.py

Only 80 lines of code, which makes sense given it’s just a drop in replacement for nn.Linear - seems like figuring out that it was an idea worth testing at scale in the first place was the hard part!

[Edit] Having had a chance to look at this properly, it's worth mentioning that it's storing the weights in full precision and applying the quantisation on every call, so unless I'm missing something significant it's not going to get the speed or VRAM advantages from the paper. Nice proof of concept for testing results side by side, though!

[Edit 2] It works enough to get some toy examples to start converging, which is exciting. Getting the efficiency gains is going to involve hooking the whole thing into some deeper torch APIs to spoof the ternary weights at training time and then actually compile them to run that way "for real" on the GPU at inference, which will likely take a bit of time to figure out, but it looks very doable.

4 votes

pete_the_paper_boat (OP)

March 1 (edited March 1)

Link

There's currently an issue open on the llama.cpp repository to add support. But there is no code publicly available as of now.

Abstract Recent research, such as BitNet [23], is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective. Furthermore, it enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.

There's currently an issue open on the llama.cpp repository to add support. But there is no code publicly available as of now.

1 vote

Link information

15 comments