- The Limitless Playbook
- Posts
- Mystery Gift Box #046 | Quantization: Making LLMs More Accessible
Mystery Gift Box #046 | Quantization: Making LLMs More Accessible
The best hidden gems I've found; interesting ideas and concepts, thought-provoking questions, mind-blowing books/podcasts, cool animes/films, and other mysteries ❤️
Hey friends,
Large Language Models.
They are inaccessible for most. If you want to do inference on BLOOM-176B, you would need EIGHT 80GB A100 GPUs, which costs about $15k each; to fine-tune BLOOM-176B, you need 72 of these GPUs1. And so, it’s important for us to find methods to reduce these heavy expensive compute resource requirements while preserving the model’s performance.
There are different methods to shrink the model size: Quantization, Knowledge Distillation, Weight Pruning, Parallelism, and others.
Let’s focus on Quantization.
😈 Quantization
Quantization refers to methods for converting (“rounding”) the weights and activations of a neural network from high precision floating point values like float32 (FP32) to lower precision values like int8 (INT8). This allows models to use less memory and be faster while trying to maintain the same accuracy.
Okay, so what are these data types float32 (FP32), FP16, BFLOAT16, TF32, INT8? Well, think of these as different ways to write and store numbers.
For example, in float32 (FP32), imagine there are 32 “slots” to write a number (float). 8 “slots” are for saying how big the number is, 23 “slots” are for the digits in the number, and 1 slot tells us if the number is positive or negative. This type is used a lot because it works well with most computer parts.
An FP32 float can store values with a range of up to 2^62… An INT8 integer can only store values with a range of up to 2^8 (256).
People often use FP32 because they are very accurate but they are also a lot slower and more memory-intensive than others like FP16; 32 slots vs 16 slots…
Model Quantization: FP32 → FP16 → INT8 with a twist
As one would expect, as you lower the precision of these weights and activations, the model performance would start to drop. The question is how far can we lower the precision and yet still maintain model performance?
Researchers found that they can get almost identical model performance using just FP16 half-precision; half the model size with identical performance!
Then we have 8-bit quantization, which uses a quarter of the precision (and thus only 1/4th of the model size).
The two most common 8-bit quantization techniques are:
Zero-point quantization
Absolute maximum (absmax) quantization
While the two quantization methods above would allow us to reduce the model size to 1/4th of the size, this usually comes at the cost of the model performance…
The HuggingFace and BigScience team has come up with LLM.int8(), which allows for INT8 precision of LLMs AND with similar model performance.
LLM.int8()
The LLM.int8() implementation in Hugging Face Transformers and Accelerate libraries is the first technique that does not degrade model performance even for large models such as BLOOM-176B
The team found that traditional quantization fails for large models due to outlier features in the model. Trying to quantize a model with many outlier features into constrained 8-bit precisions will produce bad results as errors propagates across layers.
So how does LLM.int8() fixes this problem?
Mixed-precision decomposition; meaning multiple precision data types. Here’s how LLM.int8() works in three steps in simple terms:
Look at the features in the model and extract the outlier features such that you now have two groups of features. Let’s call them “normal” and “outlier” features
Perform calculation of the outlier features in FP16 and the normal features in INT8
Dequantize the results of the normal features and add both the results of the outlier features and normal features together to receive the final result in FP16
The results?
The performance of BLOOM-176B with the BF16 and LLM.int8() precision on the different benchmarks show that there are near ZERO degradation in performance! Although the speed of inference of LLM.int8() is slower (not significantly) than the BF16 precision, it does requires significant less (half of BF16!) hardware requirements to run.
In this newsletter, I have summarises the key lessons from this HuggingFace article here. Check it out for more explanation and coding details of using LLM.int8() 🔥
⛰ 4-4-4 Exploration Project
Each month, I would explore one new thing; a skill, a subject, or an experience.
January 2023: Writing and Storytelling (Subject) ✅
February 2023: KURIOS – Cabinet of Curiosities (Cirque Du Soleil — Experience) ✅
March 2023: 28 Days of Cold Exposure (Subject and Experience) ✅
April 2023: Complete Growth / Product Marketing Course (Subject) 🟥
May 2023: LL project (Skill + Subject) ✅
June 2023: LL project 2 (Skill + Subject) ✅
July 2023: LL project 3 (Skill + Subject) 🟧
📚 This week, I finished reading…
Few books in progress 🤓
Have interesting gems you want to share with me and others? Share it by replying to this email and I will include it in the next gift box :)
With love,
Ryan O. 🎮
😈 Connect with me on:
🎬 YouTube, 🐦 Twitter, 👨🏻💻 LinkedIn, 🌍 Personal Website, and 📸 Instagram