CUDA
Blogs
Quantization
- Tim Dettmers:
- Y. Belkada, T. Dettmers: A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes
- While, ideally the training and inference should be done in FP32, it is two times slower than FP16/BF16
- Experimentially, we have discovered that instead of using the 4-byte FP32 precision, we can get an almost identical inference outcome with 2-byte BF16/FP16 half-precision, which halves the model size.
- LLM.int8(): zero degradation matrix multiplication for Large Language Models
- In essence, LLM.int8() seeks to complete the matrix multiplication computation in three steps:
- From the input hidden states, extract the outliers (i.e. values that are larger than a certain threshold) by column.
- Perform the matrix multiplication of the outliers in FP16 and the non-outliers in int8.
- Dequantize the non-outlier results and add both outlier and non-outlier results together to receive the full result in FP16.
- HuggingFace python implementation
Companies
Other