Quantization
TO READ
2017-Mixed precision training, See [MNA+17].
Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT
2019-Neural network distiller: A python package for dnn compression research
See :cite:p`zmora2019neural`. It is from Intel.
It also has a github repo at https://github.com/IntelLabs/distiller
Read its code about quantization! It has 4.3k stars!
2011-Improving the speed of neural networks on CPUs
See [VSM+11]. It is from Google.
Has stuff about SSE.
2020-Integer quantization for deep learning inference: Principles and empirical evaluation
See [WJZ+20].
This paper is from NVidia. It lists a table comparing throughput of fp16, int8, int4, int1 on NVIDIA Turing GPUs.
An example:
#!/usr/bin/env python3
import math
def test_asymmetric_quant():
# We use the notation from
# https://arxiv.org/pdf/2004.09602
# equation 1 and 2
#
# f(x) = sx + z
# it is called asymmetric because
# float_low != -float_high
# and
# int_low != -int_high
#
# It is also called affine quantization in this paper
float_low = -3
float_high = 4
int_low = -128
int_high = 127
s = (int_high - int_low) / (float_high - float_low) # 36.4285
z1 = -float_low * s - 127 # -17.71
z = -round(float_low * s) - 127 # -18
print(s, z1, z)
f = lambda x: round(s * x) + z
print(f(float_low), f(float_high), f(1)) # -127, 128, 18
def test_symmetric_quant():
# We use the notation from
# https://arxiv.org/pdf/2004.09602
# equation 1 and 2
#
# f(x) = sx + z, where z is 0
# it is called symmetric because
# float_low == -float_high
# and
# int_low == -int_high
#
# It is also called affine quantization in this paper
float_low = -4
float_high = 4
int_low = -127
int_high = 127
s = (int_high - int_low) / (float_high - float_low) # 36.4285
z1 = -float_low * s - 127 # -17.71
z = -round(float_low * s) - 127 # -18
print(s, z1, z) # 31.75 0.0 0
f = lambda x: round(s * x) + z
print(f(float_low), f(float_high), f(1)) # -127, 127, 32
def main():
print("---asymmetric---")
test_asymmetric_quant()
print("---symmetric---")
test_symmetric_quant()
if __name__ == "__main__":
main()
2022-A survey of quantization methods for efficient neural network inference
See [GKD+22]
1998-Quantization
See [GN98].
Quantization
- GKD+22
Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision, pages 291–326. Chapman and Hall/CRC, 2022. URL: https://arxiv.org/pdf/2103.13630.
- GN98
Robert M. Gray and David L. Neuhoff. Quantization. IEEE transactions on information theory, 44(6):2325–2383, 1998. URL: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=720541.
- MNA+17
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and others. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017. URL: https://openreview.net/pdf?id=r1gs9JgRZ.
- VSM+11
Vincent Vanhoucke, Andrew Senior, Mark Z Mao, and others. Improving the speed of neural networks on cpus. In Proc. deep learning and unsupervised feature learning NIPS workshop, volume 1, 4. 2011. URL: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37631.pdf.
- WJZ+20
Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev, and Paulius Micikevicius. Integer quantization for deep learning inference: principles and empirical evaluation. arXiv preprint arXiv:2004.09602, 2020. URL: https://arxiv.org/pdf/2004.09602.