Quantization
TO READ
2017-Mixed precision training, See [MNA+17].
Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT
2019-Neural network distiller: A python package for dnn compression research
See :cite:p`zmora2019neural`. It is from Intel.
It also has a github repo at https://github.com/IntelLabs/distiller
Read its code about quantization! It has 4.3k stars!
2011-Improving the speed of neural networks on CPUs
See [VSM+11]. It is from Google.
Has stuff about SSE.
2020-Integer quantization for deep learning inference: Principles and empirical evaluation
See [WJZ+20].
This paper is from NVidia. It lists a table comparing throughput of fp16, int8, int4, int1 on NVIDIA Turing GPUs.
An example:
#!/usr/bin/env python3
import math
def test_asymmetric_quant():
    # We use the notation from
    # https://arxiv.org/pdf/2004.09602
    # equation 1 and 2
    #
    # f(x) = sx + z
    # it is called asymmetric because
    # float_low != -float_high
    # and
    # int_low != -int_high
    #
    # It is also called affine quantization in this paper
    float_low = -3
    float_high = 4
    int_low = -128
    int_high = 127
    s = (int_high - int_low) / (float_high - float_low)  # 36.4285
    z1 = -float_low * s - 127  # -17.71
    z = -round(float_low * s) - 127  # -18
    print(s, z1, z)
    f = lambda x: round(s * x) + z
    print(f(float_low), f(float_high), f(1))  # -127, 128, 18
def test_symmetric_quant():
    # We use the notation from
    # https://arxiv.org/pdf/2004.09602
    # equation 1 and 2
    #
    # f(x) = sx + z, where z is 0
    # it is called symmetric because
    # float_low == -float_high
    # and
    # int_low == -int_high
    #
    # It is also called affine quantization in this paper
    float_low = -4
    float_high = 4
    int_low = -127
    int_high = 127
    s = (int_high - int_low) / (float_high - float_low)  # 36.4285
    z1 = -float_low * s - 127  # -17.71
    z = -round(float_low * s) - 127  # -18
    print(s, z1, z)  # 31.75  0.0 0
    f = lambda x: round(s * x) + z
    print(f(float_low), f(float_high), f(1))  # -127, 127, 32
def main():
    print("---asymmetric---")
    test_asymmetric_quant()
    print("---symmetric---")
    test_symmetric_quant()
if __name__ == "__main__":
    main()
2022-A survey of quantization methods for efficient neural network inference
See [GKD+22]
1998-Quantization
See [GN98].
Quantization
- GKD+22
- Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision, pages 291–326. Chapman and Hall/CRC, 2022. URL: https://arxiv.org/pdf/2103.13630. 
- GN98
- Robert M. Gray and David L. Neuhoff. Quantization. IEEE transactions on information theory, 44(6):2325–2383, 1998. URL: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=720541. 
- MNA+17
- Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and others. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017. URL: https://openreview.net/pdf?id=r1gs9JgRZ. 
- VSM+11
- Vincent Vanhoucke, Andrew Senior, Mark Z Mao, and others. Improving the speed of neural networks on cpus. In Proc. deep learning and unsupervised feature learning NIPS workshop, volume 1, 4. 2011. URL: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37631.pdf. 
- WJZ+20
- Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev, and Paulius Micikevicius. Integer quantization for deep learning inference: principles and empirical evaluation. arXiv preprint arXiv:2004.09602, 2020. URL: https://arxiv.org/pdf/2004.09602.