Quantization

TO READ

2017-Mixed precision training, See [MNA+17].

Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT

A blog, https://developer.nvidia.com/blog/achieving-fp32-accuracy-for-int8-inference-using-quantization-aware-training-with-tensorrt/

2019-Neural network distiller: A python package for dnn compression research

See :cite:p`zmora2019neural`. It is from Intel.

It also has a github repo at https://github.com/IntelLabs/distiller

Read its code about quantization! It has 4.3k stars!

2011-Improving the speed of neural networks on CPUs

See [VSM+11]. It is from Google.

Has stuff about SSE.

2020-Integer quantization for deep learning inference: Principles and empirical evaluation

See [WJZ+20].

This paper is from NVidia. It lists a table comparing throughput of fp16, int8, int4, int1 on NVIDIA Turing GPUs.

An example:

#!/usr/bin/env python3

import math


def test_asymmetric_quant():
    # We use the notation from
    # https://arxiv.org/pdf/2004.09602
    # equation 1 and 2
    #
    # f(x) = sx + z
    # it is called asymmetric because
    # float_low != -float_high
    # and
    # int_low != -int_high
    #
    # It is also called affine quantization in this paper

    float_low = -3
    float_high = 4

    int_low = -128
    int_high = 127

    s = (int_high - int_low) / (float_high - float_low)  # 36.4285
    z1 = -float_low * s - 127  # -17.71
    z = -round(float_low * s) - 127  # -18
    print(s, z1, z)

    f = lambda x: round(s * x) + z

    print(f(float_low), f(float_high), f(1))  # -127, 128, 18


def test_symmetric_quant():
    # We use the notation from
    # https://arxiv.org/pdf/2004.09602
    # equation 1 and 2
    #
    # f(x) = sx + z, where z is 0
    # it is called symmetric because
    # float_low == -float_high
    # and
    # int_low == -int_high
    #
    # It is also called affine quantization in this paper

    float_low = -4
    float_high = 4

    int_low = -127
    int_high = 127

    s = (int_high - int_low) / (float_high - float_low)  # 36.4285
    z1 = -float_low * s - 127  # -17.71
    z = -round(float_low * s) - 127  # -18
    print(s, z1, z)  # 31.75  0.0 0

    f = lambda x: round(s * x) + z

    print(f(float_low), f(float_high), f(1))  # -127, 127, 32


def main():
    print("---asymmetric---")
    test_asymmetric_quant()
    print("---symmetric---")
    test_symmetric_quant()


if __name__ == "__main__":
    main()

2022-A survey of quantization methods for efficient neural network inference

See [GKD+22]

1998-Quantization

See [GN98].

Quantization

GKD+22: Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision, pages 291–326. Chapman and Hall/CRC, 2022. URL: https://arxiv.org/pdf/2103.13630.
GN98: Robert M. Gray and David L. Neuhoff. Quantization. IEEE transactions on information theory, 44(6):2325–2383, 1998. URL: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=720541.
MNA+17: Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and others. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017. URL: https://openreview.net/pdf?id=r1gs9JgRZ.
VSM+11: Vincent Vanhoucke, Andrew Senior, Mark Z Mao, and others. Improving the speed of neural networks on cpus. In Proc. deep learning and unsupervised feature learning NIPS workshop, volume 1, 4. 2011. URL: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37631.pdf.
WJZ+20: Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev, and Paulius Micikevicius. Integer quantization for deep learning inference: principles and empirical evaluation. arXiv preprint arXiv:2004.09602, 2020. URL: https://arxiv.org/pdf/2004.09602.