Internals

https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/core/QuantizerBase.h defines the base class Quantizer.

https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/quantized/Quantizer.h defines the subclasses of Quantizer, such as

  • PerTensorAffineQuantizer - qscheme is kPerTensorAffine.

QScheme

See https://github.com/pytorch/pytorch/blob/master/c10/core/QScheme.h

./code/qscheme/main.cc
 1#include "torch/script.h"
 2
 3static void TestQScheme() {
 4  TORCH_CHECK(torch::toString(torch::kPerTensorAffine) == "per_tensor_affine");
 5
 6  TORCH_CHECK(torch::toString(torch::kPerChannelAffine) ==
 7              "per_channel_affine");
 8
 9  TORCH_CHECK(torch::toString(torch::kPerTensorSymmetric) ==
10              "per_tensor_symmetric");
11
12  TORCH_CHECK(torch::toString(torch::kPerChannelSymmetric) ==
13              "per_channel_symmetric");
14
15  TORCH_CHECK(torch::toString(torch::kPerChannelAffineFloatQParams) ==
16              "per_channel_affine_float_qparams");
17}
18
19int main() {
20  TestQScheme();
21  return 0;
22}

PerTensorAffineQuantizer

It has 4 important methods:

  • QScheme qscheme() const, always returns kPerTensorAffine.

  • double scale() const

  • int64_t zero_point() const

  • ScalarType scalar_type() const

It uses quantize_tensor_per_tensor_affine_cpu when FBGEMM is available.

Otherwise, it uses https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp#L3533.

  • For arm, it uses quantize_tensor_arm. It is a template with many specializations.

  • For x86, it uses quantize_val

    • If FBGEMM is available, it uses quantize_val

    • Otherwise, it uses https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/quantized/AffineQuantizerBase.cpp#L100

      template <typename T>
      T quantize_val(double scale, int64_t zero_point, float value) {
        // std::nearbyint results in nearest integer value according to the current
        // rounding mode and the default rounding mode is rounds to even in half-way
        // cases in most popular processor architectures like x86 and ARM. This is
        // typically faster than an alternatives like std::round that rounds half-way
        // cases away from zero, and can be consistent with SIMD implementations for
        // example in x86 using _mm512_cvtps_epi32 or mm512_round_ps with
        // _MM_FROUND_CUR_DIRECTION option that also follow the current rounding mode.
        int64_t qvalue;
        constexpr int64_t qmin = std::numeric_limits<typename T::underlying>::min();
        constexpr int64_t qmax = std::numeric_limits<typename T::underlying>::max();
        float inv_scale = 1.0f / static_cast<float>(scale);
        qvalue = static_cast<int64_t>(zero_point + Round(value * inv_scale));
        qvalue = std::max<int64_t>(qvalue, qmin);
        qvalue = std::min<int64_t>(qvalue, qmax);
        return static_cast<T>(qvalue);
      }
      

dequantize_val is defined as:

template <typename T>
TORCH_API float dequantize_val(double scale, int64_t zero_point, T value) {
  return static_cast<float>(scale) * (value.val_ - static_cast<int32_t>(zero_point));
}