Quantization

Qualcomm® AI Hub enables converting floating point models to fixed point in a process called quantization. Fixed point representations map real numbers to integers (e.g., int8, int16), allowing faster and more memory-efficient inference. Quantized models can be compiled to all supported target runtimes on Qualcomm® AI Hub and can have up to a 3x improvement in performance. The Snapdragon® Hexagon Tensor Processor performs best with quantized operations.

To capture these performance improvements while retaining model accuracy, quantized models need to be calibrated with unlabeled sample input data. Calibration is the process of determining the fixed point mapping (scales and zero points) between floating point and its quantized integer representation. With an unquantized source model and calibration data, Qualcomm® AI Hub produces a quantized model asset that can be compiled to run on device.

Models may experience accuracy loss from quantization. We are actively developing this feature to be more robust to different types of models. If you encounter issues, please reach out on our slack channel.

Overview

The Qualcomm® AI Hub quantize job takes in an unquantized ONNX as input and produces a quantized ONNX model as output. This can be used even if the source model is PyTorch and you want to deploy this to TensorFlow Lite or Qualcomm® AI Engine Direct, by building an end-to-end workflow with compile jobs in addition to the quantize job. We will walk through the following example:

Preparing the source model
1. Load the PyTorch model.
2. Trace the PyTorch model to TorchScript format.
3. Call submit_compile_job() to compile to ONNX.
Quantizing an ONNX model
1. Load and pre-process the calibration data.
2. Call submit_quantize_job() to quantize the model.
Compiling a quantized model
1. Call submit_compile_job() to compile to TensorFlow Lite.

Preparing the source model

The first step is to trace the model and compile it to ONNX. We recommend compiling it to ONNX even if the source model is already ONNX, since it allows the compiler to run optimization passes prior to quantization. This will ensure that un-optimized patterns that may otherwise cause issues during quantization are addressed.

This step is done using a call to submit_compile_job() with the option --target_runtime onnx. Please refer to Compiling Models for more information.

import os

import numpy as np
import torch
import torchvision
from PIL import Image

import qai_hub as hub

# 1. Load pre-trained PyTorch model from torchvision
torch_model = torchvision.models.mobilenet_v2(weights="IMAGENET1K_V1").eval()

# 2. Trace the model to TorchScript format
input_shape = (1, 3, 224, 224)
pt_model = torch.jit.trace(torch_model, torch.rand(input_shape))

# 3. Compile the model to ONNX
device = hub.Device("Samsung Galaxy S24 (Family)")
compile_onnx_job = hub.submit_compile_job(
    model=pt_model,
    device=device,
    input_specs=dict(image_tensor=input_shape),
    options="--target_runtime onnx",
)
assert isinstance(compile_onnx_job, hub.CompileJob)

unquantized_onnx_model = compile_onnx_job.get_target_model()
assert isinstance(unquantized_onnx_model, hub.Model)

Quantizing an ONNX model

The function submit_quantize_job() can be used to quantize an ONNX model. The function takes an ONNX model and calibration data as input, quantizes the model, and returns the quantized ONNX model.

The resulting quantized model is in ONNX “fake quantization” format. This is a quantization representation where ops technically have floating point inputs/outputs and quantization bottlenecks are represented separately with QuantizeLinear/DequantizeLinear pairs. This is similar to the static ONNX QDQ format here, except weights are still stored as floating point followed by QuantizeLinear. Note that this is the only ONNX quantization format that Qualcomm® AI Hub officially supports as input to compile jobs.

For calibration data we will use imagenette_samples.zip. Download the file and unzip it in your local directory before running the below code. This tutorial uses 100 samples. In general, we recommend using 500-1000 samples.

In this example, which is a continuation of the example above, we choose to quantize to 8-bit integer weigths and 8-bit integer activations (i.e., w8a8).

# 4. Load and pre-process downloaded calibration data
# This transform is required for PyTorch imagenet classifiers
# Source: https://pytorch.org/hub/pytorch_vision_resnet/
mean = np.array([0.485, 0.456, 0.406]).reshape((3, 1, 1))
std = np.array([0.229, 0.224, 0.225]).reshape((3, 1, 1))
sample_inputs = []

images_dir = "imagenette_samples/images"
for image_path in os.listdir(images_dir):
    image = Image.open(os.path.join(images_dir, image_path))
    image = image.convert("RGB").resize(input_shape[2:])
    sample_input = np.array(image).astype(np.float32) / 255.0
    sample_input = np.expand_dims(np.transpose(sample_input, (2, 0, 1)), 0)
    sample_inputs.append(((sample_input - mean) / std).astype(np.float32))
calibration_data = dict(image_tensor=sample_inputs)

# 5. Quantize the model
quantize_job = hub.submit_quantize_job(
    model=unquantized_onnx_model,
    calibration_data=calibration_data,
    weights_dtype=hub.QuantizeDtype.INT8,
    activations_dtype=hub.QuantizeDtype.INT8,
)

quantized_onnx_model = quantize_job.get_target_model()
assert isinstance(quantized_onnx_model, hub.Model)

Compiling a quantized model

The quantized ONNX model can be compiled further to TensorFlow Lite or Qualcomm® AI Engine Direct. The quantized ops in the ONNX model will become quantized ops in the target runtime asset, ready to better utilize available hardware.

By default, it will leave inputs and outputs in float32. This may add some overhead on platforms that support both integer and floating point math. However, on platforms that do not support floating point math at all, this may lead to more severe issues. To remedy this, we can tell the compiler to respect the quantization even at the IO boundary using the --quantize_io compile option (see Compile Options). In this case, conversion to and from the integer types will need to happen outside the model in the integration code.

# 6. Compile to target runtime (TFLite)
compile_tflite_job = hub.submit_compile_job(
    model=quantized_onnx_model,
    device=device,
    options="--target_runtime tflite --quantize_io",
)
assert isinstance(compile_tflite_job, hub.CompileJob)

Please refer to Compiling ONNX models to TensorFlow Lite or QNN for more information.

Quantization options

The below table shows which precisions are supported for each runtime.

	Weights	Activations	Mixed Precision*
TFLite	int8	int8	Not supported
QNN	int8	int8, int16	Not supported
ONNX	int8	int8, int16	Not supported

*Mixed precision allows running different ops with different precisions in the same network. In the long-term, all runtimes should support mixed precision in addition to int4, int8, and int16 for weights and activations.

Please refer to submit_quantize_job() and Quantize Options for additional options.

Benchmarking quantized model performance

Sometimes it is useful to quickly benchmark the model latency before going through the process of getting sample input data. For this use case, the calibration data can be a single random sample. While the resulting quantized model’s accuracy will be poor, it will have the same on-device latency as an accurate model.

import numpy as np

import qai_hub as hub

device = hub.Device("Samsung Galaxy S24 (Family)")
calibration_data = dict(
    image_tensor=[np.random.randn(1, 3, 224, 224).astype(np.float32)]
)

# Convert the input onnx to optimized ONNX then quantize to ONNX QDQ format
compile_onnx_job = hub.submit_compile_job(
    model="mobilenet_v2.onnx",
    device=device,
    input_specs=dict(image_tensor=(1, 3, 224, 224)),
)
assert isinstance(compile_onnx_job, hub.CompileJob)

unquantized_onnx_model = compile_onnx_job.get_target_model()
assert isinstance(unquantized_onnx_model, hub.Model)

quantize_job = hub.submit_quantize_job(
    model=unquantized_onnx_model,
    calibration_data=calibration_data,
    weights_dtype=hub.QuantizeDtype.INT8,
    activations_dtype=hub.QuantizeDtype.INT8,
)
assert isinstance(quantize_job, hub.QuantizeJob)

quantized_onnx_model = quantize_job.get_target_model()
assert isinstance(quantized_onnx_model, hub.Model)

# Model can be compiled to tflite, qnn, or onnx format
compile_qnn_job = hub.submit_compile_job(
    model=quantized_onnx_model,
    device=device,
    options="--target_runtime qnn_context_binary --quantize_io",
)
assert isinstance(compile_qnn_job, hub.CompileJob)

compiled_model = compile_qnn_job.get_target_model()
assert isinstance(compiled_model, hub.Model)

profile_job = hub.submit_profile_job(
    model=compiled_model,
    device=device,
)
assert isinstance(profile_job, hub.ProfileJob)