Quantization (Beta)
Qualcomm® AI Hub enables converting floating point models to fixed point in a process called quantization. Fixed point representations map real numbers to integers (e.g., int8, int16), allowing faster and more memory-efficient inference. Quantized models can be compiled to all supported target runtimes on Qualcomm® AI Hub and can have up to a 3x improvement in performance. The Snapdragon® Hexagon Tensor Processor performs best with quantized operations.
To capture these performance improvements while retaining model accuracy, quantized models need to be calibrated with unlabeled sample input data. Calibration is the process of determining the fixed point mapping (scales and zero points) between floating point and its quantized integer representation. With an unquantized source model and calibration data, Qualcomm® AI Hub produces a quantized model asset that can be compiled to run on device.
Please note that quantization in Qualcomm® AI Hub is still in Beta mode. We are actively developing this feature to be more robust to different types of models. If you encounter issues, please reach out on our slack channel.
Overview
The Qualcomm® AI Hub quantize job takes in an unquantized ONNX as input and produces a quantized ONNX model as output. This can be used even if the source model is PyTorch and you want to deploy this to TensorFlow Lite or Qualcomm® AI Engine Direct, by building an end-to-end workflow with compile jobs in addition to the quantize job. We will walk through the following example:
- Preparing the source model
Load the PyTorch model.
Trace the PyTorch model to TorchScript format.
Call
submit_compile_job()
to compile to ONNX.
- Quantizing an ONNX model
Load and pre-process the calibration data.
Call
submit_quantize_job()
to quantize the model.
- Compiling a quantized model
Call
submit_compile_job()
to compile to TensorFlow Lite.
Preparing the source model
The first step is to trace the model and compile it to ONNX. We recommend compiling it to ONNX even if the source model is already ONNX, since it allows the compiler to run optimization passes prior to quantization. This will ensure that un-optimized patterns that may otherwise cause issues during quantization are addressed.
This step is done using a call to submit_compile_job()
with the
option --target_runtime onnx
. Please refer to Compiling Models for
more information.
import os
import numpy as np
import torch
import torchvision
from PIL import Image
import qai_hub as hub
# 1. Load pre-trained PyTorch model from torchvision
torch_model = torchvision.models.mobilenet_v2(weights="IMAGENET1K_V1").eval()
# 2. Trace the model to TorchScript format
input_shape = (1, 3, 224, 224)
pt_model = torch.jit.trace(torch_model, torch.rand(input_shape))
# 3. Compile the model to ONNX
device = hub.Device("Samsung Galaxy S24 (Family)")
compile_onnx_job = hub.submit_compile_job(
model=pt_model,
device=device,
input_specs=dict(image_tensor=input_shape),
options="--target_runtime onnx",
)
assert isinstance(compile_onnx_job, hub.CompileJob)
unquantized_onnx_model = compile_onnx_job.get_target_model()
assert isinstance(unquantized_onnx_model, hub.Model)
Quantizing an ONNX model
The function submit_quantize_job()
can be used to quantize an
ONNX model. The function takes an ONNX model and calibration data as input,
quantizes the model, and returns the quantized ONNX model.
The resulting quantized model is in ONNX “fake quantization” format. This is a quantization representation where ops technically have floating point inputs/outputs and quantization bottlenecks are represented separately with QuantizeLinear/DequantizeLinear pairs. This is similar to the static ONNX QDQ format here, except weights are still stored as floating point followed by QuantizeLinear. Note that this is the only ONNX quantization format that Qualcomm® AI Hub officially supports as input to compile jobs.
For calibration data we will use imagenette_samples.zip. Download the file and unzip it in your local directory before running the below code. This tutorial uses 100 samples. In general, we recommend using 500-1000 samples.
In this example, which is a continuation of the example above, we choose to
quantize to 8-bit integer weigths and 8-bit integer activations (i.e.,
w8a8
).
# 4. Load and pre-process downloaded calibration data
# This transform is required for PyTorch imagenet classifiers
# Source: https://pytorch.org/hub/pytorch_vision_resnet/
mean = np.array([0.485, 0.456, 0.406]).reshape((3, 1, 1))
std = np.array([0.229, 0.224, 0.225]).reshape((3, 1, 1))
sample_inputs = []
images_dir = "imagenette_samples/images"
for image_path in os.listdir(images_dir):
image = Image.open(os.path.join(images_dir, image_path))
image = image.convert("RGB").resize(input_shape[2:])
sample_input = np.array(image).astype(np.float32) / 255.0
sample_input = np.expand_dims(np.transpose(sample_input, (2, 0, 1)), 0)
sample_inputs.append(((sample_input - mean) / std).astype(np.float32))
calibration_data = dict(image_tensor=sample_inputs)
# 5. Quantize the model
quantize_job = hub.submit_quantize_job(
model=unquantized_onnx_model,
calibration_data=calibration_data,
weights_dtype=hub.QuantizeDtype.INT8,
activations_dtype=hub.QuantizeDtype.INT8,
)
quantized_onnx_model = quantize_job.get_target_model()
assert isinstance(quantized_onnx_model, hub.Model)
Compiling a quantized model
The quantized ONNX model can be compiled further to TensorFlow Lite or Qualcomm® AI Engine Direct. The quantized ops in the ONNX model will become quantized ops in the target runtime asset, ready to better utilize available hardware.
By default, it will leave inputs and outputs in float32. This may add some
overhead on platforms that support both integer and floating point math.
However, on platforms that do not support floating point math at all,
this may lead to more severe issues. To remedy this, we can tell the compiler
to respect the quantization even at the IO boundary using the --quantize_io
compile option (see Compile Options). In this case, conversion to and
from the integer types will need to happen outside the model in the integration
code.
# 6. Compile to target runtime (TFLite)
compile_tflite_job = hub.submit_compile_job(
model=quantized_onnx_model,
device=device,
options="--target_runtime tflite --quantize_io",
)
assert isinstance(compile_tflite_job, hub.CompileJob)
Please refer to Compiling ONNX models to TensorFlow Lite or QNN for more information.
Quantization options
The below table shows which precisions are supported for each runtime.
Weights |
Activations |
Mixed Precision* |
|
---|---|---|---|
TFLite |
int8 |
int8 |
Not supported |
QNN |
int8 |
int8, int16 |
Not supported |
ONNX |
int8 |
int8, int16 |
Not supported |
*Mixed precision allows running different ops with different precisions in the same network. In the long-term, all runtimes should support mixed precision in addition to int4, int8, and int16 for weights and activations.
Please refer to submit_quantize_job()
and Quantize Options for additional options.
Benchmarking quantized model performance
Sometimes it is useful to quickly benchmark the model latency before going through the process of getting sample input data. For this use case, the calibration data can be a single random sample. While the resulting quantized model’s accuracy will be poor, it will have the same on-device latency as an accurate model.
import numpy as np
import qai_hub as hub
device = hub.Device("Samsung Galaxy S24 (Family)")
calibration_data = dict(
image_tensor=[np.random.randn(1, 3, 224, 224).astype(np.float32)]
)
# Convert the input onnx to optimized ONNX then quantize to ONNX QDQ format
compile_onnx_job = hub.submit_compile_job(
model="mobilenet_v2.onnx",
device=device,
input_specs=dict(image_tensor=(1, 3, 224, 224)),
)
assert isinstance(compile_onnx_job, hub.CompileJob)
unquantized_onnx_model = compile_onnx_job.get_target_model()
assert isinstance(unquantized_onnx_model, hub.Model)
quantize_job = hub.submit_quantize_job(
model=unquantized_onnx_model,
calibration_data=calibration_data,
weights_dtype=hub.QuantizeDtype.INT8,
activations_dtype=hub.QuantizeDtype.INT8,
)
assert isinstance(quantize_job, hub.QuantizeJob)
quantized_onnx_model = quantize_job.get_target_model()
assert isinstance(quantized_onnx_model, hub.Model)
# Model can be compiled to tflite, qnn, or onnx format
compile_qnn_job = hub.submit_compile_job(
model=quantized_onnx_model,
device=device,
options="--target_runtime qnn_context_binary --quantize_io",
)
assert isinstance(compile_qnn_job, hub.CompileJob)
compiled_model = compile_qnn_job.get_target_model()
assert isinstance(compiled_model, hub.Model)
profile_job = hub.submit_profile_job(
model=compiled_model,
device=device,
)
assert isinstance(profile_job, hub.ProfileJob)