Compiling Models

Qualcomm® AI Hub supports the compilation of models trained using:

PyTorch
ONNX
AI Model Efficiency Toolkit (AIMET) quantized models.
TensorFlow (via ONNX)

Any of the above models can be compiled for the following target runtimes:

TensorFlow Lite (recently renamed LiteRT; recommended for Android developers)
ONNX (recommended for Windows developers)
Qualcomm® AI Engine Direct (QNN) context binary (SOC-specific)
Qualcomm® AI Engine Direct (QNN) model library (operating system-specific)
Qualcomm® AI Engine Direct (QNN) DLC (hardware agnostic)

Note that the Qualcomm® AI Engine Direct SDK does not guarantee model libraries will be ABI compatible with all versions of the SDK. See Qualcomm® AI Engine Direct Options.

To specify the version of Qualcomm® AI Engine Direct, include the --qairt_version. See Common Options.

Compiling PyTorch to TensorFlow Lite

To compile a PyTorch model, we must first generate a TorchScript model in memory using the jit.trace method in PyTorch. Once traced, you can compile the model using the submit_compile_job() API.

TensorFlow Lite models can run on the CPU, GPU (using GPU delegation), or the NPU (using QNN delegation).

from typing import Tuple

import torch
import torchvision

import qai_hub as hub

# Using pre-trained MobileNet
torch_model = torchvision.models.mobilenet_v2(pretrained=True)
torch_model.eval()

# Trace model
input_shape: Tuple[int, ...] = (1, 3, 224, 224)
example_input = torch.rand(input_shape)
pt_model = torch.jit.trace(torch_model, example_input)

# Compile model on a specific device
compile_job = hub.submit_compile_job(
    pt_model,
    name="MyMobileNet",
    device=hub.Device("Samsung Galaxy S24 (Family)"),
    input_specs=dict(image=input_shape),
)

assert isinstance(compile_job, hub.CompileJob)

If you already have a saved traced or scripted torch model (saved with torch.jit.save), you can submit it directly. We will use mobilenet_v2.pt as an example. In this example, we also profile the compiled model:

import qai_hub as hub

# Compile a model
compile_job = hub.submit_compile_job(
    model="mobilenet_v2.pt",
    device=hub.Device("Samsung Galaxy S24 (Family)"),
    input_specs=dict(image=(1, 3, 224, 224)),
)
assert isinstance(compile_job, hub.CompileJob)

# Profile the compiled model
profile_job = hub.submit_profile_job(
    model=compile_job.get_target_model(),
    device=hub.Device("Samsung Galaxy S24 (Family)"),
)
assert isinstance(profile_job, hub.ProfileJob)

Compiling PyTorch model to a QNN Model Library

Qualcomm® AI Hub supports compiling and profiling a PyTorch model to QNN model library. In this example, we will use mobilenet_v2.pt and compile it to a QNN Model Library (.so file) for the ARM64 Android platform (aarch64_android).

The model library is an operating system-specific deployment mechanism that is SOC agnostic. Note that the Qualcomm® AI Engine Direct SDK does not guarantee model libraries will be ABI compatible with all versions of the SDK. This means a model compiled with one version of the SDK is not guaranteed to run on another version of the SDK. See Qualcomm® AI Engine Direct Options.

import qai_hub as hub

# Compile a model to a QNN Model Library
compile_job = hub.submit_compile_job(
    model="mobilenet_v2.pt",
    device=hub.Device("Samsung Galaxy S23 (Family)"),
    options="--target_runtime qnn_lib_aarch64_android",
    input_specs=dict(image=(1, 3, 224, 224)),
)
assert isinstance(compile_job, hub.CompileJob)

The return value is an instance of CompileJob. Take a look at this example to learn how to profile this model for the Snapdragon® Neural Processing Unit (NPU)

Compiling PyTorch model to a QNN DLC

Qualcomm® AI Hub supports compiling and profiling a PyTorch model to QNN DLC. In this example, we will use mobilenet_v2.pt and compile it to a QNN DCL (.bin file).

The DLC hardware agnostic. Note that the Qualcomm® AI Engine Direct SDK guarantees DLCs will be compatible with all versions of the SDK. This means a DLC compiled with one version of the SDK is guaranteed to run on another version of the SDK. See Qualcomm® AI Engine Direct Options.

import qai_hub as hub

# Compile a model to QNN DLC
compile_job = hub.submit_compile_job(
    model="mobilenet_v2.pt",
    device=hub.Device("Samsung Galaxy S24 (Family)"),
    options="--target_runtime qnn_dlc",
    input_specs=dict(image=(1, 3, 224, 224)),
)
assert isinstance(compile_job, hub.CompileJob)

The return value is an instance of CompileJob. Take a look at this example to learn how to profile this model for the Snapdragon® Neural Processing Unit (NPU)

Compiling PyTorch model to a QNN Context Binary

Qualcomm® AI Hub supports compiling and profiling a PyTorch model to a QNN context binary. In this example, we will use mobilenet_v2.pt and compile it to a QNN context binary optimized to run on specific device. Since they are optimized specifically for targeted hardware, it can only be compiled for a single device.

The context binary is an SOC-specific deployment mechanism. When compiled for a device, it is expected that the model will be deployed to the same device. The format is operating system agnostic so the same model can be deployed on Android, Linux, or Windows. The context binary is designed only for the NPU.

import qai_hub as hub

# Compile a model to QNN context binary
compile_job = hub.submit_compile_job(
    model="mobilenet_v2.pt",
    device=hub.Device("Samsung Galaxy S24 (Family)"),
    options="--target_runtime qnn_context_binary",
    input_specs=dict(image=(1, 3, 224, 224)),
)
assert isinstance(compile_job, hub.CompileJob)

The return value is an instance of CompileJob. Take a look at this example to learn how to profile this model for the Snapdragon® Neural Processing Unit (NPU)

The QNN context binary can also be embedded within an ONNX model.

Compiling to a Precompiled QNN ONNX

Qualcomm® AI Hub supports compiling to and profiling a pre-compiled ONNX Runtime model. It’s an ONNX Runtime compatible model that contains the pre-compiled QNN binary runnable using ONNX Runtime on Snapdragon devices. More details are documented here.

The advantages of using precompiled QNN ONNX:

Ease of deployment: Works across Android, Linux, or Windows.
Performance gain: Equivalent to QNN context binary.
Simple inference code: ONNX Runtime uses QNN Execution Provider to run inference on the compiled model.
Large Models: Works for large models (>1GB) like LLMs, Stable Diffusion, etc.

Please note that the QNN context binary is operating system agnostic, but device specific. Additionally, context binaries are designed only for the NPU. In this example, let us assume we want to target the Snapdragon® 8 Elite:

import qai_hub as hub

# Compile a model to QNN context binary
compile_job = hub.submit_compile_job(
    model="mobilenet_v2.pt",
    device=hub.Device("Snapdragon 8 Elite QRD"),
    options="--target_runtime precompiled_qnn_onnx",
    input_specs=dict(image=(1, 3, 224, 224)),
)
assert isinstance(compile_job, hub.CompileJob)

The compiled model is a, optionally zipped, directory (with extension .onnx) with an ONNX file and a QNN context binary file. If you upload a pre-compiled ONNX Runtime model that you compiled yourself, it should conform to the following folder structure:

<modeldir>.onnx
   ├── <model>.onnx
   └── <model>.bin

Please note that there is a relative path reference from the ONNX model to the QNN context binary, so please be mindful of that reference if you rename or move the .bin file.

Compiling PyTorch model for the ONNX Runtime

Qualcomm® AI Hub supports compiling a PyTorch model for the ONNX Runtime. In this example, we will use mobilenet_v2.pt and compile it to an ONNX model. This model can be profiled using the ONNX Runtime.

ONNX Runtime supports execution on CPU, GPU (using DML execution provider), or the NPU (using QNN execution provider):

import qai_hub as hub

# Compile a model to an ONNX model
compile_job = hub.submit_compile_job(
    model="mobilenet_v2.pt",
    device=hub.Device("Samsung Galaxy S23 (Family)"),
    options="--target_runtime onnx",
    input_specs=dict(image=(1, 3, 224, 224)),
)
assert isinstance(compile_job, hub.CompileJob)

Compiling ONNX models to TensorFlow Lite or QNN

Qualcomm® AI Hub also supports the compilation of ONNX models to TensorFlow Lite or a QNN model library. We will use mobilenet_v2.onnx as an example.

import qai_hub as hub

# Compile a model to TensorFlow Lite
compile_job = hub.submit_compile_job(
    model="mobilenet_v2.onnx",
    device=hub.Device("Samsung Galaxy S23 (Family)"),
)
assert isinstance(compile_job, hub.CompileJob)

# Compile a model to a QNN Model Library
compile_job = hub.submit_compile_job(
    model="mobilenet_v2.onnx",
    device=hub.Device("Samsung Galaxy S23 (Family)"),
    options="--target_runtime qnn_lib_aarch64_android",
)
assert isinstance(compile_job, hub.CompileJob)

# Compile a model to a QNN DLC
compile_job = hub.submit_compile_job(
    model="mobilenet_v2.onnx",
    device=hub.Device("Samsung Galaxy S23 (Family)"),
    options="--target_runtime qnn_dlc",
)
assert isinstance(compile_job, hub.CompileJob)

Note that the ONNX model may be unquantized (as in the above example) or it may be quantized (as we will see in Quantization (Beta)). If the source model is quantized, the quantization parameters will be respected to produce a quantized deployable asset. An ONNX model can also be a directory to support ONNX models with external weights. The, optionally zipped, directory (with extension .onnx) must contain exactly one .onnx file and exactly one weight file with extension .data. It should conform to the following folder structure:

<modeldir>.onnx
   ├── <model>.onnx
   └── <model>.data

where <modeldir> and <model> can be any names. If your ONNX model does not adhere to that structure, please use the following code to make it adhere:

# if you have an ONNX model "file.onnx" which uses external weights,
# but does not adhere to Qualcomm AI Hub's required format, use this
# code to make it adhere

import onnx

model = onnx.load("file.onnx")
onnx.save(model, "new_file.onnx", save_as_external_data=True, location="new_file.data")

# place both "new_file.onnx" and "new_file.data" in a new directory with
# a .onnx extension, without any other files and upload that directory
# to Qualcomm AI Hub, either as is or as a .zip file

Please note that there is a relative path reference from the ONNX model to the weight file, so please be mindful of that reference if you rename or move the weight file.

Compiling models quantized with AIMET to TensorFlow Lite or QNN

AI Model Efficiency Toolkit (AIMET) is an open source library that provides advanced model quantization and compression techniques for training neural network models. AIMET’s QuantizationSimModel can be exported to an ONNX model (.onnx) and an encodings file (.encodings) with the quantization parameters.

To use this model, create a directory with .aimet in the name. It should contain one .onnx model and a corresponding encoding file,

<modeldir>.aimet
   ├── <model>.onnx
   ├── <model>.data (optional)
   └── <encodings>.encodings

where <modeldir>, <model>, and <encodings> can be any names. <model.data> is required if and only if the ONNX model has external weights.

Let’s use mobilenet_v2_onnx.aimet.zip as an example. After unzipping to mobilenet_v2_onnx.aimet directory, we can submit compile job via:

import qai_hub as hub

# Compile to TensorFlow Lite
compile_job = hub.submit_compile_job(
    model="mobilenet_v2_onnx.aimet",
    device=hub.Device("Samsung Galaxy S24 (Family)"),
)
assert isinstance(compile_job, hub.CompileJob)

# Compile to a QNN Model Library
compile_job = hub.submit_compile_job(
    model="mobilenet_v2_onnx.aimet",
    device=hub.Device("Samsung Galaxy S24 (Family)"),
    options="--target_runtime qnn_lib_aarch64_android --quantize_full_type int8",
)
assert isinstance(compile_job, hub.CompileJob)