Compiling Models

Qualcomm® AI Hub supports the compilation of models trained using:

  • PyTorch

  • ONNX

  • AI Model Efficiency Toolkit (AIMET) quantized models.

  • TensorFlow (via ONNX)

Any of the above models can be compiled for the following target runtimes:

Note that the Qualcomm® AI Engine Direct SDK does not guarantee model libraries will be ABI compatible with all versions of the SDK. See Qualcomm® AI Engine Direct Options.

Compiling PyTorch to TensorFlow Lite

To compile a PyTorch model, we must first generate a TorchScript model in memory using the jit.trace method in PyTorch. Once traced, you can can compile the model using the submit_compile_job() API.

TensorFlow Lite models can run on the CPU, GPU (using GPU delegation), or the NPU (using QNN delegation).

from typing import Tuple

import torch
import torchvision

import qai_hub as hub

# Using pre-trained MobileNet
torch_model = torchvision.models.mobilenet_v2(pretrained=True)
torch_model.eval()

# Trace model
input_shape: Tuple[int, ...] = (1, 3, 224, 224)
example_input = torch.rand(input_shape)
pt_model = torch.jit.trace(torch_model, example_input)

# Compile model on a specific device
compile_job = hub.submit_compile_job(
    pt_model,
    name="MyMobileNet",
    device=hub.Device("Samsung Galaxy S24 (Family)"),
    input_specs=dict(image=input_shape),
)

assert isinstance(compile_job, hub.CompileJob)

If you already have a saved traced or scripted torch model (saved with torch.jit.save), you can submit it directly. We will use mobilenet_v2.pt as an example. In this example, we also profile the compiled model:

import qai_hub as hub

# Compile a model
compile_job = hub.submit_compile_job(
    model="mobilenet_v2.pt",
    device=hub.Device("Samsung Galaxy S24 (Family)"),
    input_specs=dict(image=(1, 3, 224, 224)),
)
assert isinstance(compile_job, hub.CompileJob)

# Profile the compiled model
profile_job = hub.submit_profile_job(
    model=compile_job.get_target_model(),
    device=hub.Device("Samsung Galaxy S24 (Family)"),
)
assert isinstance(profile_job, hub.ProfileJob)

Compiling PyTorch model to a QNN Model Library

Qualcomm® AI Hub supports compiling and profiling a PyTorch model to QNN model library. In this example, we will use mobilenet_v2.pt and compile it to a QNN Model Library (.so file) for the ARM64 Android platform (aarch64_android).

The model library is an operating system-specific deployment mechanism that is SOC agnostic. Note that the Qualcomm® AI Engine Direct SDK does not guarantee model libraries will be ABI compatible with all versions of the SDK. This means a model compiled with one version of the SDK is not guaranteed to run on another version of the SDK. See Qualcomm® AI Engine Direct Options.

import qai_hub as hub

# Compile a model to a QNN Model Library
compile_job = hub.submit_compile_job(
    model="mobilenet_v2.pt",
    device=hub.Device("Samsung Galaxy S23 (Family)"),
    options="--target_runtime qnn_lib_aarch64_android",
    input_specs=dict(image=(1, 3, 224, 224)),
)
assert isinstance(compile_job, hub.CompileJob)

The return value is an instance of CompileJob. Take a look at this example to learn how to profile this model for the Snapdragon® Neural Processing Unit (NPU)

Compiling PyTorch model to a QNN Context Binary

Qualcomm® AI Hub supports compiling and profiling a PyTorch model to a QNN context binary. In this example, we will use mobilenet_v2.pt and compile it to a QNN context binary optimized to run on specific device. Since they are optimized specifically for targeted hardware, it can only be compiled for a single device.

The context binary is an SOC-specific deployment mechanism. When compiled for a device, it is expected that the model will be deployed to the same device. The format is operating system agnostic so the same model can be deployed on Android, Linux, or Windows. The context binary is designed only for the NPU.

import qai_hub as hub

# Compile a model to QNN context binary
compile_job = hub.submit_compile_job(
    model="mobilenet_v2.pt",
    device=hub.Device("Samsung Galaxy S24 (Family)"),
    options="--target_runtime qnn_context_binary",
    input_specs=dict(image=(1, 3, 224, 224)),
)
assert isinstance(compile_job, hub.CompileJob)

The return value is an instance of CompileJob. Take a look at this example to learn how to profile this model for the Snapdragon® Neural Processing Unit (NPU)

The QNN context binary can also be embedded within an ONNX model.

Compiling Precompiled QNN ONNX

Qualcomm® AI Hub supports compiling and profiling a pre-compiled ONNX Runtime model. It’s an ONNX Runtime compatible model that contains the pre-compiled QNN binary runnable using ONNX Runtime on Snapdragon devices. More details are documented here.

The advantages of using precompiled QNN ONNX:

  • Ease of deployment: Works across Android, Linux, or Windows.

  • Performance gain: Equivalent to QNN context binary.

  • Simple inference code: ONNX Runtime uses QNN Execution Provider to run inference on the compiled model.

  • Large Models: Works for large models (>1GB) like LLMs, Stable Diffusion, etc.

Please note that the QNN context binary is operating system agnostic, but device specific. Additionally, context binaries are designed only for the NPU. In this example, let us assume we want to target the Snapdragon® 8 Elite:

import qai_hub as hub

# Compile a model to QNN context binary
compile_job = hub.submit_compile_job(
    model="mobilenet_v2.pt",
    device=hub.Device("Snapdragon 8 Elite QRD"),
    options="--target_runtime precompiled_qnn_onnx",
    input_specs=dict(image=(1, 3, 224, 224)),
)
assert isinstance(compile_job, hub.CompileJob)

The compiled model is a zipped directory (with extension .onnx) with an ONNX file and a QNN context binary file. If you upload a pre-compiled ONNX Runtime model that you compiled yourself, it should conform to the following folder structure:

<model>.onnx
   ├── <model>.onnx
   └── <model>.bin

Please note that there is a relative path reference from the ONNX model to the QNN context binary, so please be mindful of that reference if you rename or move the .bin file.

Compiling PyTorch model for the ONNX Runtime

Qualcomm® AI Hub supports compiling a PyTorch model for the ONNX Runtime. In this example, we will use mobilenet_v2.pt and compile it to an ONNX model. This model can be profiled using the ONNX Runtime.

ONNX Runtime supports execution on CPU, GPU (using DML execution provider, or the NPU (using QNN execution provider):

import qai_hub as hub

# Compile a model to an ONNX model
compile_job = hub.submit_compile_job(
    model="mobilenet_v2.pt",
    device=hub.Device("Samsung Galaxy S23 (Family)"),
    options="--target_runtime onnx",
    input_specs=dict(image=(1, 3, 224, 224)),
)
assert isinstance(compile_job, hub.CompileJob)

Compiling ONNX models to TensorFlow Lite or QNN

Qualcomm® AI Hub also supports the compilation of ONNX models to TensorFlow Lite or a QNN model library. We will use mobilenet_v2.onnx as an example.

import qai_hub as hub

# Compile a model to TensorFlow Lite
compile_job = hub.submit_compile_job(
    model="mobilenet_v2.onnx",
    device=hub.Device("Samsung Galaxy S23 (Family)"),
)
assert isinstance(compile_job, hub.CompileJob)

# Compile a model to a QNN Model Library
compile_job = hub.submit_compile_job(
    model="mobilenet_v2.onnx",
    device=hub.Device("Samsung Galaxy S23 (Family)"),
    options="--target_runtime qnn_lib_aarch64_android",
)
assert isinstance(compile_job, hub.CompileJob)

Note that the ONNX model may be unquantized (as in the above example) or it may be quantized (as we will see in Quantization (Beta)). If the source model is quantized, the quantization parameters will be respected to produce a quantized deployable asset.

Compiling models quantized with AIMET to TensorFlow Lite or QNN

AI Model Efficiency Toolkit (AIMET) is an open source library that provides advanced model quantization and compression techniques for training neural network models. AIMET’s QuantizationSimModel can be exported to one of the following:

  • Recommended: ONNX models (.onnx) and an encodings file (.encodings) with the quantization parameters.

  • TorchScript (.pt) and an encodings file (.encodings) with the quantization parameters.

To use these models, create a directory with .aimet in the name. It should contain one .pt or .onnx model and corresponding encoding file,

<model>.aimet
   ├── <model>.onnx  or .pt file
   └── <model>.encodings

where <model> can be any name.

Let’s use mobilenet_v2_onnx.aimet.zip as an example. After unzipping to mobilenet_v2_onnx.aimet directory, we can submit compile job via:

import qai_hub as hub

# Compile to TensorFlow Lite
compile_job = hub.submit_compile_job(
    model="mobilenet_v2_onnx.aimet",
    device=hub.Device("Samsung Galaxy S24 (Family)"),
)
assert isinstance(compile_job, hub.CompileJob)

# Compile to a QNN Model Library
compile_job = hub.submit_compile_job(
    model="mobilenet_v2_onnx.aimet",
    device=hub.Device("Samsung Galaxy S24 (Family)"),
    options="--target_runtime qnn_lib_aarch64_android --quantize_full_type int8",
)
assert isinstance(compile_job, hub.CompileJob)