Compiling Models

Qualcomm® AI Hub supports the compilation of models trained using:

  • PyTorch

  • ONNX

  • AI Model Efficiency Toolkit (AIMET) quantized models.

  • TensorFlow (via ONNX)

Any of the above models can be compiled for the following target runtimes:

Note that the Qualcomm® AI Engine Direct SDK does not guarantee model libraries will be ABI compatible with all versions of the SDK. See Qualcomm® AI Engine Direct Options.

Compiling PyTorch to TensorFlow Lite

To compile a PyTorch model, we must first generate a TorchScript model in memory using the jit.trace method in PyTorch. Once traced, you can can compile the model using the submit_compile_job() API.

TensorFlow Lite models can run on the CPU, GPU (using GPU delegation), or the NPU (using QNN delegation).

from typing import Tuple

import torch
import torchvision

import qai_hub as hub

# Using pre-trained MobileNet
torch_model = torchvision.models.mobilenet_v2(pretrained=True)
torch_model.eval()

# Trace model
input_shape: Tuple[int, ...] = (1, 3, 224, 224)
example_input = torch.rand(input_shape)
pt_model = torch.jit.trace(torch_model, example_input)

# Compile model on a specific device
compile_job = hub.submit_compile_job(
    pt_model,
    name="MyMobileNet",
    device=hub.Device("Samsung Galaxy S23 (Family)"),
    input_specs=dict(image=input_shape),
)

assert isinstance(compile_job, hub.CompileJob)

If you already have a saved traced or scripted torch model (saved with torch.jit.save), you can submit it directly. We will use mobilenet_v2.pt as an example.

import qai_hub as hub

# Compile a model
compile_job = hub.submit_compile_job(
    model="mobilenet_v2.pt",
    device=hub.Device("Samsung Galaxy S23 (Family)"),
    input_specs=dict(image=(1, 3, 224, 224)),
)
assert isinstance(compile_job, hub.CompileJob)

# Profile the compiled model
profile_job = hub.submit_profile_job(
    model=compile_job.get_target_model(),
    device=hub.Device("Samsung Galaxy S23 (Family)"),
)
assert isinstance(profile_job, hub.ProfileJob)

Compiling PyTorch model to a QNN Model Library

Qualcomm® AI Hub supports compiling and profiling a PyTorch model to QNN model library. In this example, we will use mobilenet_v2.pt and compile it to a QNN Model Library (.so file) for the ARM64 Android platform (aarch64_android).

The model library is an operating system-specific deployment mechanism that is SOC-agnostic. Note that the Qualcomm® AI Engine Direct SDK does not guarantee model libraries will be ABI compatible with all versions of the SDK. See Qualcomm® AI Engine Direct Options.

import qai_hub as hub

# Compile a model to a QNN Model Library
compile_job = hub.submit_compile_job(
    model="mobilenet_v2.pt",
    device=hub.Device("Samsung Galaxy S23"),
    options="--target_runtime qnn_lib_aarch64_android",
    input_specs=dict(image=(1, 3, 224, 224)),
)
assert isinstance(compile_job, hub.CompileJob)

The return value is an instance of CompileJob. Take a look at the example here to learn how to compile this model for the Snapdragon® Neural Processing Unit (NPU)

Compiling PyTorch model to a QNN Context Binary

Qualcomm® AI Hub supports compiling and profiling a PyTorch model to a QNN context binary. In this example, we will use mobilenet_v2.pt and compile it to a QNN context binary optimized to run on specific device. Since they are optimized specifically for targeted hardware, it can only be compiled for a single device.

The context binary is an SOC-specific deployment mechanism. When compiled for a device, is expected that the model be deployed for the same device. The format is operating system agnostic so the same model can be deployed on Android, Linux, or Windows. The context binary is designed only for the NPU.

import qai_hub as hub

# Compile a model to QNN context binary
compile_job = hub.submit_compile_job(
    model="mobilenet_v2.pt",
    device=hub.Device("Samsung Galaxy S23"),
    options="--target_runtime qnn_context_binary",
    input_specs=dict(image=(1, 3, 224, 224)),
)
assert isinstance(compile_job, hub.CompileJob)

The return value is an instance of CompileJob. Take a look at the example here to learn how to compile this model for the Snapdragon® Neural Processing Unit (NPU)

The QNN context binary can also be embedded with an ONNX model.

Compiling Precompiled QNN ONNX

Qualcomm® AI Hub supports compiling and profiling a pre-compiled ONNX Runtime model. It’s an ONNX Runtime compatible model that contains the pre-compiled QNN binary runnable using ONNX Runtime on Snapdragon devices. More details are documented here.

The advantages of using precompiled QNN ONNX:

  • Ease of deployment: Works across Android, Linux, or Windows.

  • Performance gain: Equivalent to QNN context binary.

  • Simple inference code:ONNX Runtime uses QNN Execution Provider to run inference on the compiled model.

  • Large Models: Works for large models (>2GB) like LLMs, Stable Diffusion, etc.

Please note that the QNN context binary is operating system agnostic, but device specific. Additionally, context binary is designed only for the NPU.

import qai_hub as hub

# Compile a model to QNN context binary
compile_job = hub.submit_compile_job(
    model="mobilenet_v2.pt",
    device=hub.Device("Samsung Galaxy S23"),
    options="--target_runtime precompiled_qnn_onnx",
    input_specs=dict(image=(1, 3, 224, 224)),
)
assert isinstance(compile_job, hub.CompileJob)

The compiled model is a zipped directory (with extension .onnx) of ONNX and QNN context binary file. When uploading for profiling of the ONNX model and QNN context binary file must be in the same directory. Also, since the relative file path is embedded in the ONNX model, changing it would cause unexpected failures.

<model>.onnx
   |-- <model>.onnx
   +-- <model>.bin

Compiling PyTorch model for the ONNX Runtime

Qualcomm® AI Hub supports compiling a PyTorch model for the ONNX Runtime. In this example, we will use mobilenet_v2.pt and compile it to an ONNX model. This model can be profiled using the ONNX Runtime.

ONNX Runtime supports execution on CPU, GPU (using DML execution provider, or the NPU (using QNN execution provider)

import qai_hub as hub

# Compile a model to an ONNX model
compile_job = hub.submit_compile_job(
    model="mobilenet_v2.pt",
    device=hub.Device("Samsung Galaxy S23"),
    options="--target_runtime onnx",
    input_specs=dict(image=(1, 3, 224, 224)),
)
assert isinstance(compile_job, hub.CompileJob)

Compiling ONNX models to TensorFlow Lite or QNN

Qualcomm® AI Hub also supports the compilation of ONNX models to TensorFlow Lite or a QNN model library. We will use mobilenet_v2.onnx as an example.

import qai_hub as hub

# Compile a model to TensorFlow Lite
compile_job = hub.submit_compile_job(
    model="mobilenet_v2.onnx",
    device=hub.Device("Samsung Galaxy S23 (Family)"),
)
assert isinstance(compile_job, hub.CompileJob)

# Compile a model to a QNN Model Library
compile_job = hub.submit_compile_job(
    model="mobilenet_v2.onnx",
    device=hub.Device("Samsung Galaxy S23 (Family)"),
    options="--target_runtime qnn_lib_aarch64_android",
)
assert isinstance(compile_job, hub.CompileJob)

Compiling models quantized with AIMET to TensorFlow Lite or QNN

AI Model Efficiency Toolkit (AIMET) is an open source library that provides advanced model quantization and compression techniques for training neural network models. AIMET’s QuantizationSimModel can be exported to one of the following:

  • Recommended: ONNX models (.onnx) and an encodings file (.encodings) with the quantization parameters.

  • TorchScript (.pt) and an encodings file (.encodings) with the quantization parameters.

To use these models, create a directory with .aimet in the name. It should contain one .pt or .onnx model and corresponding encoding file.

<model>.aimet
   |-- <model>.onnx  or .pt file
   +-- <model>.encodings

where <model> can be any name.

Let’s use mobilenet_v2_onnx.aimet.zip as an example. After unzipping to mobilenet_v2_onnx.aimet directory, we can submit compile job via

import qai_hub as hub

# Compile to TensorFlow Lite
compile_job = hub.submit_compile_job(
    model="mobilenet_v2_onnx.aimet",
    device=hub.Device("Samsung Galaxy S23"),
)
assert isinstance(compile_job, hub.CompileJob)

# Compile to a QNN Model Library
compile_job = hub.submit_compile_job(
    model="mobilenet_v2_onnx.aimet",
    device=hub.Device("Samsung Galaxy S23"),
    options="--target_runtime qnn_lib_aarch64_android --quantize_full_type int8",
)
assert isinstance(compile_job, hub.CompileJob)