Deployment

Once you have a deployable asset, you may want to integrate it into an application. This process will depend on the target runtime, so please refer to its documentation:

Note: Deploying quantized ONNX models includes a few additional steps that must be followed for improved on-device performance and reduced memory footprint.

Create Deployable ONNX models

ONNX Graph after Quantize & Compile Job: The combination of quantize and compile jobs in AI Hub produces ONNX graphs with edge-centric quantized representation where edges go through fake quantization (Q + DQ). All the weights are kept in fp32 and all the ops operate on fp32.

If desired, you can transform the ONNX graph into an op-centric quantized representation that has a one-to-one mapping with a QOp representation. The benefit of QDQ over QOp, is that only two additional ops (Q, DQ) are needed in the opset to represent a fully quantized graph. The weights are stored as quantized values. This reduces the model size and also contributes to a cleaner mapping to QOp.

To create this deployable asset:

Download the target model from AI Hub.
Run this script on the downloaded model.
If desired, upload and profile this updated model.

Qualcomm® AI Hub Apps

This process can be daunting with a steep learning curve. To help you get started, we provide a repository of sample apps and tutorials:

Qualcomm® AI Hub Apps