Deployment

Once you have a deployable asset, you may want to integrate it into an application. This process will depend on the target runtime, so please refer to its documentation:

Note: Deploying quantized ONNX models includes a few additional steps that must be followed for improved on-device performance and reduced memory footprint.

Create Deployable ONNX models

ONNX Graph after Quantize & Compile Job: The combination of quantize and compile jobs in AI Hub produces ONNX graphs with edge-centric quantized representation where edges go through fake quantization (Q + DQ). All the weights are kept in fp32 and all the ops operate on fp32.

ONNX Graph used in Profile & Inference Job: As part of profile and inference jobs in AI Hub, the ONNX graph is transformed to op-centric quantized representation that has a one-to-one mapping with a QOp representation. The benefit of QDQ over QOp, is that only two additional ops (Q, DQ) are needed in the opset to represent a fully quantized graph. The weights are stored as quantized values. This reduces the model size and also contributes to a cleaner mapping to QOp. The performance metrics are shown using this graph representation. This deployable ONNX asset cannot be directly downloaded through AI Hub.

To create this deployable asset: * Download the target model from AI Hub. * Run this script on the downloaded model.

Qualcomm® AI Hub Apps

This process can be daunting with a steep learning curve. To help you get started, we provide a repository of sample apps and tutorials: