Deployment
Once you have a deployable asset, you may want to integrate it into an application. This process will depend on the target runtime, so please refer to its documentation:
Note: Deploying quantized ONNX models includes a few additional steps that must be followed for improved on-device performance and reduced memory footprint.
Create Deployable ONNX models
ONNX Graph after Quantize & Compile Job: The combination of quantize and compile jobs in AI Hub produces ONNX graphs with edge-centric quantized representation where edges go through fake quantization (Q + DQ). All the weights are kept in fp32 and all the ops operate on fp32.
ONNX Graph used in Profile & Inference Job: As part of profile and inference jobs in AI Hub, the ONNX graph is transformed to op-centric quantized representation that has a one-to-one mapping with a QOp representation. The benefit of QDQ over QOp, is that only two additional ops (Q, DQ) are needed in the opset to represent a fully quantized graph. The weights are stored as quantized values. This reduces the model size and also contributes to a cleaner mapping to QOp. The performance metrics are shown using this graph representation. This deployable ONNX asset cannot be directly downloaded through AI Hub.
To create this deployable asset: * Download the target model from AI Hub. * Run this script on the downloaded model.
Qualcomm® AI Hub Apps
This process can be daunting with a steep learning curve. To help you get started, we provide a repository of sample apps and tutorials: