API documentation
Core API
Core functionality of the API, available directly via qai_hub
.
Upload a dataset that expires in 30 days. |
|
Uploads a model. |
|
Returns a dataset for a given id. |
|
Returns a list of datasets visible to you. |
|
Returns a list of available devices. |
|
Returns the super set of available device attributes. |
|
Returns a job for a given id. |
|
Returns summary information for jobs matching the specified filters. |
|
Returns a model for a given id. |
|
Returns a list of models. |
|
If true, API calls may print progress to standard output. |
|
Submits a compile job. |
|
Submits a profile job. |
|
Submits an inference job. |
|
Submits a compilation job and a profile job. |
|
Deprecated. |
Managed objects
Classes returned by the API. These objects are managed by the API and should
not be instantiated directly. For example, to construct a
Model
instance, call upload_model()
or get_model()
.
A dataset should not be constructed directly. |
|
Create a target device representation. |
|
Neural network model object. |
|
abstract Job base class. |
|
Summary information about a job and its current status. |
|
Summary information about a compile job and its current status. |
|
Summary information about a quantize job and its current status. |
|
Summary information about a profile job and its current status. |
|
Summary information about an inference job and its current status. |
|
Compile job for a model, a set of input specs, and a set of device. |
|
Profile job for a model, a set of input specs, and a device. |
|
Inference job for a model, user provided inputs, and a device. |
|
Compile Job result structure. |
|
Profile Job result structure. |
|
Inference Job result structure. |
|
Status of a job. |
|
The type of a job (compile, profile, etc.) |
|
Set of supported input model types. |
|
Model metadata key. |
Exceptions
Exceptions thrown by the API:
Base class for all exceptions explicitly thrown by the API. |
|
Something in the user input caused a failure; you may need to adjust your input. |
|
Internal API failure; please contact ai-hub-support@qti.qualcomm.com for assistance. |
Common Options
Options for submit_compile_job()
, submit_profile_job()
, submit_inference_job()
:
Option |
Description |
---|---|
|
Specifies the target compute unit(s). When used in a compile job, this can optimize the code generation of various layers. When used in a profile or inference job, this targets the specified compute unit(s). Implicit fall back to CPU will be determined by the Options for
Default:
Example:
|
Compile Options
Options for submit_compile_job()
:
Compile option |
Description |
---|---|
|
Overrides the default target runtime. The default target runtime is derived from the source model and the target device. Options for
The default target is decided as follows:
Examples:
|
|
Overrides the default output names. By default, the output names of a model are
Requirements:
Default:
Examples:
|
|
instructs the compiler to treat intermediate int64 tensors as int32. Options:
Requirements:
Default:
Examples:
|
|
instructs the compiler to treat int64 input and output tensors as int32. Options:
Requirements:
Default:
Examples:
|
|
Overrides the channel layout. By default, the compiler maintains the layout of the model’s input and output tensors. This results in additional transpose layers in the final model, most commonly at the beginning and end. These options instruct the compiler to generate code in such a way that the generated model can consume inputs in channel-last layout and/or produce outputs in channel-last layout. Options:
Requirements:
Default:
Examples:
|
|
Quantizes an unquantized model to the specified type. It quantizes both activations and weights using a representative dataset. If not such dataset is provided, a randomly generated one is used. In that case, the generated model can be used as a proxy for the achievable performance only, as the model will not be able to produce accurate results. Options:
Requirements:
Default:
Examples:
|
|
Quantize the input and outputs when quantizing a model. Options:
Default:
Examples:
|
|
Specify the input/output type if
Requirements:
Default:
Examples:
|
|
Specify the amount in MB of VTCM memory to reserve and utilize.
Examples:
|
|
Only applicable when compiling to QNN context binary. Specify to set the graph optimization value in range 1 to 3.
Examples:
|
Profile & Inference Options
Options for submit_profile_job()
:, submit_inference_job()
:
Profile option |
Description |
---|---|
|
Dequantize the output. Default is
Default:
Examples:
|
|
Specifies the TensorFlow Lite delegates to load and target. Multiple delegates can be specified. They will be used in the order specified. Specify the best delegate(s) for the device to run on the desired compute units. Options:
Default:
Examples:
|
|
Specify behavior for the TensorFlow Lite delegates. This option is specified as a semicolon separated list of sub-options. The full set of sub-options are in the TensorFlow Lite Options, TensorFlow Lite Delegate Options for Qualcomm® AI Engine Direct, TensorFlow Lite Delegate Options for GPUv2, and TensorFlow Lite Delegate Options for NNAPI tables below. Examples:
|
|
Specify behavior when targeting a Qualcomm® AI Engine Direct model library. This option is specified as a semicolon separated list of sub-options. The full set of sub-options are in the Qualcomm® AI Engine Direct Options table below. Requirements:
Examples:
|
|
Specify behavior for the ONNX Runtime and ONNX Runtime execution providers. This option is specified as a semicolon separated list of sub-options. The full set of sub-options are in the ONNX Runtime Options and ONNX Runtime QNN Execution Provider options tables below. Examples:
|
|
Specifies the ONNX Runtime execution providers to load and target. This option can be used with
Default:
Examples:
|
Profile Options
Options for submit_profile_job()
:
Profile option |
Description |
---|---|
|
Specifies the maximum number of profile iterations. Fewer iterations may be run if their cumulative runtime
is predicted to exceed the execution timeout defined by
Default:
Examples:
|
|
Specifies the maximum amount of time to spend performing inference. At runtime, the number of
iterations adapts to complete approximately within this limit, never exceeding the value specified
by
Requirements:
Default:
Examples:
|
ONNX Runtime Options
Sub-options for --onnx_options
that are applicable to all execution providers, although some execution providers require
certain values.
ONNX Runtime option |
Description |
---|---|
|
Sets the execution mode for the session. Default is
|
|
Set the number of threads used to parallelize the execution within nodes. Default is |
|
Set the number of threads used to parallelize the execution of the graph. If nodes can be run in parallel,
this sets the maximum number of threads to use to run them in parallel. If sequential execution is enabled
this option is ignored. Default is |
|
Enable a memory pattern optimization. If the input shapes are the same, generate a memory pattern for a
future request so that just one allocation can handle all internal memory allocation. Default is |
|
Enable a memory arena on the CPU for future usage. Default is |
|
Set the optimization level to apply when loading a graph. See <https://onnxruntime.ai/docs/performance/model-optimizations/graph-optimizations.htmlOptons> for details.
Default is
|
ONNX Runtime QNN Execution Provider options
Sub-options for --onnx_options
specific to the QNN execution provider.
For additional information on the QNN execution provider options see <https://onnxruntime.ai/docs/execution-providers/QNN-ExecutionProvider.html#configuration-options>.
ONNX Runtime option (QNN) |
Description |
---|---|
|
HTP performance mode. Default is
|
|
Set the graph optimization level. Default is
|
|
Enable a fp32 model to be inferenced with fp16 precision. This option greatly improves performance of models with any fp32 inputs.
Default is |
ONNX Runtime DirectML Execution Provider options
There are no user options for the DirectML execution provider. However, the following options will be automatically set
when using DirectML: execution_mode=SEQUENTIAL;enable_memory_pattern=false
.
TensorFlow Lite Options
Sub-options for --tflite_options
that are applicable to all delegates.
TensorFlow Lite option |
Description |
---|---|
|
Fallback to lower priority delegates when one fails to prepare. When this is false, any delegate failure will fail the entire job. Default is |
|
Run the interpreter during cold load. This validates that delegates can execute successfully.
If an error occurs, failed delegates will be eliminated and the model will be reloaded until a working configuration can be found.
Subsequent loads will not re-run the interpreter. Default is |
|
Allow delegates to use reduced fp16 precision for fp32 operations. Delegate options may override this. Default is |
|
Force the use of OpenGL when possible. This exists for debugging purposes. Default is |
|
Number of threads to use. Default: is |
|
Force all intermediate dynamic tensors to be released once they are not used by the model.
Please use this configuration with caution, since it might reduce the peak memory usage of the model at the cost of a slower
inference speed. Default is |
TensorFlow Lite Delegate Options for Qualcomm® AI Engine Direct
Sub-options for --tflite_options
specific to the QNN delegate.
For additional information on the QNN options see the QNN GPU backend and QNN HTP backend pages in the Qualcomm® AI Engine Direct Documentation.
Note that not all options are available for all backends.
TfLite option (QNN) |
Description |
---|---|
|
Set the log level for qnn.
Default:
|
|
Set the underlying graph priority.
Default:
|
|
Set the precision for the GPU backend that defines the optimization levels of the
graph tensors that are not input nor output tensors. Default is
|
|
Performance mode for the GPU. Higher performance consumes more power. Default is
|
|
DSP performance mode.
Default is
|
|
Specify the origin of the quanitization parameters. Default is
|
|
HTP performance mode.
Default is
|
|
HTP precision mode. Only applicable on 8gen1 and newer. Default: is
|
|
Optimize for load time or inference time. Default is
|
|
Using short conv hmx may lead to better performance. However, convolutions that have short depth and/or weights
that are not symmetric could exhibit inaccurate results. Default is |
|
Using fold relu may lead to better performance. Quantization ranges for convolution should be equal or a subset
of the Relu operation for correct results. Default is |
|
Set VTCM size in MB. If VTCM size is not set, the default VTCM size will be used. VTCM size must be bigger than 0, and if VTCM size is set to bigger than VTCM size available for this device, it will be set to the VTCM size available for this device. |
|
Set the number of HVX threads. If |
TensorFlow Lite Delegate Options for GPUv2
Sub-options for --tflite_options
specific to the GPUv2 delegate. The GPUv2 delegate is chosen with --tflite_delegates gpu
.
For additional information on the TensorFlow Lite GPU options see delegate_options.h.
TfLite option (GPUv2) |
Description |
---|---|
|
Specify the compilation/runtime trade-off. Default is
|
|
|
|
A graph could have multiple partitions that can be delegated.
This limits the maximum number of partitions to be delegated. Default is |
TensorFlow Lite Delegate Options for NNAPI
Sub-options for --tflite_options
specific to the NNAPI delegate.
For additional information on the NNAPI specific options see nnapi_delegate_c_api.h.
TensorFlow Lite option (NNAPI) |
Description |
---|---|
|
Power/performance trade-off. Default is
|
|
Set the maximum number of partitions. A value less than or equal to zero means no limit.
If the delegation of the full set of supported nodes would generate a
number of partitions greater than this parameter, only |
|
Allow fp32 to run as fp16. Default is |
Qualcomm® AI Engine Direct Options
Sub-options for --qnn_options
.
These options are applicable when the target runtime is qnn_lib_aarch64_android
.
For additional information on the Qualcomm® AI Engine Direct options see the QNN GPU backend and QNN HTP backend pages in the Qualcomm® AI Engine Direct Documentation.
Note that not all options are available for all backends.
Note that the Qualcomm® AI Engine Direct SDK does not guarantee model libraries will be ABI compatible with all versions of the SDK. Using a model library with a newer SDK may result in undefined behavior. If you experience an error using an older model library with AI Hub, you may need to recompile the model library. The version of the QNN SDK used is visible on the AI Hub Compile Job details page.
Qualcomm® AI Engine Direct option |
Description |
---|---|
|
Specify asynchronous execution queue depth. |
|
The name of the graph to deserialize from a context binary. This argument is necessary if the context binary contains more than one graph. An error is generated if an invalid graph name is provided or multiple graphs are used. |
|
Sets the error reporting level. Options:
|
|
Amount of memory to be reserved for error information. Specified in KB. |
|
Sets the peak memory limit hint of a deserialized context in MB. |
|
Sets the default priority for graphs in this context. Options:
|
|
Set the GPU performance hint. Default is
|
|
OpenGL buffers will be used if set to true. |
|
Set the HTP performance mode. Default is
|
|
Options:
|
|
Specify the precision mode to use. Default is
|
|
When |
|
When |
|
The QNN GPU backend will use queue recording to improve performance. When |
|
For any graph where a convolution or convolution-like operation is followed by |
|
Define number of HVX threads to reserve and utilize for a particular graph. Default is |
|
Set an optimization level to balance prepare and execution. Options:
|
|
Specify in combination with
|
|
If no precision value is set, the QNN HTP backend assumes that the client expects to run a quantized network. When the precision value is set to
|
|
Run all Convolution operations using HMX instructions. Default is |
|
Set the amount of VTCM memory (in MB) to reserve and utilize. Specify |