API documentation

Core API

Core functionality of the API, available directly via qai_hub.

upload_dataset

Upload a dataset that expires in 30 days.

upload_model

Uploads a model.

get_dataset

Returns a dataset for a given id.

get_datasets

Returns a list of datasets visible to you.

get_devices

Returns a list of available devices.

get_device_attributes

Returns the super set of available device attributes.

get_job

Returns a job for a given id.

get_jobs

Returns a list of jobs visible to you.

get_model

Returns a model for a given id.

get_models

Returns a list of models.

set_verbose

If true, API calls may print progress to standard output.

submit_compile_job

Submits a compile job.

submit_profile_job

Submits a profiling job.

submit_inference_job

Submits an inference job.

submit_compile_and_profile_jobs

Submits a compilation job and a profile job.

Managed objects

Classes returned by the API. These objects are managed by the API and should not be instantiated directly. For example, to construct a Model instance, call upload_model() or get_model().

Dataset

A dataset should not be constructed directly.

Device

Create a target device representation.

Model

Neural network model object.

Job

Job for a model and a device.

CompileJob

Compile job for a model, a set of input specs, and a set of device.

ProfileJob

Profile job for a model, a set of input specs, and a device.

InferenceJob

Inference job for a model, user provided inputs, and a device.

CompileJobResult

Compile Job result structure.

ProfileJobResult

Profile Job result structure.

InferenceJobResult

Inference Job result structure.

JobStatus

Status of a job.

SourceModelType

Set of supported input model types.

Exceptions

Exceptions thrown by the API:

Error

Base class for all exceptions explicitly thrown by the API.

UserError

Something in the user input caused a failure; you may need to adjust your input.

InternalError

Internal API failure; please contact ai-hub-support@qti.qualcomm.com for assistance.

Common Options

Options for submit_compile_job(), submit_profile_job(), submit_inference_job():

Option

Description

--compute_unit <units>

Specifies the target compute unit(s). When used in a compile job, this can optimize the code generation of various layers. When used in a profile or inference job, this targets the specified compute unit(s).

Implicit fall back to CPU will be determined by the target_runtime; TfLite and ONNX Runtime always include CPU fallback.

Options for <units> (comma-separated values are unordered):

  • all

    • Target all available compute units with priority NPU, then GPU, then CPU.

  • npu

    • Target NPU.

  • gpu

    • Target GPU.

  • cpu

    • Target CPU.

Default:

  • npu when the target runtime is qnn_lib_aarch64_android

  • all for any other target runtime

Example:

  • --compute_unit npu,cpu

Compile Options

Options for submit_compile_job():

Compile option

Description

--target_runtime <runtime>

Overrides the default target runtime. The default target runtime is derived from the source model. Options for <runtime>:

  • tflite: TensorFlow Lite (.tflite)

  • qnn_lib_aarch64_android: Qualcomm® AI Engine Direct model library (.so) targeting AArch64 Android

  • qnn_context_binary: Qualcomm® AI Engine Direct context binary (.bin) targeting the hardware specified in the compile job.

  • onnx: ONNX Runtime (.onnx)

Default:

The source model dictates the default target.

  • .pt : tflite

  • .onnx : tflite

  • .aimet_onnx : tflite

  • .aimet_pt : qnn_lib_aarch64_android

Examples:

  • --target_runtime tflite

  • --target_runtime qnn_lib_aarch64_android

--output_names "<name>[,name]..."

Overrides the default output names. By default, the output names of a model are output_0, output_1 …. Options:

  • comma separated list of output names

Requirements:

  • When used, a name must be specified for each model output.

Default:

  • Output names of a model are output_0, output_1 ….

Examples:

  • --output_names "new_name_0,new_name_1"

--force_channel_last_input "<name>[,name]..."

--force_channel_last_output "<name>[,name]..."

Overrides the channel layout. By default, the compiler maintains the layout of the model’s input and output tensors. This results in additional transpose layers in the final model, most commonly at the beginning and end. These options instruct the compiler to generate code in such a way that the generated model can consume inputs in channel-last layout and/or produce outputs in channel-last layout. Options:

  • comma separated list of input tensor names

Requirements:

  • This option cannot be used if the target runtime is ONNX. Note that --force_channel_last_output takes either the default names or the names from the --output_names option.

Default:

  • The compiler will maintain the layout of the model’s input and output tensors.

Examples:

  • --force_channel_last_input "image,indices"

  • --force_channel_last_output "output_0,output_1"

--quantize_full_type <type>

Quantizes an unquantized model to the specified type. It quantizes both activations and weights using a representative dataset. If not such dataset is provided, a randomly generated one is used. In that case, the generated model can be used as a proxy for the achievable performance only, as the model will not be able to produce accurate results. Options:

  • int8: quantize activations and weights to int8

  • int16: quantize activations and weights to int16

  • w8a16: quantize weights to int8 and activations to int16 (recommended over int16)

  • w4a8: quantize weights to int4 and activations to int8

  • w4a16: quantize weights to int4 and activations to int16

Default:

  • No fake quantization.

Examples:

  • --quantize_full_type int8

--quantize_io [true|false]

Quantize the input and outputs when quantizing a model. Options:

  • true (or no argument)

  • false

Default:

  • Inputs and output are not quantized.

Examples:

  • --quantize_io

--quantize_io_type <type>

Specify the input/output type if --quantize_io is specified. This is currently only available for the following runtime and quantization type (--quantization_type_full):

  • TensorFlow Lite and int8: int8 (default), uint8

Default:

  • Type is determined based on the runtime and quantization type.

Examples:

  • --quantize_io --quantize_io_type uint8

--qnn_context_binary_vtcm <num>

Specify the amount in MB of VTCM memory to reserve and utilize.

  • Default: 4MB.

  • 0: Use the maximum amount of VTCM for a specific device.

  • This option can only be used along with --target_runtime qnn_context_binary

Examples:

  • --qnn_context_binary_vtcm 2

--qnn_context_binary_optimization_level <num>

Only applicable when compiling to QNN context binary. Specify to set the graph optimization value in range 1 to 3.

  • Default: 2

  • 1: Fast preparation time, less optimal graph

  • 2: Longer preparation time, more optimal graph

  • 3: Longest preparation time, most likely even more optimal graph

  • This option can only be used along with --target_runtime qnn_context_binary

Examples:

  • --qnn_context_binary_optimization_level 3

Profile & Inference Options

Options for submit_profile_job():, submit_inference_job():

Profile option

Description

--dequantize_outputs [true|false]

Dequantize the output. Default is true. Options:

  • true

  • false

Default:

  • Output is dequantized

Examples:

  • --dequantize_outputs

--tflite_delegates <delegate>[,<delegate>]...

Specifies the TensorFlow Lite delegates to load and target. Multiple delegates can be specified. They will be used in the order specified. Specify the best delegate(s) for the device to run on the desired compute units. Options:

  • qnn

  • qnn-gpu

  • nnapi

  • nnapi-gpu

  • gpu

  • xnnpack

Default:

  • The selection of delegates is informed by the compute units specified.

Examples:

  • --tflite_delegates qnn,qnn-gpu

--tflite_options <option>[;<option>]...

Specify behavior for the TensorFlow Lite delegates. This option is specified as a semicolon separated list of sub-options. The full set of sub-options are in the TensorFlow Lite Options, TensorFlow Lite Delegate Options for Qualcomm® AI Engine Direct, TensorFlow Lite Delegate Options for GPUv2, and TensorFlow Lite Delegate Options for NNAPI tables below.

Examples:

  • --tflite_options number_of_threads=4;qnn_htp_precision=kHtpQuantized

--qnn_options <option>[;<option>]...

Specify behavior when targeting a Qualcomm® AI Engine Direct model library. This option is specified as a semicolon separated list of sub-options. The full set of sub-options are in the Qualcomm® AI Engine Direct Options table below.

Requirements:

  • --qnn_options is only applicable when used with --target_runtime qnn_lib_aarch64_android.

Examples:

  • --target_runtime qnn_lib_aarch64_android --qnn_options context_priority=HIGH;context_htp_performance_mode=BALANCED

--onnx_execution_providers <ep>[,<ep>]...

Specifies the ONNX Runtime execution providers to load and target. This option can be used with --target_runtime onnx. Multiple execution providers can be specified. Options:

  • qnn

  • qnn-gpu

Default:

  • The selection of execution providers is informed by the compute units specified.

Examples:

  • --onnx_execution_providers qnn,qnn-gpu

Profile Options

Options for submit_profile_job():

Profile option

Description

--max_profiler_iterations=<num>

Specifies the maximum number of profile iterations. Fewer iterations may be run if their cumulative runtime is predicted to exceed the execution timeout defined by --max_inference_time. Options:

  • <num> The maximum number of iterations to run.

Default:

  • 100 – 100 iterations

Examples:

  • --max_profiler_iterations=50

--max_profiler_time=<seconds>

Specifies the maximum amount of time to spend performing inference. At runtime, the number of iterations adapts to complete approximately within this limit, never exceeding the value specified by --max_profiler_iterations. For example, let’s say a profile job is submitted with --max_profiler_iterations=100 --max_profiler_time=10. This indicates that no more than 10 seconds should be spent in a loop profiling inference, allowing for up to 100 iterations if they can be completed within that time. The loop is terminated if the running average of inference times predicts that another iteration would cause the timeout to be exceeded.

  • <seconds> The soft limit of how long to spend performing inference on-device.

Requirements:

  • This value must be strictly positive and not exceed 600.

Default:

  • 600 – 10 minutes

Examples:

  • --max_profiler_time=30

TensorFlow Lite Options

Sub-options for --tflite_options that are applicable to all delegates.

TensorFlow Lite option

Description

enable_fallback=[true|false]

Fallback to lower priority delegates when one fails to prepare. When this is false, any delegate failure will fail the entire job. Default is true.

invoke_interpreter_on_cold_load=[true|false]

Run the interpreter during cold load. This validates that delegates can execute successfully. If an error occurs, failed delegates will be eliminated and the model will be reloaded until a working configuration can be found. Subsequent loads will not re-run the interpreter. Default is false.

allow_fp32_as_fp16=[true|false]

Allow delegates to use reduced fp16 precision for fp32 operations. Delegate options may override this. Default is true.

force_opengl=[true|false]

Force the use of OpenGL when possible. This exists for debugging purposes. Default is false.

number_of_threads=<num>

Number of threads to use. Default: is -1, which then uses std::thread::hardware_concurrency() / 2.

release_dynamic_tensors=[true|false]

Force all intermediate dynamic tensors to be released once they are not used by the model. Please use this configuration with caution, since it might reduce the peak memory usage of the model at the cost of a slower inference speed. Default is false.

TensorFlow Lite Delegate Options for Qualcomm® AI Engine Direct

Sub-options for --tflite_options specific to the QNN delegate. For additional information on the QNN options see the QNN GPU backend and QNN HTP backend pages in the Qualcomm® AI Engine Direct Documentation. Note that not all options are available for all backends.

TfLite option (QNN)

Description

qnn_log_level=<option>

Set the log level for qnn. Default: kLogLevelWarn. Options:

  • kLogOff

  • kLogLevelError

  • kLogLevelWarn

  • kLogLevelInfo

  • kLogLevelVerbose

  • kLogLevelDebug

qnn_graph_priority=<option>

Set the underlying graph priority. Default: kQnnPriorityDefault. Options:

  • kQnnPriorityDefault

  • kQnnPriorityLow

  • kQnnPriorityNormal

  • kQnnPriorityNormalHigh

  • kQnnPriorityHigh

  • kQnnPriorityUndefined

qnn_gpu_precision=<option>

Set the precision for the GPU backend that defines the optimization levels of the graph tensors that are not input nor output tensors. Default is kGpuFp16. Options:

  • kGpuUserProvided

  • kGpuFp32

  • kGpuFp16

  • kGpuHybrid

qnn_gpu_performance_mode=<option>

Performance mode for the GPU. Higher performance consumes more power. Default is kGpuHigh. Options:

  • kGpuDefault

  • kGpuHigh

  • kGpuNormal

  • kGpuLow

qnn_dsp_performance_mode=<option>

DSP performance mode. Default is kDspBurst. Options:

  • kDspLowPowerSaver

  • kDspPowerSaver

  • kDspHighPowerSaver

  • kDspLowBalanced

  • kDspBalanced

  • kDspHighPerformance

  • kDspSustainedHighPerformance

  • kDspBurst

qnn_dsp_encoding=<option>

Specify the origin of the quanitization parameters. Default is kDspStatic. Options:

  • kDspStatic: Use quantization parameters from the model.

  • kDspDynamic: Use quantization parameters from the model inputs.

qnn_htp_performance_mode=<option>

HTP performance mode. Default is kHtpBurst. Options:

  • kHtpLowPowerSaver

  • kHtpPowerSaver

  • kHtpHighPowerSaver

  • kHtpLowBalanced

  • kHtpBalanced

  • kHtpHighPerformance

  • kHtpSustainedHighPerformance

  • kHtpBurst

qnn_htp_precision=<option>

HTP precision mode. Only applicable on 8gen1 and newer. Default: is kHtpFp16, when supported. Options:

  • kHtpQuantized

  • kHtpFp16

qnn_htp_optimization_strategy=<option>

Optimize for load time or inference time. Default is kHtpOptimizeForInference.

  • kHtpOptimizeForInference

  • kHtpOptimizeForPrepare

qnn_htp_use_conv_hmx=[true|false]

Using short conv hmx may lead to better performance. However, convolutions that have short depth and/or weights that are not symmetric could exhibit inaccurate results. Default is true.

qnn_htp_use_fold_relu=[true|false]

Using fold relu may lead to better performance. Quantization ranges for convolution should be equal or a subset of the Relu operation for correct results. Default is false.

qnn_htp_vtcm_size=<size>

Set VTCM size in MB. If VTCM size is not set, the default VTCM size will be used. VTCM size must be bigger than 0, and if VTCM size is set to bigger than VTCM size available for this device, it will be set to the VTCM size available for this device.

qnn_htp_num_hvx_threads=<num>

Set the number of HVX threads. If num exceeds the max number of HVX threads, the number will be clipped to the maximum number of threads supported. The num must be greater than 0.

TensorFlow Lite Delegate Options for GPUv2

Sub-options for --tflite_options specific to the GPUv2 delegate. The GPUv2 delegate is chosen with --tflite_delegates gpu. For additional information on the TensorFlow Lite GPU options see delegate_options.h.

TfLite option (GPUv2)

Description

gpu_inference_preference=<option>

Specify the compilation/runtime trade-off. Default is TFLITE_GPU_INFERENCE_PREFERENCE_SUSTAINED_SPEED. Options:

  • TFLITE_GPU_INFERENCE_PREFERENCE_FAST_SINGLE_ANSWER: The delegate will be used only once, therefore, bootstrap/init time should be taken into account.

  • TFLITE_GPU_INFERENCE_PREFERENCE_SUSTAINED_SPEED: Prefer maximizing the throughput. The same delegate will be used repeatedly on multiple inputs.

  • TFLITE_GPU_INFERENCE_PREFERENCE_BALANCED: Balance init latency and throughput. This option will result in slightly higher init latency than TFLITE_GPU_INFERENCE_PREFERENCE_FAST_SINGLE_ANSWER but should have inference latency closer to TFLITE_GPU_INFERENCE_PREFERENCE_SUSTAINED_SPEED.

gpu_inference_priority1=<option>

gpu_inference_priority2=<option>

gpu_inference_priority3=<option>

Set the top ordered priorities.

Default for priority1 is TFLITE_GPU_INFERENCE_PRIORITY_MIN_LATENCY. Default for priority2 is TFLITE_GPU_INFERENCE_PRIORITY_MIN_MEMORY_USAGE. Default for priority3 is TFLITE_GPU_INFERENCE_PRIORITY_MAX_PRECISION. To specify maximum precision, use the following set for priority1-priority3: TFLITE_GPU_INFERENCE_PRIORITY_MAX_PRECISION, TFLITE_GPU_INFERENCE_PRIORITY_AUTO, TFLITE_GPU_INFERENCE_PRIORITY_AUTO. Options:

  • TFLITE_GPU_INFERENCE_PRIORITY_AUTO

  • TFLITE_GPU_INFERENCE_PRIORITY_MAX_PRECISION

  • TFLITE_GPU_INFERENCE_PRIORITY_MIN_LATENCY

  • TFLITE_GPU_INFERENCE_PRIORITY_MIN_MEMORY_USAGE

gpu_max_delegated_partitions=<num>

A graph could have multiple partitions that can be delegated. This limits the maximum number of partitions to be delegated. Default is 1.

TensorFlow Lite Delegate Options for NNAPI

Sub-options for --tflite_options specific to the NNAPI delegate. For additional information on the NNAPI specific options see nnapi_delegate_c_api.h.

TensorFlow Lite option (NNAPI)

Description

nnapi_execution_preference=<option>

Power/performance trade-off. Default is kSustainedSpeed. Options:

  • kLowPower

  • kFastSingleAnswer

  • kSustainedSpeed

nnapi_max_number_delegated_partitions=<max>

Set the maximum number of partitions. A value less than or equal to zero means no limit. If the delegation of the full set of supported nodes would generate a number of partitions greater than this parameter, only <max> of them will be actually accelerated. The selection is currently done sorting partitions in decreasing order of number of nodes and selecting them until the limit is reached. Default is 3.

nnapi_allow_fp16=[true|false]

Allow fp32 to run as fp16. Default is true.

Qualcomm® AI Engine Direct Options

Sub-options for --qnn_options. These options are applicable when the target runtime is qnn_lib_aarch64_android. For additional information on the Qualcomm® AI Engine Direct options see the QNN GPU backend and QNN HTP backend pages in the Qualcomm® AI Engine Direct Documentation. Note that not all options are available for all backends.

Note that the Qualcomm® AI Engine Direct SDK does not guarantee model libraries will be ABI compatible with all versions of the SDK. Using a model library with a newer SDK may result in undefined behavior. If you experience an error using an older model library with AI Hub, you may need to recompile the model library. The version of the QNN SDK used is visible on the AI Hub Compile Job details page.

Qualcomm® AI Engine Direct option

Description

context_async_execution_queue_depth_numeric=<num>

Specify asynchronous execution queue depth.

context_enable_graphs=<name>[,<name>]...

The names of the graphs to deserialize from a context binary. All graphs are enabled by default. An error is generated if an invalid graph name is provided.

context_error_reporting_options_level=<option>

Sets the error reporting level. Options:

  • BRIEF

  • DETAILED

context_error_reporting_options_storage_limit=<kb>

Amount of memory to be reserved for error information. Specified in KB.

context_memory_limit_hint=<mb>

Sets the peak memory limit hint of a deserialized context in MB.

context_priority=<option>

Sets the default priority for graphs in this context. Options:

  • LOW

  • NORMAL

  • NORMAL_HIGH

  • HIGH

context_gpu_performance_hint=<option>

Set the GPU performance hint. Default is HIGH. Options:

  • HIGH: Maximize GPU clock frequencies.

  • NORMAL: Balance performance dependent upon power management.

  • LOW: Use lowest power consumption at the expense of inference latency.

context_gpu_use_gl_buffers=[true|false]

OpenGL buffers will be used if set to true.

context_htp_performance_mode=<option>

Set the HTP performance mode. Default is BURST. Options:

  • EXTREME_POWER_SAVER

  • LOW_POWER_SAVER

  • POWER_SAVER

  • HIGH_POWER_SAVER

  • LOW_BALANCED

  • BALANCED

  • HIGH_PERFORMANCE

  • SUSTAINED_HIGH_PERFORMANCE

  • BURST

default_graph_priority=<option>

Options:

  • LOW

  • NORMAL

  • NORMAL_HIGH

  • HIGH

default_graph_gpu_precision=<option>

Specify the precision mode to use. Default is USER_PROVIDED. Options:

  • FLOAT32: Convert tensor data types to FP32 and select kernels that use an FP32 accumulator.

  • FLOAT16: Convert tensor data types to FP16 and select kernels that use an FP16 accumulator where possible.

  • HYBRID: Convert tensor data types to FP16 and select kernels that use an FP32 accumulator.

  • USER_PROVIDED: Do not optimize tensor data types.

default_graph_gpu_disable_memory_optimizations=[true|false]

When true, each tensor in the model will be allocated unique memory and sharing is disabled.

default_graph_gpu_disable_node_optimizations=[true|false]

When true, operations will not be fused and will be kept separate.

default_graph_gpu_disable_queue_recording=[true|false]

The QNN GPU backend will use queue recording to improve performance. When true, queue recording is disabled.

default_graph_htp_disable_fold_relu_activation_into_conv=[true|false]

For any graph where a convolution or convolution-like operation is followed by Relu or ReluMinMax, the Relu is folded into the convolution operation. Default is false.

default_graph_htp_num_hvx_threads=<num>

Define number of HVX threads to reserve and utilize for a particular graph. Default is 4.

default_graph_htp_optimization_type=<option>

Set an optimization level to balance prepare and execution. Options:

  • FINALIZE_OPTIMIZATION_FLAG

default_graph_htp_optimization_value=<num>

Specify in combination with default_graph_htp_optimization_type. Options for <num>:

  • 1: Faster preparation time, less optimal graph.

  • 2: Longer preparation time, more optimal graph

  • 3: Longest preparation time, most likely even more optimal graph.

default_graph_htp_precision=<option>

If no precision value is set, the QNN HTP backend assumes that the client expects to run a quantized network. When the precision value is set to FLOAT16, the QNN HTP backend will convert user provided float32 inputs to float16 and execute the graph with float16 math. Options:

  • FLOAT16

default_graph_htp_disable_short_depth_conv_on_hmx=[true|false]

Run all Convolution operations using HMX instructions. Default is false.

default_graph_htp_vtcm_size=<mb>

Set the amount of VTCM memory (in MB) to reserve and utilize. Specify 0 to use the maximum amount. Default is 4.