API documentation

Core API

Core functionality of the API, available directly via qai_hub.

`upload_dataset`	Upload a dataset that expires in 30 days.
`upload_model`	Uploads a model.
`get_dataset`	Returns a dataset for a given id.
`get_datasets`	Returns a list of datasets visible to you.
`get_devices`	Returns a list of available devices.
`get_device_attributes`	Returns the super set of available device attributes.
`get_frameworks`	Returns a list of available ML frameworks.
`get_job`	Returns a job for a given id.
`get_job_summaries`	Returns summary information for jobs matching the specified filters.
`get_model`	Returns a model for a given id.
`get_models`	Returns a list of models.
`set_verbose`	If true, API calls may print progress to standard output.
`submit_compile_job`	Submits a compile job.
`submit_quantize_job`	Submits a quantize job.
`submit_profile_job`	Submits a profile job.
`submit_inference_job`	Submits an inference job.
`submit_link_job`	Submits a link job.
`submit_compile_and_profile_jobs`	Submits a compilation job and a profile job.
`submit_compile_and_quantize_jobs`	Compiles a model to onnx and runs a quantize job on the produced onnx model.
`get_jobs`	Deprecated.

Managed objects

Classes returned by the API. These objects are managed by the API and should not be instantiated directly. For example, to construct a Model instance, call upload_model() or get_model().

`Dataset`	A dataset should not be constructed directly.
`Device`	Create a target device representation.
`Framework`	A representation of an ML framework.
`Model`	Neural network model object.
`Job`	abstract Job base class.
`JobSummary`	Summary information about a job and its current status.
`CompileJobSummary`	Summary information about a compile job and its current status.
`QuantizeJobSummary`	Summary information about a quantize job and its current status.
`ProfileJobSummary`	Summary information about a profile job and its current status.
`InferenceJobSummary`	Summary information about an inference job and its current status.
`LinkJobSummary`	Summary information about a link job and its current status.
`CompileJob`	Compile job for a model, a set of input specs, and a set of device.
`ProfileJob`	Profile job for a model, a set of input specs, and a device.
`QuantizeJob`	Quantize job for a model, a set of input specs, and a set of device.
`InferenceJob`	Inference job for a model, user provided inputs, and a device.
`LinkJob`	Link job for a collection of models.
`CompileJobResult`	Compile Job result structure.
`ProfileJobResult`	Profile Job result structure.
`QuantizeJobResult`	Quantize Job result structure.
`InferenceJobResult`	Inference Job result structure.
`LinkJobResult`	Link Job result structure.
`QuantizeDtype`	Supported data types when submitting quantize jobs.
`JobStatus`	Status of a job.
`JobType`	The type of a job (compile, profile, etc.)
`SourceModelType`	Set of supported input model types.
`ModelMetadataKey`	Model metadata key.

Exceptions

Exceptions thrown by the API:

`Error`	Base class for all exceptions explicitly thrown by the API.
`UserError`	Something in the user input caused a failure; you may need to adjust your input.
`InternalError`	Internal API failure; please contact ai-hub-support@qti.qualcomm.com for assistance.

Common Options

Options for submit_compile_job(), submit_profile_job(), submit_inference_job():

Option

Description

--compute_unit <units>

Specifies the target compute unit(s). When used in a compile job, this can optimize the code generation of various layers. When used in a profile or inference job, this targets the specified compute unit(s).

Implicit fall back to CPU will be determined by the target_runtime; TfLite and ONNX Runtime always include CPU fallback.

Options for <units> (comma-separated values are unordered):

all
- Target all available compute units with priority NPU, then GPU, then CPU.
npu
- Target NPU.
gpu
- Target GPU.
cpu
- Target CPU.

Default:

npu when the target runtime is qnn_lib_aarch64_android
all for any other target runtime

Example:

--compute_unit npu,cpu

--qairt_version <version>

Specifies the version of Qualcomm® AI Runtime to use in the job. This is applicable to:

Compile jobs targeting Qualcomm® AI Engine Direct (e.g., qnn_context_binary).
Compile jobs targeting ONNX Runtime with a pre-compiled QNN context binary (i.e., precompiled_qnn_onnx).
All inference jobs.
All profile jobs.

The list of supported QAIRT versions is available from the get_frameworks() API. Valid arguments are:

major.minor version; a three part major.minor.patch version will produce an error.
default, which will select the default version.
latest, which will select the latest version.

Compile Options

Options for submit_compile_job():

Compile option	Description
`--target_runtime <runtime>`	Overrides the default target runtime. The default target runtime is derived from the source model and the target device. Options for `<runtime>`: `tflite`: TensorFlow Lite (`.tflite`) `qnn_lib_aarch64_android`: Qualcomm® AI Engine Direct model library (`.so`) targeting AArch64 Android. `qnn_context_binary`: Qualcomm® AI Engine Direct context binary (`.bin`) targeting the hardware specified in the compile job. `qnn_dlc`: Qualcomm® AI Engine Direct Deep Learning Container (`.dlc`) hardware agnostic. `onnx`: ONNX Runtime (`.onnx`) `precompiled_qnn_onnx`: ONNX Runtime model with a pre-compiled QNN context binary. The default target is decided as follows: Default target is set to `tflite` for any Android device Default target is set to `onnx` for any Windows device Examples: `--target_runtime tflite` `--target_runtime qnn_lib_aarch64_android`
`--output_names "<name>[,name]..."`	Overrides the default output names. By default, the output names of a model are `output_0`, `output_1` …. Options: comma separated list of output names Requirements: When used, a name must be specified for each model output. Default: Output names of a model are `output_0`, `output_1` …. Examples: `--output_names "new_name_0,new_name_1"`
`--truncate_64bit_tensors [true\|false]`	instructs the compiler to treat intermediate int64 tensors as int32 and float64 tensors as float32. Options: `true` (or no argument) `false` Requirements: Must be used for the TensorFlow Lite target runtime when the source model has int64 or float64 intermediate tensors ignored for other target runtimes Default: Intermediate tensors are not truncated Examples: `--truncate_64bit_tensors`
`--truncate_64bit_io [true\|false]`	instructs the compiler to treat int64 input and output tensors as int32. Options: `true` (or no argument) `false` Requirements: Must be used for TensorFlow Lite and QNN target runtime when the source model has int64 input or output tensors ignored for other target runtimes Default: Inputs and output are not truncated Examples: `--truncate_64bit_io`
`--force_channel_last_input <name>[,name2...]` `--force_channel_last_output <name>[,name2...]`	Overrides the channel layout. By default, the compiler maintains the layout of the model’s input and output tensors. This results in additional transpose layers in the final model, most commonly at the beginning and end. These options instruct the compiler to generate code in such a way that the generated model can consume inputs in channel-last layout and/or produce outputs in channel-last layout. It will also transform the original layout of the associated representative dataset (if provided) to channel-last layout for the mentioned names. Options: comma separated list of input tensor names Requirements: This option cannot be used if the target runtime is ONNX. Note that `--force_channel_last_output` takes either the default names or the names from the `--output_names` option. Default: The compiler will maintain the layout of the model’s input and output tensors. Examples: `--force_channel_last_input image,indices` `--force_channel_last_output output_0,output_1`
`--quantize_full_type <type>`	Quantizes an unquantized model to the specified type. It quantizes both activations and weights using a representative dataset. If not such dataset is provided, a randomly generated one is used. In that case, the generated model can be used as a proxy for the achievable performance only, as the model will not be able to produce accurate results. Options: `int8`: quantize activations and weights to int8 `int16`: quantize activations and weights to int16 `w8a16`: quantize weights to int8 and activations to int16 (recommended over `int16`) `w4a8`: quantize weights to int4 and activations to int8 `w4a16`: quantize weights to int4 and activations to int16 Requirements: This option cannot be used if the input is an AIMET model, or if the target runtime is ONNX. Default: No fake quantization. Examples: `--quantize_full_type int8`
`--quantize_weight_type <type>`	Quantizes the weights in an unquantized model to the specified type. Options: `float16`: quantize weights to float16 Requirements: This option cannot be used if the input is an AIMET model, or if the target runtime is not TensorFlow Lite. Default: No weight quantization. Examples: `--quantize_weight_type float16`
`--quantize_io [true\|false]`	Quantize the input and outputs when quantizing a model. Options: `true` (or no argument) `false` Default: Inputs and output are not quantized. Examples: `--quantize_io`
`--quantize_io_type <type>`	Specify the input/output type if `--quantize_io` is specified. This is currently only available for the following runtime and quantization type (`--quantization_type_full`): TensorFlow Lite and `int8`: `int8` (default), `uint8` Requirements: This option can only be used if the target runtime is TensorFlow Lite. Default: Type is determined based on the runtime and quantization type. Examples: `--quantize_io --quantize_io_type uint8`
`--qnn_graph_name <name>`	Deprecated by `--qnn_options context_enable_graphs=<name>`. Defines the graph name for QNN targets. This is particularly consequential for linked models that contain multiple graphs and need dinstinguishing names. Requirements: The `<name>` must be a valid variable name in C++. Examples: `--qnn_graph_name large_input`
`--qnn_context_binary_vtcm <num>`	Deprecated by `--qnn_options default_graph_htp_vtcm_size=<num>`. Specify the amount in MB of VTCM memory to reserve and utilize. Default: 0 (Use the maximum amount of VTCM for a specific device) This option can only be used along with `--target_runtime qnn_context_binary` or `--target_runtime precompiled_qnn_onnx` Examples: `--qnn_context_binary_vtcm 2`
`--qnn_context_binary_optimization_level <num>`	Deprecated by `--qnn_options default_graph_htp_optimizations=FINALIZE_OPTIMIZATION_FLAG=<num>` or the equivalent shorthand `--qnn_options default_graph_htp_optimizations=O=<num>`. Only applicable when compiling to QNN context binary. Specify to set the graph optimization value in range 1 to 3. Default: 3 1: Fast preparation time, less optimal graph 2: Longer preparation time, more optimal graph 3: Longest preparation time, most likely even more optimal graph This option can only be used along with `--target_runtime qnn_context_binary` or `--target_runtime precompiled_qnn_onnx` Examples: `--qnn_context_binary_optimization_level 3`
`--qnn_options <option>[;<option>]...`	Specify behavior when targeting Qualcomm® AI Engine Direct. This option is specified as a semicolon-separated list of sub-options. The full set of sub-options are in the Qualcomm® AI Engine Direct Options table below. Examples: `--qnn_options default_graph_htp_optimizations=ENABLE_DLBC_WEIGHTS=1;default_graph_htp_enable_weights_packing=true`

Quantize Options

Options for submit_quantize_job():

Quantize option

Description

--range_scheme <scheme>

Which algorithm to use to determine the quantization range of each tensor in the network. Default is auto. Options:

auto Defers to the server to choose an algorithm. Currently this always uses mse_minimizer.
mse_minimizer Collects histograms for each tensor and sets the range to include most of the data, while clipping outliers.
min_max Sets the range as the absolute minimum and maximum found for each tensor during calibration. Useful when outlier values are important to the network.

Link Options

There are currently no options for submit_link_job().

Profile & Inference Options

Options for submit_profile_job(), submit_inference_job():

Profile option	Description
`--dequantize_outputs [true\|false]`	Dequantize the output. Default is `true`. Options: `true` `false` Default: Output is dequantized Examples: `--dequantize_outputs`
`--tflite_delegates <delegate>[,<delegate>]...`	Specifies the TensorFlow Lite delegates to load and target. Multiple delegates can be specified. They will be used in the order specified. Specify the best delegate(s) for the device to run on the desired compute units. Options: `qnn` `qnn-gpu` `nnapi` `nnapi-gpu` `gpu` `xnnpack` Default: The selection of delegates is informed by the compute units specified. Examples: `--tflite_delegates qnn,qnn-gpu`
`--tflite_options <option>[;<option>]...`	Specify behavior for the TensorFlow Lite delegates. This option is specified as a semicolon separated list of sub-options. The full set of sub-options are in the TensorFlow Lite Options, TensorFlow Lite Delegate Options for Qualcomm® AI Engine Direct, TensorFlow Lite Delegate Options for GPUv2, and TensorFlow Lite Delegate Options for NNAPI tables below. Examples: `--tflite_options number_of_threads=4;qnn_htp_precision=kHtpQuantized`
`--qnn_options <option>[;<option>]...`	Specify behavior when runnig any Qualcomm® AI Engine Direct target asset. This option is specified as a semicolon separated list of sub-options. The full set of sub-options are in the Qualcomm® AI Engine Direct Options table below. Examples: `--qnn_options context_priority=HIGH;context_htp_performance_mode=BALANCED`
`--onnx_options <option>[;<option>]...`	Specify behavior for the ONNX Runtime and ONNX Runtime execution providers. This option is specified as a semicolon separated list of sub-options. The full set of sub-options are in the ONNX Runtime Options and ONNX Runtime QNN Execution Provider options tables below. Examples: `--onnx_options qnn_htp_performance_mode=burst`
`--onnx_execution_providers <ep>[,<ep>]...`	Specifies the ONNX Runtime execution providers to load and target. This option can be used with `--target_runtime onnx`. Multiple execution providers can be specified. Options: `qnn` `directml` Default: The selection of execution providers is informed by the compute units specified. Examples: `--onnx_execution_providers qnn`
`--max_profiler_iterations <num>`	Specifies the maximum number of profile iterations. Fewer iterations may be run if their cumulative runtime is predicted to exceed the execution timeout defined by `--max_profiler_time`. Options: `<num>` The maximum number of iterations to run. Default: `100` – 100 iterations Examples: `--max_profiler_iterations 50`
`--max_profiler_time <seconds>`	Specifies the maximum amount of time to spend performing inference. At runtime, the number of iterations adapts to complete approximately within this limit, never exceeding the value specified by `--max_profiler_iterations`. For example, let’s say a profile job is submitted with `--max_profiler_iterations 100 --max_profiler_time 10`. This indicates that no more than 10 seconds should be spent in a loop profiling inference, allowing for up to 100 iterations if they can be completed within that time. The loop is terminated if the running average of inference times predicts that another iteration would cause the timeout to be exceeded. `<seconds>` The soft limit of how long to spend performing inference on-device. Requirements: This value must be strictly positive and not exceed 600. Default: `600` – 10 minutes Examples: `--max_profiler_time 30`

ONNX Runtime Options

Sub-options for --onnx_options that are applicable to all execution providers, although some execution providers require certain values.

ONNX Runtime option	Description
`execution_mode=<mode>`	Sets the execution mode for the session. Default is `SEQUENTIAL`. Options: `SEQUENTIAL` `PARALLEL`
`intra_op_num_threads=<num>`	Set the number of threads used to parallelize the execution within nodes. Default is `0` – default number of threads.
`inter_op_num_threads=<num>`	Set the number of threads used to parallelize the execution of the graph. If nodes can be run in parallel, this sets the maximum number of threads to use to run them in parallel. If sequential execution is enabled this option is ignored. Default is `0` – default number of threads.
`enable_memory_pattern=[true\|false]`	Enable a memory pattern optimization. If the input shapes are the same, generate a memory pattern for a future request so that just one allocation can handle all internal memory allocation. Default is `false`.
`enable_cpu_memory_arena=[true\|false]`	Enable a memory arena on the CPU for future usage. Default is `false`.
`graph_optimization_level=<level>`	Set the optimization level to apply when loading a graph. See <https://onnxruntime.ai/docs/performance/model-optimizations/graph-optimizations.html> for details. Default is `ENABLE_ALL`. Options: `DISABLE_ALL` `ENABLE_BASIC` `ENABLE_EXTENDED` `ENABLE_ALL`

ONNX Runtime QNN Execution Provider options

Sub-options for --onnx_options specific to the QNN execution provider. For additional information on the QNN execution provider options see <https://onnxruntime.ai/docs/execution-providers/QNN-ExecutionProvider.html#configuration-options>.

ONNX Runtime option (QNN)

Description

qnn_htp_performance_mode=<option>

HTP performance mode. Default is burst. Options:

default
low_power_saver
power_saver
high_power_saver
low_balanced
balanced
high_performance
sustained_high_performance
burst

qnn_htp_graph_optimization_mode=[1|2|3]

Set the graph optimization level. Default is 3. Options:

1 Faster preparation time, less optimal graph.
2 Longer preparation time, more optimal graph.
3 Longest preparation time, most likely even more optimal graph.

qnn_enable_htp_fp16_precision=[0|1]

Enable a fp32 model to be inferenced with fp16 precision. This option greatly improves performance of models with any fp32 inputs. Default is 1 (enable).

ONNX Runtime DirectML Execution Provider options

There are no user options for the DirectML execution provider. However, the following options will be automatically set when using DirectML: execution_mode=SEQUENTIAL;enable_memory_pattern=false.

TensorFlow Lite Options

Sub-options for --tflite_options that are applicable to all delegates.

TensorFlow Lite option	Description
`enable_fallback=[true\|false]`	Fallback to lower priority delegates when one fails to prepare. When this is false, any delegate failure will fail the entire job. Default is `true`.
`invoke_interpreter_on_cold_load=[true\|false]`	Run the interpreter during cold load. This validates that delegates can execute successfully. If an error occurs, failed delegates will be eliminated and the model will be reloaded until a working configuration can be found. Subsequent loads will not re-run the interpreter. Default is `false`.
`allow_fp32_as_fp16=[true\|false]`	Allow delegates to use reduced fp16 precision for fp32 operations. Delegate options may override this. Default is `true`.
`force_opengl=[true\|false]`	Force the use of OpenGL when possible. This exists for debugging purposes. Default is `false`.
`number_of_threads=<num>`	Number of threads to use. Default: is `-1`, which then uses `std::thread::hardware_concurrency() / 2`.
`release_dynamic_tensors=[true\|false]`	Force all intermediate dynamic tensors to be released once they are not used by the model. Please use this configuration with caution, since it might reduce the peak memory usage of the model at the cost of a slower inference speed. Default is `false`.

TensorFlow Lite Delegate Options for Qualcomm® AI Engine Direct

Sub-options for --tflite_options specific to the QNN delegate. For additional information on the QNN options see the QNN GPU backend and QNN HTP backend pages in the Qualcomm® AI Engine Direct Documentation. Note that not all options are available for all backends.

TfLite option (QNN)	Description
`qnn_log_level=<option>`	Set the log level for qnn. Default: `kLogLevelWarn`. Options: `kLogOff` `kLogLevelError` `kLogLevelWarn` `kLogLevelInfo` `kLogLevelVerbose` `kLogLevelDebug`
`qnn_graph_priority=<option>`	Set the underlying graph priority. Default: `kQnnPriorityDefault`. Options: `kQnnPriorityDefault` `kQnnPriorityLow` `kQnnPriorityNormal` `kQnnPriorityNormalHigh` `kQnnPriorityHigh` `kQnnPriorityUndefined`
`qnn_gpu_precision=<option>`	Set the precision for the GPU backend that defines the optimization levels of the graph tensors that are not input nor output tensors. Default is `kGpuFp16`. Options: `kGpuUserProvided` `kGpuFp32` `kGpuFp16` `kGpuHybrid`
`qnn_gpu_performance_mode=<option>`	Performance mode for the GPU. Higher performance consumes more power. Default is `kGpuHigh`. Options: `kGpuDefault` `kGpuHigh` `kGpuNormal` `kGpuLow`
`qnn_dsp_performance_mode=<option>`	DSP performance mode. Default is `kDspBurst`. Options: `kDspLowPowerSaver` `kDspPowerSaver` `kDspHighPowerSaver` `kDspLowBalanced` `kDspBalanced` `kDspHighPerformance` `kDspSustainedHighPerformance` `kDspBurst`
`qnn_dsp_encoding=<option>`	Specify the origin of the quanitization parameters. Default is `kDspStatic`. Options: `kDspStatic`: Use quantization parameters from the model. `kDspDynamic`: Use quantization parameters from the model inputs.
`qnn_htp_performance_mode=<option>`	HTP performance mode. Default is `kHtpBurst`. Options: `kHtpLowPowerSaver` `kHtpPowerSaver` `kHtpHighPowerSaver` `kHtpLowBalanced` `kHtpBalanced` `kHtpHighPerformance` `kHtpSustainedHighPerformance` `kHtpBurst`
`qnn_htp_precision=<option>`	HTP precision mode. Only applicable on 8gen1 and newer. Default: is `kHtpFp16`, when supported. Options: `kHtpQuantized` `kHtpFp16`
`qnn_htp_optimization_strategy=<option>`	Optimize for load time or inference time. Default is `kHtpOptimizeForInference`. `kHtpOptimizeForInference`: Longer preparation time, more optimal graph. `kHtpOptimizeForInferenceO3`: Longest preparation time, typically most optimal graph. `kHtpOptimizeForPrepare`: Faster preparation time, less optimal graph.
`qnn_htp_use_conv_hmx=[true\|false]`	Using short conv hmx may lead to better performance. However, convolutions that have short depth and/or weights that are not symmetric could exhibit inaccurate results. Default is `true`.
`qnn_htp_use_fold_relu=[true\|false]`	Using fold relu may lead to better performance. Quantization ranges for convolution should be equal or a subset of the Relu operation for correct results. Default is `false`.
`qnn_htp_vtcm_size=<size>`	Set VTCM size in MB. If VTCM size is not set, the default VTCM size will be used. VTCM size must be bigger than 0, and if VTCM size is set to bigger than VTCM size available for this device, it will be set to the VTCM size available for this device.
`qnn_htp_num_hvx_threads=<num>`	Set the number of HVX threads. If `num` exceeds the max number of HVX threads, the number will be clipped to the maximum number of threads supported. The `num` must be greater than 0.

TensorFlow Lite Delegate Options for GPUv2

Sub-options for --tflite_options specific to the GPUv2 delegate. The GPUv2 delegate is chosen with --tflite_delegates gpu. For additional information on the TensorFlow Lite GPU options see delegate_options.h.

TfLite option (GPUv2)

Description

gpu_inference_preference=<option>

Specify the compilation/runtime trade-off. Default is TFLITE_GPU_INFERENCE_PREFERENCE_SUSTAINED_SPEED. Options:

TFLITE_GPU_INFERENCE_PREFERENCE_FAST_SINGLE_ANSWER: The delegate will be used only once, therefore, bootstrap/init time should be taken into account.

TFLITE_GPU_INFERENCE_PREFERENCE_SUSTAINED_SPEED: Prefer maximizing the throughput. The same delegate will be used repeatedly on multiple inputs.

TFLITE_GPU_INFERENCE_PREFERENCE_BALANCED: Balance init latency and throughput. This option will result in slightly higher init latency than TFLITE_GPU_INFERENCE_PREFERENCE_FAST_SINGLE_ANSWER but should have inference latency closer to TFLITE_GPU_INFERENCE_PREFERENCE_SUSTAINED_SPEED.

gpu_inference_priority1=<option>

gpu_inference_priority2=<option>

gpu_inference_priority3=<option>

Set the top ordered priorities.

Default for priority1 is TFLITE_GPU_INFERENCE_PRIORITY_MIN_LATENCY. Default for priority2 is TFLITE_GPU_INFERENCE_PRIORITY_MIN_MEMORY_USAGE. Default for priority3 is TFLITE_GPU_INFERENCE_PRIORITY_MAX_PRECISION. To specify maximum precision, use the following set for priority1-priority3: TFLITE_GPU_INFERENCE_PRIORITY_MAX_PRECISION, TFLITE_GPU_INFERENCE_PRIORITY_AUTO, TFLITE_GPU_INFERENCE_PRIORITY_AUTO. Options:

TFLITE_GPU_INFERENCE_PRIORITY_AUTO
TFLITE_GPU_INFERENCE_PRIORITY_MAX_PRECISION
TFLITE_GPU_INFERENCE_PRIORITY_MIN_LATENCY
TFLITE_GPU_INFERENCE_PRIORITY_MIN_MEMORY_USAGE

gpu_max_delegated_partitions=<num>

A graph could have multiple partitions that can be delegated. This limits the maximum number of partitions to be delegated. Default is 1.

TensorFlow Lite Delegate Options for NNAPI

Sub-options for --tflite_options specific to the NNAPI delegate. For additional information on the NNAPI specific options see nnapi_delegate_c_api.h.

TensorFlow Lite option (NNAPI)

Description

nnapi_execution_preference=<option>

Power/performance trade-off. Default is kSustainedSpeed. Options:

kLowPower
kFastSingleAnswer
kSustainedSpeed

nnapi_max_number_delegated_partitions=<max>

Set the maximum number of partitions. A value less than or equal to zero means no limit. If the delegation of the full set of supported nodes would generate a number of partitions greater than this parameter, only <max> of them will be actually accelerated. The selection is currently done sorting partitions in decreasing order of number of nodes and selecting them until the limit is reached. Default is 3.

nnapi_allow_fp16=[true|false]

Allow fp32 to run as fp16. Default is true.

Qualcomm® AI Engine Direct Options

Sub-options for --qnn_options. These options are applicable when the target runtime is qnn_context_binary, qnn_lib_aarch64_android, or qnn_dlc. For additional information on the Qualcomm® AI Engine Direct options see the QNN GPU backend and QNN HTP backend pages in the Qualcomm® AI Engine Direct Documentation. Note that not all options are available for all backends.

QNN model libraries are only compatible with the QAIRT version with which they were produced. QNN context binaries and QAIRT DLCs are compatible with the QAIRT version with which they were produced and newer versions. The version of the QNN SDK used is visible on the AI Hub Compile Job details page.

When targeting HTP, graphs undergo an additional “preparation” phase to perform potentially expensive SoC-specific compilation. This step is performed ahead of time (offline) when using a context binary and on-device otherwise. Other backends perform all device-specific preparation on-device (online). Options affecting HTP graph preparation are noted below.

Qualcomm® AI Engine Direct option	Description
`context_async_execution_queue_depth_numeric=<num>`	Execution: Specify asynchronous execution queue depth.
`context_enable_graphs=<name>`	On- and offline preparation: The name of the graphs to compile into a model file or load at runtime. This argument is required at runtime if the context binary being loaded contains more than one graph. An error is generated if an invalid graph name is provided or multiple graphs are selected at runtime.
`context_error_reporting_options_level=<option>`	Execution: Sets the error reporting level. Options: `BRIEF` `DETAILED`
`context_error_reporting_options_storage_limit=<kb>`	Execution: Amount of memory to be reserved for error information. Specified in KB.
`context_memory_limit_hint=<mb>`	Execution: Sets the peak memory limit hint of a deserialized context in MB.
`context_priority=<option>`	Execution: Sets the default priority for graphs in this context. Options: `LOW` `NORMAL` `NORMAL_HIGH` `HIGH`
`context_gpu_performance_hint=<option>`	Execution: Set the GPU performance hint. Default is `HIGH`. Options: `HIGH`: Maximize GPU clock frequencies. `NORMAL`: Balance performance dependent upon power management. `LOW`: Use lowest power consumption at the expense of inference latency.
`context_gpu_use_gl_buffers=[true\|false]`	Execution: OpenGL buffers will be used if set to true.
`context_htp_adaptive_polling_time=<uint32>`	Execution: Set HTP adaptive RPC polling in microseconds.
`context_htp_performance_mode=<option>`	Execution: Set the HTP performance mode. Default is `BURST`. Options: `EXTREME_POWER_SAVER` `LOW_POWER_SAVER` `POWER_SAVER` `HIGH_POWER_SAVER` `LOW_BALANCED` `BALANCED` `HIGH_PERFORMANCE` `SUSTAINED_HIGH_PERFORMANCE` `BURST`
`default_graph_priority=<option>`	Execution: Options: `LOW` `NORMAL` `NORMAL_HIGH` `HIGH`
`default_graph_gpu_precision=<option>`	Online preparation: Specify the precision mode to use. Default is `USER_PROVIDED`. Options: `FLOAT32`: Convert tensor data types to FP32 and select kernels that use an FP32 accumulator. `FLOAT16`: Convert tensor data types to FP16 and select kernels that use an FP16 accumulator where possible. `HYBRID`: Convert tensor data types to FP16 and select kernels that use an FP32 accumulator. `USER_PROVIDED`: Do not optimize tensor data types.
`default_graph_gpu_disable_memory_optimizations=[true\|false]`	Online preparation: When `true`, each tensor in the model will be allocated unique memory and sharing is disabled.
`default_graph_gpu_disable_node_optimizations=[true\|false]`	Online preparation: When `true`, operations will not be fused and will be kept separate.
`default_graph_gpu_disable_queue_recording=[true\|false]`	Online preparation: The QNN GPU backend will use queue recording to improve performance. When `true`, queue recording is disabled.
`default_graph_htp_disable_fold_relu_activation_into_conv=[true\|false]`	On- and offline preparation: For any graph where a convolution or convolution-like operation is followed by `Relu` or `ReluMinMax`, the `Relu` is folded into the convolution operation. Default is `false`.
`default_graph_htp_enable_weights_packing=[true\|false]`	On- and offline preparation; experimental: any proposed method of usage and behavior may change in future releases. At graph prepare, 8-bit weights that are in the 4-bit range to be stored in the context binary as packed 4-bit, potentially reducing the context binary size. However, please note that while this may reduce the size of a context binary, it does not guarantee any performance improvements. Default is `false`.
`default_graph_htp_num_cores=<num>`	On- and offline preparation: Specify the number of cores the graph will use for execution. Default is `1`.
`default_graph_htp_num_hvx_threads=<num>`	On- and offline preparation: Define number of HVX threads to reserve and utilize for a particular graph. Default is `4`.
`default_graph_htp_optimizations=<options>`	On- and offline preparation: HTP compiler optimization settings. Pass as comma-separated list of key-value pairs. Options: `ENABLE_DLBC`: (`0` or `1`) Enables Deep Learning Bandwidth Compression. `ENABLE_DLBC_WEIGHTS`: (`0` or `1`) Enables DLBC weights compression. `ENABLE_SPARSE_WEIGHTS_COMPRESSION`: (`0` or `1`) Enables weight sparsity compression. `FINALIZE_OPTIMIZATION_FLAG` `1`: Faster preparation time, less optimal graph. `2`: Longer preparation time, more optimal graph. `3`: Longest preparation time, most likely even more optimal graph. `O`: Alias for `FINALIZE_OPTIMIZATION_FLAG`. Allows optimization level to be specified with shorthand like `O=3`. Examples: `--qnn_options default_graph_htp_optimizations=ENABLE_DLBC_WEIGHTS=1,FINALIZE_OPTIMIZATION_FLAG=3` `--qnn_options default_graph_htp_optimizations=O=3`
`default_graph_htp_precision=<option>`	On- and offline preparation: If no precision value is set, the QNN HTP backend assumes that the client expects to run a quantized network. When the precision value is set to `FLOAT16`, the QNN HTP backend will convert user provided float32 inputs to float16 and execute the graph with float16 math. Options: `FLOAT16`
`default_graph_htp_disable_short_depth_conv_on_hmx=[true\|false]`	On- and offline preparation: Run all Convolution operations using HMX instructions. Default is `false`.
`default_graph_htp_vtcm_size=<mb>`	On- and offline preparation: Set the amount of VTCM memory (in MB) to reserve and utilize. Specify `0` to use the maximum amount. Default is `4`.