Running Inference

Running any model on mobile and edge devices with specialized hardware may differ from running it on its reference environment. For instance, while your PyTorch implementation runs inference at float32 precision, the target hardware may run computations using float16 or even int8. This may lead to numerical discrepancies, as well as the possibility of underflows and overflows. Whether or not this has a detrimental effect on your results depends on your model and your data distribution.

Inference jobs provide you with a way to upload input data, run inference on real hardware, and download the output results. By comparing these results directly to your reference implementation, you can determine whether or not the optimized model works as expected. Inference is only supported for optimized models: models in source formats such as PyTorch and ONNX must be compiled with submit_compile_job() or similar.

Running Inference with a TensorFlow Lite model

This example uses a TensorFlow Lite model SqueezeNet10.tflite to run inference.

import numpy as np

import qai_hub as hub

sample = np.random.random((1, 224, 224, 3)).astype(np.float32)


inference_job = hub.submit_inference_job(
    model="SqueezeNet10.tflite",
    device=hub.Device("Samsung Galaxy S23 Ultra"),
    inputs=dict(x=[sample]),
)

assert isinstance(inference_job, hub.InferenceJob)
inference_job.download_output_data()
  • The inputs for inference must be a dictionary, where the keys are the names of the features and the values are the tensors. The tensors can be a list of numpy arrays or a single numpy array if it is a single data point.

  • inference_job is an instance of InferenceJob.

Multiple inference jobs can be launched at the same time by providing a list of Device objects to the submit_inference_job() API.

Running Inference with a QNN Model Library & QNN Context Binary

This example compiles a TorchScript model (mobilenet_v2.pt) to QNN model library or QNN context binary format. Then inference is run on device with the compiled target model.

import numpy as np

import qai_hub as hub

sample = np.random.random((1, 3, 224, 224)).astype(np.float32)

compile_job = hub.submit_compile_job(
    model="mobilenet_v2.pt",
    device=hub.Device("Samsung Galaxy S23"),
    options="--target_runtime qnn_lib_aarch64_android",
    input_specs=dict(image=(1, 3, 224, 224)),
)
assert isinstance(compile_job, hub.CompileJob)

inference_job = hub.submit_inference_job(
    model=compile_job.get_target_model(),
    device=hub.Device("Samsung Galaxy S23 Ultra"),
    inputs=dict(image=[sample]),
)
assert isinstance(inference_job, hub.InferenceJob)
import numpy as np

import qai_hub as hub

input_shape = (1, 3, 224, 224)
sample = np.random.random(input_shape).astype(np.float32)

compile_job = hub.submit_compile_job(
    model="mobilenet_v2.pt",
    device=hub.Device("Samsung Galaxy S23"),
    options="--target_runtime qnn_context_binary",
    input_specs=dict(image=input_shape),
)
assert isinstance(compile_job, hub.CompileJob)

inference_job = hub.submit_inference_job(
    model=compile_job.get_target_model(),
    device=hub.Device("Samsung Galaxy S23 Ultra"),
    inputs=dict(image=[sample]),
)
assert isinstance(inference_job, hub.InferenceJob)

Verify model accuracy on-device using an inference job

This example demonstrates how to validate the numerics of a QNN Model Library model on-device.

Reusing the model from the profiling example (mobilenet_v2.pt)

from typing import Dict, List
import torch
import qai_hub as hub

device_s23 = hub.Device(name="Samsung Galaxy S23 Ultra")
compile_job = hub.submit_compile_job(
    model="mobilenet_v2.pt",
    device=device_s23,
    input_specs={"x": (1, 3, 224, 224)},
    options="--target_runtime qnn_lib_aarch64_android",
)

assert isinstance(compile_job, hub.CompileJob)
on_device_model = compile_job.get_target_model()

We can use this optimized .so model and run inference on it with input data on a specific device. The input image used in this example can be downloaded - input_image1.jpg.

https://qaihub-public-assets.s3.us-west-2.amazonaws.com/apidoc/input_image1.jpg
import numpy as np
from PIL import Image
# Convert the image to numpy array of shape [1, 3, 224, 224]
image = Image.open("input_image1.jpg").resize((224, 224))
img_array = np.array(image, dtype=np.float32)

# Ensure correct layout (NCHW) and re-scale
input_array = np.expand_dims(np.transpose(img_array / 255.0, (2, 0, 1)), axis=0)

# Run inference using the on-device model on the input image
inference_job = hub.submit_inference_job(
    model=on_device_model,
    device=device_s23,
    inputs=dict(x=[input_array]),
)
assert isinstance(inference_job, hub.InferenceJob)

We can use this on-device raw output to generate class predictions and compare it to the reference implementation. You’ll need imagenet classes for this - imagenet_classes.txt.


# Get the on-device output
on_device_output: Dict[str, List[np.ndarray]] = inference_job.download_output_data()  # type: ignore

# Load the torch model and perform inference
torch_model = torch.jit.load("mobilenet_v2.pt")
torch_model.eval()

# Calculate probabilities for torch model
torch_input = torch.from_numpy(input_array)
torch_output = torch_model(torch_input)
torch_probabilities = torch.nn.functional.softmax(torch_output[0], dim=0)

# Calculate probabilities for the on-device output
output_name = list(on_device_output.keys())[0]
out = on_device_output[output_name][0]
on_device_probabilities = np.exp(out) / np.sum(np.exp(out), axis=1)

# Read the class labels for imagenet
with open("imagenet_classes.txt", "r") as f:
    categories = [s.strip() for s in f.readlines()]

# Print top five predictions for the on-device model
print("Top-5 On-Device predictions:")
top5_classes = np.argsort(on_device_probabilities[0], axis=0)[-5:]
for c in reversed(top5_classes):
    print(f"{c} {categories[c]:20s} {on_device_probabilities[0][c]:>6.1%}")

# Print top five prediction for torch model
print("Top-5 PyTorch predictions:")
top5_prob, top5_catid = torch.topk(torch_probabilities, 5)
for i in range(top5_prob.size(0)):
    print(
        f"{top5_catid[i]:4d} {categories[top5_catid[i]]:20s} {top5_prob[i].item():>6.1%}"
    )

The code above produces results that look like this:

Top-5 On-Device predictions:
968 cup                   71.3%
504 coffee mug            16.4%
967 espresso               7.8%
809 soup bowl              1.3%
659 mixing bowl            1.2%

Top-5 PyTorch predictions:
968 cup                   71.4%
504 coffee mug            16.1%
967 espresso               8.0%
809 soup bowl              1.4%
659 mixing bowl            1.2%

The on-device results are nearly equivalent to the reference implementation. This tells us the model did not suffer from a correctness regression and gives us confidence it will behave as expected once deployed.

To strengthen this confidence, consider expanding this to several images and using quantitative summaries, such as measuring KL divergence or comparing accuracy (if the labels are known). This also makes it easier to validate across your target devices.

Running inference using a previously uploaded dataset and model

Analogous to models, AI Hub exposes an API to enable users upload data that can be reused.

import numpy as np

import qai_hub as hub

data = dict(
    x=[
        np.random.random((1, 224, 224, 3)).astype(np.float32),
        np.random.random((1, 224, 224, 3)).astype(np.float32),
    ]
)
hub_dataset = hub.upload_dataset(data)

You can now run an inference job using the uploaded dataset. This example uses SqueezeNet10.tflite.

# Submit job
job = hub.submit_inference_job(
    model="SqueezeNet10.tflite",
    device=hub.Device("Samsung Galaxy S23 Ultra"),
    inputs=hub_dataset,
)