Running Inference
Running any model on mobile and edge devices with specialized hardware may differ from running it on its reference environment. For instance, while your PyTorch implementation runs inference at float32 precision, the target hardware may run computations using float16 or even int8. This may lead to numerical discrepancies, as well as the possibility of underflows and overflows. Whether or not this has a detrimental effect on your results depends on your model and your data distribution.
Inference jobs provide you with a way to upload input data, run inference on
real hardware, and download the output results. By comparing these results
directly to your reference implementation, you can determine whether or not
the optimized model works as expected. Inference is only supported for optimized
models: models in source formats such as PyTorch and ONNX must be compiled
with submit_compile_job()
or similar.
Running Inference with a TensorFlow Lite model
This example uses a TensorFlow Lite model SqueezeNet10.tflite to run inference.
import numpy as np
import qai_hub as hub
sample = np.random.random((1, 224, 224, 3)).astype(np.float32)
inference_job = hub.submit_inference_job(
model="SqueezeNet10.tflite",
device=hub.Device("Samsung Galaxy S23 (Family)"),
inputs=dict(x=[sample]),
)
assert isinstance(inference_job, hub.InferenceJob)
inference_job.download_output_data()
The inputs for inference must be a dictionary, where the keys are the names of the features and the values are the tensors. The tensors can be a list of numpy arrays or a single numpy array if it is a single data point.
inference_job
is an instance ofInferenceJob
.
Multiple inference jobs can be launched at the same time by providing a list of
Device
objects to the
submit_inference_job()
API.
Running Inference with a QNN Model Library & QNN Context Binary
This example compiles a TorchScript model (mobilenet_v2.pt) to QNN model library or QNN context binary format. Then inference is run on device with the compiled target model.
import numpy as np
import qai_hub as hub
sample = np.random.random((1, 3, 224, 224)).astype(np.float32)
compile_job = hub.submit_compile_job(
model="mobilenet_v2.pt",
device=hub.Device("Samsung Galaxy S23 (Family)"),
options="--target_runtime qnn_lib_aarch64_android",
input_specs=dict(image=(1, 3, 224, 224)),
)
assert isinstance(compile_job, hub.CompileJob)
inference_job = hub.submit_inference_job(
model=compile_job.get_target_model(),
device=hub.Device("Samsung Galaxy S23 (Family)"),
inputs=dict(image=[sample]),
)
assert isinstance(inference_job, hub.InferenceJob)
import numpy as np
import qai_hub as hub
input_shape = (1, 3, 224, 224)
sample = np.random.random(input_shape).astype(np.float32)
compile_job = hub.submit_compile_job(
model="mobilenet_v2.pt",
device=hub.Device("Samsung Galaxy S24 (Family)"),
options="--target_runtime qnn_context_binary",
input_specs=dict(image=input_shape),
)
assert isinstance(compile_job, hub.CompileJob)
inference_job = hub.submit_inference_job(
model=compile_job.get_target_model(),
device=hub.Device("Samsung Galaxy S24 (Family)"),
inputs=dict(image=[sample]),
)
assert isinstance(inference_job, hub.InferenceJob)
Verify model accuracy on-device using an inference job
This example demonstrates how to validate the numerics of a QNN Model Library model on-device.
Reusing the model from the profiling example (mobilenet_v2.pt)
from typing import Dict, List
import torch
import qai_hub as hub
device_s23 = hub.Device(name="Samsung Galaxy S23 (Family)")
compile_job = hub.submit_compile_job(
model="mobilenet_v2.pt",
device=device_s23,
input_specs={"x": (1, 3, 224, 224)},
options="--target_runtime qnn_lib_aarch64_android",
)
assert isinstance(compile_job, hub.CompileJob)
on_device_model = compile_job.get_target_model()
We can use this optimized .so
model and run inference on it with
input data on a specific device. The input image used in this example can be
downloaded - input_image1.jpg.
import numpy as np
from PIL import Image
# Convert the image to numpy array of shape [1, 3, 224, 224]
image = Image.open("input_image1.jpg").resize((224, 224))
img_array = np.array(image, dtype=np.float32)
# Ensure correct layout (NCHW) and re-scale
input_array = np.expand_dims(np.transpose(img_array / 255.0, (2, 0, 1)), axis=0)
# Run inference using the on-device model on the input image
inference_job = hub.submit_inference_job(
model=on_device_model,
device=device_s23,
inputs=dict(x=[input_array]),
)
assert isinstance(inference_job, hub.InferenceJob)
We can use this on-device raw output to generate class predictions and compare it to the reference implementation. You’ll need imagenet classes for this - imagenet_classes.txt.
# Get the on-device output
on_device_output: Dict[str, List[np.ndarray]] = inference_job.download_output_data() # type: ignore
# Load the torch model and perform inference
torch_model = torch.jit.load("mobilenet_v2.pt")
torch_model.eval()
# Calculate probabilities for torch model
torch_input = torch.from_numpy(input_array)
torch_output = torch_model(torch_input)
torch_probabilities = torch.nn.functional.softmax(torch_output[0], dim=0)
# Calculate probabilities for the on-device output
output_name = list(on_device_output.keys())[0]
out = on_device_output[output_name][0]
on_device_probabilities = np.exp(out) / np.sum(np.exp(out), axis=1)
# Read the class labels for imagenet
with open("imagenet_classes.txt", "r") as f:
categories = [s.strip() for s in f.readlines()]
# Print top five predictions for the on-device model
print("Top-5 On-Device predictions:")
top5_classes = np.argsort(on_device_probabilities[0], axis=0)[-5:]
for c in reversed(top5_classes):
print(f"{c} {categories[c]:20s} {on_device_probabilities[0][c]:>6.1%}")
# Print top five prediction for torch model
print("Top-5 PyTorch predictions:")
top5_prob, top5_catid = torch.topk(torch_probabilities, 5)
for i in range(top5_prob.size(0)):
print(
f"{top5_catid[i]:4d} {categories[top5_catid[i]]:20s} {top5_prob[i].item():>6.1%}"
)
The code above produces results that look like this:
Top-5 On-Device predictions:
968 cup 71.3%
504 coffee mug 16.4%
967 espresso 7.8%
809 soup bowl 1.3%
659 mixing bowl 1.2%
Top-5 PyTorch predictions:
968 cup 71.4%
504 coffee mug 16.1%
967 espresso 8.0%
809 soup bowl 1.4%
659 mixing bowl 1.2%
The on-device results are nearly equivalent to the reference implementation. This tells us the model did not suffer from a correctness regression and gives us confidence it will behave as expected once deployed.
To strengthen this confidence, consider expanding this to several images and using quantitative summaries, such as measuring KL divergence or comparing accuracy (if the labels are known). This also makes it easier to validate across your target devices.
Running inference using a previously uploaded dataset and model
Analogous to models, AI Hub exposes an API to enable users to upload data that can be reused.
import numpy as np
import qai_hub as hub
data = dict(
x=[
np.random.random((1, 224, 224, 3)).astype(np.float32),
np.random.random((1, 224, 224, 3)).astype(np.float32),
]
)
hub_dataset = hub.upload_dataset(data)
You can now run an inference job using the uploaded dataset. This example uses SqueezeNet10.tflite.
# Submit job
job = hub.submit_inference_job(
model="SqueezeNet10.tflite",
device=hub.Device("Samsung Galaxy S23 (Family)"),
inputs=hub_dataset,
)