Optimising Deep Neural Inference for Edge Devices: Toolchain and Techniques PyCon India 2025

Optimising Deep Neural Inference for Edge Devices: Toolchain and Techniques
.ical

2025-09-12 14:00–17:00, Room 5

Deploying real-time AI models on embedded Linux platforms like Raspberry Pi, Jetson Nano, or Rockchip-based boards is a growing need in industries like manufacturing, healthcare, and automotive. However, the challenges are real: constrained computing, tight memory, and limited power. This hands-on workshop walks you through the full lifecycle—designing, optimising, cross-compiling, and deploying lightweight CNNS for inference at the edge.
Participants will start with a base CNN (e.g., MobileNet or ShuffleNet), apply model compression techniques like pruning and quantisation, and then learn how to build optimised deployment pipelines using TensorFlow Lite and PyTorch Mobile. We'll also touch upon using NPU accelerators and real-time profiling to hit performance targets. By the end, the audience would be able to deploy and benchmark a real model on an embedded Linux system.

Introduction - Motivation: Real-world use cases (smart cameras, Iot sensors, etc.)
- Overview of embedded platforms (Jetson, Rockchip, Pi)

Model Selection and Design - Lightweight CNNS: MobileNet, ShuffleNet, SqueezeNet
- Trade-offs: Accuracy vs. Latency vs. Size
- Hands-on: Load and inspect model performance

Model Optimisation Techniques - Quantisation (Post-training, Aware Training)
- Pruning & Knowledge Distillation
- Tools: PyTorch/TF model optimisation toolkits
- Hands-on: Apply optimisations to a CNN

Cross-Compiling & Toolchains - Toolchains for ARM (aarch64): arm-linux-gnueabihf, Buildroot
- Docker-based emulation
- Hands-on: Compile an optimised TFLite model for ARM

Runtime & Deployment Pipelines - Frameworks: TensorFlow Lite, PyTorch Mobile
- Hardware acceleration: Coral Edge TPU, NPU on RK3588, Jetson’s Tensorrt
- Hands-on: Deploy to device or emulated ARM board

Performance Benchmarking & Profiling - Tools: perf, htop, nvprof, TFLite Benchmark Tool
- Scheduling & resource management
- Hands-on: Profile and optimise inference latency

Wrap-up - Recap + Resources
- Where to go from here (Tinyml, ONNX Runtime, Automl Edge)

Case Study: Real-Time Object Identification for Drones (Jetson Nano / Raspberry Pi + NPU)

In many drone-based applications—such as agriculture, search and rescue, and surveillance—there’s a need for real-time object identification directly on the drone to avoid latency and connectivity issues associated with cloud inference.
Scenario
A lightweight drone is equipped with a Raspberry Pi 4B or Jetson Nano and a camera module. The goal is to identify specific objects on the ground—e.g., vehicles, people, or crops—using a real-time, optimised model running locally.

Problem
- The drone has constrained compute and battery capacity.
- The model must operate under real-time constraints (<50ms inference time).
- No internet connection during flight; the model must run fully offline.

Solution Covered in Workshop
- Participants will replicate a scaled-down version of this use case:
- Use MobileNetV2 or TinyYOLO for object detection.
- Apply quantisation-aware training to reduce model size and power usage.
- Deploy the model using TensorFlow Lite or ONNX Runtime with NPU acceleration (if hardware available).
- Profile latency and FPS during emulated inference.
- Use frame-skipping and scheduling strategies to meet power constraints.
Hands-on Highlights
- Load sample aerial imagery or simulated video feed.
-Test an optimised model on a Jetson Nano or Pi.
- Compare CPU-only vs. NPU-accelerated inference times.

Prerequisites –

Prior Knowledge of ML, Python
Working experience in Ubuntu
Laptop with Ubuntu installed or WSL(Windows Platform)
Git, Python installed

Target Audience –

Intermediate

Optimising Deep Neural Inference for Edge Devices: Toolchain and Techniques .ical 2025-09-12 14:00–17:00, Room 5

Optimising Deep Neural Inference for Edge Devices: Toolchain and Techniques
.ical

2025-09-12 14:00–17:00, Room 5