edge ai on stm32 — comparing stm32n6570 and stm32mp257f
Comparative Edge AI project focused on deploying face and gesture recognition pipelines on STM32N6570-DK and STM32MP257F-DK2, with attention to latency, deployment complexity, and embedded constraints.
system_log // edge_ai_stm32
This project was a Master 2 academic team project focused on evaluating embedded AI deployment on two recent STMicroelectronics platforms: the STM32N6570-DK microcontroller and the STM32MP257F-DK2 microprocessor.
The goal was to compare two Edge AI execution strategies on real hardware through computer vision applications, while measuring practical constraints such as latency, runtime behavior, memory pressure, integration effort, and overall robustness.
mission_scope
We worked on two main application families:
- face detection and face recognition
- hand gesture recognition from 0 to 5 fingers
The project was designed as a comparative study between:
- an MCU-oriented approach with strong embedded constraints and NPU acceleration,
- an MPU-oriented approach offering more software flexibility through embedded Linux.
The initial specification targeted strong functional performance, including face detection precision above 90% and latency below 200 ms.
hardware_targets
stm32n6570-dk
The STM32N6570-DK is built around an Arm Cortex-M55 running at 800 MHz and integrates ST’s Neural-ART accelerator. It provides 4.2 MB of internal SRAM and targets constrained, low-power, real-time embedded execution.
This platform represents the microcontroller side of Edge AI: tighter memory, lower-level integration, and stronger emphasis on optimization.
stm32mp257f-dk2
The STM32MP257F-DK2 follows a different philosophy. It combines a Linux-capable MPU architecture with a more flexible software environment, external memory support, and a more powerful execution context for AI workloads.
This platform represents the microprocessor side of Edge AI: richer software tooling, easier high-level development, but also greater system complexity.
software_pipeline
The project relied on ST’s embedded AI ecosystem to move models from training environments to target hardware.
For the MCU workflow, the main toolchain included:
- STM32CubeMX
- STM32CubeIDE
- STM32CubeProgrammer
- X-CUBE-AI / ST Edge AI tools
For the MPU workflow, the project used:
- OpenSTLinux
- Python
- OpenCV
- ONNX Runtime
- VSINPUExecutionProvider for NPU acceleration when supported
The deployment pipeline followed the usual sequence:
- train or select a model,
- export it in a compatible format,
- quantize and optimize it,
- integrate it on target,
- benchmark inference and validate behavior in real conditions.
model_choices
face_applications
For face detection on the MPU, the selected model was BlazeFace, chosen for lightweight real-time detection under Linux.
On the MCU, the implementation used two models:
- CenterFace for face detection,
- MobileFaceNet for face recognition.
This MCU pipeline allowed face detection first, then identity matching using facial embeddings and cosine similarity against a reference base.
gesture_applications
Gesture recognition was intentionally implemented with two different strategies.
On the MPU, the team used a direct image classification approach based on MobileNetV2 adapted to 6 classes corresponding to 0 to 5 raised fingers.
On the MCU, the approach was more hybrid:
- palm detection,
- 21 hand landmarks extraction,
- geometric post-processing to infer the final finger count.
This made the comparison more interesting because it was not only a hardware comparison, but also a comparison of algorithmic strategies for embedded gesture understanding.
implementation_notes
One of the most valuable parts of the project was understanding how different embedded environments change the deployment experience.
On the STM32MP257F-DK2, Linux made development more modular and easier to debug. Camera access, preprocessing, ONNX model execution, and display logic could be managed in Python with a relatively comfortable software stack.
On the STM32N6570-DK, integration was much lower level. The project required tighter control over buffers, tensor sizes, peripheral configuration, and the full inference loop. Camera acquisition, resizing, inference, and display all had to be organized inside a much more constrained embedded pipeline.
measured_results
| platform | application | estimated inference | real fps | qualitative validation |
|---|---|---|---|---|
| STM32MP257F-DK2 | Face detection | 25 ms | 18 | Very good |
| STM32MP257F-DK2 | Gesture recognition | 33 ms | 22 | Medium |
| STM32N6570-DK | Face detection + recognition | 129 ms | 6 | Very good |
| STM32N6570-DK | Gesture detection | 15 ms | 30 | Very good |
The MPU delivered fluid real-time execution and was especially effective for face detection.
The MCU delivered the best gesture pipeline in the project, reaching strong robustness with 30 FPS, while also proving that optimized AI workloads can run effectively on a constrained microcontroller.
constraints_and_lessons
The project also revealed that software complexity matters as much as raw compute power.
Even though the MPU offers a richer environment, OpenSTLinux setup and deployment complexity consumed a significant part of the project effort. In contrast, the MCU required more low-level engineering but gave a more convincing result for some optimized embedded AI tasks.
Another important lesson came from the gesture model on the MPU. Because the dataset was collected in controlled conditions, the model generalized less well when lighting, background, or orientation changed.
The project originally included an audio module, but that part was dropped during execution in order to secure the vision deliverables and maintain the project timeline.
project_outcome
This project gave me practical experience across the full embedded AI chain:
- dataset preparation,
- model training and export,
- quantization,
- deployment on embedded targets,
- benchmark analysis,
- and system-level trade-offs between MCU and MPU platforms.
More importantly, it helped me understand that successful Edge AI is not only about model accuracy. It also depends on integration cost, runtime constraints, memory limits, tooling maturity, and deployment realism.