To Count or Not to Count: On the Usage of Performance Counter Metrics for Host to Embedded GPU Performance Estimation

by Alexander Louis-Ferdinand Jung, Moritz Reiber, Jannik Steinmetz, Konstantin Lübeck, and Oliver Bringmann
In MBMV 2026; 29. Workshop (): , 2026.

Keywords: Performance Estimation, Statistical Modeling, Artificial Intelligence, Deep Neural Networks, Embedded GPUs

Abstract

Artificial intelligence (AI) is undergoing a transition from cloud processing to local processing using embedded devices due to concerns regarding privacy, bandwidth limitations, and energy consumption. Consequently, AI models must be adapted to the constrained computing capabilities of embedded devices, such as embedded GPUs. This poses significant challenges, as the development and training of AI models still takes place on workstation or server GPUs with limited access to the hardware running the final AI function. Statistical performance estimation models offer insight into the runtime behavior of deep neural network (DNN) architectures. Furthermore, these estimation models can guide a neural architecture search (NAS) to prioritize the training of candidates that are compatible with the runtime constraints of the target hardware platform. However, implementing these statistical models requires either substantial training data or a certain degree of hardware knowledge to ensure the accuracy of the estimations. Benchmarking large quantities of training data is a time-consuming process, and knowledge about the hardware is not always available for commercial off-the-shelf (COTS) embedded GPUs. To address these issues, we investigate the usage of hardware performance counter metrics collected on a host GPU as supplementary features employed by the performance estimator for the embedded GPU. In this paper, we examine different statistical performance estimation models and how the usage of hardware performance counter metrics influences estimation accuracy for both whole DNNs as well as single convolution layers.