Turning Up the Heat: 10,000 GMTI Tracks on a Raspberry Pi
Our earlier blog article 2000 GMTI Tracks on a Raspberry Pi contained results from an initial benchmarking test of the Hawkstream tracker. Following some optimisation work, the tracker can now achieve 10,000 simultaneous GMTI tracks on a resource constrained platform. We use a Raspberry Pi 4 for testing since it is both convenient and reasonably representative of the compute power that may be available on widely deployed small systems.
Sources of improvement
The ability to achieve 10,000 tracks on a Pi, is primarily down to the design of a new algorithm which provides better scalability than those in wide use many existing systems. Since the original benchmark tests were performed, some improvements to make better use of this scalability, and some other simple optimisations which are usually the first port of call when looking for speed-ups have been applied.
Enabling compiler optimisations
The simplest optimisation applied was simply to switch on the compiler optimisation flags. For C or C++ code compiled with the GCC suite (GNU Compiler Collection), this is simply a matter of passing the -O3 flag as a command line argument when compiling.
On the Pi, the original 32-bit Raspbian OS was updated for a 64-bit Ubuntu 24 installation, which permitted the use of a newer 64-bit compiler.
Parallel execution
In the original benchmarking, the main body of the code was running as a single execution thread. Given that even a Raspberry Pi has 4 cores available, restricting to a single thread means that only 25% of the available CPU can be used. Since then, the tracker has been permitted to run in multiple threads, and the workload split across cores. Some of the work that a tracker must do (such as calculating association costs) would be categorised as “embarrassingly parallel”, meaning that the work can be easily split across cores, with limited need for coordination. Sub-dividing the work contributed to a significant reduction in the latency between receiving new detections and the track output being updated. A consequence of this is that the CPU loading reported by Linux utilities such as top show higher peak loadings, as more of the cores are being put to use.
Association Cost Function Improvements
Performance analysis showed that the cost function used to associate tracks with new detections was a significant CPU consumer. This makes sense, since for every set of 10000 new detections being matched with 10000 existing tracks, this function is evaluated a total of 10000×10000 = 100,000,000 times. Simple improvements to this halved the compute time required.
Test Results
Using the same methodology as before, the tests were re-run with the maximum track limit set at 10,000 and a feed of 10,000 fresh simulated radar detections every second.
Soak Testing on a Desktop P.C.
Under load, the CPU (AMD Ryzen 5 5500, 6 cores, 3600 Mhz) usage reported by top peaked at 111% of CPU, before settling down to around 85%. Given that the machine has 6 cores/12 threads, the reporting limit for top is likely 600% or 1200%. The longest time to process a batch of updates was 135.8 m.s., with around 120 m.s. typically. This compares to around 170 m.s. previously, running with 2000 tracks and detections. This shows that the latency has reduced despite the increased track loading. It is likely that on this platform the maximum track count could be increased without further software modifications, given that it is currently operating well within its limits.
Soak Testing on a Raspberry Pi 4.
Repeating the tests on the Pi, the CPU loading peaks at 288% of CPU (the theoretical maximum is 400%, corresponding to the 4 cores all being busy). The peak update processing latency is 817 m.s., with around 800 m.s. typically. This is a small increase from the previously measured 780 m.s., but is still consistent with real-time processing. The tracker is still capable of keeping up with the real-time detection arrival rate (1 second interval), but there will be a noticeable delay in the processing of the latest updates. Further optimisations can likely reduce this latency. As would be expected, there is a smaller gain from parallelisation given the Pi’s lower number of processor cores and logical threads compared with the AMD Ryzen 5.
Dropping back to a 5000 track limit, the CPU loading peaks at 76%, and the peak latency is 231 m.s., which is well within acceptable bounds for most applications.
Conclusions/Further Work
Doing a first pass of optimisations has been sufficient to permit 10,000 simultaneous GMTI tracks. On the desktop AMD Ryzen 5, this is achieved comfortably. The Pi is also capable of 10,000 simultaneous tracks, but with less breathing space compared to the real-time arrival rate of detections, and there remains a noticeable latency in processing updates. However, the optimisations have permitted a 5x increase in the number of simultaneous tracks compared with the benchmarking baseline.
There are likely to be a number of further improvements available, particularly in areas which have not yet had any optimisation effort applied. Mapping the easily parallelised parts of the code to use an Nvidia GPU could also result in significant speedups, but these tend to be less widely available in currently operational platforms.
Given that the initial goal of 10,000 tracks has now been achieved, the focus of further efforts will likely switch to tracking accuracy improvements and integration with third party systems.
If you would like to find out more about this tracker software, to understand if it might be suitable for your application, please contact us to discuss it.