Speed Up Arduino With Clever Coding

We love Arduino here at Hackaday; they’ve probably done more to make embedded programming accessible to more people than anything else in the history of the field. One thing the Arduino ecosystem is rarely praised for is its speed. That’s where [Playduino]  comes in, with his video (embedded below) that promises to make everyone’s favourite microcontroller run 50x faster.

You might be expecting an unstable overclocking setup, with swapped crystals, tweaked voltages and a hefty heat sink, but no! This is stock hardware. The 50x speedup comes from one simple hack: don’t use digitalWrite();

If you aren’t familiar, the digitalWrite() function is one of the key functions Arduino gives you to operate its boards– specify the pin and the value (high or low) to drive it. It’s very easy, but it’s also very slow. [Playduino] takes a moment to show just how much is going on under the hood when you call digitalWrite(), and shows you what you can do instead if you have a need for speed. (Hint: there’s no Arduino-provided code involved; hardware registers and the __asm keyword show up.)

If you learned embedded programming in an earlier era, this will probably seem glaringly obvious. If you, like so many of us, got started inside of the Arduino ecosystem, these closer-to-the-metal programming techniques could prove useful tools in your quiver. Big thanks to [Stephan Walters] for the tip.

Of course if you prefer to speed things up by hardware rather than software, you can overclock an Arduino– with liquid nitrogen, even.

Continue reading “Speed Up Arduino With Clever Coding”

C++ Design Patterns For Low-Latency Applications

With performance optimizations seemingly having lost their relevance in an era of ever-increasing hardware performance, there are still many good reasons to spend some time optimizing code. In a recent preprint article by [Paul Bilokon] and [Burak Gunduz] of the Imperial College London the focus is specifically on low-latency patterns that are relevant for applications such as high-frequency trading (HFT). In HFT the small margins are compensated for by churning through absolutely massive volumes of trades, all of which relies on extremely low latency to gain every advantage. Although FPGA-based solutions are very common in HFT due their low-latency, high-parallelism, C++ is the main language being used beyond FPGAs.

Although many of the optimizations listed in the paper are quite obvious, such as prewarming the CPU caches, using constexpr, loop unrolling and use of inlining, other patterns are less obvious, such as hotpath versus coldpath. This overlaps with the branch reduction pattern, with both patterns involving the separation of commonly and rarely executed code (like error handling and logging), improving use of the CPU’s caches and preventing branch mispredictions, as the benchmarks (using Google Benchmark) clearly demonstrates. All design patterns can also be found in the GitHub repository.

Other interesting tidbits are the impact of signed and unsigned comparisons, mixing floating point datatypes and of course lock-free programming using a ring buffer design. Only missing from this list appears to be aligned vs unaligned memory accesses and zero-copy optimizations, but those should be easy additions to implement and test next to the other optimizations in this paper.

Using Valgrind To Analyze Code For Bottlenecks Makes Faster, Less Power-Hungry Programs

What is the right time to optimize code? This is a very good question, which usually comes down to two answers. The first answer is to have a good design for the code to begin with, because ‘optimization’ does not mean ‘fixing bad design decisions’. The second answer is that it should happen after the application has been sufficiently debugged and its developers are at risk of getting bored.

There should also be a goal for the optimization, based on what makes sense for the application. Does it need to process data faster? Should it send less data over the network or to disk? Shouldn’t one really have a look at that memory usage? And just what is going on inside those CPU caches that makes performance sometimes drop off a cliff on a single core?

All of this and more can be analyzed using tools from the Valgrind suite, including Cachegrind, Callgrind, DHAT and Massif.

Keeping Those Cores Cool

Modern day processors are designed with low power usage in mind, regardless of whether they are aimed at servers, desktop systems or embedded applications. This essentially means that they are in a low power state when not doing any work (idle loop), with some CPUs and microcontrollers turning off power to parts of the chip which are not being used. Consequently, the more the processor has to do, the more power it will use and the hotter it will get.

Continue reading “Using Valgrind To Analyze Code For Bottlenecks Makes Faster, Less Power-Hungry Programs”